
Governed Memory: The Missing Infrastructure for Enterprise AI Agents
Schema-Enforced Memory, Semantic Routing, and Consistency for Enterprise Multi-Agent Systems
The Memory Governance Gap
When multiple agents share a memory substrate across thousands of entity records, retrieval quality is necessary but insufficient. The real challenges are structural.
Schema Compliance
One agent stores "deal value" as free text; another expects a typed number. Downstream systems break silently.
Context Appropriateness
Your support agent gets the full brand playbook when it only needs the escalation policy. Wasted tokens, conflicting instructions.
Delivery Redundancy
The same compliance policy injected into every step of a multi-turn execution, consuming context window budgets.
Quality Opacity
No mechanism to detect that extraction quality is degrading or that governance routing is missing critical policies. Failures are silent.
RAG addresses retrieval relevance. It leaves four gaps open: no governance over what is stored, no organizational context routing, no session-aware delivery, and no quality feedback loop.
What the Paper Introduces
Four integrated mechanisms that close the memory governance gap — each validated through controlled experiments and production deployment.
Dual Memory Model
Capture everything, lose nothing
A single extraction pass produces both open-set atomic facts (vector-embedded) and schema-enforced typed properties — simultaneously. Neither modality alone is sufficient; the combination ensures 38% of facts captured exclusively through open-set extraction aren't lost to rigid schemas.
Governance Routing
The right context to the right agent
A tiered router selects which organizational context — policies, guidelines, templates — reaches each agent. A fast path (~200–400ms, zero LLM tokens) for real-time agents and a full two-stage path (~2–5s, chain-of-thought analysis) for batch workflows. Progressive delivery tracks what's been injected and sends only deltas.
Reflection-Bounded Retrieval
Completeness checking with bounded cost
An iterative protocol checks evidence completeness and generates targeted follow-up queries within bounded rounds. But we report an honest finding: when data is absent from the store, no amount of retrieval sophistication helps. Data completeness outweighs retrieval sophistication.
Self-Improving Schema Lifecycle
Schemas that get better automatically
AI-assisted authoring bootstraps schemas from natural language. Automated evaluation scores every interaction against domain-specific rubrics. A three-phase refinement pipeline diagnoses underperforming properties and generates targeted improvements — all without human intervention.
Measured Against the Industry's Most Rigorous Benchmark
LoCoMo tests long-term conversational memory across 272 sessions and 1,542 questions — spanning single-hop recall, multi-hop reasoning, temporal understanding, and open-ended inference.
LoCoMo Overall Accuracy
On the largest category in the benchmark (841 questions), the system outperforms the human baseline of 75.4% by +8.2 percentage points — reasoning about accumulated knowledge rather than merely retrieving facts.
Across 30 conflict pairs where the same entity changed its database, budget, or cloud provider, the system surfaced the fresh claim in 83.3% of cases — with recency decay scoring stale entries 10–100× lower than fresh ones.
Four Layers, One Feedback Loop
Content enters at Layer 1, flows up through governance and retrieval, and Layer 4 feeds refined schemas back — closing the quality loop. Each layer can be independently configured.
Three Metrics Worth Understanding
Not abstract benchmarks. Each one answers a question a production team actually asks — with a real-world example of what it looks like when it fails.
Defect Rate
“How often does the system hallucinate or contradict itself because of noisy retrieval?”
A meeting moved from Tuesday to Friday. If the system still reports Tuesday — because a stale memory scored higher in vector similarity — that's a Temporal Defect. Standard RAG carries an 8.4% baseline rate of this. Quality gates bring it to 6.3% — a 25% relative reduction in hallucinations caused by noisy context.
Signal-to-Noise Ratio
“For every useful rule injected into an agent's context, how much noise comes with it?”
Standard RAG scores 1.1:1 — roughly equal parts signal and noise. The LLM must reason through irrelevant, outdated, or contradictory context to find the useful part. Governed Memory's reasoning gate achieves 4.2:1. The downstream LLM is never confused by stale context because the gate correctly discarded it with 94.5% precision.
Compliance Rate
“Can the system be tricked into leaking sensitive data under pressure?”
50 deliberate attempts to get the system to reveal a CEO's personal cell phone number across easy, medium, and hard difficulty tiers. Zero leaks — not because a guardrail triggered every time, but because the reasoning gate maintained a negative constraint baseline that scrubbed the response regardless of how the query was framed.
18 Capabilities. Built for Production Enterprise.
Most enterprise memory deployments — RAG pipelines, single-agent stores, retrieval frameworks — address 2 to 4 of these. Governed Memory addresses all 18, in production, as an integrated system.
Every Result Comes From Production APIs, Not Lab Experiments
Deployed across multiple organizations, spanning sales, support, and research workflows. All experiments executed against the production API using synthetic identities.
500 queries × 5 types = 3,800 scoped results under adversarial conditions. The 2.74% flagged were all false positives — contacts sharing name tokens like 'Aisha Chen' and 'Aisha Singh'. Hard email-key pre-filtering enforces ownership before vector search runs.
Governance-aware hybrid routing with zero LLM tokens. 65% of the guideline library discarded before the reasoning gate — surgical precision at production speed.
Scrubs secrets and PII before AND after LLM extraction. 4 sensitivity tiers across API keys, financial PII, identity PII, and contact data. 3 anonymization strategies: redact, mask, or hash.
When we ingested a contact's email thread after their discovery call and follow-up call, the email thread added zero new memories. Every fact was already known. 162 of 195 extraction attempts were duplicates.
Defining a New Discipline
This paper introduces terminology and abstractions intended to serve as a reference architecture for enterprise memory governance — a discipline that didn't have a name until now.
Read the Full Paper
The overview above covers the what. The paper covers the why, the how, and the honest limitations — including the results that surprised us.
- Why reflection only helps when the data is already there — and what to invest in instead
- Why 38% of enterprise knowledge is permanently lost in a schema-only system
- How naming a policy 'Sales: Discovery Call' vs 'Policy_1' doubles its discovery rate
- 7 formal algorithms with pseudocode — 16 controlled experiments with ground truth
- Deliberate negative results reported transparently
Hamed Taheri · Personize AI · February 2026