Krish Kankure

Machine Learning & Data Engineer

Abstract circuit-board style image representing sequence models and compressed memory
Unsplash

SSMs as Streaming Context Compressors Instead of RAG for Long-Running LLM Inputs

Most production LLM systems handle long inputs with some combination of truncation, chunking, summarization, and retrieval-augmented generation (RAG). That works well for lookup-heavy tasks, but it starts to feel awkward for inputs that are continuously growing, streaming, or where important information is distributed across time and not easily retrievable by semantic search.

I’ve been thinking about a different framing:

What if we used an SSM (e.g., Mamba-style architecture) as a learned, streaming compression layer for LLM inputs, instead of relying primarily on RAG to recover context later?

Not necessarily a full replacement in every setting—but potentially a better primitive for certain workloads.


What We Usually Do Today for Long Context

When an input exceeds the effective context budget (or simply becomes too expensive), most systems do one or more of the following:

1) Naive Truncation / Sliding Window

Keep the most recent N tokens, drop the rest.

  • Pros: simple, fast, deterministic
  • Cons: catastrophic forgetting of early facts, poor long-horizon reasoning

This is acceptable for short-turn chat, but weak for long-lived agents, logs, meetings, research sessions, or stream processing.

2) RAG (Chunk -> Embed -> Retrieve -> Inject)

The standard approach:

  1. Split corpus/history into chunks
  2. Embed chunks
  3. Store vectors (and often metadata)
  4. At query time, retrieve top-k chunks
  5. Inject retrieved chunks back into the prompt

This is powerful because it externalizes memory and avoids paying attention cost over everything all the time.

But RAG has structural assumptions:

  • Information is chunkable
  • The right chunk can be retrieved from a query
  • Similarity search approximates relevance
  • Re-injected chunks are enough for reconstruction

Those assumptions break down more often than people admit.

3) Hierarchical Summarization / Memory Buffers

Compress older context into summaries (sometimes recursively), then carry summaries forward.

  • Pros: cheap, simple, model-agnostic
  • Cons: lossy drift, summary bias, irreversible omissions, difficult update semantics

4) Long-Context Transformers

Just buy a bigger window (128k, 1M+, etc.) and hope for the best.

This helps, but:

  • cost still grows with context
  • latency grows
  • attention over lots of irrelevant tokens is wasteful
  • “available in window” != “reliably used”

Where RAG Feels Misaligned

RAG is excellent when the problem is retrieval of a mostly static knowledge base.

It is less natural when the problem is compressing a temporal stream into a usable latent memory.

Examples where this matters:

  • A coding agent watching a repo + terminal + logs + diffs over hours
  • A research assistant ingesting papers incrementally and preserving evolving hypotheses
  • A meeting agent tracking commitments, conflicts, and unresolved threads across live transcripts
  • A monitoring agent processing event streams where rare anomalies depend on long-range state
  • Any system where context is expected to grow continuously rather than be queried as static docs

In these settings, RAG makes you repeatedly:

  • serialize experience into chunks,
  • hope future queries retrieve the right chunks,
  • spend tokens rehydrating context that the system already “saw.”

That feels like treating memory as document search, even when the problem is really sequential state tracking.


The Alternative Framing: SSM as a Learned Compression Layer

State Space Models (SSMs), especially modern selective variants like Mamba-style architectures, are interesting here because they are naturally oriented toward sequence processing with compact state.

The core idea:

  • Feed a long or streaming sequence into an SSM encoder
  • Maintain a rolling hidden state / compressed representation
  • Periodically emit compact memory tokens, summaries, or latent slots
  • Hand only the compressed representation (plus recent raw tokens) to the LLM

So instead of this:

raw stream -> chunk -> embed -> vector DB -> retrieve -> prompt

you can do:

raw stream -> SSM compressor -> compact memory -> prompt

Or hybrid:

raw stream -> SSM compressor + event indexing -> LLM (with optional retrieval fallback)


Why SSMs (and Why Mamba-like Models Specifically)?

Transformers are great at flexible token-token interaction, but they are not the only way to process sequences. SSM families are compelling for compression because they can support:

  • Streaming-friendly updates (state evolves as tokens arrive)
  • Linear-time sequence processing (important for long inputs)
  • Compact recurrent state (natural memory bottleneck)
  • Learned selectivity (in selective SSM variants, the model can gate what matters)

A selective SSM can, in principle, learn behavior like:

  • keep durable facts
  • discard repetitive noise
  • retain unresolved goals/constraints
  • update beliefs when contradictions appear
  • preserve temporal structure better than chunk-wise averaging

That is exactly what we want from a context compressor.


Proposed Architecture (Technical Sketch)

Here’s a practical system design direction.

1) Dual-Path Context Pipeline

Path A: Recent Raw Context (High Fidelity)

Keep a normal recency window (e.g., last 4k–16k tokens).

This is where exact wording and local reasoning happens.

Path B: SSM Compressed Memory (Long Horizon)

Process the full stream through an SSM encoder that maintains persistent state and emits compressed memory artifacts.

Possible outputs:

  • memory tokens (fixed number of learned vectors projected into LLM token space)
  • structured slots (facts / entities / tasks / unresolved questions)
  • episodic summaries (time-bucketed compressed states)
  • change deltas (what changed since prior state)

Then the LLM prompt gets:

  • recent raw window
  • compressed memory representation
  • maybe a small set of retrieved raw snippets if needed for exact quotes

This preserves both precision and history.


2) Compression Modes

Different tasks need different compression semantics.

A) Latent Memory Tokenization

The SSM outputs M latent vectors per segment (or per time interval), which are projected to embeddings consumable by the LLM.

This is the closest analog to “token compression.”

  • Efficient
  • Differentiable end-to-end (in theory)
  • But less interpretable/debuggable

B) Structured Memory Extraction

The SSM drives a structured memory state:

  • entities
  • attributes
  • constraints
  • tasks
  • decisions
  • open loops
  • contradictions

This is easier to inspect and test, but harder to train robustly.

C) Hierarchical Episodic Compression

The SSM emits a compressed state at multiple timescales:

  • short-term state (seconds/minutes)
  • mid-term episodes (conversation section / file section)
  • long-term global state (session-wide)

This is useful for streaming or agentic systems.


3) Query-Time Use

At generation time, the LLM can consume:

  • recent raw tokens
  • SSM memory tokens
  • optional retrieval hits

This matters because SSM compression is not guaranteed to preserve exact phrasing. For quote-sensitive or compliance-sensitive tasks, you still want a path to source text.

So the system can operate like:

  1. Use SSM memory for broad situational awareness
  2. Use retrieval only when exact reconstruction is needed
  3. Keep recent window for local reasoning

That makes RAG a precision tool, not the primary long-context substrate.


How This Differs from “Just Summarization”

This is not just “generate summaries every N tokens.”

A learned SSM compressor can be trained (or fine-tuned) to optimize for downstream utility under a token budget, rather than human-readable prose quality.

That means the objective can target things like:

  • answer accuracy after compression
  • retention of constraints
  • temporal consistency
  • action-state correctness in agents
  • anomaly detection recall in streams

Human-readable summaries are optimized for humans. Compressed memory for LLMs should be optimized for machine reusability.

Those are related, but not the same.


Training / Optimization Ideas

This is the hard part—and the interesting part.

Objective 1: Reconstruction (Weak Baseline)

Train the SSM compressor to preserve enough information to reconstruct original segments.

Problem: reconstruction overweights surface form and may waste capacity on unimportant details.

Objective 2: Task-Conditioned Compression (Better)

Train compression so an LLM can still solve downstream tasks under a strict token budget.

Example setup:

  • Full-context teacher answers a task
  • Student sees recent window + SSM-compressed memory
  • Optimize student to match teacher / ground truth

This directly aligns compression with utility.

Objective 3: Contrastive Retrieval-of-State

Given a query at time t, compressed state should help discriminate relevant past events/facts from distractors.

This encourages “queryable memory” without explicit chunk retrieval.

Objective 4: Change Sensitivity / Update Correctness

For streaming domains, explicitly penalize stale memory:

  • if a fact changes, compressed state should update quickly
  • if a contradiction appears, old belief should be downweighted

This is where recurrent/stateful models have a conceptual advantage over static chunk embeddings.


Where This Could Shine

1) Streaming / Growing Inputs

The bigger the sequence and the more incremental the updates, the worse repeated re-chunk + re-embed pipelines feel.

SSMs naturally support incremental state updates.

2) Agent Memory

Agents often need:

  • what was attempted
  • what failed
  • current plan
  • constraints
  • environment state

That’s sequential state, not just document retrieval.

3) Log / Telemetry / Event Streams

Many signals are distributed and temporal. Semantic chunk retrieval can miss cross-event patterns that matter.

A learned stateful compressor may retain latent dynamics better.

4) Long Coding Sessions

RAG over codebases is good for static retrieval, but interactive coding sessions produce evolving state:

  • edits
  • compiler errors
  • test outcomes
  • attempted fixes
  • environment drift

A streaming memory layer could track this more naturally.


Tradeoffs and Failure Modes (Important)

This idea is promising, but it is not free magic.

1) Interpretability Drops

RAG returns chunks you can inspect. An SSM latent state is opaque.

That makes debugging harder:

  • What did the compressor retain?
  • Why did it forget a fact?
  • Why did it hallucinate a stale constraint?

You’ll likely need strong observability tools:

  • memory probes
  • attribution diagnostics
  • synthetic retention tests
  • state diff visualizations

2) Exactness Can Suffer

RAG is good at exact text recall (if retrieval works). Latent compression is lossy by design.

If your workload needs:

  • verbatim quotes
  • legal text fidelity
  • citations to exact passages
  • compliance-grade traceability

then pure SSM compression is risky unless paired with source retrieval.

3) Training Complexity

RAG systems are mostly engineering. SSM compression becomes a modeling problem.

You now need:

  • training data
  • objectives
  • evaluation harnesses
  • drift detection
  • potentially domain adaptation

This increases system complexity substantially.

4) Domain Specificity

A compressor trained on coding streams may perform poorly on legal docs or medical transcripts.

RAG generalizes better across domains because embeddings + retrieval are relatively domain-agnostic (at least compared to a specialized learned compressor).

5) Catastrophic Forgetting in State

Stateful models can silently accumulate error or stale beliefs over long horizons.

Without resets/checkpoints/grounding, compressed memory can drift.

This suggests a practical design:

  • periodic re-grounding from source
  • episodic snapshots
  • checksums / consistency passes
  • retrieval fallback for verification

6) Integration Friction with Existing LLM APIs

Most vendor LLM APIs are token-in/token-out. Feeding custom latent memory representations is not always supported directly.

In practice, you may need to:

  • project latent state into pseudo-tokens
  • decode state into compact structured text
  • run a small adapter model before the main LLM

That adds latency and engineering overhead.


A More Realistic Near-Term Position: SSM + RAG, Not SSM vs RAG

I don’t think the strongest version of this idea is “replace RAG everywhere.”

The stronger claim is:

Use SSM/Mamba-style models as the primary streaming compression and state-tracking layer, and use RAG as an exact recall / provenance layer.

That gives you:

  • SSM memory for continuity and evolving state
  • RAG for precise retrieval and source grounding
  • recent raw window for local reasoning fidelity

This combination matches how long-running systems actually fail:

  • they lose state continuity (SSM helps)
  • they miss exact evidence (RAG helps)
  • they overflow recency (window helps)

Evaluation: How I’d Test This

If I were building this, I’d avoid vague “it feels smarter” claims and benchmark aggressively.

Metrics That Matter

  • Answer accuracy under fixed token budget
  • Constraint retention rate (long-horizon facts/goals)
  • Update correctness (when facts change mid-stream)
  • Temporal reasoning accuracy
  • Hallucinated stale-memory rate
  • Latency and cost per generated token
  • Recovery performance when exact quote/citation is required

Benchmark Tasks

  • Long synthetic transcripts with fact updates and distractors
  • Multi-hour coding session replays (edits/tests/errors)
  • Agent trajectories (tools, actions, failures, revised plans)
  • Streaming incident logs with delayed causal signals

And compare:

  1. Sliding window only
  2. RAG only
  3. Summarization memory
  4. SSM compression only
  5. Hybrid SSM + RAG (likely best practical baseline)

Implementation Thoughts (Pragmatic)

A practical v1 does not need end-to-end differentiability with the frontier LLM.

You can build a useful prototype by:

  1. SSM compressor service

    • consumes stream incrementally
    • emits compact memory state every N tokens / events
  2. Memory materializer

    • converts latent/hidden state into compact textual memory (or structured JSON)
    • e.g., active constraints, entities, unresolved issues, recent changes
  3. LLM orchestrator

    • prompt = recent window + compact memory + optional retrieved snippets
  4. RAG fallback

    • only triggered when confidence is low or exact evidence is needed

This gets you most of the architectural benefits while remaining compatible with standard LLM APIs.


The Deeper Thesis

RAG treats long context as a search problem.

For many modern workloads—especially streaming, growing, agentic workloads—long context is increasingly a state estimation problem.

SSMs (including Mamba-like models) are interesting not because they “beat transformers” in the abstract, but because they offer a better inductive bias for continuous compression of evolving sequence state.

That makes them a compelling candidate for the layer before the LLM:

  • not the final reasoner,
  • but the memory system that decides what survives.

Conclusion

RAG is the default for long-context systems because it’s modular, inspectable, and works surprisingly well. But it is not the only way to manage context—and for growing/streaming inputs, it may not even be the right first primitive.

A learned SSM/Mamba-style compression layer could provide:

  • lower-cost long-horizon continuity,
  • better temporal state tracking,
  • and a cleaner abstraction for persistent memory in agents.

The tradeoff is real: more modeling complexity, less interpretability, and the need for careful grounding mechanisms.

My current view is simple:

  • If your problem is static knowledge lookup: RAG is still king.
  • If your problem is evolving sequence state: SSM-based compression deserves serious attention.
  • If you care about reliability: hybridize.

Open Questions I’m Interested In

  • What is the best interface from SSM state to an LLM: pseudo-tokens, structured memory, or adapter model?
  • How do we evaluate “memory quality” beyond downstream accuracy?
  • Can selective SSMs learn robust update semantics (overwrite stale facts, preserve stable constraints)?
  • What is the right hybrid trigger policy for invoking retrieval versus trusting compressed memory?
  • How domain-specific does the compressor become in practice?

If You’re Building Long-Running AI Systems

If your inputs are expected to grow over time (sessions, logs, code loops, streams), it may be worth modeling context as state compression rather than only document retrieval.

That shift in framing alone changes the architecture you design.