Designing a Smarter Spring AI RAG Pipeline with Query Rewriting, History Compression, and Document Fusion

Published: 2026-06-10

Large language models are powerful, but in production Q&A systems they still fail in familiar ways: they miss relevant facts, misread ambiguous questions, repeat redundant evidence, or confidently answer from weak retrieval. Retrieval-augmented generation is meant to reduce those failure modes, yet a basic RAG setup often runs into four practical problems:

Low recall: a single query does not always surface all relevant documents.
Ambiguous user input: natural-language questions are often incomplete, vague, or dependent on earlier turns in a conversation.
Redundant or conflicting documents: multiple retrievers or query variants can return overlapping or inconsistent material.
Edge-case handling: when retrieval returns nothing or only low-similarity matches, the system needs to degrade gracefully instead of collapsing.

Spring AI addresses these issues through a modular advanced RAG design. The most important building blocks are:

MultiQueryExpander
RewriteQueryTransformer
CompressionQueryTransformer
ConcatenationDocumentJoiner
ContextualQueryAugmenter

Used together, these components form a pipeline that improves answer quality while giving developers room to trade off precision, recall, latency, and resilience.

Where advanced RAG starts to matter

A simple retriever-plus-generator flow is rarely enough once a system moves beyond demos.

In an enterprise knowledge base, the main risk is not latency but precision: if terminology is strict and documentation is fragmented, retrieval must be both broad enough to find the right material and strict enough to avoid wrong evidence.

In customer support, the pressure is different. Response time is critical, conversations are multi-turn, and users often ask follow-up questions with pronouns or omitted context. Here, the pipeline has to stay fast while still resolving ambiguity.

The Spring AI components above target exactly those real-world constraints.

Query-side optimization: first make the question clearer, then broaden it

Two components sit at the heart of query improvement: RewriteQueryTransformer and MultiQueryExpander. They solve different problems and work best when combined.

`MultiQueryExpander`: improving recall through query variants

The role of MultiQueryExpander is straightforward: take one user query and generate several semantically related variants, then retrieve against each of them. This increases document coverage and helps recover information that a single wording might miss.

This is especially useful in cases such as:

domains with many equivalent terms, such as medical or legal vocabulary
ambiguous words, where multiple meanings are plausible
short or underspecified queries that need broader semantic expansion

A user asking how to optimize JVM, for example, may benefit from variants focused on memory tuning, garbage collection, or performance configuration.

Key parameters

chatClientBuilder: required; defines the client used to interact with the LLM
numberOfQueries: how many variants to generate; default is 3, and in practice 2–5 is a useful range
includeOriginal: whether the original query should remain in the expanded set; default is false to avoid duplicate retrieval
promptTemplate: custom prompt control for how the variants are produced, such as more formal, domain-specific, or concise wording

How to tune it

A larger number of variants usually improves recall, but it also increases retrieval cost and may create more duplicate documents downstream.

A practical pattern is to scale the count with query complexity:

simple or specific queries: 2 variants
normal knowledge-base queries: 3 variants
highly ambiguous or sparse queries: 4–5 variants

Domain-specific prompt templates also matter. If the system serves a specialized corpus, the expansion prompt should encourage terminology normalization and known domain synonyms rather than free-form paraphrasing.

`RewriteQueryTransformer`: improving precision through semantic cleanup

If MultiQueryExpander broadens the search, RewriteQueryTransformer sharpens it first.

Its job is to rewrite a single query into a form that the retrieval layer can interpret more reliably. That may involve:

removing ambiguity
eliminating redundant phrasing
normalizing terms
filling in omitted context from conversation history

Typical examples include rewriting a vague query such as “the price of Apple” into either the company or the fruit depending on context, or converting a rambling request into a concise retrieval-oriented expression.

Key parameters

chatClientBuilder: required; enables LLM-based rewriting
promptTemplate: must include {query} and {history} placeholders
preserveHistory: whether the output keeps the original dialog history; default is true, which is important in multi-turn scenarios

Why this matters in conversation

Consider a short dialogue:

User: “Recommend a programming language suitable for beginners.”
Assistant: “Python is a great choice because its syntax is simple and easy to learn.”
Current query: “What data-processing libraries does it have?”

Without rewriting, the retriever sees a pronoun-heavy question and may fail to connect “it” to Python. Passing both the query and conversation history into RewriteQueryTransformer lets the system rewrite it as “What data-processing libraries does Python have?”

That single change can dramatically improve retrieval quality.

Performance advice

Two practical optimizations are worth applying early:

Cache common rewrites for frequent ambiguous phrases such as “What is it?” or “How do I do that?”
Use lightweight pre-detection so obviously clear queries can bypass rewriting entirely

Why the combination works

The strongest pattern is not using either component alone, but arranging them as a two-stage query pipeline:

RewriteQueryTransformer clarifies and standardizes the user’s intent
MultiQueryExpander generates several retrieval-friendly variants from that cleaned-up query

This “precision first, recall second” design is generally more stable than expanding raw user input directly.

A useful tuning guide looks like this:

<table> <thead> <tr> <th>Scenario</th> <th>Rewrite style</th> <th>Multi-query count</th> <th>Expected benefit</th> </tr> </thead> <tbody> <tr> <td>Professional knowledge base</td> <td>Terminology-focused</td> <td>3–4</td> <td>Better matching on specialized terms</td> </tr> <tr> <td>General Q&A</td> <td>Natural and concise</td> <td>2–3</td> <td>Balance between quality and cost</td> </tr> <tr> <td>Highly ambiguous queries</td> <td>Context-enriched</td> <td>4–5</td> <td>Better coverage for unclear intent</td> </tr> </tbody> </table>

Compression for multi-turn chat: keeping history useful without drowning the model

As conversations grow longer, history becomes both valuable and expensive. Long chat context creates at least three problems:

the model struggles to identify which earlier turns matter to the current question
long history can reduce retrieval effectiveness when embeddings are generated from bloated context
the generation model spends too many tokens on old context instead of answering the current query

CompressionQueryTransformer exists to solve this by condensing dialogue history into a compact summary that keeps only the information relevant to the present turn.

What it does well

Its main strengths are:

reducing context length, often by around 50%–70%
emphasizing relevant historical details
preserving enough conversational continuity to avoid logical breaks

Key parameters

chatClientBuilder: required for LLM-based compression
promptTemplate: determines what should be preserved in the compressed history
maxHistoryTokens: maximum token budget for the compressed output; for a model like GPT-3.5-turbo, a practical range is roughly 500–1000 depending on the overall prompt budget

A typical support-chat example

Imagine a ten-turn e-commerce support exchange. After all that context, the user asks: “How do I claim the discount you mentioned earlier?”

If the raw history is long, CompressionQueryTransformer first checks whether it exceeds maxHistoryTokens—say the limit is 500. If it does, the component generates a compressed summary that keeps only the key facts related to the discount and discards unrelated parts of the earlier conversation.

The result is a shorter, more focused history that may cut context length by about 65% while preserving exactly what the retriever and answer generator need.

Better compression strategies

A layered strategy tends to work better than compressing everything equally:

keep the most recent 3 turns in full
summarize older turns, but only if they are relevant to the current query

This preserves local coherence while preventing history bloat.

Where it should sit in the pipeline

In most conversational systems, the best order is:

CompressionQueryTransformer
RewriteQueryTransformer
retrieval

Compressing history before rewriting gives the rewriter a cleaner and more focused context window.

Caching compressed summaries by conversation ID is also worthwhile, especially when users ask several follow-up questions against the same session state.

Document-side processing: deduplication, ordering, and merge control

Once multiple retrievers or multiple query variants are involved, the retrieval layer often returns a noisy set of results. This is where ConcatenationDocumentJoiner becomes important.

`ConcatenationDocumentJoiner`: combining evidence without flooding the model

This component addresses three recurring problems:

duplicate documents: the same item may be returned by several query variants
conflicting statements: different sources may disagree
messy structure: metadata such as origin, confidence, or ranking may be inconsistent across documents

Its purpose is not just to concatenate text. It optimizes the final evidence bundle through:

configurable deduplication
confidence-aware merge ordering
metadata preservation and normalization
a total-length limit so the final context stays inside the LLM window

Key parameters

deduplicationStrategy: NONE, CONTENT_HASH, or SEMANTIC_SIMILARITY
similarityThreshold: used for semantic deduplication, typically 0.85–0.95
maxTotalLength: maximum merged character count; for GPT-3.5-turbo, staying within about 8000 characters is a reasonable guideline depending on the full prompt layout
separator: document separator, defaulting to \n\n---\n\n, which helps the model recognize boundaries between sources
preserveMetadata: default true; useful for traceability and conflict handling

Choosing the right deduplication strategy

<table> <thead> <tr> <th>Strategy</th> <th>How it works</th> <th>Strengths</th> <th>Weaknesses</th> <th>Best fit</th> </tr> </thead> <tbody> <tr> <td>CONTENT_HASH</td> <td>Exact hash match on document content</td> <td>Efficient, roughly O(n), no loss of precision</td> <td>Cannot detect near-duplicates</td> <td>Structured content such as API docs or records</td> </tr> <tr> <td>SEMANTIC_SIMILARITY</td> <td>Cosine similarity over embeddings</td> <td>Can catch paraphrased or near-duplicate content</td> <td>More expensive, roughly O(n²), depends on embedding quality</td> <td>Unstructured text such as articles or reviews</td> </tr> <tr> <td>NONE</td> <td>No deduplication</td> <td>No extra overhead</td> <td>Can produce heavy redundancy</td> <td>Low-latency scenarios or clean sources</td> </tr> </tbody> </table>

A mixed data environment may require custom logic: exact hashing for structured sources, semantic similarity for free text.

More than deduplication

Merge order matters as much as filtering. If the joiner can sort by confidence or source priority, the final document bundle becomes more useful for generation. Preserving metadata is also important when the system needs to explain where an answer came from or when it must mark contradictory passages.

Retrieval integration

A common end-to-end pattern is:

generate multiple query variants
retrieve in parallel across the variants or sources
pass all results into ConcatenationDocumentJoiner
deduplicate, sort, trim, and merge
forward the cleaned document set to the next stage

Parallel retrieval and caching are often the biggest practical wins here.

Contextual augmentation: what to do when retrieval is weak

Even after rewriting, expansion, and document merging, retrieval can still fail in softer ways. Results may not be empty, but they may be only weakly related. Or the retrieved documents may answer part of the question while leaving a crucial gap.

ContextualQueryAugmenter is designed for exactly these situations.

What it adds

It supports three kinds of recovery:

query augmentation: generate a better second-pass query from the current query, conversation history, and retrieved document summary
information completion: create follow-up retrieval requests when the evidence is incomplete
context bridging: reconnect the current question with earlier dialogue when the chain of meaning is broken

Key parameters

relevanceThreshold: default 0.7; below this, results are treated as insufficiently relevant
maxRetries: default 2; limits how many augmentation attempts are allowed
augmentTemplate: must include {query}, {history}, {documentSummary}, and {issue}

Example: low-relevance first retrieval

Suppose a user asks: “Which vector databases are supported by Spring AI RAG?”

The first retrieval pass returns documents with an average score of 0.62, below the relevance threshold of 0.7. At that point, ContextualQueryAugmenter can trigger a second query that is more explicit, such as one that spells out the framework and clarifies that the question is about RAG module integration with vector databases.

A stronger rewritten query may then produce a second retrieval round with an average score of 0.89, turning a weak result set into one that is actually usable.

This component is especially valuable when the retriever needs help inferring missing context from user wording.

Failure handling and graceful degradation

A production RAG system cannot assume that every retrieval, merge, and LLM call will succeed. The pipeline needs clear fallback behavior.

Typical failure cases include:

<table> <thead> <tr> <th>Failure scenario</th> <th>Detection</th> <th>Handling strategy</th> <th>Example response</th> </tr> </thead> <tbody> <tr> <td>No retrieval results</td> <td>documents.isEmpty()</td> <td>Retry with query expansion, then inform the user and suggest alternatives</td> <td>“No relevant documents were found. You may want to try…”</td> </tr> <tr> <td>Very low similarity</td> <td>average score < 0.5</td> <td>Use ContextualQueryAugmenter, reduce answer certainty</td> <td>“Based on limited information, a possible answer is…”</td> </tr> <tr> <td>Conflicting documents</td> <td>contradiction detected</td> <td>Mark conflicts, present differing viewpoints, ask for clarification</td> <td>“Different documents describe this differently…”</td> </tr> <tr> <td>LLM invocation failure</td> <td>catch ChatClientException</td> <td>Retry up to 3 times, switch to backup model, or return retrieval-only output</td> <td>“The AI service is currently busy. Here are relevant document excerpts…”</td> </tr> </tbody> </table>

The key idea is that failure should not feel like a crash. It should become a controlled downgrade in capability.

Putting the modules together: a complete advanced RAG flow

The advanced Spring AI pipeline can be understood as a layered system.

Foundation layer

ChatClient powers all LLM-dependent components: MultiQueryExpander, RewriteQueryTransformer, CompressionQueryTransformer, and ContextualQueryAugmenter
VectorStore supports the retrievers and also enables semantic deduplication in ConcatenationDocumentJoiner

Query-processing layer

CompressionQueryTransformer prepares compact conversation history
RewriteQueryTransformer clarifies the current user question
MultiQueryExpander creates retrieval variants from the rewritten query

Document-processing layer

one or more Retriever instances fetch candidate documents
ConcatenationDocumentJoiner deduplicates, sorts, normalizes, and merges the result set

Augmentation and generation layer

ContextualQueryAugmenter reacts to weak retrieval or missing information
the generation model answers from the final evidence bundle
a cross-cutting exception handler can watch the entire pipeline for failures

This creates a closed loop from query intake to optimization, retrieval, evidence fusion, answer generation, and fallback handling.

Event-driven coordination and loose coupling

A useful architectural detail in this style of pipeline is event-driven collaboration. Instead of tightly binding each module to the next, components can publish and subscribe to events.

For example, after MultiQueryExpander finishes generating variants, it can emit a query-expanded event that retrievers listen for. This keeps modules loosely coupled and makes it easier to add caching, logging, observability, or conditional routing without rewriting the entire flow.

Performance bottlenecks and how to manage them

Profiling an advanced RAG system usually reveals three categories of cost.

1. Compute-heavy steps

LLM calls in query rewriting, expansion, compression, and augmentation, often costing 500–2000 ms each
vector similarity computations, particularly semantic deduplication, which can reach O(n²) complexity with respect to document count

2. I/O-heavy steps

retrieval across several data sources, where network latency adds up
large document merge operations that increase memory pressure and string-handling overhead

3. Resource contention

exhausted ChatClient connection pools under concurrency
vector-database connection limits causing retrieval queuing

Practical optimization strategies

Reduce LLM overhead

Batch generation where possible. If MultiQueryExpander can request several variants in a single call rather than separate calls, API overhead drops sharply.

Use model tiering. Not every task deserves the most capable model.

lightweight models are often enough for history compression
stronger models are better reserved for complex augmentation or conflict resolution

Control retrieval and document cost

Limit returned documents. A retriever with a sensible cap such as top 10 often performs better than one dumping dozens of weak matches into the merge stage.

Pre-filter using metadata. Excluding irrelevant source types or categories before similarity search can cut both cost and noise.

Adopt incremental deduplication. Let ConcatenationDocumentJoiner deduplicate as documents arrive, instead of loading everything first and merging later. This reduces peak memory usage.

Improve concurrency and caching

Component-level caching works especially well for stateless transformations such as query rewrites.

Pool resources carefully. A ChatClient pool configured for around 20 concurrent requests, combined with timeouts and retry logic, is far more stable than ad hoc connection creation. Reusing vector-database TCP connections similarly reduces handshake overhead.

What good performance looks like

Useful KPIs for this kind of pipeline include:

end-to-end latency target: P95 below 2000 ms
LLM calls per user turn: fewer than 3 after optimization, compared with 5–7 before
peak memory usage: below 512 MB when processing large document sets such as a 100-page corpus slice
concurrency target: roughly 50 simultaneous users without obvious latency growth beyond 100 ms

A representative single-node benchmark on an 8-core, 16 GB machine shows the impact of incremental optimization:

<table> <thead> <tr> <th>Optimization step</th> <th>Average latency</th> <th>LLM calls</th> <th>Peak memory</th> <th>Supported concurrency</th> </tr> </thead> <tbody> <tr> <td>Unoptimized</td> <td>3200ms</td> <td>6</td> <td>980MB</td> <td>20</td> </tr> <tr> <td>Batched LLM generation</td> <td>2500ms</td> <td>3</td> <td>950MB</td> <td>25</td> </tr> <tr> <td>+ Model tiering</td> <td>1800ms</td> <td>3</td> <td>820MB</td> <td>30</td> </tr> <tr> <td>+ Component caching</td> <td>1200ms</td> <td>1.2</td> <td>750MB</td> <td>40</td> </tr> <tr> <td>+ Resource pooling</td> <td>950ms</td> <td>1.2</td> <td>680MB</td> <td>50</td> </tr> </tbody> </table>

The pattern is clear: the largest gains come not from one dramatic change, but from coordinated reductions across LLM calls, retrieval volume, and infrastructure overhead.

Configuration patterns by business scenario

Enterprise knowledge base: accuracy first

This setup prioritizes:

precise retrieval over speed
strict preservation of conversational context
full metadata retention for traceability and source authority

Typical tuning choices include:

stronger terminology-focused rewrite prompts
3–4 expanded queries
semantic deduplication with a stricter threshold
higher relevance thresholds before documents are accepted
preservation of source metadata throughout the merge stage

Customer support: speed first

This setup aims for:

response times under 1000 ms
fast ambiguity resolution in multi-turn chat
acceptable precision trade-offs in exchange for fluid interaction

Common adjustments include:

only 2 query variants from MultiQueryExpander
lower maxHistoryTokens, around 300, in CompressionQueryTransformer
CONTENT_HASH for faster deduplication
a more aggressive downgrade path in exception handling

Looking ahead: adaptive and multimodal RAG

The current modular design also points toward future improvements.

Adaptive pipelines

A more dynamic system could tune itself at runtime using reinforcement signals or rules such as:

selecting deduplication strategy based on query type
increasing or decreasing query-variant count based on historical success
automatically switching to cheaper models when latency rises

Multimodal retrieval

The same pipeline ideas can be extended beyond text:

MultiQueryExpander could generate cross-modal queries that include text and image descriptions
ConcatenationDocumentJoiner could merge text with image metadata
ContextualQueryAugmenter could help resolve multimodal ambiguity, such as “What feature does this icon represent?”

Knowledge graph integration

A knowledge graph can act as an additional retrieval source:

retrievers query both vector stores and graph data
the joiner merges entity relationships with textual evidence
augmentation logic uses graph structure to resolve document conflicts more reliably

A practical selection guide

When choosing which components to enable, the business objective should drive the design.

<table> <thead> <tr> <th>Goal</th> <th>Recommended combination</th> <th>Key setting</th> </tr> </thead> <tbody> <tr> <td>Maximize recall</td> <td>MultiQueryExpander with 5 variants + semantic deduplication</td> <td>similarityThreshold = 0.85</td> </tr> <tr> <td>Maximize precision</td> <td>RewriteQueryTransformer + hash deduplication</td> <td>relevanceThreshold = 0.8</td> </tr> <tr> <td>Smooth multi-turn dialogue</td> <td>CompressionQueryTransformer + ContextualQueryAugmenter</td> <td>maxHistoryTokens = 500</td> </tr> <tr> <td>Low-latency serving</td> <td>2 query variants + hash deduplication + lightweight model</td> <td>maxRetries = 0</td> </tr> <tr> <td>High reliability</td> <td>full component set + stronger exception handling</td> <td>maxRetries = 3 and a backup model</td> </tr> </tbody> </table>

Common problems and likely fixes

<table> <thead> <tr> <th>Symptom</th> <th>Likely cause</th> <th>Recommended action</th> </tr> </thead> <tbody> <tr> <td>Answer contains incorrect information</td> <td>low-relevance documents entered generation, or conflicts were not handled</td> <td>raise relevanceThreshold to 0.85 and strengthen conflict detection</td> </tr> <tr> <td>Response time exceeds 3 seconds</td> <td>too many LLM calls, or too many retrieved documents</td> <td>cut query variants to 2–3 and reduce maxTotalLength</td> </tr> <tr> <td>Conversation context gets lost</td> <td>history compression is too aggressive, or preserveHistory = false</td> <td>increase maxHistoryTokens to around 800 and ensure history is preserved</td> </tr> <tr> <td>Retrieval output is highly repetitive</td> <td>wrong deduplication strategy, or query variants are too similar</td> <td>switch to semantic deduplication and improve the expansion prompt</td> </tr> </tbody> </table>

The real value of advanced RAG in Spring AI

The strength of this design is not any single component. It is the configurable trade-off system they create together.

Advanced RAG is always balancing three competing forces:

precision
recall
latency

Spring AI’s modular approach makes that balance explicit. You can start with a minimal pipeline, then add rewriting, history compression, query expansion, document fusion, and contextual augmentation only where the business case justifies the cost.

A sensible rollout path is usually:

build the basic retrieval-and-generation flow
add query rewriting for clarity
add query expansion for recall
add document joining and deduplication
introduce contextual augmentation and stronger exception handling
tune everything with real usage data rather than assumptions

Done well, the result is a RAG system that keeps the generative flexibility of LLMs while reducing hallucinations, improving evidence quality, and remaining stable under real production conditions.

Designing a Smarter Spring AI RAG Pipeline with Query Rewriting, History Compression, and Document Fusion

Where advanced RAG starts to matter

Query-side optimization: first make the question clearer, then broaden it

MultiQueryExpander: improving recall through query variants

Key parameters

How to tune it

RewriteQueryTransformer: improving precision through semantic cleanup

Key parameters

Why this matters in conversation

Performance advice

Why the combination works

Compression for multi-turn chat: keeping history useful without drowning the model

What it does well

Key parameters

A typical support-chat example

Better compression strategies

Where it should sit in the pipeline

Document-side processing: deduplication, ordering, and merge control

ConcatenationDocumentJoiner: combining evidence without flooding the model

Key parameters

Choosing the right deduplication strategy

More than deduplication

Retrieval integration

Contextual augmentation: what to do when retrieval is weak

What it adds

Key parameters

Example: low-relevance first retrieval

Failure handling and graceful degradation

Putting the modules together: a complete advanced RAG flow

Foundation layer

Query-processing layer

Document-processing layer

Augmentation and generation layer

Event-driven coordination and loose coupling

Performance bottlenecks and how to manage them

1. Compute-heavy steps

2. I/O-heavy steps

3. Resource contention

Practical optimization strategies

Reduce LLM overhead

Control retrieval and document cost

Improve concurrency and caching

What good performance looks like

Configuration patterns by business scenario

Enterprise knowledge base: accuracy first

Customer support: speed first

Looking ahead: adaptive and multimodal RAG

Adaptive pipelines

Multimodal retrieval

Knowledge graph integration

A practical selection guide

Common problems and likely fixes

The real value of advanced RAG in Spring AI

Related Posts

`MultiQueryExpander`: improving recall through query variants

`RewriteQueryTransformer`: improving precision through semantic cleanup

`ConcatenationDocumentJoiner`: combining evidence without flooding the model