Large language models are powerful, but in production Q&A systems they still fail in familiar ways: they miss relevant facts, misread ambiguous questions, repeat redundant evidence, or confidently answer from weak retrieval. Retrieval-augmented generation is meant to reduce those failure modes, yet a basic RAG setup often runs into four practical problems:
- Low recall: a single query does not always surface all relevant documents.
- Ambiguous user input: natural-language questions are often incomplete, vague, or dependent on earlier turns in a conversation.
- Redundant or conflicting documents: multiple retrievers or query variants can return overlapping or inconsistent material.
- Edge-case handling: when retrieval returns nothing or only low-similarity matches, the system needs to degrade gracefully instead of collapsing.
Spring AI addresses these issues through a modular advanced RAG design. The most important building blocks are:
MultiQueryExpanderRewriteQueryTransformerCompressionQueryTransformerConcatenationDocumentJoinerContextualQueryAugmenter
Used together, these components form a pipeline that improves answer quality while giving developers room to trade off precision, recall, latency, and resilience.
Where advanced RAG starts to matter
A simple retriever-plus-generator flow is rarely enough once a system moves beyond demos.
In an enterprise knowledge base, the main risk is not latency but precision: if terminology is strict and documentation is fragmented, retrieval must be both broad enough to find the right material and strict enough to avoid wrong evidence.
In customer support, the pressure is different. Response time is critical, conversations are multi-turn, and users often ask follow-up questions with pronouns or omitted context. Here, the pipeline has to stay fast while still resolving ambiguity.
The Spring AI components above target exactly those real-world constraints.
Query-side optimization: first make the question clearer, then broaden it
Two components sit at the heart of query improvement: RewriteQueryTransformer and MultiQueryExpander. They solve different problems and work best when combined.
MultiQueryExpander: improving recall through query variants
The role of MultiQueryExpander is straightforward: take one user query and generate several semantically related variants, then retrieve against each of them. This increases document coverage and helps recover information that a single wording might miss.
This is especially useful in cases such as:
- domains with many equivalent terms, such as medical or legal vocabulary
- ambiguous words, where multiple meanings are plausible
- short or underspecified queries that need broader semantic expansion
A user asking how to optimize JVM, for example, may benefit from variants focused on memory tuning, garbage collection, or performance configuration.
Key parameters
chatClientBuilder: required; defines the client used to interact with the LLMnumberOfQueries: how many variants to generate; default is 3, and in practice 2–5 is a useful rangeincludeOriginal: whether the original query should remain in the expanded set; default isfalseto avoid duplicate retrievalpromptTemplate: custom prompt control for how the variants are produced, such as more formal, domain-specific, or concise wording
How to tune it
A larger number of variants usually improves recall, but it also increases retrieval cost and may create more duplicate documents downstream.
A practical pattern is to scale the count with query complexity:
- simple or specific queries: 2 variants
- normal knowledge-base queries: 3 variants
- highly ambiguous or sparse queries: 4–5 variants
Domain-specific prompt templates also matter. If the system serves a specialized corpus, the expansion prompt should encourage terminology normalization and known domain synonyms rather than free-form paraphrasing.
RewriteQueryTransformer: improving precision through semantic cleanup
If MultiQueryExpander broadens the search, RewriteQueryTransformer sharpens it first.
Its job is to rewrite a single query into a form that the retrieval layer can interpret more reliably. That may involve:
- removing ambiguity
- eliminating redundant phrasing
- normalizing terms
- filling in omitted context from conversation history
Typical examples include rewriting a vague query such as “the price of Apple” into either the company or the fruit depending on context, or converting a rambling request into a concise retrieval-oriented expression.
Key parameters
chatClientBuilder: required; enables LLM-based rewritingpromptTemplate: must include{query}and{history}placeholderspreserveHistory: whether the output keeps the original dialog history; default istrue, which is important in multi-turn scenarios
Why this matters in conversation
Consider a short dialogue:
- User: “Recommend a programming language suitable for beginners.”
- Assistant: “Python is a great choice because its syntax is simple and easy to learn.”
- Current query: “What data-processing libraries does it have?”
Without rewriting, the retriever sees a pronoun-heavy question and may fail to connect “it” to Python. Passing both the query and conversation history into RewriteQueryTransformer lets the system rewrite it as “What data-processing libraries does Python have?”
That single change can dramatically improve retrieval quality.
Performance advice
Two practical optimizations are worth applying early:
- Cache common rewrites for frequent ambiguous phrases such as “What is it?” or “How do I do that?”
- Use lightweight pre-detection so obviously clear queries can bypass rewriting entirely
Why the combination works
The strongest pattern is not using either component alone, but arranging them as a two-stage query pipeline:
RewriteQueryTransformerclarifies and standardizes the user’s intentMultiQueryExpandergenerates several retrieval-friendly variants from that cleaned-up query
This “precision first, recall second” design is generally more stable than expanding raw user input directly.
A useful tuning guide looks like this:
<table> <thead> <tr> <th>Scenario</th> <th>Rewrite style</th> <th>Multi-query count</th> <th>Expected benefit</th> </tr> </thead> <tbody> <tr> <td>Professional knowledge base</td> <td>Terminology-focused</td> <td>3–4</td> <td>Better matching on specialized terms</td> </tr> <tr> <td>General Q&A</td> <td>Natural and concise</td> <td>2–3</td> <td>Balance between quality and cost</td> </tr> <tr> <td>Highly ambiguous queries</td> <td>Context-enriched</td> <td>4–5</td> <td>Better coverage for unclear intent</td> </tr> </tbody> </table>Compression for multi-turn chat: keeping history useful without drowning the model
As conversations grow longer, history becomes both valuable and expensive. Long chat context creates at least three problems:
- the model struggles to identify which earlier turns matter to the current question
- long history can reduce retrieval effectiveness when embeddings are generated from bloated context
- the generation model spends too many tokens on old context instead of answering the current query
CompressionQueryTransformer exists to solve this by condensing dialogue history into a compact summary that keeps only the information relevant to the present turn.
What it does well
Its main strengths are:
- reducing context length, often by around 50%–70%
- emphasizing relevant historical details
- preserving enough conversational continuity to avoid logical breaks
Key parameters
chatClientBuilder: required for LLM-based compressionpromptTemplate: determines what should be preserved in the compressed historymaxHistoryTokens: maximum token budget for the compressed output; for a model like GPT-3.5-turbo, a practical range is roughly 500–1000 depending on the overall prompt budget
A typical support-chat example
Imagine a ten-turn e-commerce support exchange. After all that context, the user asks: “How do I claim the discount you mentioned earlier?”
If the raw history is long, CompressionQueryTransformer first checks whether it exceeds maxHistoryTokens—say the limit is 500. If it does, the component generates a compressed summary that keeps only the key facts related to the discount and discards unrelated parts of the earlier conversation.
The result is a shorter, more focused history that may cut context length by about 65% while preserving exactly what the retriever and answer generator need.
Better compression strategies
A layered strategy tends to work better than compressing everything equally:
- keep the most recent 3 turns in full
- summarize older turns, but only if they are relevant to the current query
This preserves local coherence while preventing history bloat.
Where it should sit in the pipeline
In most conversational systems, the best order is:
CompressionQueryTransformerRewriteQueryTransformer- retrieval
Compressing history before rewriting gives the rewriter a cleaner and more focused context window.
Caching compressed summaries by conversation ID is also worthwhile, especially when users ask several follow-up questions against the same session state.
Document-side processing: deduplication, ordering, and merge control
Once multiple retrievers or multiple query variants are involved, the retrieval layer often returns a noisy set of results. This is where ConcatenationDocumentJoiner becomes important.
ConcatenationDocumentJoiner: combining evidence without flooding the model
This component addresses three recurring problems:
- duplicate documents: the same item may be returned by several query variants
- conflicting statements: different sources may disagree
- messy structure: metadata such as origin, confidence, or ranking may be inconsistent across documents
Its purpose is not just to concatenate text. It optimizes the final evidence bundle through:
- configurable deduplication
- confidence-aware merge ordering
- metadata preservation and normalization
- a total-length limit so the final context stays inside the LLM window
Key parameters
deduplicationStrategy:NONE,CONTENT_HASH, orSEMANTIC_SIMILARITYsimilarityThreshold: used for semantic deduplication, typically 0.85–0.95maxTotalLength: maximum merged character count; for GPT-3.5-turbo, staying within about 8000 characters is a reasonable guideline depending on the full prompt layoutseparator: document separator, defaulting to\n\n---\n\n, which helps the model recognize boundaries between sourcespreserveMetadata: defaulttrue; useful for traceability and conflict handling
Choosing the right deduplication strategy
<table> <thead> <tr> <th>Strategy</th> <th>How it works</th> <th>Strengths</th> <th>Weaknesses</th> <th>Best fit</th> </tr> </thead> <tbody> <tr> <td>CONTENT_HASH</td>
<td>Exact hash match on document content</td>
<td>Efficient, roughly O(n), no loss of precision</td>
<td>Cannot detect near-duplicates</td>
<td>Structured content such as API docs or records</td>
</tr>
<tr>
<td>SEMANTIC_SIMILARITY</td>
<td>Cosine similarity over embeddings</td>
<td>Can catch paraphrased or near-duplicate content</td>
<td>More expensive, roughly O(n²), depends on embedding quality</td>
<td>Unstructured text such as articles or reviews</td>
</tr>
<tr>
<td>NONE</td>
<td>No deduplication</td>
<td>No extra overhead</td>
<td>Can produce heavy redundancy</td>
<td>Low-latency scenarios or clean sources</td>
</tr>
</tbody>
</table>
A mixed data environment may require custom logic: exact hashing for structured sources, semantic similarity for free text.
More than deduplication
Merge order matters as much as filtering. If the joiner can sort by confidence or source priority, the final document bundle becomes more useful for generation. Preserving metadata is also important when the system needs to explain where an answer came from or when it must mark contradictory passages.
Retrieval integration
A common end-to-end pattern is:
- generate multiple query variants
- retrieve in parallel across the variants or sources
- pass all results into
ConcatenationDocumentJoiner - deduplicate, sort, trim, and merge
- forward the cleaned document set to the next stage
Parallel retrieval and caching are often the biggest practical wins here.
Contextual augmentation: what to do when retrieval is weak
Even after rewriting, expansion, and document merging, retrieval can still fail in softer ways. Results may not be empty, but they may be only weakly related. Or the retrieved documents may answer part of the question while leaving a crucial gap.
ContextualQueryAugmenter is designed for exactly these situations.
What it adds
It supports three kinds of recovery:
- query augmentation: generate a better second-pass query from the current query, conversation history, and retrieved document summary
- information completion: create follow-up retrieval requests when the evidence is incomplete
- context bridging: reconnect the current question with earlier dialogue when the chain of meaning is broken
Key parameters
relevanceThreshold: default 0.7; below this, results are treated as insufficiently relevantmaxRetries: default 2; limits how many augmentation attempts are allowedaugmentTemplate: must include{query},{history},{documentSummary}, and{issue}
Example: low-relevance first retrieval
Suppose a user asks: “Which vector databases are supported by Spring AI RAG?”
The first retrieval pass returns documents with an average score of 0.62, below the relevance threshold of 0.7. At that point, ContextualQueryAugmenter can trigger a second query that is more explicit, such as one that spells out the framework and clarifies that the question is about RAG module integration with vector databases.
A stronger rewritten query may then produce a second retrieval round with an average score of 0.89, turning a weak result set into one that is actually usable.
This component is especially valuable when the retriever needs help inferring missing context from user wording.
Failure handling and graceful degradation
A production RAG system cannot assume that every retrieval, merge, and LLM call will succeed. The pipeline needs clear fallback behavior.
Typical failure cases include:
<table> <thead> <tr> <th>Failure scenario</th> <th>Detection</th> <th>Handling strategy</th> <th>Example response</th> </tr> </thead> <tbody> <tr> <td>No retrieval results</td> <td>documents.isEmpty()</td>
<td>Retry with query expansion, then inform the user and suggest alternatives</td>
<td>“No relevant documents were found. You may want to try…”</td>
</tr>
<tr>
<td>Very low similarity</td>
<td>average score < 0.5</td>
<td>Use ContextualQueryAugmenter, reduce answer certainty</td>
<td>“Based on limited information, a possible answer is…”</td>
</tr>
<tr>
<td>Conflicting documents</td>
<td>contradiction detected</td>
<td>Mark conflicts, present differing viewpoints, ask for clarification</td>
<td>“Different documents describe this differently…”</td>
</tr>
<tr>
<td>LLM invocation failure</td>
<td>catch ChatClientException</td>
<td>Retry up to 3 times, switch to backup model, or return retrieval-only output</td>
<td>“The AI service is currently busy. Here are relevant document excerpts…”</td>
</tr>
</tbody>
</table>
The key idea is that failure should not feel like a crash. It should become a controlled downgrade in capability.
Putting the modules together: a complete advanced RAG flow
The advanced Spring AI pipeline can be understood as a layered system.
Foundation layer
ChatClientpowers all LLM-dependent components:MultiQueryExpander,RewriteQueryTransformer,CompressionQueryTransformer, andContextualQueryAugmenterVectorStoresupports the retrievers and also enables semantic deduplication inConcatenationDocumentJoiner
Query-processing layer
CompressionQueryTransformerprepares compact conversation historyRewriteQueryTransformerclarifies the current user questionMultiQueryExpandercreates retrieval variants from the rewritten query
Document-processing layer
- one or more
Retrieverinstances fetch candidate documents ConcatenationDocumentJoinerdeduplicates, sorts, normalizes, and merges the result set
Augmentation and generation layer
ContextualQueryAugmenterreacts to weak retrieval or missing information- the generation model answers from the final evidence bundle
- a cross-cutting exception handler can watch the entire pipeline for failures
This creates a closed loop from query intake to optimization, retrieval, evidence fusion, answer generation, and fallback handling.
Event-driven coordination and loose coupling
A useful architectural detail in this style of pipeline is event-driven collaboration. Instead of tightly binding each module to the next, components can publish and subscribe to events.
For example, after MultiQueryExpander finishes generating variants, it can emit a query-expanded event that retrievers listen for. This keeps modules loosely coupled and makes it easier to add caching, logging, observability, or conditional routing without rewriting the entire flow.
Performance bottlenecks and how to manage them
Profiling an advanced RAG system usually reveals three categories of cost.
1. Compute-heavy steps
- LLM calls in query rewriting, expansion, compression, and augmentation, often costing 500–2000 ms each
- vector similarity computations, particularly semantic deduplication, which can reach O(n²) complexity with respect to document count
2. I/O-heavy steps
- retrieval across several data sources, where network latency adds up
- large document merge operations that increase memory pressure and string-handling overhead
3. Resource contention
- exhausted
ChatClientconnection pools under concurrency - vector-database connection limits causing retrieval queuing
Practical optimization strategies
Reduce LLM overhead
Batch generation where possible. If MultiQueryExpander can request several variants in a single call rather than separate calls, API overhead drops sharply.
Use model tiering. Not every task deserves the most capable model.
- lightweight models are often enough for history compression
- stronger models are better reserved for complex augmentation or conflict resolution
Control retrieval and document cost
Limit returned documents. A retriever with a sensible cap such as top 10 often performs better than one dumping dozens of weak matches into the merge stage.
Pre-filter using metadata. Excluding irrelevant source types or categories before similarity search can cut both cost and noise.
Adopt incremental deduplication. Let ConcatenationDocumentJoiner deduplicate as documents arrive, instead of loading everything first and merging later. This reduces peak memory usage.
Improve concurrency and caching
Component-level caching works especially well for stateless transformations such as query rewrites.
Pool resources carefully. A ChatClient pool configured for around 20 concurrent requests, combined with timeouts and retry logic, is far more stable than ad hoc connection creation. Reusing vector-database TCP connections similarly reduces handshake overhead.
What good performance looks like
Useful KPIs for this kind of pipeline include:
- end-to-end latency target: P95 below 2000 ms
- LLM calls per user turn: fewer than 3 after optimization, compared with 5–7 before
- peak memory usage: below 512 MB when processing large document sets such as a 100-page corpus slice
- concurrency target: roughly 50 simultaneous users without obvious latency growth beyond 100 ms
A representative single-node benchmark on an 8-core, 16 GB machine shows the impact of incremental optimization:
<table> <thead> <tr> <th>Optimization step</th> <th>Average latency</th> <th>LLM calls</th> <th>Peak memory</th> <th>Supported concurrency</th> </tr> </thead> <tbody> <tr> <td>Unoptimized</td> <td>3200ms</td> <td>6</td> <td>980MB</td> <td>20</td> </tr> <tr> <td>Batched LLM generation</td> <td>2500ms</td> <td>3</td> <td>950MB</td> <td>25</td> </tr> <tr> <td>+ Model tiering</td> <td>1800ms</td> <td>3</td> <td>820MB</td> <td>30</td> </tr> <tr> <td>+ Component caching</td> <td>1200ms</td> <td>1.2</td> <td>750MB</td> <td>40</td> </tr> <tr> <td>+ Resource pooling</td> <td>950ms</td> <td>1.2</td> <td>680MB</td> <td>50</td> </tr> </tbody> </table>The pattern is clear: the largest gains come not from one dramatic change, but from coordinated reductions across LLM calls, retrieval volume, and infrastructure overhead.
Configuration patterns by business scenario
Enterprise knowledge base: accuracy first
This setup prioritizes:
- precise retrieval over speed
- strict preservation of conversational context
- full metadata retention for traceability and source authority
Typical tuning choices include:
- stronger terminology-focused rewrite prompts
- 3–4 expanded queries
- semantic deduplication with a stricter threshold
- higher relevance thresholds before documents are accepted
- preservation of source metadata throughout the merge stage
Customer support: speed first
This setup aims for:
- response times under 1000 ms
- fast ambiguity resolution in multi-turn chat
- acceptable precision trade-offs in exchange for fluid interaction
Common adjustments include:
- only 2 query variants from
MultiQueryExpander - lower
maxHistoryTokens, around 300, inCompressionQueryTransformer CONTENT_HASHfor faster deduplication- a more aggressive downgrade path in exception handling
Looking ahead: adaptive and multimodal RAG
The current modular design also points toward future improvements.
Adaptive pipelines
A more dynamic system could tune itself at runtime using reinforcement signals or rules such as:
- selecting deduplication strategy based on query type
- increasing or decreasing query-variant count based on historical success
- automatically switching to cheaper models when latency rises
Multimodal retrieval
The same pipeline ideas can be extended beyond text:
MultiQueryExpandercould generate cross-modal queries that include text and image descriptionsConcatenationDocumentJoinercould merge text with image metadataContextualQueryAugmentercould help resolve multimodal ambiguity, such as “What feature does this icon represent?”
Knowledge graph integration
A knowledge graph can act as an additional retrieval source:
- retrievers query both vector stores and graph data
- the joiner merges entity relationships with textual evidence
- augmentation logic uses graph structure to resolve document conflicts more reliably
A practical selection guide
When choosing which components to enable, the business objective should drive the design.
<table> <thead> <tr> <th>Goal</th> <th>Recommended combination</th> <th>Key setting</th> </tr> </thead> <tbody> <tr> <td>Maximize recall</td> <td>MultiQueryExpander with 5 variants + semantic deduplication</td>
<td>similarityThreshold = 0.85</td>
</tr>
<tr>
<td>Maximize precision</td>
<td>RewriteQueryTransformer + hash deduplication</td>
<td>relevanceThreshold = 0.8</td>
</tr>
<tr>
<td>Smooth multi-turn dialogue</td>
<td>CompressionQueryTransformer + ContextualQueryAugmenter</td>
<td>maxHistoryTokens = 500</td>
</tr>
<tr>
<td>Low-latency serving</td>
<td>2 query variants + hash deduplication + lightweight model</td>
<td>maxRetries = 0</td>
</tr>
<tr>
<td>High reliability</td>
<td>full component set + stronger exception handling</td>
<td>maxRetries = 3 and a backup model</td>
</tr>
</tbody>
</table>
Common problems and likely fixes
<table> <thead> <tr> <th>Symptom</th> <th>Likely cause</th> <th>Recommended action</th> </tr> </thead> <tbody> <tr> <td>Answer contains incorrect information</td> <td>low-relevance documents entered generation, or conflicts were not handled</td> <td>raiserelevanceThreshold to 0.85 and strengthen conflict detection</td>
</tr>
<tr>
<td>Response time exceeds 3 seconds</td>
<td>too many LLM calls, or too many retrieved documents</td>
<td>cut query variants to 2–3 and reduce maxTotalLength</td>
</tr>
<tr>
<td>Conversation context gets lost</td>
<td>history compression is too aggressive, or preserveHistory = false</td>
<td>increase maxHistoryTokens to around 800 and ensure history is preserved</td>
</tr>
<tr>
<td>Retrieval output is highly repetitive</td>
<td>wrong deduplication strategy, or query variants are too similar</td>
<td>switch to semantic deduplication and improve the expansion prompt</td>
</tr>
</tbody>
</table>
The real value of advanced RAG in Spring AI
The strength of this design is not any single component. It is the configurable trade-off system they create together.
Advanced RAG is always balancing three competing forces:
- precision
- recall
- latency
Spring AI’s modular approach makes that balance explicit. You can start with a minimal pipeline, then add rewriting, history compression, query expansion, document fusion, and contextual augmentation only where the business case justifies the cost.
A sensible rollout path is usually:
- build the basic retrieval-and-generation flow
- add query rewriting for clarity
- add query expansion for recall
- add document joining and deduplication
- introduce contextual augmentation and stronger exception handling
- tune everything with real usage data rather than assumptions
Done well, the result is a RAG system that keeps the generative flexibility of LLMs while reducing hallucinations, improving evidence quality, and remaining stable under real production conditions.