Grounding a support agent in real product data

Posted on Apr 3, 2026

The first time I pointed a raw LLM at our toy store’s support inbox, it told a customer our return window was 90 days. It is 30. My wife caught it before it went out, but the fact that the system generated that answer with complete confidence bothered me for days. The model was not lying, exactly. It was just averaging across thousands of return policies it had seen during training and producing something plausible. For a real store with real customers, plausible is dangerous.

Part 2 of the AI E-commerce Store series. I am building a production AI system for my family’s educational toy store using Google ADK 2.0 and WooCommerce.

This is the problem that Retrieval-Augmented Generation solves. Instead of letting the model answer from memory, you pull relevant documents from your own data and include them in the prompt. The model generates based on that retrieved context, not its training data. But I found that basic RAG on its own has its own set of failure modes, and the four patterns from Generative AI Design Patterns (O’Reilly, 2025) gave me a framework for thinking about those failures systematically.

Starting with basic retrieval

This is where Basic RAG (Pattern 6 from the book) comes in. The foundation is simple: before the LLM generates anything, retrieve relevant information from your own data and include it as context.

For our support agent, the retrieval corpus is the product catalog (names, descriptions, prices, specs, stock status), store policies (returns, shipping, warranties), FAQs with verified answers, and customer-specific order history scoped per session.

The process has three steps. Chunking splits your source documents into pieces. For product data, a natural boundary is one product per chunk: name, price, description, specs, and stock status together. For policy documents, I chunk by section and keep headers for context. Embedding converts each chunk into a vector that captures its meaning. You run each chunk through an embedding model like text-embedding-004 and store the vectors in a search index. Retrieval takes the customer’s query, embeds it with the same model, and finds the nearest chunks by vector similarity. Those chunks become the generation context.

Here is a basic RAG tool in ADK 2.0, pulling live data from WooCommerce:

from google.adk import Agent
from pydantic import BaseModel
from woocommerce import API as WooAPI

woo = WooAPI(
    url="https://yourstore.com",
    consumer_key="ck_xxx",
    consumer_secret="cs_xxx",
    version="wc/v3",
)

class RetrievedContext(BaseModel):
    chunks: list[str]
    sources: list[str]
    relevance_scores: list[float]

def retrieve_product_info(query: str) -> RetrievedContext:
    """Search WooCommerce products and return relevant context."""
    products = woo.get(
        "products", params={"search": query, "per_page": 10}
    ).json()
    chunks, sources = [], []
    for p in products:
        chunk = f"Product: {p['name']}\nPrice: ${p['price']}\n"
        chunk += f"Description: {p['short_description']}\n"
        chunk += f"Stock: {'In stock' if p['in_stock'] else 'Out of stock'}"
        chunks.append(chunk)
        sources.append(p["permalink"])
    return RetrievedContext(
        chunks=chunks,
        sources=sources,
        relevance_scores=[1.0] * len(chunks),
    )

support_agent = Agent(
    name="basic_support",
    model="gemini-2.5-flash",
    instruction="""Answer the customer question using ONLY the provided
        product context. If the answer is not in the context, say
        "I don't have that information" and offer to connect them
        with a human agent. Never invent prices or availability.""",
    tools=[retrieve_product_info],
)

This already kills the worst hallucinations. The agent can only reference products that actually exist in the catalog, at their current prices, with their real stock status. But I wanted to do better.

Making search understand what customers actually mean

Keyword search breaks fast in e-commerce. This was something I learned early. A customer searching for “STEM toys for 5 year olds” will not match a product named “Junior Engineer Building Blocks Ages 4-6” through keywords alone. The gap between how customers describe what they want and how products are named is enormous.

Semantic indexing (Pattern 7) fixes this by embedding both queries and products into a shared vector space, matching on meaning instead of keywords. But e-commerce data has quirks that make this harder than it sounds.

Product specs live in tables: dimensions, weight, age ranges, materials. Normal chunking strategies designed for paragraphs break table structure apart. What I found works is serializing table rows into sentences during indexing. “The Junior Engineer set weighs 1.2 kg, is recommended for ages 4-6, and contains 127 pieces” embeds much better than raw HTML table markup.

Then there is domain jargon. Educational toy retail has its own language. “Montessori-aligned” and “open-ended play” mean specific things to parents who use these terms. Your embedding model needs to capture these relationships. Fine-tuned embeddings or curated metadata fields added to chunk text before embedding help here.

Synonyms are another headache. Customers say “building blocks,” but our catalog says “construction set.” They ask about “robot kits,” we stock “programmable robotics.” Including category hierarchies and related terms in the chunk text helps the embedding model connect these without maintaining explicit synonym dictionaries.

And then there are numbers. Embeddings are genuinely bad at numerical comparison. “Toys under $30” is a filter, not a semantic query. My approach is hybrid: use semantic search for concept matching, then apply structured filters for numerical constraints. WooCommerce’s API supports price range filters natively, so I handle this at the retrieval layer instead of hoping the embedding model can do arithmetic.

The core insight I keep coming back to: indexing quality determines retrieval quality, which determines generation quality. Time spent on good chunk construction pays off at every downstream step.

Cleaning up retrieval results

What surprised me was how noisy raw retrieval results are. You get duplicate products because the same item appears in multiple categories, chunks that are only partly relevant because one paragraph in a policy document matters but the rest does not, and sometimes irrelevant results that barely cleared the similarity threshold.

Node Postprocessing (Pattern 10) sits between retrieval and generation. It cleans the context before the LLM ever sees it.

Reranking is the big one. Initial retrieval is fast but approximate (think HNSW-based vector search). A reranker, usually a cross-encoder model, takes each query-chunk pair and produces a more accurate relevance score. It is slower, but your top results get significantly better. Entity resolution matters for multi-turn support. When a customer says “that toy” or “the one I asked about earlier,” you need to map those references to specific products, which means maintaining conversation state and resolving vague references back to product IDs.

Contextual compression is worth doing too. A retrieved policy document might be 500 tokens, but only one sentence actually answers the question. Compression extracts just the relevant part, reducing noise and leaving room for other useful context in the prompt. And deduplication: the same product appearing in “New Arrivals,” “STEM Toys,” and “Ages 4-6” categories will produce three near-identical chunks. Deduplicate by product ID.

Here is a postprocessing step in the ADK workflow:

def postprocess_chunks(retrieved: RetrievedContext) -> RetrievedContext:
    """Rerank and filter retrieved chunks."""
    filtered = []
    seen_names = set()
    for i, chunk in enumerate(retrieved.chunks):
        product_name = chunk.split("\n")[0]
        if (
            product_name not in seen_names
            and retrieved.relevance_scores[i] > 0.5
        ):
            filtered.append(chunk)
            seen_names.add(product_name)
    return RetrievedContext(
        chunks=filtered[:5],
        sources=retrieved.sources[:5],
        relevance_scores=retrieved.relevance_scores[:5],
    )

In production you would replace the name-based deduplication with proper product ID matching and use a cross-encoder reranker instead of the threshold filter. But the structure matters. Postprocessing is a separate, testable step, not something you hope the LLM figures out on its own.

Making uncertainty visible

The last pattern in this pipeline, Trustworthy Generation (Pattern 11), addresses something I think gets overlooked. There is a gap between “the model used the right data” and “the customer knows the model used the right data.” Even a perfectly grounded answer feels unreliable if there is no way to verify it.

Citations are the simplest version of this. Every factual claim should link to its source. “This set is recommended for ages 4-6 (product page)” is something the customer can check. “This set is for younger kids” is not. I tell the model to cite source URLs from the retrieved context for every factual statement.

Confidence detection is more subtle. Not every query has a clear answer in your data. When the model hedges (“I think,” “it might be,” “I’m not sure”), that is a signal to escalate rather than guess. You can detect this in code and route uncertain responses to human agents.

Out-of-domain detection matters too. A customer asking your toy store agent about mortgage rates is obviously off-topic. But what about “is this toy safe for children with latex allergies?” That is close to your domain but probably not in your product data. The agent needs to know the boundary between “I can answer this” and “I need to hand this off.”

There is also Corrective RAG, or CRAG. This adds a check after retrieval. A lightweight classifier evaluates whether the retrieved documents actually match the query. If relevance is low, the system can reformulate and retry, fall back to web search, or acknowledge the gap and escalate.

Here is the confidence check as an ADK workflow node:

from google.adk import Event

def confidence_check(response: str):
    """Route low-confidence responses to human escalation."""
    low_confidence_phrases = [
        "I don't have",
        "I'm not sure",
        "I cannot confirm",
    ]
    is_confident = not any(
        phrase in response for phrase in low_confidence_phrases
    )
    if is_confident:
        return Event(output=response, message=response)
    return Event(
        message=f"{response}\n\nWould you like me to connect you with a team member?",
        state={"needs_escalation": True},
    )

What I like about this approach is that it turns the support agent from a black box into a system with visible failure modes. When it does not know something, it says so and offers a way forward. That honesty is worth more than a hundred correct answers.

The full pipeline

Each pattern handles a specific failure mode. Combined, they form a pipeline that retrieves, cleans, generates, and verifies, all as separate testable nodes in an ADK 2.0 Workflow.

from google.adk import Agent, Workflow, Event
from pydantic import BaseModel
from woocommerce import API as WooAPI

woo = WooAPI(
    url="https://yourstore.com",
    consumer_key="ck_xxx",
    consumer_secret="cs_xxx",
    version="wc/v3",
)

class RetrievedContext(BaseModel):
    chunks: list[str]
    sources: list[str]
    relevance_scores: list[float]

def retrieve_product_info(query: str) -> RetrievedContext:
    """Search WooCommerce products and return relevant context."""
    products = woo.get(
        "products", params={"search": query, "per_page": 10}
    ).json()
    chunks, sources = [], []
    for p in products:
        chunk = f"Product: {p['name']}\nPrice: ${p['price']}\n"
        chunk += f"Description: {p['short_description']}\n"
        chunk += f"Stock: {'In stock' if p['in_stock'] else 'Out of stock'}"
        chunks.append(chunk)
        sources.append(p["permalink"])
    return RetrievedContext(
        chunks=chunks,
        sources=sources,
        relevance_scores=[1.0] * len(chunks),
    )

def postprocess_chunks(node_input: RetrievedContext) -> RetrievedContext:
    """Rerank and filter. Remove duplicates, resolve entities."""
    filtered = []
    seen_names = set()
    for i, chunk in enumerate(node_input.chunks):
        product_name = chunk.split("\n")[0]
        if (
            product_name not in seen_names
            and node_input.relevance_scores[i] > 0.5
        ):
            filtered.append(chunk)
            seen_names.add(product_name)
    return RetrievedContext(
        chunks=filtered[:5],
        sources=node_input.sources[:5],
        relevance_scores=node_input.relevance_scores[:5],
    )

support_agent = Agent(
    name="grounded_support",
    model="gemini-2.5-flash",
    instruction="""Answer the customer question using ONLY the provided
        product context. If the answer is not in the context, say
        "I don't have that information" and offer to connect them
        with a human agent. Always cite your sources with product
        links. Never invent prices, availability, or policy details.""",
    output_schema=str,
    mode="task",
)

def confidence_check(node_input: str):
    """Check for hedging language. Escalate if uncertain."""
    low_confidence = ["I don't have", "I'm not sure", "I cannot confirm"]
    is_confident = not any(phrase in node_input for phrase in low_confidence)
    if is_confident:
        return Event(output=node_input, message=node_input)
    return Event(
        message=f"{node_input}\n\nWould you like me to connect you with a team member?",
        state={"needs_escalation": True},
    )

root_agent = Workflow(
    name="support_rag_pipeline",
    edges=[
        (
            "START",
            retrieve_product_info,
            postprocess_chunks,
            support_agent,
            confidence_check,
        ),
    ],
)

The flow reads top to bottom: query comes in, products are fetched from WooCommerce, chunks are deduplicated and filtered, the agent writes a grounded response, and the confidence check decides whether to deliver the answer or escalate.

Each node can be tested on its own, and that composability is what makes the approach practical. Need to add policy document retrieval? Add another retrieval node before postprocessing. Want to test a different reranker? Swap the postprocessing node. Need multilingual support? Add a translation node after the confidence check. The graph structure makes the pipeline extensible without rewriting what already works.

I keep coming back to one thing: start with prices, stock status, and policies. These are the claims that cause real damage when they are wrong. You can always improve retrieval quality later, but getting the factual foundation right is what keeps your customers’ trust intact.

The next post covers style transfer and template generation for product descriptions. Read it here.

Thanks for reading.