Building a Retail Knowledge Graph: Replacing Your SQL Analyst with Natural Language Queries

When people talk about giving non-technical teams access to their data through natural language, they almost always reach for the same solution: embed everything into a vector store and let an LLM do semantic search.

That works fine for unstructured content - documents, notes, support tickets. But retail operational data is not unstructured. It's deeply, intentionally relational. Products belong to categories, categories belong to departments, suppliers provide products, products live in stores, stores have inventory levels that change hourly. The relationships are the data. Flatten everything into embeddings and you destroy exactly what makes it queryable.

This is why, for a retail client managing a multi-store operation across KSA, we built a knowledge graph instead.

Why a Graph, Not a Vector Store

The question worth answering upfront: if you already have a SQL database, why not just put a natural language layer on top of that?

You can. Text-to-SQL is a real approach and it works reasonably well for simple queries. The problem shows up when business users ask questions that require traversing multiple relationships - and in retail, that's almost every interesting question.

"Which suppliers are providing products that are consistently understocked in our top-performing stores?"

In SQL, that's four or five JOINs across the products, inventory, stores, suppliers, and sales tables, with subqueries to calculate store performance and define "consistently understocked." The correct query is non-trivial to write and fragile to maintain as the schema evolves. More importantly, it's nearly impossible to generate reliably from natural language, because the LLM has to infer the full join path from schema documentation alone.

In a knowledge graph, that same question is a traversal. You start at the stores node, follow the performance edges to find the top performers, traverse to their inventory nodes, filter for understock conditions, and follow back to suppliers. The structure of the query mirrors the structure of the question. That alignment is what makes natural language prompting reliable.

Graph databases also handle what's called multi-hop traversal naturally - "find me all products two supplier-hops away from a supplier that had a logistics issue last month." Try expressing that as a SQL query and see what you get.

Modeling Retail Data as a Graph

Before writing a line of Python, we spent time with the client mapping their domain into nodes and relationships. This is the step most teams rush, and it's where most knowledge graph projects go wrong.

For this client, the core entity types - nodes in Neo4j terms - were:

Product (SKU, name, unit cost, category ID)
Category (name, department)
Supplier (name, region, lead time, reliability score)
Store (name, location, tier)
InventorySnapshot (quantity on hand, reorder threshold, date)
Transaction (quantity sold, date, channel)

The relationships between them carried as much meaning as the nodes themselves:

(Supplier)-[:SUPPLIES {since, contract_type}]->(Product)
(Product)-[:BELONGS_TO]->(Category)
(Store)-[:STOCKS]->(Product)
(Store)-[:HAS_INVENTORY {date}]->(InventorySnapshot)
(InventorySnapshot)-[:FOR_PRODUCT]->(Product)
(Transaction)-[:AT_STORE]->(Store)
(Transaction)-[:FOR_PRODUCT]->(Product)

A few things worth noting here. First, relationships in Neo4j can carry properties - the SUPPLIES relationship holds since and contract_type, which lets you ask "which products are we sourcing from suppliers whose contracts expire in the next 90 days?" without those attributes living on either the supplier or product node. Second, we modeled inventory as a snapshot node rather than a property on the product, because inventory changes constantly. Snapshots let you query historical states without destroying current data.

Getting this model right upfront saved significant pain later. Every time we changed the schema after data ingestion started, it meant a migration.

The Stack: Neo4j, LangChain, Gemini

Once the graph was designed and loaded, the integration layer used three components:

Neo4j as the graph database. We used Neo4j AuraDB (their managed cloud service) for this project, which removed infrastructure overhead and gave us the Cypher query interface we needed.

LangChain for the orchestration layer, specifically the Neo4jGraph connector and GraphCypherQAChain. This chain handles the core loop: take a natural language question, generate a Cypher query, execute it against the graph, and synthesize the results into a natural language answer.

Gemini (specifically Gemini 1.5 Pro) as the LLM doing the Cypher generation and answer synthesis. The choice here was partly client-driven - they were already on Google Cloud - but Gemini's long context window was genuinely useful for injecting the full graph schema into the system prompt without worrying about truncation.

Getting Cypher Generation Right

This is where the real engineering work lives, and where most implementations fall apart.

Out of the box, GraphCypherQAChain works like this: it injects your graph schema into a prompt template, appends the user's question, and asks the LLM to generate a Cypher query. That generated query runs against Neo4j, and the results come back as the basis for the answer.

The first problem we hit: Gemini would generate syntactically valid Cypher that referenced relationship types that didn't exist in our schema. [:SUPPLIED_BY] instead of [:SUPPLIES], [:PART_OF] instead of [:BELONGS_TO]. These queries would execute without error but return empty results, which the LLM would then confidently interpret as "no data found." Silently wrong is worse than loudly broken.

The fix was a three-part prompt engineering approach.

First, explicit schema injection. LangChain's default schema extraction is sometimes too abstract - it lists node labels and relationship types but doesn't show the LLM how they connect directionally. We replaced the default with a hand-written schema block that showed full path patterns:

Graph Schema:
Node labels: Product, Category, Supplier, Store, InventorySnapshot, Transaction

Relationship patterns (use these EXACTLY):
(:Supplier)-[:SUPPLIES]->(:Product)
(:Product)-[:BELONGS_TO]->(:Category)
(:Store)-[:STOCKS]->(:Product)
(:Store)-[:HAS_INVENTORY]->(:InventorySnapshot)
(:InventorySnapshot)-[:FOR_PRODUCT]->(:Product)
(:Transaction)-[:AT_STORE]->(:Store)
(:Transaction)-[:FOR_PRODUCT]->(:Product)

CRITICAL: Only use relationship types listed above. Do not invent new relationship names.

Second, few-shot examples. We included four worked examples in the system prompt - a natural language question paired with the correct Cypher query. These examples covered the query patterns that showed up most often: filtering by node property, aggregating transaction data, and multi-hop traversals. Gemini's Cypher accuracy went from around 60% to above 90% after this change. Few-shot prompting consistently outperforms zero-shot for structured query generation tasks, regardless of model.

Third, a validation layer before execution. Before any generated Cypher query hits the database, it runs through a lightweight validator that checks: are all relationship types in the allowed set? Does the query contain a RETURN clause? Is the pattern structure valid? If not, we re-prompt once with the failed query and the validation error appended. This gave us a safety net for the cases where even good prompting produced something off.

Handling Ambiguous Questions

One thing that doesn't get discussed enough in text-to-query systems: what happens when the question is genuinely ambiguous?

"Show me the slow-moving products" sounds specific. But slow-moving relative to what - the category average, the store's historical average, a fixed velocity threshold? The graph can answer any of these, but it needs to know which one you mean.

We handled this with a clarification layer upstream of the Cypher generation. If the incoming query contains terms that map to multiple interpretations in our domain vocabulary ("slow-moving", "top products", "recent", "underperforming"), the system asks one targeted clarifying question before generating anything. This felt like friction when we first designed it, but the client's operations team preferred one clarifying question to one confidently wrong answer.

What the Client Actually Got

The business outcome was straightforward: their operations manager could open a dashboard, type a question, and get a real answer - sourced from live graph data - in under three seconds. No SQL, no waiting for the analytics team, no scheduled reports.

The queries they ran most often were exactly the multi-hop type that SQL would have made painful: supplier-to-stock coverage questions, category-level performance by store tier, reorder priority lists weighted by supplier lead time. These had previously required analyst involvement or just didn't get asked at all.

The analyst didn't disappear. They moved from writing queries to maintaining the graph model, tuning the prompt layer, and handling the genuinely complex analyses that require judgment rather than data retrieval. That's a better use of someone with that skill set.

What to Know Before You Build This

A knowledge graph is not always the right call. If your questions are mostly lookup queries - "what's the current stock of SKU X?" - a well-indexed SQL database with a thin natural language layer will serve you better and cost less to build. Graphs earn their complexity when your questions are about connections, paths, and multi-entity relationships.

The data modeling phase will take longer than you expect. Budget for it. Getting the entity and relationship schema right before ingestion is the difference between a system that answers questions well and one that answers them confidently wrong.

Cypher generation is the bottleneck, not graph performance. Neo4j at this scale is fast. The iteration work goes into the prompt layer - schema injection, few-shot examples, validation, and domain vocabulary disambiguation. Plan accordingly.

If you're evaluating whether a knowledge graph makes sense for your data - or whether text-to-SQL or a hybrid approach would serve you better - talk to us. We'll tell you which one fits, and why.