structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor performance. Modern embeddings are based on the BERT architecture, which is essentially the encoder part of a Transformer, and are trained on a huge text dataset with the main goal of capturing semantic meaning. Modern embedding models can provide incredible retrieval performance, but they are trained on a large set of unstructured text with a focus on semantic meaning. As a result, even though embedding JSON may look like an intuitively simple and elegant solution, using a generic embedding model for JSON objects would demonstrate results far from peak performance.
Deep dive
Tokenization
The first step is tokenization, which takes the text and splits it into tokens, which are generally a generic part of the word. The modern embedding models utilize Byte-Pair Encoding (BPE) or WordPiece tokenization algorithms. These algorithms are optimized for natural language, breaking words into common sub-components. When a tokenizer encounters raw JSON, it struggles with the high frequency of non-alphanumeric characters. For example, "usd": 10, is not viewed as a key-value pair; instead, it’s fragmented:
- The quotes (
"), colon (:), and comma (,) - Tokens
usdand10
This creates a low signal-to-noise ratio. In natural language, almost all words contribute to the semantic “signal”. While in JSON (and other structured formats), a significant percentage of tokens are “wasted” on structural syntax that contains zero semantic value.
Attention calculation
The core power of Transformers lies in the attention mechanism. This allows the model to weight the importance of tokens relative to each other.
In the sentence The price is 10 US dollars or 9 euros, attention can easily link the value 10 to the concept price because these relationships are well-represented in the model’s pre-training data and the model has seen this linguistic pattern millions of times. On the other hand, in the raw JSON:
"price": {
"usd": 10,
"eur": 9,
}
the model encounters structural syntax it was not primarily optimized to “read”. Without the linguistic connector, the resulting vector will fail to capture the true intent of the data, as the relationships between the key and the value are obscured by the format itself.
Mean Pooling
The final step in generating a single embedding representation of the document is Mean Pooling. Mathematically, the final embedding (E) is the centroid of all token vectors (e1, e2, e3) in the document:

This is where the JSON tokens become a mathematical liability. If 25% of the tokens in the document are structural markers (braces, quotes, colons), the final vector is heavily influenced by the “meaning” of punctuation. As a result, the vector is effectively “pulled” away from its true semantic center in the vector space by these noise tokens. When a user submits a natural language query, the distance between the “clean” query vector and “noisy” JSON vector increases, directly hurting the retrieval metrics.
Flatten it
So now that we know about the JSON limitations, we need to figure out how to resolve them. The general and most straightforward approach is to flatten the JSON and convert it into natural language.
Let’s consider the typical product object:
{
"skuId": "123",
"description": "This is a test product used for demonstration purposes",
"quantity": 5,
"price": {
"usd": 10,
"eur": 9,
},
"availableDiscounts": ["1", "2", "3"],
"giftCardAvailable": "true",
"category": "demo product"
...
}
This is a simple object with some attributes like description, etc. Let’s apply the tokenization to it and see how it looks:

Now, let’s convert it into text to make the embeddings’ work easier. In order to do that, we can define a template and substitute the JSON values into it. For example, this template could be used to describe the product:
Product with SKU {skuId} belongs to the category "{category}"
Description: {description}
It has a quantity of {quantity} available
The price is {price.usd} US dollars or {price.eur} euros
Available discount ids include {availableDiscounts as comma-separated list}
Gift cards are {giftCardAvailable ? "available" : "not available"} for this product
So the final result will look like:
Product with SKU 123 belongs to the category "demo product"
Description: This is a test product used for demonstration purposes
It has a quantity of 5 available
The price is 10 US dollars or 9 euros
Available discount ids include 1, 2, and 3
Gift cards are available for this product
And apply tokenizer to it:

Not only does it have 14% fewer tokens now, but it also is a much clearer form with the semantic meaning and required context.
Let’s measure the results
Note: Complete, reproducible code for this experiment is available in the Google Colab notebook
Now let’s try to measure retrieval performance for both options. We are going to focus on the standard retrieval metrics like Recall@k, Precision@k, and MRR to keep it simple, and will utilize a generic embedding model (all-MiniLM-L6-v2) and the Amazon ESCI dataset with random 5,000 queries and 3,809 associated products.
The all-MiniLM-L6-v2 is a popular choice, which is small (22.7m params) but provides fast and accurate results, making it a good choice for this experiment.
For the dataset, the version of Amazon ESCI is used, specifically milistu/amazon-esci-data (), which is available on Hugging Face and contains a collection of Amazon products and search queries data.
The flattening function used for text conversion is:
def flatten_product(product):
return (
f"Product {product['product_title']} from brand {product['product_brand']}"
f" and product id {product['product_id']}"
f" and description {product['product_description']}"
)
A sample of the raw JSON data is:
{
"product_id": "B07NKPWJMG",
"title": "RoWood 3D Puzzles for Adults, Wooden Mechanical Gear Kits for Teens Kids Age 14+",
"description": " Specifications
Model Number: Rowood Treasure box LK502
Average build time: 5 hours
Total Pieces: 123
Model weight: 0.69 kg
Box weight: 0.74 KG
Assembled size: 100*124*85 mm
Box size: 320*235*39 mm
Certificates: EN71,-1,-2,-3,ASTMF963
Recommended Age Range: 14+
Contents
Plywood sheets
Metal Spring
Illustrated instructions
Accessories
MADE FOR ASSEMBLY
-Follow the instructions provided in the booklet and assembly 3d puzzle with some exciting and engaging fun. Fell the pride of self creation getting this exquisite wooden work like a pro.
GLORIFY YOUR LIVING SPACE
-Revive the enigmatic charm and cheer your parties and get-togethers with an experience that is unique and interesting .
",
"brand": "RoWood",
"color": "Treasure Box"
}
For the vector search, two FAISS indexes are created: one for the flattened text and one for the JSON-formatted text. Both indexes are flat, which means that they will compare distances for each of the existing entries instead of utilizing an Approximate Nearest Neighbour (ANN) index. This is important to ensure that retrieval metrics are not affected by the ANN.
D = 384
index_json = faiss.IndexFlatIP(D)
index_flatten = faiss.IndexFlatIP(D)
To reduce the dataset a random number of 5,000 queries has been selected and all corresponding products have been embedded and added to the indexes. As a result, the collected metrics are as follows:

all-MiniLM-L6-v2 embedding model on the Amazon ESCI dataset. The flattened approach consistently yields higher scores across all key retrieval metrics (Precision@10, Recall@10, and MRR). Image by authorAnd the performance change of the flattened version is:

The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
References
[1] Full experiment code https://colab.research.google.com/drive/1dTgt6xwmA6CeIKE38lf2cZVahaJNbQB1?usp=sharing
[2] Model https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
[3] Amazon ESCI dataset. Specific version used: https://huggingface.co/datasets/milistu/amazon-esci-data
The original dataset available at https://www.amazon.science/code-and-datasets/shopping-queries-dataset-a-large-scale-esci-benchmark-for-improving-product-search
[4] FAISS https://ai.meta.com/tools/faiss/


