Everyone’s obsessed with the power of large language models, but most AI systems aren’t failing because the models are weak. They’re failing because the data feeding them is a mess. One of the most critical, least talked about design decisions? How you chunk the information behind the scenes. Get that wrong, and even the smartest AI will serve up nonsense with confidence.

The race to embed AI into platforms, workflows, and customer experiences, too many organisations overlook how the structure of their content shapes the performance of their models. This isn’t a back-end technical quirk. It’s a front-line issue that determines accuracy, relevance, and trust, especially in environments where evidence matters and decisions carry weight.

What is chunking, and why does it matter?

In a retrieval-augmented generation (RAG) system, the architecture powering many modern AI tools that are large language models don’t generate answers from scratch. They pull relevant information from a vectorised knowledge base, built by splitting documents into chunks and converting those into high-dimensional embeddings.

The chunk size (e.g. 8000 tokens vs 2000 or 1000) and overlap (how many tokens are repeated between chunks) directly impact what gets retrieved. Too large, and the context is too broad. Too small, and meaning is lost.

It’s a bit like slicing a cake: cut it in the wrong place, and you miss the cherry on top. In AI terms, that cherry might be the one paragraph containing the clause, policy, or insight your system desperately needs to find.

What we tested and why it matters

To explore the real impact of chunk configuration on retrieval performance, we ran a controlled experiment using a FastAPI-based semantic search system across six long-form documents from government and research sources. These documents varied in length, structure, and topic, providing a realistic testbed for evaluating retrieval accuracy.

Each document was indexed and queried using three primary chunk sizes:

8000 tokens (100%)
2000 tokens (25%)
1000 tokens (12.5%)

Rather than applying a fixed overlap across the board, the overlap was dynamically scaled based on chunk size to reflect realistic implementation practices:

Smaller chunks received proportionally larger overlaps (up to 25%) to preserve context and meaning
Larger chunks required less overlap (as low as 7.5%) to reduce redundancy and minimise memory load

This approach allowed us to test not just raw chunk size, but how contextual continuity affected semantic precision.

We used cosine similarity as our benchmark a standard metric that measures how semantically close each retrieved chunk is to the original query. Our goal was to identify the configurations that consistently delivered accurate, relevant results across a range of inputs.

What we found confirmed a persistent challenge in AI pipeline design: poor chunking can break even the most advanced models.

When chunk size and overlap are poorly tuned, the downstream effects compound:

User trust is eroded when results are inconsistent, irrelevant, or partial, especially in high-stakes contexts like law, policy, health, or infrastructure
Performance suffers as the model struggles to find coherent units of meaning, increasing the likelihood of hallucinated completions or fragmented responses
Infrastructure costs escalate, with smaller or overlapping chunks inflating the total number of embeddings stored and retrieved, multiplying memory usage, storage, and compute time

These aren’t minor inefficiencies. In enterprise and public-sector settings, they undermine the credibility, scalability, and cost-effectiveness of AI deployments.

That’s why chunking isn’t a mere optimisation task, it’s a strategic design decision.

The sweet spot was at 2000 tokens

What we found was clear. 2000-token chunks with a 250-token overlap consistently performed best, offering the highest balance of precision, stability, and semantic relevance.

Cosine similarity scores peaked at 0.8677, outperforming both larger chunks (which diluted relevance) and smaller ones (which introduced fragmentation)

1,000-token chunks plateaued early, offering no gain in accuracy despite increasing the volume of embeddings and the computational load

Overlap was most effective at moderate chunk sizes. At 2,000 tokens, it added helpful context. But at smaller sizes, it merely introduced noise, degrading signal quality

These findings challenge a common assumption in AI design, that smaller chunks automatically improve precision. In reality, more data is not better if it’s poorly structured. When chunking is treated as an afterthought or left to defaults it can quietly undermine retrieval accuracy, inflate costs, and reduce confidence in the system’s outputs.

This isn’t just an engineering concern. It’s a strategic lever. Getting it right means your AI retrieves exactly what matters, when it matters.

The cost of getting it wrong

When AI retrieves the wrong chunk or fails to retrieve the right one, the system doesn’t just stumble. It misleads, omits, or invents. In high-accountability settings like government, research, or regulation, these aren’t edge-case bugs. They’re trust failures.

Misaligned chunking leads to:

Missing critical context in legal, scientific, or procedural queries
Over-reliance on hallucinations, as the model tries to “fill in the blanks”
Poor user trust and slow adoption, even when the core model is sound
Escalating infrastructure costs, driven by bloated embeddings and excessive retrieval

To the user, it looks like your AI doesn’t work or isn’t performing as efficiently or effectively as it should be. In reality, your data pipeline is broken, and the model is only as good as the structure it retrieves from.

How to fix it

Optimising chunk size isn’t about chasing technical perfection, it’s about creating reliable, responsive systems that people trust. Fortunately, improving retrieval precision doesn’t require a full rebuild. It requires a smarter approach to structure.

Start with 2000-token chunks and an overlap of 250 tokens, this size consistently hits the semantic sweet spot
Avoid overly large or small chunks. Large chunks dilute meaning. Small chunks fragment context and inflate system load
Explore semantic or sentence-based chunking using NLP techniques to respect natural boundaries
Use hybrid scoring Combine sparse (BM25) and dense (vector) approaches to improve relevance
Continuously evaluate using real documents, queries, and feedback loops

Why this matters now

As AI moves from experimental to embedded guiding decisions in justice, health, policy, climate, and beyond the invisible layers beneath the surface matter more than ever.

The future of AI performance won’t be defined by bigger models alone. It will be shaped by the discipline, design, and data practices that support them.

Get chunking wrong, and the system will mislead with confidence. Get it right, and your AI can inform, support, and scale with clarity and trust.

Insights

Your AI doesn’t have a thinking problem it has a structure problem