Chapter 3 — Advanced Chunking Frameworks

Third post of the chapter-by-chapter walkthrough of LLM Primer III: Enhancing Enterprise AI with RAG. Where naive choices most silently degrade everything that follows — and where two recent techniques have changed what is possible at the frontier.

Why this chapter exists

Once the documents are parsed, the next decision is also the most consequential: how to break them into pieces small enough to embed and large enough to still mean something on their own. This is chunking. A chunk that splits a definition from its qualifier will retrieve confidently and be wrong. A chunk that bundles five unrelated topics dilutes every embedding it touches. The retrieval system you build on top can only recover what the chunking step preserved, and the failure modes here are quiet — the retriever still returns candidates, the model still produces fluent answers, and only the user notices the answers are subtly wrong.

One line: chunking is fundamentally a labeling problem, not a cutting problem — a chunk is a unit of retrieval, and a unit of retrieval needs enough self-contained context to be findable.

3.1 The chunking spectrum

It helps to order the strategies by how much they know about the document. At one end, fixed-size chunking knows nothing — count tokens, cut. Fast, deterministic, and acceptable for stylistically uniform short text (chat transcripts, FAQ entries, customer reviews). On structured technical documents it is a quiet disaster. Recursive chunking applies a prioritized list of separators — paragraphs, then line breaks, then sentences, then words — splitting on the highest-priority boundary that fits under the target size. Nearly as cheap as fixed-size and substantially better. This is the right default for most teams.

Semantic chunking moves the decision from syntax to meaning: embed each sentence, walk the sequence, mark topic boundaries where adjacent-sentence similarity drops below a threshold. It works well on long-form prose where structural cues are weak (analyst reports, interview transcripts) and badly on structured technical documents where dense cross-references and repeated boilerplate confuse the sentence embeddings. Structure-aware chunking treats the parsed document as a tree and chunks along it — by section, by header level, by code function. Done well it is the most faithful form of chunking; done without a layout-aware parser upstream it produces nothing different from recursive chunking, because the structure was never extracted. These four are alternatives, not a stack to deploy together.

3.2 The overlap myth and the context cliff

Almost every tutorial recommends chunk overlap of 15–20%. The intuition is correct as far as it goes — overlap prevents boundary losses — but the curve flattens fast. The first 10% recovers most of the benefit. Past about 25%, accuracy is essentially flat while cost climbs on three axes: embedding bills scale linearly with chunk count, index size and query latency grow, and the retriever’s top results start being near-duplicates of each other. A user query matches a passage in chunks A, B, and C; the context window is consumed without new information arriving; the reranker spends its budget reordering variants of the same content. A team reaching for 30–40% overlap should treat that as a signal that the chunker is wrong, not that the overlap is too low.

Related but distinct is the context cliff: the sharp drop in retrieval quality when a chunk loses the anchor terms that made it findable. A paragraph that opens with “The 2023 amendment to Policy 47-B required all branches to…” and then describes the requirement in subsequent sentences. Cut after the opening, and the requirement-describing chunk no longer mentions the policy, the amendment, or the year. It will retrieve confidently for unrelated queries and miss the canonical ones. Retrieval is top-k — either the chunk surfaces or it does not, with no graceful degradation. The cliff is the dominant failure mode in technical corpora, where pronouns and short forms carry the antecedent forward through a section.

3.3 Matching chunk size to query type

Chunk size is often debated as if there were a single right answer. There is not, because the right answer depends on what queries the system will receive. A factoid query — “what was the deductible on policy 47-B in 2024?” — wants 150–300 tokens, narrow enough to be precise, wide enough to disambiguate. A reasoning query — “summarize the changes between 2023 and 2024 versions, and explain how they affect renewal” — wants 800–1,200 tokens to preserve the connective tissue within a section. The optimal size differs by 4–8x between the two, and production traffic is usually a mix.

Two productive responses. Multi-granularity indexing stores the same corpus at multiple chunk sizes and routes queries by intent classification. Hierarchical retrieval indexes small chunks for precision but returns their parent sections for context — one index, conditioned at query time, more common in production because it degrades gracefully when intent classification is wrong. The parent-document pattern is one of the highest-value techniques in the production retrieval literature.

3.4 Contextual retrieval and late chunking

The frontier is the recognition that the chunk and the embedding are separable concerns. Two recent techniques exploit that separation in opposite directions. Contextual retrieval, popularized by Anthropic in 2024, sends each chunk plus the full document to a cheap LLM and asks for a one- or two-sentence description of where the chunk sits — “This chunk discusses the change to deductible calculations introduced in the 2024 amendment to Policy 47-B” — then prepends that to the chunk text before embedding. The chunk becomes findable for queries the underlying text never named. Reported gains are around a 49% reduction in retrieval failures on Anthropic’s evaluation, more with hybrid search and reranking on top. The trick that makes it economical is prompt caching: send the document once, process each chunk against the cached version.

Late chunking, introduced by Jina AI in 2024, attacks the same problem from the other side. The full document is passed through a long-context embedding model in one pass, producing token-level embeddings already contextualized over the whole document. Only then is the document chunked, with each chunk’s embedding pooled from its now-contextualized tokens. No extra LLM calls; the embeddings inherit document-level context implicitly. The constraint is that the embedding model has to support it natively (jina-embeddings-v3/v4 and some research models do) and the document has to fit in the model’s window. For documents that fit, late chunking matches contextual retrieval at a fraction of the indexing cost. For documents that do not, contextual retrieval is more general. The two are not mutually exclusive, and serious production systems often run both with a deduplication step on top.

Worth holding onto: a useful test for any chunk in any production system — if a stranger read it with no other context, could they say what document it came from, what subject it addresses, and what role it plays? If the answer is no, the chunk is on the wrong side of the cliff, and retrieval over it is operating on luck. Contextual retrieval and late chunking exist to make the answer yes at scale.

What Chapter 3 sets up

Chunking turns a parsed document into a population of retrievable units. Each unit needs somewhere to live — stored, indexed, queried at low latency, updated as the corpus changes. That somewhere is the vector database, and the choice of vector database is a different kind of decision than the chunking decision. Chunking is a software problem with software costs. Database selection is a software problem with infrastructure, operational, and regulatory consequences, and the wrong choice can take six months to undo.

Next — Chapter 4: Selecting the Right Vector Database. Purpose-built versus extension architectures, the managed leaders, the open-source field, and the three axes — residency, ops, and total cost — that usually decide the real choice.

Want the full picture? The book walks the chunking cost surface honestly — index-time versus per-query cost, embedding-model coupling, multi-granularity patterns — and includes the duplicate-recall diagnostic and the contextual-retrieval prompt template that close the cliff cleanly in production. View LLM Primer III on Amazon →