Chapter 8 — Using LLMs in Applications: Chatbots, Code, Extraction, and Agents

Published on: 2026-02-25 Last updated on: 2026-06-05 Version: 3
Chapter 8 — Using LLMs in Applications: Chatbots, Code, Extraction, and Agents

Chapter 8 — Using LLMs in Applications

This is Part 8 of a series walking through LLM Primer I: How Generative AI Works. Yesterday we covered embeddings, RAG, and multimodality. Today we look at how LLMs actually show up in shipping products — the patterns that work, the ones that don't, and the new wave of agentic systems where the model is the driver.


A chatbot is not just a model

The most common mistake people make about chatbots is thinking the model is the product. It isn't. The model is a component. The product is the system that wraps the model: prompt templates, conversation history management, safety filters, retrieval layers, tool integrations, fallback policies, user interface.

Most of the engineering effort in a production chatbot goes into the surrounding system, not the model. A well-engineered chatbot using a mid-tier model usually beats a poorly-engineered chatbot using a frontier model. The book walks through the architectural patterns that actually work, including how to manage conversation state, when to summarize old turns versus keep them verbatim, and how to layer safety controls.

Key idea: The model is rarely the bottleneck. The bottleneck is usually context management, retrieval quality, or evaluation rigor. Engineering effort spent on these almost always pays off more than upgrading to a bigger model.

Summarization and search

Two of the highest-impact LLM use cases are both about condensing information. Summarization shrinks long text into a shorter version while preserving the meaning. Semantic search finds relevant material across a large corpus by intent rather than keyword.

The interesting modern pattern is to combine them. A user asks a question. The system retrieves relevant documents. The model summarizes the retrieved material into a focused answer. This is what most "AI search" products actually do under the hood. When it works, it feels like magic. When it fails, it's almost always because the retrieval step missed the relevant material, not because the model couldn't summarize it.

Code generation

Programming languages are formal languages with strict grammar and clear feedback. That makes them an especially good fit for LLMs. A model that has seen large amounts of code learns to predict completions that compile, function signatures that match conventions, and idioms that resemble the surrounding code.

Modern code assistants are a particular kind of RAG system: they retrieve relevant context from the codebase being edited and feed it to the model along with the user's request. The model is genuinely good at this. The book is realistic about both the upside (real productivity gains in well-scoped tasks) and the downside (subtle correctness issues that are hard to spot in fluent-looking code).

Knowledge extraction

The reverse of writing is reading. Knowledge extraction is the pattern where you give the model an unstructured document and ask it to produce structured data — pull the invoice number, date, and total from this PDF; extract the candidate's job history from this resume; identify the chemical compounds mentioned in this paper.

This is one of the most directly useful business applications of LLMs, and it's relatively safe because the output can be validated against a schema. The book walks through how to design the prompt and the validation layer together so that malformed model outputs are caught and retried rather than silently corrupting downstream systems.

Evaluation, in production

Because LLM outputs are probabilistic, you cannot test them the way you test deterministic software. There's no single right answer to compare against. Evaluation blends several techniques: automated metrics where possible, scoring by a stronger model, structured human review, A/B testing in production, and ongoing monitoring of drift.

This section also introduces the named benchmarks that show up everywhere in LLM research and product announcements: MMLU, GPQA-Diamond, HumanEval, SWE-bench, MMMU, LiveBench, GSM8K, MATH, ARC-AGI, BFCL, IFEval. The book includes a one-paragraph reference for each, so you can read any model comparison and know what's actually being measured.

The new pattern: agentic systems

This section is new in the 2026 edition, because it's where the field has moved fastest. In an agentic system, the model is in the driver's seat. Instead of just producing text, it decides when to call a calculator, when to query a database, when to invoke a search tool, when to ask a clarifying question — and what to do with the results.

The mechanism is structured tool calling. Each available tool is described to the model as a function signature with a description and a schema for its arguments. The model can emit a structured tool invocation instead of plain prose. The surrounding system parses the invocation, executes the tool, returns the result, and the model decides what to do next. The loop continues until the model declares the task complete.

This pattern raises new engineering concerns the book takes seriously. Agentic loops can consume resources unpredictably. Tool failures propagate into model behavior. Safety considerations are amplified, because the model now influences the world rather than just describing it. The book walks through how to design tool inventories, evaluate step-level correctness, and contain runaway loops.

Important: The shift from chatbots to agentic systems isn't just architectural — it's a shift in what you're trusting the model to do. A chatbot generates text you can review before acting on. An agent takes actions in the world before you see the result. The safety properties are categorically different.

What Chapter 8 sets up

By the end of Chapter 8, you have a practical playbook for the major LLM application patterns. You know what shape of system to build for each kind of problem, what evaluation looks like in each case, and how to read the benchmark numbers that vendors publish about their models. The next chapter takes the natural next step: what does it cost to run all of this at scale?


Next up — Chapter 9: Performance, Scaling, and Costs. Tomorrow we look at the operational realities. Latency, throughput, cost per request, quantization, on-device deployment, and how to think about model size when most of your business won't actually benefit from the largest available model.

Want the full picture? The book includes a dedicated benchmarks reference and a deep treatment of agentic patterns, both new in the 2026 edition. Grab LLM Primer I on Amazon →

SHO
SHO
CTO of Receipt Roller Inc., he builds innovative AI solutions and writes to make large language models more understandable, sharing both practical uses and behind-the-scenes insights.