From One Shot to a Pipeline: Evolving DOCX → JSON (V1 → V2)

Why change what works?
A common first version of “turn this Word file into JSON” is simple: read the text, send all of it to the LLM once, parse JSON back. It ships fast and works on small docs. In production, you hit context limits, unstable JSON, and making LLM guess structure that the document already explicitly provides.
V2 keeps the same goal, but instead of letting the model do everything, it splits the work: code extracts structure first, the model improves content, and then the system combines and scores the final result.
This post walks through V1’s limits, V2’s architecture, how to run it, a before/after example, and what improves in practice (plus an honest word on cost).
The problem with V1: one LLM call to extract “everything”
A typical V1 pipeline:
Strip paragraphs from the DOCX into a big string.
One prompt: “Here is the full document. Output JSON with title, sections, bullets, …”
Validate the response against a schema (if you are lucky) and return.
Limitations:
| Issue | What goes wrong |
|---|---|
| Context | Long documents exceed the model window, or the tail gets truncated. Quality drops at the end, or the call fails. |
| Responsibility blur | The model must invent document structure (sections, order) from flat text, even though Word already has styles: Heading 1–2, list styles, etc. |
| Single point of failure | One bad response (malformed JSON, rate limit, timeout) fails the entire run. There is no natural “retry this page only” boundary. |
| All-or-nothing parsing | Fixing JSON often means re-running the full prompt, burning tokens. |
| Debuggability | When output is wrong, the prompt is a wall of text; you cannot easily see “which part of the doc” broke the model. |
V1 is not “wrong” for a prototype or a 2-page brief. It does not scale in robustness or observability for real file sizes and real SLAs.
V2 architecture: separate concerns
V2 is a modular pipeline. The LLM’s job is narrower: improve and label content that is already structured in memory, not to rediscover the whole document from a blob.
End-to-end flow (as implemented in spirit):
extract_blocks()— Walk paragraphs with python-docx; use style names to label headings (with level), list/bullet lines, and normal paragraphs.build_sections()— Fold blocks into a list of sections; each new heading starts a section; content before the first heading can be grouped (e.g. under"Introduction").chunk_sections()— Pack sections into chunks under a character budget (e.g. ~1500 characters of section text) so each LLM call stays small. Oversized sections are isolated with a warning in logs rather than silently breaking layout.LLM enrichment (per chunk) — For each chunk, a prompt asks for a JSON array of sections: clearer wording, optional keywords, etc., without merging or splitting sections arbitrarily. Retries (e.g. 3 attempts with short backoff) protect against transient API/parsing flakiness.
merge_results()— Concatenate chunk results: combinedsections, deduplicated top-levelkeywords, optionaltitle/notesif the model returns document-shaped data.compute_confidence()— A code-defined document-level score (e.g. from coverage of source text in the result plus a small structural signal), so confidence is not only “what the model guessed.”Validate schema — e.g. Pydantic:
title,sectionswith typed content blocks (paragraph/bullet), not one opaque string.Output JSON — Print to stdout (or redirect to a file).
Conceptually, 4–5 is “smart part in small boxes,” 1–3 and 6–7 is “boring, testable, deterministic.”
Illustrative code (V2 shape)
Extraction: blocks, not a single string
def extract_blocks(docx_path) -> list[dict]:# for each non-empty paragraph:# heading → { "type": "heading", "text": "...", "level": n }# list → { "type": "bullet", "text": "..." }# else → { "type": "paragraph", "text": "..." } ...
Chunking: keep sections intactdef chunk_sections(sections, max_chars=1500) -> list[list[dict]]:# Greedy packing by estimated character length; oversized sections# may form their own chunk (log a warning). ...
Orchestration: loop chunks, then merge and scoreresults = [safe_generate(build_prompt(chunk)) for chunk in chunks]
final = merge_results(results)
original_text = " ".join(b.get("text", "") for b in blocks) final["confidence"] = compute_confidence(final["sections"], original_text)validated = DocumentSchema(**final) print(json.dumps(validated.model_dump(), indent=2, ensure_ascii=False))
V1 (conceptual) for comparison
text = extract_text(path) # one big string
data = generate_json_and_parse(extract_json(build_prompt(text)))
How to run the project (CLI)
python src/app.py path/to/document.docx
Debug mode (append chunk prompts/outputs to files in the CWD for inspection):
python src/app.py path/to/document.docx --debug
Example: input vs output (V1 “shape” vs V2 “shape”)
Input (DOCX): a short file with a heading, two body paragraphs, and a bullet list—styled as Heading 1, Body, and List in Word.
V1-style output (illustrative): one JSON object, often with one content string and a separate bullets list, because the model guessed structure from newlines.
{ "title": "", "sections": [ { "heading": "Scope", "content": "This document defines the API contract for internal services.", "bullets": [ "All new endpoints must be versioned.", "Breaking changes require a major version bump." ] } ], "keywords": ["api", "versioning"], "confidence": 0.82, "notes": ["Headings are clear; list is consistent."] }
V2-style output (illustrative): sections carry a unified content array of typed blocks—closer to how the document is actually built and better for UIs, search snippets, and accessibility.
{ "title": "", "sections": [ { "heading": "Scope", "type": "general", "content": [ { "type": "paragraph", "text": "This document defines the API contract for internal services." }, { "type": "bullet", "text": "All new endpoints must be versioned." }, { "type": "bullet", "text": "Breaking changes require a major version bump." } ] } ], "keywords": ["api", "versioning", "backwards compatibility"], "notes": [], "confidence": 0.9 }
What actually improves over V1
Scalability
V1 scales poorly with document length: one context window, one failure domain.
V2 breaks work into chunks bounded by size, so longer documents are handled in multiple bounded calls. Throughput is sequential unless you add parallelization later.
Reliability
Per-chunk retries reduce the impact of transient network/API/parse issues compared to a single do-or-die call.
Isolation is clearer: a bad chunk is easier to pinpoint in logs and debug output than “something in the 40-page prompt.”
Cost efficiency (be precise)
V1 often = one (large) call.
V2 often = several (smaller) calls, so total cost can go up on long docs. The efficiency is different: you spend tokens on refinement and keywords over already-structured chunks, not on the model re-deriving headings and lists from scratch. You also pay python-docx-level work in CPU (negligible) instead of the model.
For a 10-line doc, V2 can be more calls than V1; for a 200-page doc, V1 may be infeasible while V2 is viable—that is the real trade.
Debuggability
- Structured logging per chunk (
Chunk i/n), optional--debugfiles for prompts and outputs, and a scorer that is inspectable code (coverage, structure) rather than a black-box model number.
Real-world use cases
RAG / search indexing: Typed blocks map cleanly to chunks for embeddings (per section or per block).
Migration and CMS import: Preserving headings and lists is easier when they are not flattened before the model sees them.
Policy and compliance docs: Repeatable, logged processing plus versionable code paths (extraction, merge, score).
Product / API docs in Word: Multi-section handoffs to API portals or static sites as JSON, not a single unstyled dump.
Takeaway
V1 is the fastest way to prove that “DOCX can become JSON.” V2 is what you build when the file is long, the structure matters, and you need retries, logs, and a confidence story that does not rest entirely on the last token the model generated.
The evolution is the same as elsewhere in software: push determinism to code, keep the LLM in a small, testable box, and merge with explicit, auditable rules at the end.
Feedback and pull requests welcome

