Skip to main content

Command Palette

Search for a command to run...

From One Shot to a Pipeline: Evolving DOCX → JSON (V1 → V2)

Updated
7 min read
From One Shot to a Pipeline: Evolving DOCX → JSON (V1 → V2)

Why change what works?

A common first version of “turn this Word file into JSON” is simple: read the text, send all of it to the LLM once, parse JSON back. It ships fast and works on small docs. In production, you hit context limits, unstable JSON, and making LLM guess structure that the document already explicitly provides.
V2 keeps the same goal, but instead of letting the model do everything, it splits the work: code extracts structure first, the model improves content, and then the system combines and scores the final result.

This post walks through V1’s limits, V2’s architecture, how to run it, a before/after example, and what improves in practice (plus an honest word on cost).

The problem with V1: one LLM call to extract “everything”

A typical V1 pipeline:

  1. Strip paragraphs from the DOCX into a big string.

  2. One prompt: “Here is the full document. Output JSON with title, sections, bullets, …”

  3. Validate the response against a schema (if you are lucky) and return.

Limitations:

Issue What goes wrong
Context Long documents exceed the model window, or the tail gets truncated. Quality drops at the end, or the call fails.
Responsibility blur The model must invent document structure (sections, order) from flat text, even though Word already has styles: Heading 1–2, list styles, etc.
Single point of failure One bad response (malformed JSON, rate limit, timeout) fails the entire run. There is no natural “retry this page only” boundary.
All-or-nothing parsing Fixing JSON often means re-running the full prompt, burning tokens.
Debuggability When output is wrong, the prompt is a wall of text; you cannot easily see “which part of the doc” broke the model.

V1 is not “wrong” for a prototype or a 2-page brief. It does not scale in robustness or observability for real file sizes and real SLAs.


V2 architecture: separate concerns

V2 is a modular pipeline. The LLM’s job is narrower: improve and label content that is already structured in memory, not to rediscover the whole document from a blob.

End-to-end flow (as implemented in spirit):

  1. extract_blocks() — Walk paragraphs with python-docx; use style names to label headings (with level), list/bullet lines, and normal paragraphs.

  2. build_sections() — Fold blocks into a list of sections; each new heading starts a section; content before the first heading can be grouped (e.g. under "Introduction").

  3. chunk_sections() — Pack sections into chunks under a character budget (e.g. ~1500 characters of section text) so each LLM call stays small. Oversized sections are isolated with a warning in logs rather than silently breaking layout.

  4. LLM enrichment (per chunk) — For each chunk, a prompt asks for a JSON array of sections: clearer wording, optional keywords, etc., without merging or splitting sections arbitrarily. Retries (e.g. 3 attempts with short backoff) protect against transient API/parsing flakiness.

  5. merge_results() — Concatenate chunk results: combined sections, deduplicated top-level keywords, optional title / notes if the model returns document-shaped data.

  6. compute_confidence() — A code-defined document-level score (e.g. from coverage of source text in the result plus a small structural signal), so confidence is not only “what the model guessed.”

  7. Validate schema — e.g. Pydantic: title, sections with typed content blocks (paragraph / bullet), not one opaque string.

  8. Output JSON — Print to stdout (or redirect to a file).

Conceptually, 4–5 is “smart part in small boxes,” 1–3 and 6–7 is “boring, testable, deterministic.”

Illustrative code (V2 shape)

Extraction: blocks, not a single string

def extract_blocks(docx_path) -> list[dict]:
# for each non-empty paragraph:
# heading → { "type": "heading", "text": "...", "level": n }
# list → { "type": "bullet", "text": "..." }
# else → { "type": "paragraph", "text": "..." } ...

Chunking: keep sections intact
def chunk_sections(sections, max_chars=1500) -> list[list[dict]]:
# Greedy packing by estimated character length; oversized sections
# may form their own chunk (log a warning). ...

Orchestration: loop chunks, then merge and score
results = [safe_generate(build_prompt(chunk)) for chunk in chunks]

final = merge_results(results)

original_text = " ".join(b.get("text", "") for b in blocks) final["confidence"] = compute_confidence(final["sections"], original_text)
validated = DocumentSchema(**final) print(json.dumps(validated.model_dump(), indent=2, ensure_ascii=False))

V1 (conceptual) for comparison

text = extract_text(path) # one big string

data = generate_json_and_parse(extract_json(build_prompt(text)))

How to run the project (CLI)

python src/app.py path/to/document.docx
Debug mode (append chunk prompts/outputs to files in the CWD for inspection):

python src/app.py path/to/document.docx --debug

Example: input vs output (V1 “shape” vs V2 “shape”)

Input (DOCX): a short file with a heading, two body paragraphs, and a bullet list—styled as Heading 1, Body, and List in Word.

V1-style output (illustrative): one JSON object, often with one content string and a separate bullets list, because the model guessed structure from newlines.

{ "title": "", "sections": [ { "heading": "Scope", "content": "This document defines the API contract for internal services.", "bullets": [ "All new endpoints must be versioned.", "Breaking changes require a major version bump." ] } ], "keywords": ["api", "versioning"], "confidence": 0.82, "notes": ["Headings are clear; list is consistent."] }

V2-style output (illustrative): sections carry a unified content array of typed blocks—closer to how the document is actually built and better for UIs, search snippets, and accessibility.

{ "title": "", "sections": [ { "heading": "Scope", "type": "general", "content": [ { "type": "paragraph", "text": "This document defines the API contract for internal services." }, { "type": "bullet", "text": "All new endpoints must be versioned." }, { "type": "bullet", "text": "Breaking changes require a major version bump." } ] } ], "keywords": ["api", "versioning", "backwards compatibility"], "notes": [], "confidence": 0.9 }

What actually improves over V1

Scalability

  • V1 scales poorly with document length: one context window, one failure domain.

  • V2 breaks work into chunks bounded by size, so longer documents are handled in multiple bounded calls. Throughput is sequential unless you add parallelization later.

Reliability

  • Per-chunk retries reduce the impact of transient network/API/parse issues compared to a single do-or-die call.

  • Isolation is clearer: a bad chunk is easier to pinpoint in logs and debug output than “something in the 40-page prompt.”

Cost efficiency (be precise)

  • V1 often = one (large) call.

  • V2 often = several (smaller) calls, so total cost can go up on long docs. The efficiency is different: you spend tokens on refinement and keywords over already-structured chunks, not on the model re-deriving headings and lists from scratch. You also pay python-docx-level work in CPU (negligible) instead of the model.

  • For a 10-line doc, V2 can be more calls than V1; for a 200-page doc, V1 may be infeasible while V2 is viable—that is the real trade.

Debuggability

  • Structured logging per chunk (Chunk i/n), optional --debug files for prompts and outputs, and a scorer that is inspectable code (coverage, structure) rather than a black-box model number.

Real-world use cases

  • RAG / search indexing: Typed blocks map cleanly to chunks for embeddings (per section or per block).

  • Migration and CMS import: Preserving headings and lists is easier when they are not flattened before the model sees them.

  • Policy and compliance docs: Repeatable, logged processing plus versionable code paths (extraction, merge, score).

  • Product / API docs in Word: Multi-section handoffs to API portals or static sites as JSON, not a single unstyled dump.

Takeaway

V1 is the fastest way to prove that “DOCX can become JSON.” V2 is what you build when the file is long, the structure matters, and you need retries, logs, and a confidence story that does not rest entirely on the last token the model generated.

The evolution is the same as elsewhere in software: push determinism to code, keep the LLM in a small, testable box, and merge with explicit, auditable rules at the end.

Github Repo

Feedback and pull requests welcome