Skip to main content

Command Palette

Search for a command to run...

From Word to JSON: A First-Pass DOCX Pipeline with an LLM

Updated
5 min read
From Word to JSON: A First-Pass DOCX Pipeline with an LLM

v1 experiment: extract text from Word, ask a model to structure it, then validate the result. Nothing fancy yet — and that is the point.

Why this exists

Word documents are easy for people to write and painful for programs to treat as data. Parsing .docx XML by hand gets you raw text and styles, but “what does this document mean?” — sections, headings, keywords — is a different problem. This project is a first attempt at a simple pipeline:

  1. Read a .docx and pull out plain text (non-empty paragraphs).

  2. Send that text to a language model with a fixed JSON schema in the prompt.

  3. Parse whatever the model returns, with a small fallback if the model wraps JSON in extra text.

  4. Validate with Pydantic so bad output fails fast instead of silently polluting downstream code. The goal is a working baseline you can improve: better extraction, richer schema, stronger prompts, or different models — without rewriting everything from scratch.

The goal is a working baseline you can improve: better extraction, richer schema, stronger prompts, or different models — without rewriting everything from scratch.

What it does (high level)

Stack (v1)

  • Python 3.12 — tested on macOS.

  • python-docx — paragraph-level text extraction.

  • OpenRouter — LLM inference through an OpenAI-compatible API (OPENROUTER_API_KEY in .env).

  • pydanticDocumentSchema as the contract for output.

  • openai client pointed at OpenRouter’s base URL.

Run

python src/app.py path/to/file.docx

The program prints pretty-printed JSON to stdout and logs each step (extract → prompt → generate → parse → validate).

The pipeline in five steps

  1. Extract — Open the document, keep only non-empty paragraphs, join with newlines. Simple and predictable; no tables or inline objects in this v1.

  2. Build prompt — Embed the target JSON shape and rules (only JSON, double quotes, empty values when missing, confidence 0–1, notes explaining confidence).

  3. Generate — Call the model with a short system message (“return ONLY valid JSON”) and temperature=0 for repeatability.

  4. Parse — Try json.loads on the full string; if that fails, take the substring from the first { to the last } and parse again (handles occasional preamble or markdown fences).

  5. Validate — Instantiate DocumentSchema so structure and types are enforced before anything else trusts the data.

This separation matters: extraction stays dumb; structure is the model’s job; correctness is Pydantic’s job.

The schema (v1)

The model is steered toward something like:

  • title — string.

  • sections — list of objects with heading, content, and bullets.

  • keywords, confidence, notes — plus a built-in “scoring guide” in the prompt so the model can self-rate how well the source text supported the structure.

That is enough to prove the idea and to feed a UI, search index, or another pipeline — while staying easy to extend later (e.g. metadata, source spans, or hierarchical outlines).

What works well in a first version

  • End-to-end path from file to typed JSON in one command.

  • Clear logging for debugging when the model or parser misbehaves.

  • Pydantic at the end — a hard boundary so you do not build on invalid JSON.

  • Pragmatic JSON recovery when the model does not return a bare object.

Honest limitations (and why that is OK)

A first pass should be slightly uncomfortable; otherwise you have over-engineered before you have learned anything. Some known constraints:

  • Paragraph-only extraction — Complex layouts, tables, headers/footers, and images are not represented in the text passed to the model.

  • Token and length — Very long documents may need chunking, summarization, or a higher max_tokens budget; v1 is tuned for “see it work” rather than “ingest a book.”

  • Model choice — The current setup uses a specific OpenRouter model; quality, cost, and latency will vary; swapping models is a knob you can turn without changing the overall design.

  • Trust — The model can still hallucinate structure; confidence and notes help, but they are not a substitute for human review or ground-truth checks where stakes are high.

Treat these as a backlog, not failure: each limitation maps to a concrete next step.

Where to take it next

When you are ready to improve beyond v1, good next moves include:

  • Richer DOCX parsing (tables, styles as hints for headings, list detection).

  • Chunking long files with merge logic for a single JSON result or multiple artifacts.

  • Stricter or looser schema depending on product needs (e.g. citations, IDs, per-section confidence).

  • Evals on a small set of real documents — measure JSON validity rate and field-level agreement with a human-labeled set.

  • API or batch CLI if this moves from a script to a service.

Closing

This project is a deliberately small first slice: Word in, structured JSON out, validation at the end. The architecture is easy to reason about and to replace piece by piece as you learn what your documents and users actually need.

If you are building something similar, start with the same split — extract, prompt, parse, validate — and only add complexity when the data forces you to.

Github Repo

Feedback and pull requests welcome as this grows into a proper tool.