From Word to JSON: A First-Pass DOCX Pipeline with an LLM

v1 experiment: extract text from Word, ask a model to structure it, then validate the result. Nothing fancy yet — and that is the point.
Why this exists
Word documents are easy for people to write and painful for programs to treat as data. Parsing .docx XML by hand gets you raw text and styles, but “what does this document mean?” — sections, headings, keywords — is a different problem. This project is a first attempt at a simple pipeline:
Read a
.docxand pull out plain text (non-empty paragraphs).Send that text to a language model with a fixed JSON schema in the prompt.
Parse whatever the model returns, with a small fallback if the model wraps JSON in extra text.
Validate with Pydantic so bad output fails fast instead of silently polluting downstream code. The goal is a working baseline you can improve: better extraction, richer schema, stronger prompts, or different models — without rewriting everything from scratch.
The goal is a working baseline you can improve: better extraction, richer schema, stronger prompts, or different models — without rewriting everything from scratch.
What it does (high level)
Stack (v1)
Python 3.12 — tested on macOS.
python-docx— paragraph-level text extraction.OpenRouter — LLM inference through an OpenAI-compatible API (
OPENROUTER_API_KEYin.env).pydantic—DocumentSchemaas the contract for output.openaiclient pointed at OpenRouter’s base URL.
Run
python src/app.py path/to/file.docx
The program prints pretty-printed JSON to stdout and logs each step (extract → prompt → generate → parse → validate).
The pipeline in five steps
Extract — Open the document, keep only non-empty paragraphs, join with newlines. Simple and predictable; no tables or inline objects in this v1.
Build prompt — Embed the target JSON shape and rules (only JSON, double quotes, empty values when missing,
confidence0–1,notesexplaining confidence).Generate — Call the model with a short system message (“return ONLY valid JSON”) and
temperature=0for repeatability.Parse — Try
json.loadson the full string; if that fails, take the substring from the first{to the last}and parse again (handles occasional preamble or markdown fences).Validate — Instantiate
DocumentSchemaso structure and types are enforced before anything else trusts the data.
This separation matters: extraction stays dumb; structure is the model’s job; correctness is Pydantic’s job.
The schema (v1)
The model is steered toward something like:
title— string.sections— list of objects withheading,content, andbullets.keywords,confidence,notes— plus a built-in “scoring guide” in the prompt so the model can self-rate how well the source text supported the structure.
That is enough to prove the idea and to feed a UI, search index, or another pipeline — while staying easy to extend later (e.g. metadata, source spans, or hierarchical outlines).
What works well in a first version
End-to-end path from file to typed JSON in one command.
Clear logging for debugging when the model or parser misbehaves.
Pydantic at the end — a hard boundary so you do not build on invalid JSON.
Pragmatic JSON recovery when the model does not return a bare object.
Honest limitations (and why that is OK)
A first pass should be slightly uncomfortable; otherwise you have over-engineered before you have learned anything. Some known constraints:
Paragraph-only extraction — Complex layouts, tables, headers/footers, and images are not represented in the text passed to the model.
Token and length — Very long documents may need chunking, summarization, or a higher
max_tokensbudget; v1 is tuned for “see it work” rather than “ingest a book.”Model choice — The current setup uses a specific OpenRouter model; quality, cost, and latency will vary; swapping models is a knob you can turn without changing the overall design.
Trust — The model can still hallucinate structure;
confidenceandnoteshelp, but they are not a substitute for human review or ground-truth checks where stakes are high.
Treat these as a backlog, not failure: each limitation maps to a concrete next step.
Where to take it next
When you are ready to improve beyond v1, good next moves include:
Richer DOCX parsing (tables, styles as hints for headings, list detection).
Chunking long files with merge logic for a single JSON result or multiple artifacts.
Stricter or looser schema depending on product needs (e.g. citations, IDs, per-section confidence).
Evals on a small set of real documents — measure JSON validity rate and field-level agreement with a human-labeled set.
API or batch CLI if this moves from a script to a service.
Closing
This project is a deliberately small first slice: Word in, structured JSON out, validation at the end. The architecture is easy to reason about and to replace piece by piece as you learn what your documents and users actually need.
If you are building something similar, start with the same split — extract, prompt, parse, validate — and only add complexity when the data forces you to.
Feedback and pull requests welcome as this grows into a proper tool.


