Community
From a messy PDF to a clean Excel — and a finished analysis report — with Codex

The task almost never stops at "turn this PDF into a spreadsheet." That's the visible half. The real job is the question waiting on the other side: which vendor did we spend the most on, is the monthly trend up or down, which line items are outliers. Copying numbers by hand is the slow, error-prone part — and it's exactly the part you can hand off. With a coding agent like Codex, the same session can read the document, build a clean Excel file, and then write the analysis on top of it. About 20–30 minutes the first time; less once the prompt is yours.
What makes this work is that Codex isn't just chatting — it runs code locally (Python, pandas, openpyxl) in its sandbox, reads the file on your disk, and writes a real .xlsx back to a folder you choose. Nothing about it requires you to know Python; you describe the columns you want and check the result.
What you can throw at it
The source format matters less than you'd think. Common ones:
- A digital PDF with tables — financial statements, invoices, exported reports. The text is selectable; extraction is near-lossless.
- A scanned PDF or a photo of a table — needs an OCR pass first. Accuracy is good for clean scans, shakier for crumpled receipts; you spot-check.
- A pile of files — twelve monthly statements, a folder of CSVs. Codex can loop over all of them and stack the result into one sheet.
- A messy spreadsheet someone already made — merged cells, totals rows mixed into data, three header rows. Cleaning this is often the bigger win than extraction.
The recipe
Prereqs: Codex CLI installed, and the source file saved somewhere you can point at (e.g. ~/Downloads/statement.pdf). About 20–30 minutes the first time.
Step 1 — Extract to a clean table. In a Codex session, be specific about the shape you want, not just "convert this":
Read ~/Downloads/statement.pdf. Pull every transaction into a table with these columns: date (YYYY-MM-DD), description, category, amount (number, no currency symbol). Skip subtotal and header rows. Show me the first 10 rows before writing anything.
Asking to see the first rows first is the whole trick — you catch a misread column or a shifted total before it propagates into 400 rows. Naming the exact columns stops the agent from inventing its own schema.
Step 2 — Validate the extraction. Don't skip this. Quick checks that catch most errors:
- Row count — "how many rows did you extract?" Compare to a rough count from the source.
- The total — "sum the amount column; does it match the printed total on the PDF?" A mismatch usually means a totals row leaked into the data or a multi-page table dropped a page.
- Spot-check the edges — eyeball the first and last few rows against the document. Date formats and negative numbers (credits, refunds) are the usual offenders.
Step 3 — Write the Excel, formatted. Once the table is right:
Write this to ~/Desktop/transactions.xlsx. Bold header row, freeze the top row, format the amount column as currency with two decimals, and add a totals row at the bottom.
You get a real spreadsheet — openable in Excel, Numbers, or Google Sheets — not a CSV you still have to clean up.
Step 4 — Go past the table: ask for the analysis. This is where it stops being a conversion tool. The clean data is already in the session, so:
Now analyze this. Total spend by category, the top 5 vendors by amount, month-over-month trend, and flag any transaction more than 2x the median. Put each as its own sheet, add a bar chart for spend-by-category, and write a 5-bullet plain-English summary on the first sheet.
What comes back is a workbook with pivots, a chart, and a written readout — the thing the PDF was standing between you and. If a cut isn't useful, ask for a different one; the data doesn't move, only the questions do.
Variations
- Bank/card statements → a spending picture — categorize transactions, then ask for monthly totals and the categories that grew fastest
- Twelve monthly reports → one trend — point Codex at the folder, have it stack them with a
monthcolumn, then chart the trend across the year - A survey CSV → a summary deck of numbers — counts, percentages, cross-tabs by segment, exported as a clean summary sheet
- A scanned table → digital — run the OCR pass first, validate harder (Step 2 matters more here), then proceed as normal
Where it doesn't fit
- Anything that has to be legally exact — tax filings, audited figures: the agent is a fast first pass, not the final word. Reconcile against the source.
- Genuinely large data — past a few hundred thousand rows you want a database and SQL, not a spreadsheet
- A report you'll regenerate weekly forever — if it's truly recurring, have Codex write a small script you re-run, instead of redoing the conversation each time
- Documents where the layout *is* the meaning — a contract, a form with conditional logic; extracting a table loses the part that mattered
Don’t take our word — try it yourself
For $3, watch Codex / Claude actually get something done — automate a repetitive chore, build a small working web page, or whip up a little tool. Plenty for one real task.
3 days · 2 image credits · one key for both Claude and Codex
Get new posts in your inbox
New recipes and field notes, plus the occasional product update. No spam, unsubscribe anytime.