From a messy PDF to a clean Excel — and a finished analysis report — with Codex

May 29, 2026

A macOS desktop screenshot: the Codex app open at about three-quarters screen over a cluttered desktop of scattered PDF files, with a navy sidebar of conversations, a chat thread asking it to extract data, and an in-app preview showing a clean spreadsheet and a bar chart; the mouse pointer rests near the chart.

The task almost never stops at "turn this PDF into a spreadsheet." That's the visible half. The real job is the question waiting on the other side: which vendor did we spend the most on, is the monthly trend up or down, which line items are outliers. Copying numbers by hand is the slow, error-prone part — and it's exactly the part you can hand off. With a coding agent like Codex, the same session can read the document, build a clean Excel file, and then write the analysis on top of it. About 20–30 minutes the first time; less once the prompt is yours.

What makes this work is that Codex isn't just chatting — it runs code locally (Python, pandas, openpyxl) in its sandbox, reads the file on your disk, and writes a real .xlsx back to a folder you choose. Nothing about it requires you to know Python; you describe the columns you want and check the result.

What you can throw at it

The source format matters less than you'd think. Common ones:

A digital PDF with tables — financial statements, invoices, exported reports. The text is selectable; extraction is near-lossless.
A scanned PDF or a photo of a table — needs an OCR pass first. Accuracy is good for clean scans, shakier for crumpled receipts; you spot-check.
A pile of files — twelve monthly statements, a folder of CSVs. Codex can loop over all of them and stack the result into one sheet.
A messy spreadsheet someone already made — merged cells, totals rows mixed into data, three header rows. Cleaning this is often the bigger win than extraction.

The recipe

Prereqs: Codex CLI installed, and the source file saved somewhere you can point at (e.g. ~/Downloads/statement.pdf). About 20–30 minutes the first time.

Step 1 — Extract to a clean table. In a Codex session, be specific about the shape you want, not just "convert this":

Read ~/Downloads/statement.pdf. Pull every transaction into a table with these columns: date (YYYY-MM-DD), description, category, amount (number, no currency symbol). Skip subtotal and header rows. Show me the first 10 rows before writing anything.

Asking to see the first rows first is the whole trick — you catch a misread column or a shifted total before it propagates into 400 rows. Naming the exact columns stops the agent from inventing its own schema.

Step 2 — Validate the extraction. Don't skip this. Quick checks that catch most errors:

Row count — "how many rows did you extract?" Compare to a rough count from the source.
The total — "sum the amount column; does it match the printed total on the PDF?" A mismatch usually means a totals row leaked into the data or a multi-page table dropped a page.
Spot-check the edges — eyeball the first and last few rows against the document. Date formats and negative numbers (credits, refunds) are the usual offenders.

Step 3 — Write the Excel, formatted. Once the table is right:

Write this to ~/Desktop/transactions.xlsx. Bold header row, freeze the top row, format the amount column as currency with two decimals, and add a totals row at the bottom.

You get a real spreadsheet — openable in Excel, Numbers, or Google Sheets — not a CSV you still have to clean up.

Step 4 — Go past the table: ask for the analysis. This is where it stops being a conversion tool. The clean data is already in the session, so:

Now analyze this. Total spend by category, the top 5 vendors by amount, month-over-month trend, and flag any transaction more than 2x the median. Put each as its own sheet, add a bar chart for spend-by-category, and write a 5-bullet plain-English summary on the first sheet.

What comes back is a workbook with pivots, a chart, and a written readout — the thing the PDF was standing between you and. If a cut isn't useful, ask for a different one; the data doesn't move, only the questions do.

Variations

Bank/card statements → a spending picture — categorize transactions, then ask for monthly totals and the categories that grew fastest
Twelve monthly reports → one trend — point Codex at the folder, have it stack them with a month column, then chart the trend across the year
A survey CSV → a summary deck of numbers — counts, percentages, cross-tabs by segment, exported as a clean summary sheet
A scanned table → digital — run the OCR pass first, validate harder (Step 2 matters more here), then proceed as normal

Where it doesn't fit

Anything that has to be legally exact — tax filings, audited figures: the agent is a fast first pass, not the final word. Reconcile against the source.
Genuinely large data — past a few hundred thousand rows you want a database and SQL, not a spreadsheet
A report you'll regenerate weekly forever — if it's truly recurring, have Codex write a small script you re-run, instead of redoing the conversation each time
Documents where the layout *is* the meaning — a contract, a form with conditional logic; extracting a table loses the part that mattered

Don’t take our word — try it yourself

For $3, watch Codex / Claude actually get something done — automate a repetitive chore, build a small working web page, or whip up a little tool. Plenty for one real task.

3 days · 2 image credits · one key for both Claude and Codex

Try it for $3 →

Get new posts in your inbox

New recipes and field notes, plus the occasional product update. No spam, unsubscribe anytime.