From EDGAR Filings to a Structured Database using Claude Code
Video 3 in a series on Claude Code
Part 3 of a series on AI coding tools for empirical research, accompanying my Markus Academy video series.
In the last post, we pulled structured data from a government API—homeownership rates by age, already in nice tabular form. But a lot of research data isn’t like that. It’s buried in PDFs, HTML filings, regulatory documents, earnings call transcripts. Getting from “the information exists somewhere on the internet” to “I have a queryable dataset” is often the hardest part of a project.
So we’re going to do exactly that: scrape SEC EDGAR, parse messy 10-K filings, extract specific sections, build a structured database, and run a simple analysis – all through Claude Code. The whole pipeline takes about 30 minutes, and it produces an actual finding.
The question
Here’s what motivated this: the 2025 tariff escalation has been the dominant policy story in markets. If you follow trade-exposed firms at all, you’ve seen the stock price reactions, the earnings call commentary, the analyst reports. But how did firms change their formal risk disclosures in response? Item 1A of the 10-K – the Risk Factors section – is where companies are legally required to describe material risks. It’s the section that lawyers agonize over. When a new risk emerges, it shows up here.
So the question is simple: did firms increase their tariff-related language in 10-K Risk Factors filings after the 2025 tariff escalation? This connects to the broader literature on policy uncertainty measurement (think Baker, Bloom & Davis) and risk factor disclosure (e.g., Campbell et al., 2014). But today we’re just doing the descriptive work – building the dataset and seeing what’s there.
Scraping EDGAR
I start Claude Code in an empty directory. Unlike the last post where I was vague about the data source, here I know exactly where the data lives – SEC EDGAR – and I can give Claude some useful context about how it works.
> ❯ I want to build a dataset of 10-K Risk Factors from SEC EDGAR
to study how tariff-related risk disclosure changed 2022–2025.
Here's what I know about EDGAR:
- Every company has a CIK (Central Index Key)
- The EDGAR filing API is at https://efts.sec.gov/LATEST/
- Rate limit: 10 requests/second, must include User-Agent header
I want 10-Ks for ~30 trade-exposed firms — manufacturing, retail,
tech hardware. Caterpillar, Nike, Apple, Deere, Ford, 3M, IntelIf you remember from the first video, I talked about how well Claude does as a function of how clear and precise you are. The more precise I am, the less it has to search randomly for things. Claude needs to know about rate limits and User-Agent headers before it writes the code, not after it gets a 403 error (of course, it still screwed up when doing this — I wasn’t insistent enough).
Plan mode and interactive design
Something notable happened right away that didn’t occur in the previous videos: Claude entered plan mode. It looked at the prompt, recognized this was a substantial multi-step project, and said it wanted to design the approach before writing any code. It researched the EDGAR API—fetching documentation, looking at examples on GitHub, understanding the filing index structure—before proposing anything.
Then Claude did something I want to highlight: it asked me questions. Three of them:
Database format: “Do you want a SQLite database? A flat file? Both?” I told it I wanted DuckDB—it’s similar to SQLite but it’s how I prefer to do things.
Keyword approach: “Do you want keyword search, full text extraction with flagging, or LLM classification (which would cost more)?” I chose keyword search—just look for words related to tariffs.
User-Agent: “What entity should we use for the User-Agent header? Do you want to fill it in later or provide it now?” I gave it my name and email right there.
Markus also had a good suggestion during the conversation: add “liberation day” to the tariff keywords, since April 2, 2025 was the major tariff announcement. Claude immediately updated the keyword list to include “liberation day” and “liberation day tariff.”
After this planning phase, Claude saved its plan to a file—just like the intentional compaction approach I described in post 1. The plan included the target company list, the database schema, the keyword dictionary, and the pipeline architecture. Then it cleared the context window but kept the plan on disk, so it could start fresh with the full context budget while retaining all the design decisions we’d made. This is exactly the pattern we talked about: research → write plan to file → new session reads plan → execute.
Building and running the pipeline
Claude wrote about 480 lines of Python to implement the pipeline. It has all the phases: CIK lookup, filing index queries, document downloading, HTML parsing with BeautifulSoup, and structured storage. The practical features matter—it caches downloaded HTML so you don’t re-download on subsequent runs, it logs errors without stopping the batch, and it tracks every download attempt in metadata.
One useful feature: you can queue a follow-up question while Claude is running a long process. While the downloader was chugging through filings, I typed my next question and it waited until the current task finished before answering.
> Run it for the full list. Let's see how it goes.The script found CIKs for 29 out of 30 firms. The one failure: Gap Inc, which was listed as ticker “GPS” in our list. Claude couldn’t find a CIK for it, so I asked what happened. It turned out Gap Inc. had rebranded and was traded under “GAP” in some datasets, not “GPS.” Claude looked up the correct CIK (39911), updated the script to use the right ticker, and re-ran. Because the script had caching built in, it only needed to download four new files—the rest were already on disk. Took about 43 seconds.
I also demonstrated something useful during the download: I accidentally hit escape and interrupted Claude mid-process. But because of how the context window works, I could just type “please continue” and it picked up right where it left off. (You can also hit Ctrl+O to toggle the “thinking” view and see Claude’s internal reasoning at any point.)
Extracting and structuring
Raw HTML files sitting in folders aren’t research data. You can’t query them, you can’t join them to Compustat, you can’t do anything without loading everything into memory and parsing it every time. I need a structured database.
Claude wrote a parser that handles the formatting variations – and there are a lot. Some filings are wrapped in HTML with <div> tags and CSS styling. Some are plain text with dots and dashes for formatting. The section headers sometimes say “ITEM 1A” in all caps, sometimes “Item 1A.” with a period, sometimes they’re inside an <a> tag. Claude’s parser tries multiple regex patterns and falls back gracefully.
The first pass extracted Item 1A from 112 out of 120 filings. I asked why eight were missing. Claude investigated and found that Honeywell and Intel had a formatting edge case: “Item 1A” and “Risk Factors” appeared on separate lines with a newline between them, but the regex required them on the same line. Claude fixed the regex to allow newlines and re-ran: 119 out of 120. The one remaining holdout was 3M, which had an unusual filing structure that would need manual attention.
This is the “95% fast, last 5% requires judgment” pattern. For a quick analysis, 119 out of 120 is fine. For a published paper, I’d go back and hand-check 3M’s filing. Either way, the database is built.
Querying the database
Now the payoff. I have a structured DuckDB database and I can actually ask questions of the data.
> Which firms mentioned tariffs the most?Claude wrote a query and came back with specific results. Deere (John Deere) leads with the most tariff mentions, increasing from three to four over the 2022-2025 period. Manufacturing and tech hardware firms mention tariffs most frequently; retail firms mention them least. Walmart only started mentioning tariffs in 2025. AGO dropped from four mentions to zero.
The context snippets are useful too. In earlier filings, you see generic language: “changes in trade policy could adversely affect our business.” In 2025, you see specific language referencing Section 301, specific tariff rates, and particular product categories. The specificity of the language tracks the specificity of the policy.
Extending to 2010
I then asked Claude to download the 2010 10-Ks for the same set of firms and run the same tariff keyword analysis. The comparison was interesting: even in 2010, 18 out of the roughly 30 companies mentioned tariffs. But the intensity was different—25 total tariff paragraphs across 18 firms in 2010, versus 26 to 35 per year in the 2022-2025 period across roughly the same number of firms. The vocabulary had also shifted: in 2010, the terms were narrower—“tariff,” “trade restrictions,” “anti-dumping,” “countervailing duties.” By 2025, the language had expanded to include “trade tension,” “trade dispute,” “trade war,” “liberation day.” The breadth of the risk disclosure, not just its frequency, had changed.
This is a finding. Not a causal result – you’d need more structure for that – but a clean descriptive fact that would motivate a deeper analysis. If you were interested in the cross-section—which types of firms saw the biggest increases—you could extend this to all 500 S&P firms without changing the pipeline. The cost of expanding the sample is low. We did the whole thing in about 30 minutes.
The final pipeline
project/
├── data/
│ ├── firm_list.csv (input: tickers)
│ ├── filings/
│ │ ├── AAPL/
│ │ │ ├── AAPL_2022_10K.html
│ │ │ ├── AAPL_2023_10K.html
│ │ │ └── ...
│ │ ├── CAT/
│ │ └── ...
│ ├── filings.db (DuckDB database)
│ └── metadata.csv (download tracking)
├── figures/
│ └── tariff_disclosure.pdf
├── download_filings.py
├── extract_and_structure.py
├── analyze.py
├── requirements.txt
├── Justfile
└── README.mdThe key thing is the database. filings.db is the reusable research asset – not the raw HTML files, not the text extractions. Once you have a structured database, you can:
Add new keywords without re-parsing all the filings
Join to other datasets (stock returns around filing dates, firm characteristics from Compustat)
Run SQL queries directly instead of writing Python parsing code every time
Hand the database to a coauthor who doesn’t need to re-run the scraping pipeline
This is the pattern for any text-as-data project: scrape → parse → structure → analyze. The raw files are an intermediate product. The database is the deliverable.
A few things I’ve learned
Front-load context for scraping tasks. Tell Claude about rate limits, authentication, API structure, and data format before it writes the code. With EDGAR specifically, mentioning the User-Agent requirement and the 10 req/sec limit upfront avoids a whole round of debugging.
Let Claude ask you questions. This was different from the earlier videos. Rather than specifying every detail upfront, I gave Claude the high-level task and it asked me the design decisions: database format, keyword approach, user-agent identity. These are research decisions—I should be making them—but I don’t need to anticipate every one in the initial prompt.
Build the database schema yourself. Or at least approve it. What tables do you need? What keywords matter? What metadata should you track? Claude proposed a schema, but the choices—like tracking keyword counts with context snippets for spot-checking—reflect domain knowledge about what I’d need downstream. That’s the researcher’s job.
Resume capability is not optional. When you’re downloading 100+ files over the network, things will fail. Timeouts, rate limit errors, missing filings, server glitches. The Gap Inc. ticker issue is a good example: because the script cached everything, fixing the ticker and re-running only required downloading four new files instead of re-doing all 120.
Quality reports save you later. The extraction quality report – which filings parsed successfully, word count distributions, flagged suspiciously short extractions – is how we caught the Honeywell/Intel regex issue and the 3M edge case. Without it, you’d discover months later that part of your sample has garbage data.
DuckDB (or SQLite) is underrated for research. Most economists I know work with CSVs and dataframes. But for text-as-data projects, a SQL-queryable database is a much better fit. Zero configuration (it’s a single file), queryable with SQL (which Claude writes fluently), and it handles the relational structure naturally – filings have many keywords, firms have many filings. You don’t need Postgres or a server. Just a
.dbfile. I prefer DuckDB, but SQLite works too.The finding matters. It’s tempting to treat scraping as a pure infrastructure task – “I built a pipeline, done.” But the whole point is to learn something. Even a simple descriptive analysis (tariff keywords increased, the vocabulary shifted from narrow trade terms to broader policy-risk language, and Walmart didn’t mention tariffs until 2025) gives you confidence that the pipeline is working and the data is real.
What’s next
In the next post, we tackle what happens when your dataset gets big – too big for RAM, too slow for a single machine. We’ll cover DuckDB, parallelization, and running jobs on remote clusters, all orchestrated through Claude Code.



Really cool. Take a look at this plugin — it’s been a gamechanger for me in projects such as this: https://github.com/EveryInc/compound-engineering-plugin
I never really worked with DuckDB until I started using Claude Code, but I also noticed that Claude really likes to use DuckDB and it has been super useful to learn.