Blogs >I Built a Finance Website (October 08, 2025)

I Built a Finance Website

Part 1 - how to get company financial data

October 08, 2025

Building a Finance Website |1200|630

I built a website myfinsight.com that provides financial insights of US public companies. This series of articles is about querying earning reports from the Securities and Exchange Commission (SEC), publishing data to the website, Mailchimp newsletters, and social media such as X and Threads.

I have a data pipeline that runs every day to query new SEC filings. The new filings are then posted on the website so that the website is always up to date with the latest company filings. Getting the data from SEC is not difficult, and you don’t need to be an export in website scraping. The daily workflow of the pipeline goes as follows:

Get the list of today’s company filings at
https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent
For each company filing, there will be a filing page, something like
https://www.sec.gov/Archives/edgar/data/907471/000090747125000090/0000907471-25-000090-index.html. Download all the “Data Files” listed on this page.
Parse the downloaded files to get the data you want. The difficulty of this part varies depending on how detailed information you want from the files.
Upload the parsed data to Google BigQuery.
Clear the cache of the website so the new data can be loaded from BigQuery to the website.
Send out newsletters to subscribers of today’s company filings (with summary and beautiful diagrams) via MailChimp API.
Pick one company filing and post the summary and beautiful diagram on X and Threads via their provided API.

This article is about the steps #1 and #2 above. Note this is not the only way to get the SEC filings. SEC also provides its own set of API you can call directly instead of scraping and downloading from the website. If you just want the simple facts, for example, net income of a company, then the API would suffice. Alternatively, you can also download the compressed historic files from SEC. The file size is around 2 GB each day. If you just want certain companies and/or certain types of filings, downloading and processing such big files is not necessary. Furthermore, the data is delayed by one day; today’s filings will only be available the next day.

Get the list of today’s company filings

10-K and 10-Q are one of the most important filings that people care about. 10-K is the annual filing and 10-Q is the quarterly filing. To get the latest 10-Q filings, one can use the URL:

https://www.sec.gov/cgi-bin/browse-edgar?type=10-Q&action=getcurrent

For 10-K filings, just replace 10-Q with 10-K in the URL. If you go to this URL, you can see a table of 10-K/Q filings. One can use Pandas module to convert this html table into a Pandas DataFrame:

response = requests.get(
    f"https://www.sec.gov/cgi-bin/browse-edgar?type=10-Q&action=getcurrent",
    headers={"User-Agent": "example@email.com"},
)
tables = pd.read_html(
    io.StringIO(response.content.decode("utf-8")), extract_links="all"
)

where example@email.com can actually be anything; I usually just put my work email so that the SEC can know I am not a bot although I doubt they would actually care. Note that the read_html method actually returns a list of tables. On this filing page there are actually multiple tables that might not be obvious because of the invisible borders. The main table we care about is the 6th table or tables[5], which contains the recent filings. Don’t worry, the SEC website is rarely updated; tables[5] has always been the correct table to download for the past few years, and therefore your program will run without an issue for a very long time.

Notice in the above code snippet, we specify extract_links="all". By doing so, all the hyperlinks will be returned in the table. This is important because we need to know the filing URL link of each filing. With some text processing, you can get something like:

Company filings |600|300 and there you have it: today’s filings (2025-09-17) with company name and their filing URL (/Archives/edgar/data/…)

Get the actual filing of each company

Once we have the filing URL, we can download the filing data from it. Let’s look at an example filing page:

https://www.sec.gov/Archives/edgar/data/907471/000090747125000090/0000907471-25-000090-index.html

Company filings |1000| The first table (with the first column “Seq” 1, 2, 3, 4, 5, 11 in this example) contains the files that are meant for humans, for example, HTML webpages and pictures. The second table (with “Seq” 6-10 and 99) are for machines to process; these are the files we wanted.

So what are these files? If we look at the “Type” column, there are 6 different types of files:

XML file

This is the most important file that contains all the “facts”. Facts are financial numbers with contexts. For example, “net income” is a fact in an income statement. “Net income” is labeled “us-gaap:NetIncomeLoss” in this file, and you can find something like:

<us-gaap:NetIncomeLoss contextRef="ref_xyz" > $24986 </us-gaap:NetIncomeLoss>

This means that the net income is $24986 in this filing. If you search for the key ref_xyz in the file, you can also find the period associated with this number, e.g. this quarter, and sometimes even dimensions, e.g. “Taiwan”, meaning this income originated from the sales in Taiwan.

This file also contains html tables as seen in the actual filing.

EX-101.SCH file

This file is relatively boring. It lists all the financial tables in this filing. For example, the balance sheet is usually listed as “http://laab.com/role/CondensedBalanceSheets” (roleURI):

<link:roleType roleURI="http://laab.com/role/CondensedBalanceSheets">
  <link:definition>110200 - Statement - CONDENSED BALANCE SHEETS</link:definition>
</link:roleType>

It could be useful when you want to use roleURI to find the same table across different files.

EX-101.CAL file

This file specifies the mathematical relationship among different facts. Without this piece of information, it is impossible to draw the income statement like this:

Meta filings |1000| For example, if you look at the top right corner, you can see that “Net income” is ”Income before income tax” minus “Provision for income tax”. In this file, “Net income”, ”Income before income tax”, and “Provision for income tax” are labeled as "gaap_NetIncomeLoss”, “gaap_IncomeBeforeTaxes”, and “gaap_IncomeTax”, respectively. Their mathematical relationship is given as

<link:calculationArc weight="1.0" xlink:from="loc_us-gaap_NetIncomeLoss" xlink:to="loc_us-gaap_IncomeBeforeTaxes"/>
<link:calculationArc weight="-1.0" xlink:from="loc_us-gaap_NetIncomeLoss" xlink:to="loc_us-gaap_IncomeTax"/>

Here we see that there are two calculationArc, which is used to link two facts. The first calculationArc links from gaap_NetIncomeLoss to gaap_IncomeBeforeTaxes, and the second calculationArc links from gaap_NetIncomeLoss to gaap_IncomeTax, or

gaap_NetIncomeLoss → gaap_IncomeBeforeTaxes
gaap_NetIncomeLoss → - gaap_IncomeTax

Note there is a minus sign in the second equation because of the weight="-1.0" in the previous snippet. Combine these two equations we get gaap_NetIncomeLoss = gaap_IncomeBeforeTaxes - gaap_IncomeTax

By parsing all calculationArc, we can construct the financial statements as in the diagram shown above.

EX-101.DEF file

This file specifies the “definition” of each financial table. For example, what facts are listed in the table and what dimension is the table in. The income statement can have facts such as income loss, which can be in the dimension such as Taiwan, meaning this table has the income value generated from Taiwan.

EX-101.LAB file

This file gives all the “labels” of facts. Depending on the company, the same fact can be labeled with different texts in the filing. For example, net income loss can be labeled “Net Income (Loss)” in one company, and “Net Income (Loss) Attributable to Parent Company” in another company.

EX-101.PRE file

This file gives the “presentation” of each table. Even in the same filing, a fact can be named differently in different tables. For example, in the EX-101.LAB file, net income loss can have two labels “Net Income (Loss)” and “Net Income (Loss) - Taiwan”. The EX-101.PRE file will tell you which table uses which label for each fact.

My pipeline processes all 6 files to get the correct label for each fact in each table and also their inter-relationship. However, if you only care about, say, income loss, then you can just locate the “us-gaap:NetIncomeLoss” in the XML file and read the value. In the next article, I will go over how I use Python parsing these files and upload the results to Google BigQuery.

← Previous Blog:

My First Software Rollback

Continue Reading:

I Built a Finance Website →