AboutBlogsSubscribe
Blogs >I Built a Finance Website (November 08, 2025)
English | 中文

I Built a Finance Website

Part 2 - how to parse company financial data
November 08, 2025

Building a Finance Website |1200|630

I built a website myfinsight.com that provides financial insights of US public companies. This series of articles is about querying earning reports from the Securities and Exchange Commission (SEC), publishing data to the website, Mailchimp newsletters, and social media such as X and Threads.

In the last article, we talked about how to get company financial data. In this article, we will discuss how to parse financial data. Let’s first look at Meta’s income statement for the quarterly period ended June 30, 2025: Meta’s income statement |900|Meta’s income statement

An income statement usually has “Revenue”, “Costs and expenses”, and “Net income”. The income statement is usually the easiest to understand for most people among all the financial reporting. As can be easily seen in the table, the net income is the revenue minus the costs and taxes.

Now you might ask, what is there to parse? Don’t we just download the table and upload it to the cloud, and then serve it on the website? You are right, if we just want to get the numbers like Yahoo Finance website, then this is all we need to do: copy and paste the numbers. However, we don’t just want the numbers, we also want deep insights? For example, what is the growth rate of net incomes? Then we have to make sure that in our database, net incomes from all different quarterly reports are "labeled" the same name, so that we can query them and calculate the difference.

We might also want to visualize the revenue sources. What is the main revenue source for Meta? Which country has the most sales? Therefore, we also need to understand the relationship among numbers in the table.

Meta’s revenue breakdown |600|Meta’s revenue breakdown

If we can derive and parse all the mathematical relationships in the table, then we can plot a diagram like below to show where the revenue is from and where it goes. This is much easier to digest than numbers in the table.

Visualization of Meta’s income |900|Visualization of Meta’s income

OK now we know we have to somehow parse the data to get financial insights, can we just tell AI to do it? Yes we can, and most of the time AI can do a very good job. However, I choose not to do it for a number of reasons:

  1. I am poor and I don’t have that much money for tokens. There are thousands of public companies that file reports every quarter. I don’t intend to spend too much money for a hobby project.
  2. It might be hard to control the quality of the AI output. How do you go around and debug it if the output is wrong or hallucinated? It is challenging to use AI to parse financial data or math calculations that have absolutely right answers with zero error margin and tolerance for creativity. With the AI models involving every day, it is also risky to delegate much of the business logic to the third party.
  3. Luckily as we will see below, the submitted financial data is already in a well defined XML structure that we can easily get math relations from.

Getting the math relations

As mentioned in the last article, the math relation can be found from the EX-101.CAL file. For the income statement: Meta’s income statement |900|Meta’s income statement

The corresponding math structure parsed from the file is:

Meta’s income statement |400|Meta’s income statement

This is a tree structure with each top element connected to one or more child elements. If you look closely, it is almost like an inverted table with the net income at the top and the revenue at the bottom.

For convenience, we can convert the above tree diagram in the below text format:

(us-gaap:NetIncomeLoss) weight: 1.0
 | (us-gaap:IncomeTaxExpenseBenefit) weight: -1.0
 | (us-gaap:IncomeLossFromContinuingOperations) weight: 1.0
 | | (us-gaap:NonoperatingIncomeExpense) weight: 1.0
 | | (us-gaap:OperatingIncomeLoss) weight: 1.0
 | | | (us-gaap:CostsAndExpenses) weight: -1.0
 | | | | (us-gaap:CostOfRevenue) weight: 1.0
 | | | | (us-gaap:GeneralAndAdministrativeExpense) weight: 1.0
 | | | | (us-gaap:ResearchAndDevelopmentExpense) weight: 1.0
 | | | | (us-gaap:SellingAndMarketingExpense) weight: 1.0
 | | | (us-gaap:Revenue) weight: 1.0

In this notation, | denotes the depth of the node or element. The element “us-gaap:NetIncomeLoss” has 0 depth because there is no | in front of it. The elements “us-gaap:IncomeTaxExpenseBenefit” and “us-gaap:IncomeLossFromContinuingOperations” has the depth of 1 because there is one | in front of it. As in the tree structure, element of depth 0 connects to elements of depth 1, that is, the root element us-gaap:NetIncomeLoss connects to us-gaap:IncomeTaxExpenseBenefit and us-gaap:IncomeLossFromContinuingOperations.

The connections in this tree structure tells us the math relations among elements. Both the elements “us-gaap:IncomeTaxExpenseBenefit” and “us-gaap:IncomeLossFromContinuingOperations” contribute to the root us-gaap:NetIncomeLoss as they are connected. We also can see that us-gaap:IncomeTaxExpenseBenefit has a weight of -1, meaning negative contribution. Therefore, from this relation we know that

us-gaap:NetIncomeLoss = us-gaap:IncomeLossFromContinuingOperations - us-gaap:IncomeTaxExpenseBenefit

You might wonder what the prefix “us-gaap” is. The “us-gaap” prefix means these elements, e.g. “us-gaap:NetIncomeLoss”, is one of the standard elements from US GAAP Financial Reporting Taxonomy (GAAP stands for “Generally Accepted Accounting Principles”). Because these standard elements are widely used by different companies, it is very much possible you can find “us-gaap:NetIncomeLoss” from different companies’ filings, making it very easy to compare them among companies. Note that if there are standard elements, there could also be non-stardard elements. For example, Netflix’s financial reports include “nflx:ContentAssets”, which refers to the company's library of licensed and self-produced films and shows that are recorded on the balance sheet as assets. This type of asset is unique to Netflix and therefore the company needs to create its own element for representing this type of asset.

Populate values

The values can be found in the XML file as mentioned in the last article. By looking up each element from the tree in the file, we can have something like:

 $18337 (us-gaap:NetIncomeLoss) w:1.0
 | $2197 (us-gaap:IncomeTaxExpenseBenefit) w:-1.0
 | $20534 (us-gaap:IncomeLossFromContinuingOperations) w:1.0
 | | $93 (us-gaap:NonoperatingIncomeExpense) w:1.0
 | | $20441  (us-gaap:OperatingIncomeLoss) w:1.0
 | | | $27075 (us-gaap:CostsAndExpenses) w:-1.0
 | | | | $2663 (us-gaap:GeneralAndAdministrativeExpense) w:1.0
 | | | | $2979 (us-gaap:SellingAndMarketingExpense) w:1.0
 | | | | $12942 (us-gaap:ResearchAndDevelopmentExpense) w:1.0
 | | | | $8491 (us-gaap:CostOfRevenue) w:1.0
 | | | $47516  (us-gaap:Revenue) w:1.0

We have replaced “weight” with “w” to shorten the expression. It is easy to check if the number is correct. For example, us-gaap:CostsAndExpenses has four children; the cost and expenses must be the sum of their values, that is

$27075 = $2663 + $2979 + $12942 + $8491

Populate labels

It is not hard to guess what these elements are. For example, us-gaap:CostOfRevenue is simply the cost of revenue. However, sometimes it is not that easy to understand; for example. us-gaap:IncomeTaxesPaidNet. Therefore, we need to find the texts or labels for each element. The file we want to look at is EX-101.LAB, which we also discussed in the last article. After populating the texts we get

 $18337 Net income, w:1.0
 | $2197 Provision for income taxes, w:-1.0
 | $20534 Income before provision for income taxes, w:1.0
 | | $93 Interest and other income, net, w:1.0
 | | $20441 Income (loss) from operations, w:1.0
 | | | $27075 Total costs and expenses, w:-1.0
 | | | | $2663 General and administrative, w:1.0
 | | | | $2979 Marketing and sales, w:1.0
 | | | | $12942 Research and development, w:1.0
 | | | | $8491 Cost of revenue, w:1.0
 | | | $47516 Revenue, w:1.0

Data structures

Since the financial filings are already in the tree structure, we should also use such data structure when we process the data. In the above we talk a lot about “elements”. These elements essentially are “nodes” in the tree data structure. Each node should have a gaap label, the human readable text, and its value. We can define this node in Python as

@dataclasses.dataclass
class Node:
    text: str
    label: str
    value: float
    weight: float
    to_nodes: list["Node"]

where to_nodes lists the node’s child nodes. For example, in the above Meta’s income statement, we have

tax = Node(
    "Provision for income taxes",
    "us-gaap:IncomeTaxExpenseBenefit",
    2197,
    -1.0,
    []
    )
income_lose = Node(
    "Income before provision for income taxes",
    "us-gaap:IncomeLossFromContinuingOperations",
    20534,
    1.0,
    []
)
net_income = Node(
    "Net income",
    "us-gaap:NetIncomeLoss",
    18337,
    1.0,
    [tax, income_lose]
)

Since the node net_income has two children tax (with weight -1) and income_lose, we can derive the equation: net_income = income_lose - tax.

With the class Node, it is very easy to plot the tree structure and derive other mathematical relations.

Final notes

I presented a very simplified view of the financial filings. In a real filing, there are much more data that I didn’t mention here. For example, a value usually has a unit. A net sale of $18337 is actually in millions. A value can also have a dimension. A net sale can be in a “country” dimension that has multiple values. Interested readers can check out https://www.xbrl.org for how financial filings are structured and rules such as https://xbrl.us/data-rule for guidance of financial filings.


Subscribe to I-Tan's Blog

Tech & My Personal Updates