
I built a website myfinsight.com that provides financial insights of US public companies. This series of articles is about querying earning reports from the Securities and Exchange Commission (SEC), publishing data to the website, Mailchimp newsletters, and social media such as X and Threads.
In the last article, we talked about how to get company financial data. In this article,
we will discuss how to parse financial data. Let’s first look at Meta’s income statement
for the quarterly period ended June 30, 2025:
Meta’s income statement
An income statement usually has “Revenue”, “Costs and expenses”, and “Net income”. The income statement is usually the easiest to understand for most people among all the financial reporting. As can be easily seen in the table, the net income is the revenue minus the costs and taxes.
Now you might ask, what is there to parse? Don’t we just download the table and upload it to the cloud, and then serve it on the website? You are right, if we just want to get the numbers like Yahoo Finance website, then this is all we need to do: copy and paste the numbers. However, we don’t just want the numbers, we also want deep insights? For example, what is the growth rate of net incomes? Then we have to make sure that in our database, net incomes from all different quarterly reports are "labeled" the same name, so that we can query them and calculate the difference.
We might also want to visualize the revenue sources. What is the main revenue source for Meta? Which country has the most sales? Therefore, we also need to understand the relationship among numbers in the table.
Meta’s revenue breakdown
If we can derive and parse all the mathematical relationships in the table, then we can plot a diagram like below to show where the revenue is from and where it goes. This is much easier to digest than numbers in the table.
Visualization of Meta’s income
OK now we know we have to somehow parse the data to get financial insights, can we just tell AI to do it? Yes we can, and most of the time AI can do a very good job. However, I choose not to do it for a number of reasons:
As mentioned in the last article,
the math relation can be found from the EX-101.CAL file. For the income statement:
Meta’s income statement
The corresponding math structure parsed from the file is:
Meta’s income statement
This is a tree structure with each top element connected to one or more child elements. If you look closely, it is almost like an inverted table with the net income at the top and the revenue at the bottom.
For convenience, we can convert the above tree diagram in the below text format:
(us-gaap:NetIncomeLoss) weight: 1.0
| (us-gaap:IncomeTaxExpenseBenefit) weight: -1.0
| (us-gaap:IncomeLossFromContinuingOperations) weight: 1.0
| | (us-gaap:NonoperatingIncomeExpense) weight: 1.0
| | (us-gaap:OperatingIncomeLoss) weight: 1.0
| | | (us-gaap:CostsAndExpenses) weight: -1.0
| | | | (us-gaap:CostOfRevenue) weight: 1.0
| | | | (us-gaap:GeneralAndAdministrativeExpense) weight: 1.0
| | | | (us-gaap:ResearchAndDevelopmentExpense) weight: 1.0
| | | | (us-gaap:SellingAndMarketingExpense) weight: 1.0
| | | (us-gaap:Revenue) weight: 1.0
In this notation, | denotes the depth of the node or element. The element “us-gaap:NetIncomeLoss” has
0 depth because there is no | in front of it. The elements “us-gaap:IncomeTaxExpenseBenefit” and
“us-gaap:IncomeLossFromContinuingOperations” has the depth of 1 because there is one | in front of it.
As in the tree structure, element of depth 0 connects to elements of depth 1, that is,
the root element us-gaap:NetIncomeLoss connects to us-gaap:IncomeTaxExpenseBenefit and
us-gaap:IncomeLossFromContinuingOperations.
The connections in this tree structure tells us the math relations among elements. Both the elements “us-gaap:IncomeTaxExpenseBenefit” and “us-gaap:IncomeLossFromContinuingOperations” contribute to the root us-gaap:NetIncomeLoss as they are connected. We also can see that us-gaap:IncomeTaxExpenseBenefit has a weight of -1, meaning negative contribution. Therefore, from this relation we know that
us-gaap:NetIncomeLoss = us-gaap:IncomeLossFromContinuingOperations - us-gaap:IncomeTaxExpenseBenefit
You might wonder what the prefix “us-gaap” is. The “us-gaap” prefix means these elements, e.g. “us-gaap:NetIncomeLoss”, is one of the standard elements from US GAAP Financial Reporting Taxonomy (GAAP stands for “Generally Accepted Accounting Principles”). Because these standard elements are widely used by different companies, it is very much possible you can find “us-gaap:NetIncomeLoss” from different companies’ filings, making it very easy to compare them among companies. Note that if there are standard elements, there could also be non-stardard elements. For example, Netflix’s financial reports include “nflx:ContentAssets”, which refers to the company's library of licensed and self-produced films and shows that are recorded on the balance sheet as assets. This type of asset is unique to Netflix and therefore the company needs to create its own element for representing this type of asset.
The values can be found in the XML file as mentioned in the last article. By looking up each element from the tree in the file, we can have something like:
$18337 (us-gaap:NetIncomeLoss) w:1.0
| $2197 (us-gaap:IncomeTaxExpenseBenefit) w:-1.0
| $20534 (us-gaap:IncomeLossFromContinuingOperations) w:1.0
| | $93 (us-gaap:NonoperatingIncomeExpense) w:1.0
| | $20441 (us-gaap:OperatingIncomeLoss) w:1.0
| | | $27075 (us-gaap:CostsAndExpenses) w:-1.0
| | | | $2663 (us-gaap:GeneralAndAdministrativeExpense) w:1.0
| | | | $2979 (us-gaap:SellingAndMarketingExpense) w:1.0
| | | | $12942 (us-gaap:ResearchAndDevelopmentExpense) w:1.0
| | | | $8491 (us-gaap:CostOfRevenue) w:1.0
| | | $47516 (us-gaap:Revenue) w:1.0
We have replaced “weight” with “w” to shorten the expression. It is easy to check if the number is correct. For example, us-gaap:CostsAndExpenses has four children; the cost and expenses must be the sum of their values, that is
$27075 = $2663 + $2979 + $12942 + $8491
It is not hard to guess what these elements are. For example, us-gaap:CostOfRevenue is simply the cost of revenue. However, sometimes it is not that easy to understand; for example. us-gaap:IncomeTaxesPaidNet. Therefore, we need to find the texts or labels for each element. The file we want to look at is EX-101.LAB, which we also discussed in the last article. After populating the texts we get
$18337 Net income, w:1.0
| $2197 Provision for income taxes, w:-1.0
| $20534 Income before provision for income taxes, w:1.0
| | $93 Interest and other income, net, w:1.0
| | $20441 Income (loss) from operations, w:1.0
| | | $27075 Total costs and expenses, w:-1.0
| | | | $2663 General and administrative, w:1.0
| | | | $2979 Marketing and sales, w:1.0
| | | | $12942 Research and development, w:1.0
| | | | $8491 Cost of revenue, w:1.0
| | | $47516 Revenue, w:1.0
Since the financial filings are already in the tree structure, we should also use such data structure when we process the data. In the above we talk a lot about “elements”. These elements essentially are “nodes” in the tree data structure. Each node should have a gaap label, the human readable text, and its value. We can define this node in Python as
@dataclasses.dataclass
class Node:
text: str
label: str
value: float
weight: float
to_nodes: list["Node"]
where to_nodes lists the node’s child nodes. For example, in the above Meta’s income statement, we have
tax = Node(
"Provision for income taxes",
"us-gaap:IncomeTaxExpenseBenefit",
2197,
-1.0,
[]
)
income_lose = Node(
"Income before provision for income taxes",
"us-gaap:IncomeLossFromContinuingOperations",
20534,
1.0,
[]
)
net_income = Node(
"Net income",
"us-gaap:NetIncomeLoss",
18337,
1.0,
[tax, income_lose]
)
Since the node net_income has two children tax (with weight -1) and income_lose,
we can derive the equation: net_income = income_lose - tax.
With the class Node, it is very easy to plot the tree structure and derive other mathematical relations.
I presented a very simplified view of the financial filings. In a real filing, there are much more data that I didn’t mention here. For example, a value usually has a unit. A net sale of $18337 is actually in millions. A value can also have a dimension. A net sale can be in a “country” dimension that has multiple values. Interested readers can check out https://www.xbrl.org for how financial filings are structured and rules such as https://xbrl.us/data-rule for guidance of financial filings.