ISSS608 2016-17 T1 Assign2 Li Dan
The Data
- Data Source: US Stocks Fundamentals (XBRL).
- 7- year time series
- 1,907,883 observations
- 9619 US listed companies
- 8529 unique indicators
This data is all about companies’ fundamentals which are extracted from the financial reports such as balance sheet and income statement, indicating the company’s health and growth.
The Objective
It is aimed to build an interactive application which enables stock investors to compare key indicators across companies and industries. This could help them to better understand the well-being of companies’ business and prospects, and pick promising stock portfolios to invest.
Tools Used
- JMP
- Tableau
- High D
- R
Get the Data Ready
Data blending
The dataset contains 9000+ companies, making it hard for users to explore. Plus, it is not meaningful to compare indicators across all the companies because the data scale of these indicators may differ from industry to industry. Thus, additional dataset is downloaded from US Stocks Fundamentals API to categorize the companies into industries.
The picture below shows existing columns in 2 files separately. The ‘data blended’ is joined by ‘company_id’ from 3 files. it further includes company names and industry sectors for each company id in file1. In addition, the column ‘2010’ & ’2016’ are dropped because of the large amount of missing values.
After joining the table, it is observed in the following table that, among 9616 companies, more than 30%(2985/9616) of the companies are from manufacturing industry; while 18 % (1771/9616) of them are from finance and insurance industry.
Select indicators
One challenge of this dataset is that we have 8500+ unique indicators across 9600+ companies, with each company having 198 indicators on average and each indicator averagely occurs 224 times among all companies. Such high dimension attribute makes the dataset hard to be visualized and explored.
With the intention of dimension deduction, the distribution of the indicators examined as shown in the picture followed. It is found that the frequency that the 8529 indicators occur is highly right skewed with a skewness value of 6.74. Notably, only around 43 (0.5%*8529) indicators occur more than 5215 times, signifying that there exist many indicators with very low frequency. This make those indicators incomparable across the 9600+ companies and they are thus can be dropped.
Besides, based on financial domain knowledge, the 8500+ indicators all have their own level of significance and measure the performance of a company in different dimensions such as the ability to pay current and long term debt and the ability to collect cash. Particularly in our case, the focus is more on the profitability and returns on shares as an investment because it is intended to screen stocks for investors.
With this concern to select indicators with high frequency and high significance in terms of measuring return on buying shares, the following 16 indicators are selected. Note that, even with low frequency, ‘inventoryNet’ and ‘SalesRevenueNet’ are reminded to calculate the acid-ratio and net profit margin mainly for manufacturing industry.
See below for remarks of these indicators
Select companies
After 16 indicators are selected, 9591 companies remained. However, these 16 indicators may not all available for each company which should have 80 cells to display these indicators across 5 years if no missing value exist. By checking the missing pattern of the 16 indicators by company id, nearly half of the company has 40+ missing value (un-shaded), making the comparison of indicators across companies less meaningful. Thus companies with more than half missing values of 16 indicators are dropped while the other 4734 companies are remained(Shaded).
Plotting preparation
Data structure manipulation
As it is intended to visualize the data using parallel coordinates, the 5 columns of ‘year’ are stacked, with each row representing 1 year of data. The column ‘Data’ is then split by the ‘indicator_id’ and grouped by ‘company id’ and ‘year’. See the screenshot below.
Excluding outliers
Balance Sheet Equation The balance sheet equation, Total Assets – Total Liabilities = Stockholders’ Equity, is fundamental accounting equation that is universally accepted and strictly followed. While applying the equation to our dataset, there exist 4593 or 19.4% of observations fail to satisfy the equation, they are thus eliminated because of the inaccuracy.
EPS The indicator ‘EarningsPerShareBasic”(EPS) has a reasonable range, based on the current EPS ranking info at Marketinout. it is possible to have an EPS of 15,000, but an EPS of 64 million has never occurred in the history of financial markets.Besides, it is also impossible to have an EPS of -1.7 million as the no company holds such a high share price to drop.
To identify and remove the abnormal value of EPS, the rows with EPS from the minimum to -100 and from 100 to the maximum value are filtered to be checked. By checking 35 filtered rows manually within the dataset or through the official financial reports, it is found 21 records are abnormal. For instance, it is almost impossible for value for ‘Regal Beloit Corp’ to have an EPS of 4680k in 2012 while the value is less than 4 for the other years.
The EPS value for those records are removed and replaced by missing value to remain the info for other columns. With this, the data range after excluding outliers is much smaller, see below for EPS distribution before and after excluding outliers.
Adding calculated fields
To create parallel coordinates with tableau, one challenge is that the data range of the selected indicators are significantly different. For example, the range of ‘EarningsPerShare’, as processed before, is from -89.99 to 3681 while the range of ‘Assets’ is from 0 to 2573 billion. To encounter this problem, 12 of the indicators are standardized to a scale of [0,1] using the formula (X- Col Minimum(X)) / (Col Maximum( X )-Col Minimum( X)). In addition, 5 ratios are calculated for investors to gain a more comprehensive understanding of the company. Their formula and function are shown as follow.
The Storyboard
After getting the data is prepared, it has the following attributes.
- 19077 rows
- 37 columns including 12 standardized indicators and 5 calculated ratios
- 4194 companies across 18 industries
Check the link for the Tableau App, note that, because of the big data size, the app loads quite slowly in tableau server, it is suggested you to download the workbook to your local system and explore the storyboard with presentation view.
Dashboard 1: Stock Overview
The first dashboard presents a combination of parallel coordinates and box plots applied to the 12 selected indicators. This is to give a general view of how companies perform in terms of these indicators across or within industries for the past 5 years.
Demo to use
- Step1: For Industry filter, un-check (All), select ‘Manufacturing’
- Step2: For Year filter, un-check(All), select ‘2015’
- Step3: Hover to the spike of ‘liabilitiesCurrent’ in the parallel coordinates and click to select. Alternatively, click on the highlighter, type 'Apple.Inc' and select it from the list.
Follow the steps above, you will get the something similar to the picture below.
When hovering to the spike of ‘liabilitiesCurrent’ in the parallel coordinates, the tooltip shows that the company is Apple and its 5 calculated indicators. Also it displays the normalized value of ‘liabilitiesCurrent’. As the data are internally linked, when the spike point is selected the other data points linked with it will all be highlighted, and the actual value of all the indicator are showing in the boxplots. Note that for box plots, if the data point is highlighted but no value is showing, that data point is a missing value. Not surprisingly, Apple performed quite well in 2015 with an EPS of 9 which is above the upper whisker of 6, ranking at No. 1 in both Net income and operating income within manufacturing and information industry.
Note that this dashboard is visualizing each row of raw data, not summarized or subset data, it is thus quite slow to process based on user selection. with the concern to improve processing efficiency, R is explored in this case, see Parallel Corrdinates + Boxplots Using R
Dashboard 2: Indicator Trend By Company
With the second dashboard, users can further explore their interested companies. It enables users to examine the trend of 6 indicators for specific companies, as compared with the industrial median values. Note that the median metric is used in this case for its less bias as compared with average metric. The 6 indicators are EPS, along with the other 5 calculated ratios: Current ratio, Quick Ratio, Debt ratio, Time-interest-earned Ratio and Net Profit Margin.
Demo to use
- Step1: Click on the triangle on the company name filter
- Step2: Enter ‘Apple Inc’ or select ‘Apple Inc’ in the dropdown list
- Step3: Hover to any of the bars
Following the demo, by hovering on EarningsPerShare in 2015, the linked data for other 5 indicators in 2015 are highlighted with their corresponding values. While the orange vertical lines represent industrial median values, it is a benchmark to know the company’s performance within the industry.
By looking at the trend of those indicators, Apple’s EPS surged to 44.64 in 2012 while collapsed to 6.49 in 2015, followed by a slight recover in 2015. Despite such big fluctuations, its EPS has been far above the median EPS of manufacturing industry, which is $0.69.
Comparing current ratio, quick ratio and debt ratio, it indicates that Apple was having less short-term and long-term assets but more short-term and long-term liabilities. Especially, after 2014, its debt ratio went up beyond the industrial medians which were 0.49 and 0.58 for 2014 and 2015 respectively. The Time-interest-earned Ratio for Apple crush 73% from 2013 to 2015. Apple’s operating income in 2013 was able to cover 360 times of interest expense at that year, but it became 97 times in 2015. Though this value '97' is still far above the industrial median, it signals Apple’s declined ability in paying long term debts.
Dashboard 3: Financial Status By Company
The third dashboard is especially designed for users to examine financial status of specific companies.
Demo to use
- Step 1: Un-check (All) for industry filter and select only ‘Manufacturing’ or with any other industries.
- Step 2: Type “Apple Inc” in Highlight Company
- Step 3: Click the rectangle facing in the right to activate the animation
The above GIF shows how Apple’s assets and liabilities surged in recent 5 years, as compared with companies from manufacturing industry and information industry. One thing to note is that, from 2011 to 2015, the debt ratio also increased from 34% to 59%, while current ratio decreased from 1.61 to 1.11. This basically means that volume of Apple's Debt increased faster than its volume of liabilities, leading to higher debt proportions in both its capital structure and financial structure by the end of fiscal year 2015. Further information is needed to measure if more debt is financially good or bad, as Apple could be raising money for new technology development and may recoup the cost soon or it was just having problems paying its debt.
Conclusion
About Fundamental analysis
When it comes to selecting stock portfolios to invest, there no “right way” or even no right portfolios to invest. But by focusing on certain fundamental metrics to be compared across companies and industries, investors can always put red flags on companies with earnings problems, decreased cash flow or too much debt. With this, they would be more confident on predicting the stock movement of companies and be in the path of cashing the stock investment.
About The Tools
- Limitation on Tableau
- Hard to construct boxplots for different columns that can align horizontally
- Doesn’t support flexible multiple axis for parallel coordinates plot
- Filtering becomes extremely slow the dashboard is hosted in Tableau Public because of the data size
- Limitation on R
-Slow computation for parallel coordinates using GGPLOT2
-Only can interactively show the data points with PLOTLY, cannot highlight each line of observation
- Limitation on High D
-Support parallel coordinates plot that has different axis for each variable, but doesn’t support to display the only filtered info, without other info shaded