ISSS608 2016-17 T1 Assign2 CHIA Yong Jian
Contents
Dataset Chosen
The dataset chosen is the US Stocks Fundamental Data (XBRL).
Theme of Interest and Motivation
The stock markets, other than allowing companies to raise equity from a pool of investors (https://www.theguardian.com/sustainable-business/stock-markets-no-longer-fit-purpose), also allow fund managers and retail investors to participate in the market to grow their capital through short term or long term investments in the companies.
For this dataset, I will explore three main questions, drilling down from the macro, to the micro:
- Overview of the US Market - Understand what is "out there" for fund managers and investors to invest in, with consideration to the sectors and market capitalisation of each company
- Use of Parallel Coordinates to explore relationships for some main items in each financial statement.
Data Sources
The following data sources are used:
- From https://www.kaggle.com/usfundamentals/us-stocks-fundamentals - (a) indicators_by_company.csv - Provides the core information of indicators as reported by companies to the U.S. Securities and Exchange Commission, (b) companies.csv - Provides the mapping of the company name to the company ID.
- From http://usfundamentals.com/ - (a) companies-names-industries.csv - Provides the NAICS industry sector (http://www.census.gov/eos/www/naics/) information for each company
- From http://www.fasb.org/jsp/FASB/Page/SectionPage&cid=1176164335312 - (a) Taxonomy_2016.xlsx - provides some information on labels
Discussion of Results
Discussion on Workings/Process
Data Sources and Analysis
Data Challenges:
- Information scattered across multiple files - for example, the indicators_by_company.csv file do not have the company name. A join is necessary to
- Incomplete information for each file - not all companies have data across all available indicators
Original preparations performed in JMP:
Step 1. Open indicators_by_company.csv, perform a transpose
Step 3. Join to companies-names-industries.csv. There are records with missing NAICS industry names. These are inherent in source data, will be recoded to "Not Available". A Sample of Column Names are as below. Save to SAS7BDAT file
Step 4. Open in Tableau
However, the above steps generated a huge export file (of a few GBs), which causes loading issues with Tableau. A review of the process was performed.
For Tree Map:
- The files are joined in Tableau instead. Taxonomy file is not joined as not all labels available in the original indicators_by_company file is available in the Taxonomy file.
For Parallel Coordinates:
- The charts are generated directly in JMP, as Tableau do not have native Parallel Coordinates capability
Tools Utilised
- SAS JMP Pro 12
- Tableau 10
Data Visualization Link
Will be updated asap.