Network Analysis of Interlocking Directorates/Project Overview

From Analytics Practicum
Jump to navigation Jump to search
Home Project Methodology Findings & Insights Project Management Project Documentation Learning Outcomes

Data Files

The data used in this project have been extracted from OneSource Database, with access granted from SMU Li Ka Shing Library. OneSource was chosen because it contains comprehensive information of public and private company and industry information worldwide, including company profiles, news, financial data, executive profiles, analyst reports, business and trade articles, etc. Not only can we extract the companies and executives list from OneSource, but we can also track back to the company’s financial profiles or news related to the company at any time; it would be a useful source of information to this project. Details of the data collection and preparation will be discussed below.

The data set extracted from OneSource consists of 2 different files, a list of companies in Singapore and a list of executives in those companies.

List of Companies

The company list table contains 50,334 records, where each record represents a company and its relevant information. While extracting, we found that there are occasionally discrepancies between the total number of companies shown in OneSource and the total number we could extract. However, as the such occurences are low, we considered OneSource to be still reliable for data collection. We categorized the companies based on the industries classification on ISIC Rev. 4. ISIC Rev. 4 (International Standard Industrial Classification of All Economic Activities) is a standard structure to categorize businesses provided by the United Nations Statistics Division.


Step 1: Filling up missing data

Through a quick scan at our data, we observed that many companies had empty cells under the parent company and parent country columns. As this information is useful to our analysis, we did an Internet search on the companies’ profiles to find the information on their parent company and filled up the empty values accordingly. For companies that we were unable to find information on, we assumed that they had no parent companies and that their parent countries were Singapore. Therefore, companies with no parent companies will have their “Parent Company” cell filled with their own name and “Parent Country” to be filled with “Singapore”.

Step 2: Editing inconsistent postal codes We also faced an issue with inconsistent postal codes data given by OneSource. As our team aims to explore the use of a position-based approach in our future analysis, postal codes are important to us. Singapore postal code normally consists of six digits; hence, we did a check where the postal code value were not 6-digit value, and performed Internet searches to fill in empty cells. If the searches do not return results, we would then fill in the cell as “NA”.As filling the missing data requires manual work for searching information on the Internet for more than 1200 records, this step requires a certain amount of time and effort to fill up the empty cells.

As it is impractical to analyze 50,000 postal codes, our team also derived a new attribute named “Postal Sector”. Every Singapore postal code contains 6 digits with the first 2 digits indicating an address’s postal sector. These 2 digits can range from “01” to “82” and they can be further classified into one of the 28 postal districts in Singapore. Hence, our team was able to determine the postal sector of every company with a postal code and further classify them into the 28 different postal districts of Singapore.

Final company 1.PNG Final company 2.PNG

List of Executives

The attributes we extracted from OneSource include: Executive Name, Executive Title, Company Name, Industry that the company belongs to, Postal Code where company’s headquarter locates, and the Country (Singapore). In OneSource, the executive list is very detailed and comprehensive, ranging from top most senior executives to lower level such as head of department directors. Due to the scope of the project, we only considered those high-level management executives, namely: Board of Directors, Senior Officers & C-Level, Executive Vice Presidents, Senior Vice Presidents, and Vice Presidents.

The Executive List contains 16 attributes, yet we found that there were some unnecessary columns, as well as missing attributes. Therefore, we had cleaned the data as described below:

Step 1: Filling in the Executives’ Full names

The data extracted from OneSource did not provide us the full name of the executives. Instead, there were 3 columns, First, Middle and Last names. However, these values were inconsistent as some cells contained single value, and some cells contained the full name of the executives. Hence, we created a new column to merge the full name under this column.

Step 2: Correct Executive Names

Multiple variations of the same executive names are also a recurring issue faced by our group. As some companies record their executives names with only their name given at birth while others record both the executives’ birth name and names which they more often go by, multiple variations of the same executive name can occur. For instance, the name “Adrian Chan Pengee” has multiple variations such as “Adrian Chan”, “Chan Pengee”, Adrian Pengee Chan”, etc. To address this issue, our team used Excel pivot tables to identify the duplicated values and manually check and update the full name values such as “Adrian Chan Pengee”. This part of the data cleaning would help us to gain cleaner data and correctly represent the nodes and connections in our further visualization.

Final dataset.png

Initial Findings

To gain a broader perspective of Singapore’s corporate environment, our team first conducted several univariate analyses on our dataset.

Firstly, the team finds that Singapore is the most frequent country where parent and ultimate parent companies arise. 85.7% of parent companies belong to Singapore while the US is a distant second with only 3.3%. We also observe that the frequencies of parent country and ultimate parent country do not differ extensively, staying within 1% of each other. However, it should be noted that such a pattern may arise because of our default assumption in assuming all parent countries arise from Singapore if no data was provided by OneSource. Therefore, future research can delve deeper into the truth of this assumption.

Secondly, we tried to identify the top 10 industries in Singapore, where the most number of listed companies reside in. The largest industry by number of companies in Singapore is the Wholesale industry. As seen in Annex B, wholesale companies make up 16.7% of Singapore companies, followed by manufacturing (12.8%) and business and management services (8.9%). The team also observed that Singapore corporate environment is dominated by private independent companies. 66.0% of Singapore companies are private independents followed by private subsidiaries which form 27.5% of the dataset.

Gephi vs NodeXL

Our team primarily used 2 visualization tools to aid us in analyzing interlocking directorates in Singapore, NodeXL and Gephi. NodeXL is an open sourced Excel plugin for casual analysts to conduct network analysis on small to moderate datasets. While Gephi is also a popular open sourced network visualization tool commonly used for exploratory data analysis it is capable of running large datasets.

As our data was extracted to Excel files, it was simpler and more intuitive for us to conduct initial analysis via Excel’s own plugin. With a user interface similar to Excel, we were able to easily create derived columns like “Company Name” with common Excel formulas. NodeXL also contained common layout options such as Fruchterman-Reingold which allowed us to derive basic patterns amongst our data. However, as the data size which NodeXL can handle is limited, our analyses from NodeXL were limited to intra-industry and intra-district analyses.

Our team used Gephi to run analysis on large datasets such as interlocking directorates in whole districts and inter-industry linkages. Analyses with Gephi were less intuitive and more data preparation was also needed. Before importing, data have to be separated into 2 CSV files, one each for the node and edge table respectively. Node tables had to have “id” and “Label” columns so that Gephi can classify the nodes properly. On the other hand, the edge table requires the CSV files to have a “Source”, “Target” and “Type” column to create the network.

Table 4.PNG

While more data preparation was needed for Gephi, it had significant advantages over NodeXL in terms of the tools it provided. Firstly, Gephi contains many different layouts which can be applied to different scenarios. For descriptive purposes, one may choose between layouts such as “Circular” and “Radial Axis”, which allows us to order our nodes in different ways. Gephi also contains more advanced force based layouts such as Yifan Hu and their very own Force Atlas layouts. These layouts allow us to look at our data in a multitude of ways allowing us to derive different insights. Results from these analyses can then be converted to GraphML files which can be easily analyzed by Gephi.

Secondly, while NodeXL only allows us to filter our graphs by size through its dynamic filter function, Gephi contain a library of useful filters. Not only does Gephi allow us to filter the graph based on node and edge attributes such as “Postal Code”, it also allows us to create nests of filters. This allows us to visualize networks such as “manufacturing firms in postal district 22 with betweeness centrality above 1”. Gephi also contains topology filter tools such as “Ego network” which allows us to zoom in onto a single company or executive and its primary, secondary and tertiary connections. The “Giant Component” filter also allowed us to easily observe whether a dominant linkage exists within a particular network and this will be further analyzed in the report below.

Table 5.PNG

Although Gephi has a wide range of visualization tools, a severe limitation of this software is its inability to contain more than one shape. Unlike NodeXL where we could observe 3 different attributes using shape, size and color, we could only observe a maximum of 2 attributes using size and color in Gephi. This can be disruptive in our analysis as we have to constantly zoom in to check if a node is an executive or company by its name.

In summary, both NodeXL and Gephi have their pros and cons but they have aided us tremendously in our analysis because they are complementary tools.