Arisaig Final Progress

From Analytics Practicum
Jump to navigation Jump to search
Appannalogo.png Home Project Proposal Project Management Project Progress Project Final Progress Final Deliverable


ANALYSIS APPROACH & CHALLENGES FACED

Data Understanding & Representation
As mentioned of the nature of the project, it involves the use of multi-dimensional data, the following are the main categories that the project uses:

  • Spatial (geographical) data
  • Temporal (time-series) data

This creates challenges in terms of understanding how the various data interact and work together as well as the representations of the findings of the patterns and relationships within the data. In the following parts, the main challenges addressed in this thesis are:

  1. Visual Representation of Spatial Data
  2. Visual Representation of Temporal Data
  3. Visual Representation of High Dimensional Data
  4. Interactivity of Multi-Linked Views
  5. Context Provision

Visual Representation of Spatial Data

The spatial data provides the opportunity to provide representation of how geographical proximities could have inter-dependency that would give the investors better insights for decision-making. This opportunity creates the problem of how to represent information such that it is still comprehensive to budding analysts/investors. In order to cater to this need, the project implemented Choropleth for the visualization of spatial data.

Choropleth provides an overall map view to see the performance of the various countries and regions with respect to a certain variable at question. The limitation is that Choropleth which showcases the countries rankings through the use of color intensity is only able to showcase one variable at a time (univariate). In order to improve the functional use of Choropleth in representation of the such data, it could be done through the integration of zoomable user interface (ZUI). It is developed to allow user to zoom in on selected region for a focused view of the visualisation. In addition, the interactivity of Choropleth would be better with the graph being responsive to user clicks and selections on the map, allowing user to view information of only “what that is of interest”. All in all, these designs would help supplement the analytical value of the Choropleth map.

Arisaig Asia zoom.png



Visual Representation of Temporal Data

The characteristics of Temporal data is the time period/time intervals values. The main use of such data is to treat time as another numerical variable and the representation of time is then performed using a linear time axis. With respect to this project, the exploration of time is incorporated through the use of a time slider to represent the various time periods used in the dataset. As such, instead of following the typical approach of representing time via assignment of a specific axis, the project approach the representation of temporal data through the use of animation. What this means is that an additional variable could be added to the axes for more insightful analysis. The gist of this approach is that graphical representations at any one time are representations of a specific year only and the scrolling of the time slider would cause the visualisations to adapt, showcasing the new time’s data.

Visual Representation of High Dimensional Data

Parallel Coordinates

The data characteristics after the collection phase shows that there is a continuous format to the data, which means that the data have the opportunity to be made for comparison and sorting. However, the representation of the high dimensional data is not easy and could be overwhelming to average layman users. This is a consideration that the project has to bear in mind. In order to showcase such data, the project employed the use of Parallel Coordinates plot to view multiple dimensions at once.

The benefits of parallel coordinates plot is that one can see the overall patterns across various selected attributes (e.g. high birth rate, high life expectancy and low population density). In addition, it also fits the bill of allowing the view of how a country “stands” in comparison to other countries, in regional and worldwide. However, since parallel coordinates show everything to the users, it may be perceive as a “messy chunk of lines” to the untrained users. As such, there possesses a difficulty of making sure that there are sufficient guides and walkthroughs to train the users to see the patterns the graph is capable of offering. In addition to provision of training, the project has also incorporated the use of interactivity to make data exploration easier. Some ideas adopted are brushing to show/hide, highlighting of selected country and also shifting of axes to the comfort of the user.

The parallel coordinates do have other limitation as well, and one such example that could affect analytical exploration is that parallel coordinates plot is unable to tell the user the spread and distribution of the countries for the axes. In order to overcome this problem, the project introduced two enhancement that could be overlaid onto the parallel coordinates plot: box plot and histogram. The box plot provides statistical insights such as the median and interquartile range, while the histogram allows the users to see the frequency distribution of each varying axes. The overlays are also responsive to events hence could be a useful tool for the users on their exploration journey.

Arisaig Boxplotandhistogram.png


Nevertheless, these graphs do have their limitations. For example, Histogram is created for each axis, but the range of the histogram vary for each axis. Hence a comparison cannot be made across dimensions directly. As such, it is necessary to do standardization of all the axes to provide an equal and fair comparison. In addition, throughout the team’s data exploration process, the data shows signs of skewed distributions. which could be misleading and a poor interpretation of the actual situation. Therefore, the enhancement of providing transformations techniques (such as logarithmic, square and square root) are introduced. These enhancements are optional which caters to the analytical level of the users.

Scatterplot

Another visualisation the project adopted for the display of multi-dimensional data is the scatter plot. The purpose of the scatter plot is to reach for a better balanced between the provision of overall picture and in details representation. The characteristics of the scatter plot allows the users to make a simple comparison of the countries with the axes based on their positioning on the plot. The use of temporal data alongside the graph as animation can help to highlight patterns of periodic behaviour of the data between non-time related dimensions. Admittedly, the animation may not be able to tell the overall trend as easily as the use of time-series graph. Hence, the project incorporated time series related data as an overlay for each country upon click. The purpose is to inform the user of the trajectory of the growth of each country over time, while still maintaining the two axes variables.

Arisaig Scatter selected.png



Through exploration of the data, linear scaling of the axes proves to be a challenge and a poor visualisation tool for the scatter plot. This is due to the data being wildly skewed to the lower scale with huge outliers (e.g. USA in terms of Total GDP). As such, hardly any form of analytical insights could be derived from the use of linear scale. Nevertheless the use of linear scale is still necessary as it informs of the raw situations which could be used to inform of major players in the dataset. However, what this showcases is that there is the need to provide additional scaling options that could spread out the data points to improve both visibility as well as usability. In this case, the project adopted the use of logarithmic scaling that places greater emphasis on the lower scales.

Arisaig Scatterplot scaling.png



In order to further improve the scatterplot ease of use for the users, the project also move away from the use of D3 dynamic color assignment. As such, colors are predefined for the various regions, and will always be consistent even after reloading of the graphs. The reason is because D3 dynamic color assignment can change upon loading of graphs and what this means is that one moment the green could be used to represent Asia Pacific and the next moment it is yellow color. This would be very confusing to the users and distract the users from being able to find out more valuable insights.

Interactivity of Multi-Linked Views

Each graphical representation has its limitations and this is can be overcome through the supplement of other visualisations that are capable of providing the lacking information. This is achieved through the use of multi-linked views in this project, where the events trigger on a visualisation will affect the other visualisations. The technique adopted is “dynamic queries”, which are highly interactive systems that enable visualizations to be manipulated when the user dynamically interacts with the visualizations or the enhancements the project have added. For instance, what was once a faded region on the scatter plot can be immediately changed into visible and colored upon click on the Choropleth map. Such an approach, will provide a coordinated views where changes are reflected in real-time, hence helping the analyst/users in their exploration process.

In addition, the use of 3 varying types of visualisation, Univariate (Choropleth), Bivariate (Scatter plot) and Multivariate (Parallel coordinates) provide three varying views that display different aspects of the data. This is useful in the progressive approach of data exploration that allows users to gain insights based on their levels of expertise while making sure not to overwhelm them with information. Similarly, the navigation of the data also benefit when the many visualizations are all display in preassigned locations, thus providing rapid exploration by saving the user from performing the same or similar operations multiple times.

Context Provision

The challenge to visualization analytics is “communication” with users to how to best utilizations the visualization aids. End of the day, despite the use of a comprehensive analysis tool, it would be useless should the user be unable to use the tool unguided. Hence, the importance of introducing the context of the data, in other words “visual data stories”. The idea of the storytelling feature is to help set the context of the user to understand how the graphs act and the information they can tell when use individual and/or together. It is so that when the users come to better understand how to tell stories with visualizations, will new possibilities also open up. One such example is the tracking of urban population percentage with the annual disposable income in the scenario of beer industry investment (3.1.2 Packaged Food: Investment), users are able to shortlist potential countries while also picking out outliers such as the muslim countries.

Arisaig Context ss.png



TECHNICAL CHALLENGES

Linking Components for Multiple Linked View

Multiple linked views are essential in the development of a data visualization, with benefits discussed in our review of related works. Enabling multiple linked views involves linking and relating the data from one view to another. Data is shared and a filter applied on one view should similarly be applied to other views. The data architecture should then be designed to take into account the requirements of sharing data.

Usage of Crossfilter Library

To facilitate easy sharing and manipulation of multidimensional data, Crossfilter, a Javascript library was utilized. Crossfilter allows for exploration of multivariate datasets in the browser, supporting extremely fast interaction with coordinated views, which is ideal for our implementation of the multiple linked views. Crossfilter works by creating dimensions upon which the data should be filtered, incremental filtering and reducing the data so that it provides greater speed compared to traditional data retrieval where they start from scratch.

Crossfilter works with D3, a Javascript library which allows us to create interactive visualizations on the web. Considering our dataset which consist of high dimensional, spatial and temporal data, Crossfilter can help manage our data in three dimensions: variable, space and time. Crossfilter creates a 3D array and an API which allows fast access to the dataset.

Challenges with Natural Order Sorting

When filtering data on a specific variable, new dimensions are created using the crossfilter.dimension(value) function. The function returns numbers in their natural order according to JavaScript’s operators. As the dataset contains some null values, although the values in the dataset are numerical, they are treated as strings by Javascript. As such, errors in sorting occurred, for instance: 9.9 was regarded as being larger than 89. To overcome this issue, we converted non-empty values into numerical variables by multiplying them by 1. In doing so, the values are compared correctly.

Arisaig 1.png



Applying Multiple Filters

Crossfilter works by incremental filtering and reducing the data as this provides for greater speed. The impact of this is that filters which have been applied previously will still remain. The dataset will then get smaller with each additional filtering and this is sometimes not the desired result as we want, for instance, when the user changes the variables on the scatter plot, we want to filter from the complete dataset. A problem that arose with this was that the number of countries shown in the scatter plot decreased over time.

Arisaig 2.png



To prevent this problem from occurring, filters have to be cleared and disposed of when new dimensions are created. This will ensure that there are no additional filters applied on the dataset when it is not desired. Clearing and disposing the filters was done using the following codes:

Arisaig 3.png



Sequence of Drawing Visualizations

The different views in our application represent visualizations of univariate, bivariate and multivariate data. As such, each view has a different number of dimensions applied on the Crossfilter object. The sequence of loading the visualizations become extremely important as the visualization with only one dimension (i.e. choropleth) should be loaded before the other visualizations with multiple dimensions (i.e. scatter plot, parallel coordinates). If the scatter plot was to be drawn before the choropleth, some countries with null values on one of the axes would be removed, and this will cause the choropleth to lose some of its data, causing inaccuracy.

Arisaig 4.png



The function calls to draw the scatter plot (_drawScatterAxes) and parallel coordinates (_drawParallel) were called inside the choropleth _drawMap function to ensure that the visualizations will be drawn in the correct order.

Coordinating Views upon Input Change

Interactions on one view affects the other views, and the visualizations should provide visual feedback to tell users of the change.

For instance, when a country is selected on any of the views, the map is zoomed in to the region of the country. In the scatter plot, the country bubble has a thicker line and the other bubbles in the same region are highlighted while the other bubbles increase in transparency. In the parallel coordinates, the country line is bold and in dark blue whereas the other country lines in the same region are highlighted in light blue.

Arisaig 5.png



There are many ways of visually representing the change in selection; trial and error was done to find the optimal method. By changing the colour and transparency, we are able to let users identify the other countries that are within the region, which provides additional information which can be helpful in letting users identify trend patterns within the region. Using bold lines make it easy to identify the particular country that was selected.

Global variables were used to store the selected country and region so that the correct selection can be made for each of the views. When a user changes his inputs, the global variables will be updated and the views will be updated correspondingly.

Arisaig 6.png



Our dataset also contains temporal data, and a slider is provided for users to select the year of interest to them. When the slider moves, the visualizations are animated and the visualizations are updated to reflect the values of the corresponding year. As the year changes and the data for the year are retrieved from the Crossfilter object, we need to make sure that the options that were selected earlier, such as region, axes, selected country, will still be active.

Arisaig 7.png



Creating Flexible Components for Reusability

Besides creating an application for the consumer symposium, our application is also designed as an analytics tool to help the Arisaig analysts in the long run. As such, our application must be flexible and cater for the uploading of new datasets.

Loading New Data Files & Display of Data Labels

The files that help to facilitate the loading of new datasets are: a CSV file which shows the dataset template, and a data dictionary file which shows the name of the columns in the dataset and the data labels. The use of these CSV files allow Arisaig the ability to change the dataset without the need for programming.

Arisaig 8.png



The columns Country, Year and Region are fixed columns which must be included in the dataset. The other columns can be changed according to their needs. As we are using the Crossfilter library, the data needs to be in the format where each row represents the data for a particular Country and Year as they represent the space and time dimension of the 3D array created using Crossfilter.

Arisaig 9.png



The data dictionary stores the variable name which corresponds to the column name in the dataset. The label shows the text to be displayed in the application as the column names used in the dataset are often abbreviations and may not be easily understood by an user of the application. These two columns are required as they are used in our application to retrieve the information and to display the label texts to the user. The description column is an optional column for describing what each column represents.

To display the data label, we retrieve the list of column names from the Variable_Name column. For each Variable_Name, the label is retrieved and returned to be printed out on the application. The lists of options on our application are also dynamically generated each time the data file is loaded. The labelling of the visualization axes are also done using this method. The following code snippet shows an example of how the method was carried out programmatically.

Arisaig 10.png



When the data dictionary file is loaded, the variable name and its label text is loaded and populated into the dropdown boxes. The variable names and labels are also saved as separate arrays and the arrays are used in the application when retrieving the data labels for selected variable. The next code snippet shows how the data labels are retrieved.

Arisaig 11.png



We have also included other features in our application to allow flexibility.

Selecting Variables for Visualization

All the visualizations allow the user the flexibility to select the variables to be used in the visualizations. As we do not know what kind of variables will be used in the dataset in future, hard coding of the variables is not possible. This requires modification of existing codes as most of the codes from the examples obtained are for fixed axes.

Arisaig 12.png



When a new variable is selected, the data has to be filtered again using Crossfilter. Traditionally, the dimensions needed are declared when creating the Crossfilter object. However, as we do not know the dimensions needed until they are loaded and selected by the user, we created dynamic dimensions to be used by the Crossfilter object instead. These dynamic dimensions are local variables of the methods and are disposed and cleared after use. The following code snippet shows how the temporary variable is used in creating a Crossfilter dimension.

Arisaig 13.png



Representing Missing Values

The original dataset obtained was for 30 years of data, from year 2000 to year 2030. The intention of obtaining 30 years of data was so that extrapolation of the graph could be done. However, when we looked at the data, we noticed that there were many empty values in the dataset. As a result, we constrained the time range of the data to year 2000 to year 2015, where most of the data was available. Based on the dataset, the application now shows a description of the trend from 2000 to 2015.

Missing values in the dataset has also changed the way that data is presented in our visualisations. If left unchecked, missing values will lead to NaN error in Javascript. For each of the visualisations, missing values had to be specially dealt with.

For the choropleth, missing values were given a value of 0 and they were assigned the colour of the lightest intensity. It is not possible to have an empty value for any country as the country will not be represented on the map. Even though the country does not have a value for the corresponding variable or year, we still want the country to be shown on the map as the map is intended for a comparison of all the countries around the world.

Arisaig 14.png



For the scatter plot, if the country has empty values for either of the axes, the country will not be shown in the scatter plot. Unlike the choropleth, we cannot assign the value of 0 for the variable that is empty as 0 is a value and that can affect the accuracy of analysis when using the scatter plot to spot trends. As such, it is more desirable to not show the country at all so that bivariate analysis can still be made on the countries with available information.

In the figure below, (1) shows all the countries as they have values available for both the axes. If the variables were changed to External Debt, where the information is not readily available for some countries, the number of countries shown decreases, like in (2). With the limited number of countries, the user is still able to make conclusions based on the scatter plot and like in this case, we can see that there is a positive correlation between external debt and total gross domestic product.

Arisaig 15.jpg



For the parallel coordinates plot, if there are any missing values on any of the axes, there would not be a corresponding point on the axis. Similar to the scatter plot, we cannot set the empty values to 0 as 0 is a valid value which some of the countries have. For instance, some countries have 0% Real GDP Growth, so if we were to set empty values as 0, it will be impossible to differentiate between the countries who do not have available data from those who have valid data. The path line of a country in the parallel coordinates plot is created by drawing lines between the points on the axes. Thus, if a country does not have a value on any of the axis, a line cannot be drawn between that axis and its adjacent axes, so there will be a break in the path of the country line.

Arisaig 16.png



Implementation of Choropleth
Selection of Map Library

In creating the choropleth, we explored several D3 map libraries. One of which we considered was worldMap created by Fred Torghele. The library was considered as it allowed for easily zooming into regions, which supported our idea of viewing the trend in a particular region. The example created svg in D3.js and then developed the application using THREE.js, a Javascript 3D library, which allows the animation of the map. However, because it makes use of 3D graphics, rendering the map takes a longer time as compared to shapes entirely created in D3.js.

Arisaig 17.png



Another map library we identified was Datamaps, a lightweight D3 library for displaying web visualisations of geographical based data. The library includes out-of-the-box support for choropleth, and the build contains the world map. It is developed heavily on D3.js, so with our D3 knowledge, we are able to manipulate the map freely. For instance, the map library did not provide a zoom feature out-of-the-box. However, since D3 provides a zooming feature, we added D3’s zoom behaviour into the codes of the Datamaps library for zooming into specific regions.

Naming of Countries

When using the Datamaps library, we found that there were some discrepancies between the naming convention they used for their country list and our dataset. For instance, Datamap’s country list used Dem. Rep. of Conga while our dataset spells it as Democratic Republic of Conga. When the country name is different, the colour cannot be mapped as the country cannot be found. To solve this issue, we renamed the country list of the Datamaps library to match that from our dataset. As future users of our application will be modifying the dataset from the existing one, they can reuse the same country names and be assured that the application will still work properly in future.

Coloration of Choropleth One issue of concern when colouring the choropleth is that the difference in colour intensity should be easily detected by the naked eye. However, in initial iterations, the difference in colour intensity was not apparent as the D3 linear scale was used to retrieve the country’s colour, which resulted in a linear gradient for the colours. The colours in a gradient are blended together and thus are harder for users to sort countries by colour and make comparisons visually. To make the colour difference more distinct, we made use of D3’s categorical scale for the coloration of the countries instead. We selected 8 stepped colours which were easy to distinguish, and assigned them to each of the brackets. With this change, it is now easier for users to visually compare the colour intensity of the countries.

Many papers including [10] have discussed ways of classifying countries on a choropleth. 3 widely used methods of data classification are: equal intervals, equal frequencies, and statistically optimal classification. We used the equal interval methods as this method allows users to visually tell the countries which have similar values to each other. For the other two methods, countries having the same values might be split into different brackets, which can be misleading. However, we found that the values for some of the variables were skewed. This resulted in many countries being in the extreme ends of the colour scale. To make the values better spread out, we applied the log transformation on the values.

Arisaig 18.png



Dynamic Legend

Since the variables to be used for the choropleth are not fixed, the range of the variable’s values are also not fixed. Hence, the range to be used for the colouring of the choropleth must be generated dynamically. This can be done easily using D3’s minimum and maximum function. However, a legend is provided to let users know of the colour range that is used for colouring the map. When a variable is changed, the legend needs to be updated dynamically. The bracketing of the colours also have to be done dynamically.

Implementation of Scatter Plot Linear/Log scale switching

Arisaig 19.png



Due to the distribution skew of values, most examined dimensions require Logarithm scaling to space out data points evenly. When Linear is changed to Log scale, data has to be filtered again to remove empty values before transformation as Log(0) throws an error.

Trend Line

The showing of trend line requires the data point to be present before a connecting path could be drawn to linked them together. However, as mentioned earlier of the missing values for several years, what could happen is that the line would be broken if empty values are also considered. As such, the project adopted the stance of removing the empty values and draw the path from the next best available point. For example, if 2003 is missing, a connection would be drawn between 2002 and 2004. The benefit approach is that it is much more comprehensive but on the other hand, the limitation is that it could result to a false interpretation of the trend.

Dynamic Bubble Sizing

To enable users to compare and identify profitable markets easily, population size was used as the variable for determining bubble size. A good visualization practice when using sizes is that there should be a visible difference between the sizes so that comparisons are made more easily. Instead of assigning fixed values to the bubble sizes, the domain of values used are changed dynamically when the variables are updated. When there are less countries, the size of the bubbles will change accordingly to prevent a possible issue of bubbles being similar in size with each other.

Implementation of Parallel Coordinates
Customizable Axes

To allow users to select the axes for the parallel coordinates, an array was created to store the selected axes. The data was then filtered according to the selected axes. The array of selected axes is used for drawing the parallel coordinates, the box plot, as well as the histogram. Based on the array of selected axes, the corresponding label texts are also retrieved.

Box Plot/Histogram Overlay

To enhance the usefulness of parallel coordinates, box plot and histogram options were provided for the users. Existing D3 examples of parallel coordinates do not show implementations of box plot and histograms on parallel coordinates.

The challenge of implementing box plot and histogram is in obtaining the data for each of the axis in the parallel coordinates and creating a box plot or histogram for each data set. For the box plot, values of the 25th, 75th and median were obtained and plotted on the axis. SVG lines were drawn to create the box plot. For the histogram, the range of values of the axis was obtained and separated into bins. The histogram was then scaled to the axis. In creating the histogram, the width of the histogram must not be wider than the difference between the axes. The challenge in this lies in that users can select between 2 and 8 axes to display, hence there is no fixed number to set on the width of the histogram. We then created a formula to dynamically determine the maximum width of the histogram bins according to the user’s selection of axes. The flexibility of the axes also meant that there was no fixed position for the box plot and histogram to be plotted on. To overcome this issue, we also created a formula to determine the centre point of the axis which indicates where the box plot and histogram is to be plotted.

Arisaig Image29.png