Arisaig Progress

From Analytics Practicum
Revision as of 11:08, 26 February 2015 by Sean.chua.2011 (talk | contribs) (added data cleaning)
Jump to navigation Jump to search
Appannalogo.png Home Project Proposal Project Management Project Progress Poster & Application Final Report


PROJECT PROGRESS
Review of Previous Work


Research has been undertaken to:
1) Understand what factors are of interest in understanding consumption patterns
2) Determine what kind of visualizations are most ideal in displaying the change in trends over time

(1) Factors of Interest in Understanding Consumption Patterns

An article by Minakshi Trivedi [1] discussed analysis performed on changes in consumer behavior patterns of healthy food. The article mentioned that both aggregated and disaggregated data should be taken into account when looking into trends. The article quoted an example that the baby food market (aggregated data) in 2006 saw a fairly flat sales growth from 2005 to 2006 and only a 3.1% increase in 2007. However, when the data is disaggregated, it can be seen that in the same time period, the growth of organic baby food sales saw a 16.4% jump followed by a 21.6% jump in 2007.The increase in popularity did not spill over to potato chips or other snack foods, and was thus unobservable at the aggregate level. Therefore, for our project, we should not only look at overall consumption patterns, but also consumption patterns in smaller product categories.

The same article also found that spatial relationships (spatial correlation, spatial causality or spatial interaction) have a significant importance in the consumption patterns. From their findings, they observed that areas that are in the same cluster produce similar test results. What this means for us is that countries that are geographically close to each other may display similar trends in consumption patterns because “near things are more related than distant things”. There is likely to be more interaction between countries that are geographically close to each other, and hence they are more likely to influence one another. This is an interesting point and we will observe if there is indeed such a pattern shown in our visualizations.

The articles also discussed some demographic factors that were significant to consumption behavior. Interestingly, marriage rate was a significant factor in the consumption pattern of beer. Income and population density were also significant factors. Locations with higher income are associated with higher consumption, and for population density, as density increases, healthy consumption decreases. These findings might aid us in interpreting our findings in our project.

Worldwide trends have seen the changing attitude towards dietary habits, with respect to basic staples to packaged and fast food. [2] Changes to agricultural practices have increased food capacity and reduced seasonal dependence, resulting in considerable changed to food consumption patterns in developing countries.

A rise of almost 400kcal a day was observed globally (with some exceptions of negative growth in developing nations), and the shift of available consumption of calories to meat, sugar and vegetable oils. Also, urbanization has affected food consumption by changes in dietary behavior to favor the fast-food industry by providing quick access to cheap take-away meals. The major consequences from a nutrition perspective of urbanization are a profound shift towards higher food energy, more fats and oils and more animal protein from meat and dairy foods.

(2) Visualizations Displaying Change in Trends Over Time

In this research area, we have two goals, (1) visualization for changes over time and (2) previous visualization work for consumption pattern over time.

In the research for visualization over time, one of the most notable visualization is by Hans Rosling in “The best stats you’ve ever seen”. The representation made by Hans Rosling combined the use of visualization together with animations to bring forth the impact. Some of the graphs that were used are:
1. Stacked graph
2. Bubble graph

Through the use of animation, the research is able simplify the visualization to avoid clustering and overload the viewers. Additionally through the use of differentiation methods such as color coding, size differentiation, the research allow the user to focus easily on what that matters to them. The tools used in Hans Rosling research showed that it is not just enough to showcase the performance on yearly basis, it is also necessary to show the path of change i.e. from one point to another. This is because it allows for the use of other simpler techniques to derive information such as Line graph.

As for previous visualization work, one research is CME Group research on “Global Consumption, Production and Trade Patterns”. In this research, CME group made its case through the use of overlaying graphs one over another. Specifically, the research used bar charts over world map visualization.

Arisaig LitReview1.png



There could be many criticisms for the visualization made in this article e.g. bad use of the world map. However, this research gives us the insight that such visualization could be used for displaying additional information for the major players around the world.



Data Cleaning & Exploration


Some challenges we faced while collecting the data was that the data came from many different datasets. Although the datasets came from two main sources, the datasets had to be downloaded individually. We also had to toggle the settings to ensure that we obtain the right data needed for our project. The datasets were also available in many different currencies. We obtained data in constant prices so that we can make accurate comparisons on the changes that have happened over the years. For future implementation, we can allow users to toggle between fixed and floating currency for different kinds of analysis.


First, we have to find and select the specific dataset we are interested in, and check for all country data.

(1) Euromonitor Data

Arisaig Clean1.png



Arisaig Clean2.png




First, we have to find and select the specific dataset we are interested in, and check for all country data.

Arisaig Clean3.png



Next, we modify data to give us current prices in USD, and copy it into an excel document.

Arisaig Clean4.png




(2) World Bank Data

Arisaig Clean21.png



Arisaig Clean22.png




Copy data from World Bank Site into an excel sheet, manually clicking through each tab.


Arisaig Clean23.png



(3) Consolidation


The first step we took in preparing the data was to understand our dataset. We look at each of the categories of data we have and decide which ones are important to us and which are the ones we don’t need. There is a large number of missing data from the years before 2000. Using a dataset with large amounts of missing data will render our analysis inaccurate. Hence, we decided to focus on data after year 2000. There is still a large amount of data for us to use (30 years of data) so it is still sufficient for us to perform analysis and visualization on it.

Obtaining the data as separate datasets meant that a lot of work had to be done to combine the data into one file to work with. The different datasets also had different formatting of the data, so we first decided on a standardized format for our data file.

Next, as we are concerned in patterns across regions, we classify the countries according to regions. We use the regions defined by World Bank so as to ensure consistency. The GDP of the countries are also given in their local currency, so we have to convert all of them into a common currency (USD) so that we can perform comparisons.

First, we have to prepare a comprehensive list matching each country to a region.

Arisaig Clean31.png



We then replicate the list 31 times over, each time iterating over each progressive year from 2000 to 2030.

Arisaig Clean32.png



With this, a vlookup is used for each column to its respective sheet for the value corresponding to the supplied year and country. Null values and dash placeholders are replaced with an empty string.


(4) Data dictionary

While consolidating the data, we also created a data dictionary to aid us in keeping track of our variables. The data dictionary will also make it easy for us to educate our client on the variables used and help them in understanding our data. Our data dictionary contains the name of the column (label) in the dataset, what the column represents, and details of the label. The details contain elaboration on the type of data it contains, as well as the units and currency that the data is in.