IS428 2016-17 Term1 Assign2 Heng Yi Teng Mabel

From Visual Analytics for Business Intelligence
Revision as of 16:54, 25 September 2016 by Mabel.heng.2012 (talk | contribs)
Jump to navigation Jump to search

Abstract

Question

Data Exploration

I used JMP for all data cleaning.

Initial columns removed

The following columns were removed as they were irrelevant to the question I wanted to answer:

  • quantity
  • packaging
  • url
  • creator
  • created_t
  • created_datetime
  • last_modified_t
  • last_modified_datetime
  • image_url
  • image_small_url
  • product_name

The following columns were removed as similar information was already captured in other columns:

  • countries
  • countries_tags

Columns removed due to missing data

The following columns were removed as more than 50% of the rows were missing data and available data had count < 1000: origins, origins_tags, manufacturing_places, manufacturing_places_tags, labels, labels_tags, labels_en, emb_codes, emb_codes_tags, first_packaging_geo, cities, cities_tags, stores, allergens, allergens_en, traces, traces_tags, traces_en, serving_size, no_nutriments, additives_tags, additives_en, ingredients_from_palm_oil, ingredients_from_palm_oil_tags, ingredients_that_may_be_from_palm_oil, ingredients_that_may_be_from_palm_oil_tags, nutrition_grade_uk, nutrition_grade_fr, energy_from_fat_100g, omega_3_fat_100g, sucrose, glucose, fructose, lactose, maltose, starch, caffeine, carbon_footprint_100g

Reformatting countries_en

I wanted to clean up the variable, countries_en.

I noticed that some products were linked to more than one country under countries_en. This is likely because the Open Food Facts database is a French initiative and France is experimenting with mandatory country of origin labelling for milk and meat in processed foods. Hence foods might be labelled with more than one country. For simplicity's sake, I removed country's labelled with more than one country. Simply using the first country in the list of countries labelled would have caused some countries to be overly-represented since the countries were listed in alphabetical order.