IS428 2016-17 Term1 Assign2 Heng Yi Teng Mabel
Contents
Abstract
Question
Data Exploration
I used JMP for all data cleaning.
Initial columns removed
The following columns were removed as they were irrelevant to the question I wanted to answer:
- quantity
- packaging
- url
- creator
- created_t
- created_datetime
- last_modified_t
- last_modified_datetime
- image_url
- image_small_url
- product_name
The following columns were removed as similar information was already captured in other columns:
- countries
- countries_tags
Columns removed due to missing data
The following columns were removed as more than 50% of the rows were missing data and available data had count < 1000: origins, origins_tags, manufacturing_places, manufacturing_places_tags, labels, labels_tags, labels_en, emb_codes, emb_codes_tags, first_packaging_geo, cities, cities_tags, stores, allergens, allergens_en, traces, traces_tags, traces_en, serving_size, no_nutriments, additives_tags, additives_en, ingredients_from_palm_oil, ingredients_from_palm_oil_tags, ingredients_that_may_be_from_palm_oil, ingredients_that_may_be_from_palm_oil_tags, nutrition_grade_uk, nutrition_grade_fr, energy_from_fat_100g, omega_3_fat_100g, sucrose, glucose, fructose, lactose, maltose, starch, caffeine, carbon_footprint_100g
Reformatting countries_en
I wanted to clean up the variable, countries_en.
I noticed that some products were linked to more than one country under countries_en. This is likely because the Open Food Facts database is a French initiative and France is experimenting with mandatory country of origin labelling for milk and meat in processed foods. Hence foods might be labelled with more than one country. For simplicity's sake, I removed country's labelled with more than one country. Simply using the first country in the list of countries labelled would have caused some countries to be overly-represented since the countries were listed in alphabetical order.