Dangy Proposal
Project Objective |
Data preparation |
Data extracted directly from the various sources is mostly in CSV and GEOJSON format. One key challenge to data manipulation was the translation of chinese characters and also its accuracy.
Translation of Chinese Characters
We took the initial step to translate the JSON files directly with Google Translate. However, we found that this would alter the original structure of GEOJSON where there was missing parenthesis. Hence we took an alternative approach of using existing python library such as googletrans. Unfortunately, we encountered limitations such as character limit of 15,000.
We finalised with a safer approach using writing our own python script. We utilised selenium module to automate the process of inputting raw content directly into google translation engine and outputting them into proper JSON or CSV data structures.
Accuracy of Translation
Google translation engine does not offer translation for every word in our JSON data files. Our teams encounter a few words without translation after running the script. Hence manual translation is necessary.
Taiwan geographical data we sourced have slightly different county namings from the google translations we received. For example, Google Translation offers translation of “Taipei City” while Taiwan geographical data contains only “Taipei”. Hence further data transformation is required to standardise the county namings. Our team creates a dictionary to store words which involve translation discrepancies and replace the word using vlookup in Excel.
Project Prototype |