ISSS608 2016-17 T1 Assign2 Nguyen Tien Duong Conclusion
Contents
Take away notes
Averaging Survey Result
Since the survey asked to evaluate using scale: 1 to 5 to describe strongly disagreed to strongly agreed opinion, it is easy for researcher to think of "averaging method". However, that approach to "prepare" data does not only represent no meaning but also create confuse.
Firstly, 1-5 scale is actually only indices ID of the the response, according to:
1: Strongly Disagreed 2: Disagreed 3: Neutral 4: Agreed 5: Strongly Agreed
If any "averaging" operation is considered, we must do in the value of it, such as "(2*agreed + 3*strongly_agreed)/(2+3)", which unfortunately return "undefined" result.
Being worse, some attempt to calculate "(2 * [4] + 3* [5]) / (2+3) = 4.6". This result is not either "agreed" or "strongly agreed". This is as well: "undefined".
Secondly, averaging leads to confuse. Some arguments may be raised to "4.6 means 'more agreed' then just agreed". That is wrong since we do not define a continuous range here, and there is not such more agreed in the scale. Furthermore, let's consider an extreme case with only:
1: Strongly Disagreed (50%) 2: Disagreed (0%) 3: Neutral (0%) 4: Agreed (0%) 5: Strongly Agreed (50%)
If 50% responses is [1] and 50% is [5] and 0% is [2],[3],[4]; using the same method, we could conclude that the entire population homogeneously thinks that it is "Neutral" toward the question. That is absolutely wrong, it is extreme and polarized!
In a short conclusion: NO AVERAGING for this type of survey result.
Relationship among dimensions
Most of the time, researchers wish to find co-relationship between any 2 dimensions. That can be done using statistical analytic. However, visual analytic can perform even better when we can provide overview and drill in process.
First of all, even statisticians also needs to visualize their data before going to fire any analytic methodology. They need a sense of the data. Therefore, visualizing an overview of data is critical.
Secondly, statistical method is great to provide an indicative number, but weak at exploration. For example, it is truly hard to know the flow of response from 1 dimension to other dimensions, and how it is distributed further down. That job can be visually done by looking at parallel set.
Data exploration and Preparation
Again, this is a fundamental, can not be done any later but at the very begging and through the entire project.
This process may incur plenty of works, research internally and externally. Clean up data and making rational assumptions. Without doing so, there is little hope for the project to get any succeed.
An obvious example using this Wiki4HE example. Definition provided by the data source, as below: Attribute Information:
AGE: numeric GENDER: 0=Male; 1=Female DOMAIN: 1=Arts & Humanities; 2=Sciences; 3=Health Sciences; 4=Engineering & Architecture; 5=Law & Politics PhD: 0=No; 1=Yes YEARSEXP (years of university teaching experience): numeric UNIVERSITY: 1=UOC; 2=UPF UOC_POSITION (academic position of UOC members): 1=Professor; 2=Associate; 3=Assistant; 4=Lecturer; 5=Instructor; 6=Adjunct OTHER (main job in another university for part-time members): 1=Yes; 2=No OTHER_POSITION (work as part-time in another university and UPF members): 1=Professor; 2=Associate; 3=Assistant; 4=Lecturer; 5=Instructor; 6=Adjunct
There is no definition for DOMAIN = 6, which has records in database. However, shall we mark it as "missing" data / "invalid" data/ ignore entire line or shall we find that is it actually?
The first thinking must incline to the easy option: Missing/remove data. But when looking to the result, a dominant amount of data are in "DOMAIN = 6" which appeals to find the actual value for this missing definition.
External data or extra background information maybe useful in those cases. With some reading over research papers that publish by the project, doing little reverse engineering to map back the result with raw data, it revealed that "DOMAIN = 6" is equivalent with "Social Science" (40% of dataset).
A quick learning point here: do plan time to grind data to the finest extend. Revisit if needed.