Latest revision as of 16:31, 26 September 2016

Abstract

There is a rough relationship between the academic level (Observed via PhD and Position) of university faculty and how they think about Wikipedia: the more profound in domain, the lower the scores. In fact, the most important factor that affects the scores is whether the user is registered or not. This is largely due to the usage factors related with behaviour and habit like contribution, Wiki usage. The detailed reasons may be that they cannot use some of the functions unless they are registered users(behaviour scores), and once they registered, which should means they think Wikipedia is good and want to contribute to Wiki(subjective positive view of Wiki), and after their contribution, the quality scores should also increase.

Motivation and Problems

Wikipedia, as the best free online encyclopedia in the world, has helped so many people a lot and enjoys a high reputation. From a student perspective, the quality of the wiki pages is quite good. But because it is an open platform that everyone can edit, the academic accuracy and quality may be a concern for viewers since they don’t have the specialized domain knowledge to distinguish the mistakes or incorrect description. So let’s look at the survey data from university faculty to find out how do they think about Wikipedia in their domains so as to make Wikipedia greater.
Problems addressed:

Is the academic quality of Wikipages an existing concern?

What’s the pattern of the scores given by the university faculties? What are the factors involved (i.e. position, experience, PhD or not)?

How to make Wikipedia better (Recommendations)?

Approaches

The data was obtained from https://archive.ics.uci.edu/ml/datasets/wiki4HE

It contained the following information that can be utilized:

Age - Showing year and month of which the flat was resold.
Gender - Male or Female.
Yearsexp - The length of university teaching experience.
PhD - Is the faculty PhD or not?
Domain - The academic domains of the faculty.
Position - the academic position of faculty (i.e. Professor, Associate Professor).

From the above data, extra fields were derived

Position - created by summarizing the position info.
Total average- Derived by averaging all the scores, since higher score means better in all categories, we can leverage this factor as the overview score of Wikipedia.

Type of chart used: Treemap, Slopegraph, Box Plot.

Interactive Data Visualization

Here is the link to Tableau Public Display . You can view my effort and explore the data yourself!

Tools Utilized

In this report, Tools used are:

Tableau 10.0 (for data analysis)
JMP 12.0 (for data preparation)
Excel 2016 (for data preparation)

Data Preparation

Import Data into Excel

After downloading the data, we can notice that it cannot be viewed correctly in Excel, so we need to make some changes in the original file before loading it into Excel and checking its patterns.

After checking the original data via Notepad, we can find all the pieces of data are split by “;”. The split sign seems cannot be recognized automatically by Excel. The solution for this is to replace all the “;” with tab: enter a tab and copy it somewhere else, open the Replace function by “Ctrl” + “H”, “Find what” “;” , “Replace with” the tab you just copy and “Replace all”, save the document and open it in Excel. And you can find everything is in good order now.

Create New Score Factors

In order to compare the section scores, we need to create new columns to calculate each section (i.e. PU=average(Pu1,Pu2,Pu3)) and Total Average. The way to do this is for each section:

 1.  Create a new column. Name the column as per the section in the first row.
 2.  Use average function to calculate the section score.

Of course we can do this in Tableau or JMP, but it is much more time-consuming due to the function writing workload.

Correct Wrong Column Names

The OTHERSTATUS and OTHER_POSITION are misplaced. We need to modify the names by changing the two title columns.

Missing Data

When checking the data, we can find a lot of “?”, which mean missing values. We can use JMP to replace all the “?” with blank:

After that we should delete all the rows with missing values. But before doing that, we should treat the position first, since there are a lot of reasonable missing values in the columns related to position.

Create Position

I consider the positions should have no matter with which university they are in, so I recognize their positions as the higher one in the “UOC_POSITION” and “OTHER_POSITON”.

According to the data description, higher position means smaller number. Then we can create a new column “Position” with formula as followed.

Then clear the formula for the new column, delete Column “University”, “UOC_POSITION”, “OTHERSTATUS” and “OTHER_POSITION”, we don’t need these columns in this analysis anymore.

Change Labels

According to the data description, the data creator used numbers to represent a lot of category variables (i.e. for PhD, 1 means Yes, 0 means No). So we should change the category variables back to their original means so that the following analysis can be easier and user friendly.

Here is how to do it: “Cols” “Utilities” “Recode”. And the picture is the example about how to revise one of the columns.

Delete Rows With Missing Data

OK, now it is time to get rid of missing values. For the missing value shown in the score sections, I decided to directly delete the entire rows which have at least one missing value.

And now, the data preparation is done. Export the file as .xlsx format for further analysis with other data analytics software.

Data Analysis

With the questions kept in mind, let's explore the data now!

Because we want to see the relationships among more than 3 different category dimensions, one of the best way to start with should be Treemap. Treemap can let us have the feeling about the data. Mosaic Plot and Parallel Set are also good choices for this scenario, but they are not supported in Tableau, so I didn't include them.

Treemap

Gender-Domain-PhD

In order to explore score patterns among different domains with academic level representing variable involved as well as to keep the Treemap readable (4-or-more-layer hierarchy looks bad), I designed a Treemap and its structure is Gender – Domain- PhD, size as Number of Records, and colour as Avg.Total Average.

According to the split in the treemap, we can find that, no matter the domain or the gender except the Health Science domain, those who are PhDs gave lower total average score to Wikipedia, which means for those who have achieved the highest degree in academic, they are not quit admire Wikipedia as much as others. The possible explanation for this is because they are more profound in the domain, so they have the ability to find the weakness of Wiki blogs contributed by other people.
The biggest difference pair is the Male Health Science section, which is the special exception in the previous observation. And the difference between the pairs in Female section are smaller than the respective pairs in Male section.

Gender–Position–Registered

In order to check the registered user distribution among different Positions to see what kind of university faculty has higher Wikipedia registration percentage. Its structure is Gender–Position-Registered (Usewiki means Registered), size as Number of Records, and colour as Avg.Total Average.

We can find similar pattern shown in the previous treemap: no matter the domain or the gender, those who are registered users gave higher total average score to Wikipedia. The possible explanation for this is that the registered users should like to use Wikipedia or otherwise, and they have contributed to Wikipedia in their domains, so the quality of Wikipedia pages is high.
The sections’ average scores also have relationships with positions. And according to the meaning behind the positions, it seems the higher the position the lower the score. Potential reason is the same as the previous one, since the positions also can show their mastery of domain knowledge.
No professor is registered user. And the registration rate is low, even the highest is less than 22%(Man Associate 21%, Man Lecture 21%).

Slopegraph

Because we have distinguished one important factor that affects the scores largely (Userwiki, which means the faculty is registered Wikipedia users or not.), and with the unsolved problem, how does the academic level affect the scores, kept in mind, I decided to use Slopegraphs to see how the scores change if the key factors change. That will give us a deep and detailed vision.

Slopegraph: How scores change from Userwiki=No to =Yes

Main Findings

The highest scores are shown in Item section SA (Sharing Attitude), which is not so related with Wikipedia, but their themselves value system. Luckily, they all think it is important to share via other media other than journals and books.
The only one obvious decreasing line in the graph is the Avg. Peu1, which means Wiki is not User friendly for those who are registered users!!! So that may because the Wikipage is so hard to edit!!!! Since this is the main function used only by registered users!!!Anyway, the Peu1 score is still very high.
And for the items on the bottom (Pf1, User1, Vis3, Use2, Exp4), most of them are about the using the wiki functions, which need registration, so the reason is quite straight. But For Vis3, which means the user cited Wikipage in academic papers, also increased a lot with Userwiki changed from No to Yes

Slopegraph: How scores change from PhD=No to =Yes

Main Findings

The obvious drops are in JR and Vis3, JR is Job relevance, which means for PhDs, they have less job relevance with Wikipages. And of course, for those PhDs, they cannot cite much Wikipage thing, since the quality is not assured.
But Pf3, which means publishing academic content in open platform, is increasing lightly, which is good.
When selecting the quality scores from Qu1 to Qu5, we can find a slightly drop when PhD turns from No to Yes. Except Qu4, the drop in the other 4 scores means bad, which is consistent with our assumption. But the drop is too small to draw a solid conclusion here.

SlopeGraph: Domain – Section Scores

The graph is to find out the domain performance differences and features. In order to have a better view of the changes, I used slopegraph. Although the line slope only can show the changes among neighbour domains, but we still can generate some interesting insights from it.

Main Findings

Law & Politics domain has relatively low scores for almost all of the sections. This may due to the domain features that it is hard to learn and write, the quality can hardly be improved by people from other domains.
The Use and Pf scores are the lowest. Use means User behaviour and Pf means user participation in open platform. That means Wikipedia still have space to improve unless it can get those profound users well-involved.
The highest Score is the SA part, and we have discussed about it.

Position- Experience distribution

In order to have an exact numeric view of the relationship among Position, Experience and scores, I added this Box plot chart to show the distribution as well as enhance the previous observation about Position and scores.

By using the duel axis, we can find a rough relationship that the shorter the experience, the higher the position, the lower the scores, which is consistent with our observation in the previous Treemap.
And the differences are really not obvious.

Conclusion

Our assumption is roughly true-- higher academic level means lower scores for Wikipedia. Some of the reasons are quality concerns but not obvious, while other Wikipedia usage behaviour (I.e. contribution, job relevance) can tell much of that. Contribution from the sample university faculty is low, although they believe that sharing is important.

Interesting findings are:

The highest scores are shown in Item section SA (Sharing Attitude), which is not so related with Wikipedia, but their themselves value system. Luckily, most of university faculty think it is important to share via other media other than journals and books. However, the scores related with online material contribution( Exp4, Pf1 and Pf3)are all relatively low.
There is only one obvious decreasing line (Avg. Peu1)in the graph, which means Wikipedia user friendliness for those who are registered users is not as good as for unregistered users!!! So that may because the Wikipage is so hard to edit, since this is the main function used only by registered users. Anyway, the Peu1 score is still very high.
Law & Politics domain has relatively low scores for almost all of the score sections. This may due to the domain features that it is hard to learn and write, the quality can hardly be improved by people from other domains.

Recommendation

Improve the Wikipedia user friendliness to those registered users especially the Wiki page editing UX, so that profound but busy university faculty won't be scared away.
Wikipedia should keep enhancing relationship with well-known universities by developing tools for faculty so that there will be more touching points with university and more attractive for faculty to actively participate in Wiki. And if possible, Wiki can used the high quality materials developed by faculty and students.

Data Dictionary

AGE: numeric
GENDER: 0=Male; 1=Female
DOMAIN: 1=Arts & Humanities; 2=Sciences; 3=Health Sciences; * 4=Engineering & Architecture; 5=Law & Politics
PhD: 0=No; 1=Yes
YEARSEXP (years of university teaching experience): numeric
UNIVERSITY: 1=UOC; 2=UPF
UOC_POSITION (academic position of UOC members): 1=Professor; 2=Associate; 3=Assistant; 4=Lecturer; 5=Instructor; 6=Adjunct
OTHER (main job in another university for part-time members): 1=Yes; 2=No
OTHER_POSITION (work as part-time in another university and UPF members): 1=Professor; 2=Associate; 3=Assistant; 4=Lecturer; 5=Instructor; 6=Adjunct
USERWIKI (Wikipedia registered user): 0=No; 1=Yes

The following survey items are Likert scale (1-5) ranging from strongly disagree / never (1) to strongly agree / always (5)

Perceived Usefulness

PU1: The use of Wikipedia makes it easier for students to develop new skills
PU2: The use of Wikipedia improves students' learning
PU3: Wikipedia is useful for teaching

Perceived Ease of Use

PEU1: Wikipedia is user-friendly
PEU2: It is easy to find in Wikipedia the information you seek
PEU3: It is easy to add or edit information in Wikipedia

Perceived Enjoyment

ENJ1: The use of Wikipedia stimulates curiosity
ENJ2: The use of Wikipedia is entertaining

Quality

QU1: Articles in Wikipedia are reliable
QU2: Articles in Wikipedia are updated
QU3: Articles in Wikipedia are comprehensive
QU4: In my area of expertise, Wikipedia has a lower quality than other educational resources
QU5: I trust in the editing system of Wikipedia

Visibility

VIS1: Wikipedia improves visibility of students' work
VIS2: It is easy to have a record of the contributions made in Wikipedia
VIS3: I cite Wikipedia in my academic papers

Social Image

IM1: The use of Wikipedia is well considered among colleagues
IM2: In academia, sharing open educational resources is appreciated
IM3: My colleagues use Wikipedia

Sharing attitude

SA1: It is important to share academic content in open platforms
SA2: It is important to publish research results in other media than academic journals or books
SA3: It is important that students become familiar with online collaborative environments

Use behaviour

USE1: I use Wikipedia to develop my teaching materials
USE2: I use Wikipedia as a platform to develop educational activities with students
USE3: I recommend my students to use Wikipedia
USE4: I recommend my colleagues to use Wikipedia
USE5: I agree my students use Wikipedia in my courses

Profile 2.0

PF1: I contribute to blogs
PF2: I actively participate in social networks
PF3: I publish academic content in open platforms

Job relevance

JR1: My university promotes the use of open collaborative environments in the Internet
JR2: My university considers the use of open collaborative environments in the Internet as a teaching merit

Behavioral intention

BI1: In the future I will recommend the use of Wikipedia to my colleagues and students
BI2: In the future I will use Wikipedia in my teaching activity

Incentives

INC1: To design educational activities using Wikipedia, it would be helpful: a best practices guide
INC2: To design educational activities using Wikipedia, it would be helpful: getting instruction from a colleague
INC3: To design educational activities using Wikipedia, it would be helpful: getting specific training
INC4: To design educational activities using Wikipedia, it would be helpfull: greater institutional recognition

Experience

EXP1: I consult Wikipedia for issues related to my field of expertise
EXP2: I consult Wikipedia for other academic related issues
EXP3: I consult Wikipedia for personal issues
EXP4: I contribute to Wikipedia (editions, revisions, articles improvement...)
EXP5: I use wikis to work with my students

Difference between revisions of "ISSS608 2016-17 T1 Assign2 Li Nanxun"

Latest revision as of 16:31, 26 September 2016

Contents

Abstract

Motivation and Problems

Approaches

Interactive Data Visualization

Tools Utilized

Data Preparation

Data Analysis

Treemap

Gender-Domain-PhD

Gender–Position–Registered

Slopegraph

Slopegraph: How scores change from Userwiki=No to =Yes

Slopegraph: How scores change from PhD=No to =Yes

SlopeGraph: Domain – Section Scores

Position- Experience distribution

Conclusion

Recommendation

Data Dictionary

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 3: / Line 3: @@
 <br/>
 There is a rough relationship between the academic level (Observed via PhD and Position) of university faculty and how they think about Wikipedia: the more profound in domain, the lower the scores. In fact, the most important factor that affects the scores is whether the user is registered or not. This is largely due to the usage factors related with behaviour and habit like contribution, Wiki usage. The detailed reasons may be that they cannot use some of the functions unless they are registered users(behaviour scores), and once they registered, which should means they think Wikipedia is good and want to contribute to Wiki(subjective positive view of Wiki), and after their contribution, the quality scores should also increase.
-Interesting findings are:
-* The highest scores are shown in Item section SA (Sharing Attitude), which is not so related with Wikipedia, but their themselves value system. Luckily, most of university faculty think it is important to share via other media other than journals and books. However, the scores related with online material contribution( Exp4, Pf1 and Pf3)are all relatively low.
-* There is only one obvious decreasing line (Avg. Peu1)in the graph, which means Wikipedia user friendliness for those who are registered users is not as good as for unregistered users!!! So that may because the Wikipage is so hard to edit, since this is the main function used only by registered users. Anyway, the Peu1 score is still very high.
-* Law & Politics domain has relatively low scores for almost all of the score sections. This may due to the domain features that it is hard to learn and write, the quality can hardly be improved by people from other domains.
 <br/>
@@ Line 39: / Line 34: @@
 <br/>
 Type of chart used: Treemap, Slopegraph, Box Plot.
+=Interactive Data Visualization=
+Here is the link to
+[https://public.tableau.com/shared/KQMSPRQ3H?:display_count=yes Tableau Public Display]
+. You can view my effort and explore the data yourself!
 =Tools Utilized=
-<br/>
 In this report, Tools used are:
 * Tableau 10.0 (for data analysis)
@@ Line 50: / Line 48: @@
 =Data Preparation=
 <br/>
-==“Revise” “Month”==
-First, let’s look at the variables which are not consistent with common sense.
+'''Import Data into Excel'''
-After loading the resale price excel file, we can find that in the “Dimensions” field, variable “Month” is recognized as a string variable rather than a continuous date variable.
+After downloading the data, we can notice that it cannot be viewed correctly in Excel, so we need to make some changes in the original file before loading it into Excel and checking its patterns.
-So we need to change its format to a date variable which should be recognized by Tableau.
-Actually we don’t revise “Month” directly, but we choose to create a new “Month” to take its place—in the following section, we will use the new variable “Registration Date” all the time and ignore the original one.
-And here is how I do it:
-.	Right click the “Dimensions” field, and select “Create Calculated Field”
-.	Then put in the new variable name and the formula in respective places of the pop-up window.
-[[File:LI NANXUN Assign1 SS1.png]]
-Because we use the “DATEADD“ function, its variable type is automatically recognized as date by Tableau.
+After checking the original data via Notepad, we can find all the pieces of data are split by “;”. The split sign seems cannot be recognized automatically by Excel.  The solution for this is to replace all the “;” with tab: enter a tab and copy it somewhere else, open the Replace function by “Ctrl” + “H”, “Find what” “;” , “Replace with” the tab you just copy and “Replace all”, save the document and open it in Excel. And you can find everything is in good order now.
 <br/>
-==Generate Comparable Price==
+'''Create New Score Factors'''
-What we can get from the original data sheet is the total price of the flat, which is not quite good for our analysis in terms of the big differences of floor areas of the flats (range from 40 m2 to 192 m2). So based on common sense, I chose to use price per m2 (“Price per sqm”) as the main price indicator in the following analysis.
-Here is how I do it:
-.    Right click the “Measures” field, and select “Create Calculated Field”
+In order to compare the section scores, we need to create new columns to calculate each section (i.e. PU=average(Pu1,Pu2,Pu3)) and Total Average.  The way to do this is for each section:
-.    Then put in the new variable name and the formula ([Resale Price]/[Flat Area (sqm)]) in respective places of the pop-up window.
+.  Create a new column. Name the column as per the section in the first row.
+.  Use average function to calculate the section score.
-And the new measure is created.
+Of course we can do this in Tableau or JMP, but it is much more time-consuming due to the function writing workload.
 <br/>
-=Resale Public Housing Supply Analysis=
+'''Correct Wrong Column Names'''
-For this part of analysis, I will not only focus on 2015 and 20161H but take all the available data in the data resource into consideration (2012March to 2016June), so that we can have a deeper understanding about the market history, and more easier to distinguish the interesting patterns and findings.
-<br/>
-==Supply Share Analysis for 2015 and 2016 1H==
-This will show the whole picture of the shares of SG resale public housing supply of 2015 & 2016 1H.
+The OTHERSTATUS and OTHER_POSITION are misplaced.  We need to modify the names by changing the two title columns.
 <br/>
-===Generate The Chart===
-.	Drag “Number of Records” from “Measures” to “Columns”.
-.	Drag “Town” to “Rows”.
-.	Drag “Registration Date” to “Columns” and put it on the left of “SUM(Registration Date)”.
-.	“Fit with width”.
-.	“Show Me” “Horizontal Bars”.
+'''Missing Data'''
-.	Sort the chart by 2015 in descending order.
+When checking the data, we can find a lot of “?”, which mean missing values. We can use JMP to replace all the “?” with blank:
-.	Drag “Registration Date” to “Filter”, select “#Year” in the pop-up window, and check “2015” and “2016” in the pop-up window after the previous one.
+[[File:ISSS608 Assignment2 Missing values.png|400px]]
-.	Drag “Flat Type” to “Marks”, change its type from “Detail” to “Colour”
-.	Change the title to “Resale Supply Shares of 2015 and 2016 1H”.
-.	Change the legend colours to colour-blind-friendly.
-.	Because the supply of 2016 is only the half data, in order to make it more visually comparable, I changed the axis of 2016 by clicking “Edit Axis” and choose “Independent axis ranges for each row or column”
-[[File:Assign1 LI NANXUN 1.png|800px]]
-<br/>
-===Main Findings===
-•	Most of the transactions are located in the new towns which are far away from CBD and like JuRong West, Tampines and Woodlands
-•	On the other hand, the less traded regions (top 3) are Bukit Timah, Marine Parade and Central Area, which are all located in the centre of the country.
-•	Within one region, the most traded flat type is normally “4 ROOM” for high transacted regions and “3 Room” for the less transacted regions.
-•	“1 ROOM” and “MULTI-GENERATION” are so few traded that we can hardly recognize them in the chart above.
-•	And “2 ROOM” are also traded much less than the other types.
 <br/>
+After that we should delete all the rows with missing values. But before doing that, we should treat the position first, since there are a lot of reasonable missing values in the columns related to position.
-==Supply Trend breakdown—Town==
-In this section, we will dig the relationship between supply and town deeper.
 <br/>
-===By number of records===
-I am sure that many people are curious about the relationship with flat location and transaction volume. And the findings will tell us the pattern of the resale public housing transactions in terms of location.
+'''Create Position'''
-<br/>
-====Generate The Chart====
-.	Drag “Number of Records” from “Measures” to “Drop field here”.
+I consider the positions should have no matter with which university they are in, so I recognize their positions as the higher one in the “UOC_POSITION” and “OTHER_POSITON”.
-.	Drag “Town” to “Columns”.
+According to the data description, higher position means smaller number. Then we can create a new column “Position” with formula as followed.
-.	Drag “Registration Date” to “Columns” and put it on the left of “Town”.
+[[File:ISSS608 Assignment2 Position.png|400px]]
-.	“Fit with width”
-.	Change the title to “Resale Flat Transaction Volume Trend by Town”
-.	Change the legend colours to colour-blind-friendly. And the colour will show the closer years in darker colours.
-.	In order to show the total transaction volume changing trend, I add average region transaction volume line for each year. Because the region number for each year is constant, then the total supply can be reflected by the average line without changing the axis.
-[[File:Assign1 LI NANXUN 2.png|800px]]
 <br/>
-====Main findings====
+Then clear the formula for the new column, delete Column “University”, “UOC_POSITION”, “OTHERSTATUS” and “OTHER_POSITION”, we don’t need these columns in this analysis anymore.
-•	The Supply Volume experienced a big drop in 2013 and rebounded in 2015.
-•	The supply ranking changes a little bit volatile.
-<br/>
-===By percentage of total===
-In order to see location preference trends of the market, I used the ‘percentage of total’ of the numbers of the transaction records.
-The percentage numbers are generated by comparing with the average of the average town prices of each year. The reason why we don't directly plot the average town prices is because the total market performance can affect the visual impression. For example, say one town is more preferred in 2013, but because of the entire market of 2013 went down, the average price of this town dropped slightly. Although we can find that the price went stronger than other regions, but what you see is a dropping line within other dropping lines, and this line drop just slower than others. If we compare the percentage, then we will see a increasing line with other lines relatively remain the same(horizontal line), which will make the location preference easier to distinguish.
 <br/>
-====Generate The Chart====
-.	Keep using the previous sheet. We just need to do some changes on the previous one.
-.	Drag “Town” from “Columns” to “Marks”, delete “Year(Registration Date)”, and change the type of “Town” from “Detail” to “Colour”.
-.	“Show Me” “Lines(Continuous)”.
-.	Add “Quick Table Calculation” to the “SUM(Number of Records)” in “Rows” as “Percentage of total” and “Edit Calculation Table” from “Table(across)” to  “Table (down)”
-.	Change the title to “Resale Flat Transaction Volume Trend by Town (percentage of total)”
-[[File:Assign1 LI NANXUN 3.png|800px]]
-<br/>
-====Main Findings====
-* For 2012 to 2016, the supply in terms of location changes a little bit volatile, which is the same observation in 5.2.1.
+'''Change Labels'''
-* Although the changes are obvious, the volumes of each region remain relatively the same level.
+According to the data description, the data creator used numbers to represent a lot of category variables (i.e. for PhD, 1 means Yes, 0 means No). So we should change the category variables back to their original means so that the following analysis can be easier and user friendly.
-* For Sengkang, Choa Chu Kang, and Punggol, they have experienced increase at least 2 years.
+Here is how to do it:  “Cols” “Utilities” “Recode”. And the picture is the example about how to revise one of the columns.
-<br/>
-==Supply Trend Breakdown— Flat Type==
+[[File:ISSS608 Assignment2 Recode.png|400px]]
-After regional difference, lets dig deeper the relationship between the supply trend and the flat types.
 <br/>
-===By number of records===
-<br/>
-====Generate The Chart====
-.	Drag “Number of Records” from “Measures” to “Drop field here”.
-.	Drag “Flat Type” to “Columns”.
-.	Drag “Registration Date” to “Columns” and put it on the left of “Town”.
-.	“Entire View”
-.	Change the title to “Resale Public Housing Supply Trend (Flat Type)”
+'''Delete Rows With Missing Data'''
-.	Drag “Registration Date” to “Marks”, and change its type from “Detail” to “Colour”. Change the legend colours to colour-blind-friendly. And the colour will show the closer years in darker colours.
+OK, now it is time to get rid of missing values. For the missing value shown in the score sections, I decided to directly delete the entire rows which have at least one missing value.
-.	Then I would like to add a running percentage of total reference line for audience. Drag “Number of Records” from “Measures” to “Rows”. Set the right “SUM(Registration Date)” “Duel Axis” and add “Quick Table Calculation”, “Edit Table Calculation” , “Add secondary calculation” ,”Percent of Total”, “Pane (across then down)”
+[[File:ISSS608 Assignment2 Delete Missing Data.png|700px]]
-And the chart is done.
-[[File:Assign1 LI NANXUN 4.png|800px]]
-<br/>
-====Main Findings====
-•	Volume Distribution in terms of flat types is very stable by looking at the accumulated percentage supply of total of the types, the volume ranking never change.
-•	The most traded type is “4 ROOM”, followed by “3 ROOM” and “5 ROOM”. And these three types count more than 90% of the market.
 <br/>
+And now, the data preparation is done. Export the file as .xlsx format for further analysis with other data analytics software.
-===By percentage of total===
+=Data Analysis=
+With the questions kept in mind, let's explore the data now!
-In order to see flat type preference trends of the market, I used the ‘percentage of total’ of the numbers of the transaction records. The reason why using this method is the same as the supply trend breakdown by town.
-<br/>
-====Generate The Chart====
-.	Keep using the previous sheet. We just need to do some changes on the previous one.
+Because we want to see the relationships among more than 3 different category dimensions, one of the best way to start with should be Treemap. Treemap can let us have the feeling about the data. Mosaic Plot and Parallel Set are also good choices for this scenario, but they are not supported in Tableau, so I didn't include them.
+==Treemap==
-.	Delete the right “SUM(Number of Records)”.
-.	Drag “Town” from “Columns” to “Marks”, delete “Year(Registration Date)”, and change the type of “Town” from “Detail” to “Colour”.
+===Gender-Domain-PhD===
+In order to explore score patterns among different domains with academic level representing variable involved as well as to keep the Treemap readable (4-or-more-layer hierarchy looks bad), I designed a Treemap and its structure is Gender – Domain- PhD,  size as Number of Records, and colour as Avg.Total Average.
-.	“Show Me” “Line (continuous)”
+[[File:ISSS608 Assignment2 Treemap1.2.png|800px]]
-.	Add “Quick Table Calculation” to the “AVG(Price per sqm)” in “Rows” as “Percentage of total” and “Edit Calculation Table” from “Table(across)” to  “Table (down)”
+* According to the split in the treemap, we can find that, no matter the domain or the gender except the Health Science domain, those who are PhDs gave lower total average score to Wikipedia, which means for those who have achieved the highest degree in academic, they are not quit admire Wikipedia as much as others. The possible explanation for this is because they are more profound in the domain, so they have the ability to find the weakness of Wiki blogs contributed by other people.
+* The biggest difference pair is the Male Health Science section, which is the special exception in the previous observation. And the difference between the pairs in Female section are smaller than the respective pairs in Male section.
-.	Change the title to “Resale Public Housing Supply Trend % (Flat Type)”
+===Gender–Position–Registered===
+In order to check the registered user distribution among different Positions to see what kind of university faculty has higher Wikipedia registration percentage. Its structure is Gender–Position-Registered (Usewiki means Registered),  size as Number of Records, and colour as Avg.Total Average.
-[[File:Assign1 LI NANXUN 5.png|800px]]
+[[File:ISSS608 Assignment2 Treemap2.2.png|800px]]
-<br/>
-====Main Findings====
-* The transaction pattern of the major types of flat don’t have big change, they are all fluctuating slightly.
+* We can find similar pattern shown in the previous treemap: no matter the domain or the gender, those who are registered users gave higher total average score to Wikipedia. The possible explanation for this is that the registered users should like to use Wikipedia or otherwise, and they have contributed to Wikipedia in their domains, so the quality of Wikipedia pages is high.
+* The sections’ average scores also have relationships with positions. And according to the meaning behind the positions, it seems the higher the position the lower the score. Potential reason is the same as the previous one, since the positions also can show their mastery of domain knowledge.
+* No professor is registered user. And the registration rate is low, even the highest is less than 22%(Man Associate 21%, Man Lecture 21%).
-* For the most transacted type—“4 ROOM”, its portion of total is on an uptrend. On the opposite, the second transacted type—“3 ROOM” is experiencing a slight drop starting from 2014.
+==Slopegraph==
-<br/>
+Because we have distinguished one important factor that affects the scores largely (Userwiki, which means the faculty is registered Wikipedia users or not.), and with the unsolved problem, how does the academic level affect the scores, kept in mind, I decided to use Slopegraphs to see how the scores change if the key factors change. That will give us a deep and detailed vision.
-==Supply Trend Breakdown—Month==
+===Slopegraph: How scores change from Userwiki=No to =Yes===
-In China, there is obvious time-series pattern shown in the entire housing market. And there is a proverb for this, saying "Golden September, Silver October". So what is the time-series pattern of SG resale housing market? In order to make the chart more info-rich, I add flat type in as a marks to show whether or not the different flat types have different time-series patterns.
-<br/>
-===Generate The Chart===
-.	Create a new sheet.
+[[File:ISSS608 Assignment2 Slopegraph1.2.png|400px]]
-.	Drag “Number of Records” from “Measures” to “Rows”.
+'''Main Findings'''
+* The highest scores are shown in Item section SA (Sharing Attitude), which is not so related with Wikipedia, but their themselves value system. Luckily, they all think it is important to share via other media other than journals and books.
+* The only one obvious decreasing line in the graph is the Avg. Peu1, which means Wiki is not User friendly for those who are registered users!!! So that may because the Wikipage is so hard to edit!!!! Since this is the main function used only by registered users!!!Anyway, the Peu1 score is still very high.
+* And for the items on the bottom (Pf1, User1, Vis3, Use2, Exp4), most of them are about the using the wiki functions, which need registration, so the reason is quite straight. But For Vis3, which means the user cited Wikipage in academic papers, also increased a lot with Userwiki changed from No to Yes
-.	Drag “Registration Date” to “Filter”, select “#Year” in the pop-up window, and check the years you want to observe in the pop-up window after the previous one.
-.	Drag “Registration Date” to “Columns”.
+===Slopegraph: How scores change from PhD=No to =Yes===
-.	Drag “Registration Date” to “Columns” on the right side, and change its type from year to month.
+[[File:ISSS608 Assignment2 Slopegraph2.2.png|400px]]
-.	Change the title.
+'''Main Findings'''
+* The obvious drops are in JR and Vis3, JR is Job relevance, which means for PhDs, they have less job relevance with Wikipages. And of course, for those PhDs, they cannot cite much Wikipage thing, since the quality is not assured.
+* But Pf3, which means publishing academic content in open platform, is increasing lightly, which is good.
+* When selecting the quality scores from Qu1 to  Qu5, we can find a slightly drop when PhD turns from No to Yes. Except Qu4, the drop in the other 4 scores means bad, which is consistent with our assumption. But the drop is too small to draw a solid conclusion here.
-.	Drag “Flat Type” to “Marks”, change its type to “Colour”.
+===SlopeGraph: Domain – Section Scores===
+The graph is to find out the domain performance differences and features. In order to have a better view of the changes, I used slopegraph. Although the line slope only can show the changes among neighbour domains, but we still can generate some interesting insights from it.
-[[File:Assign1 LI NANXUN 6.png|800px]]
+[[File:ISSS608 Assignment2 Slopegraph3.2.png|600px]]
-<br/>
-===Main Findings===
-•	The fluctuation shows that the supply has time series pattern. Normally there are peaks in October, and around April and May. And there is no apparent difference in terms of different flat types.
+'''Main Findings'''
+* Law & Politics domain has relatively low scores for almost all of the sections. This may due to the domain features that it is hard to learn and write, the quality can hardly be improved by people from other domains.
+* The Use and Pf scores are the lowest. Use means User behaviour and Pf means user participation in open platform. That means Wikipedia still have space to improve unless it can get those profound users well-involved.
+* The highest Score is the SA part, and we have discussed about it.
 <br/>
+==Position- Experience distribution==
+In order to have an exact numeric view of the relationship among Position, Experience and scores, I added this Box plot chart to show the distribution as well as enhance the previous observation about Position and scores.
-=Resale Public Housing Price Analysis=
+[[File:ISSS608 Assignment2 Distribution.png|600px]]
-For this section, we mainly focus on Question 2,3,4. Price of a flat is one of the most important factors that people care about. So what are the price? Are they going up or down(Trend)? How other important factors are affecting the price? We will answer these questions in the following sections.
-<br/>
-==Price Distribution of 2015==
-Analysing the Price Distribution of 2015 will give us a direct but in-depth understanding of the market.
-<br/>
-===General (no bins)===
-First of all, let’s look at the whole price distribution of 2015. This shows the transaction volumes for each price(bin as S$1).
-<br/>
-====Generate The Chart====
-.	Create a new sheet.
-.	Drag “Price per sqm” from “Measures” to “Columns”. Set from “sum” to “Dimension”
-.	Drag “Number of Records” to “Rows”.
-.	Drag “Registration Date” to “Filter”, select “#Year” in the pop-up window, and only check “2015” in the pop-up window after the previous one.
-.	“Fit with width”
-.	Change the title to “Resale Public Housing Price Distribution 2015”
+* By using the duel axis, we can find a rough relationship that the shorter the experience, the higher the position, the lower the scores, which is consistent with our observation in the previous Treemap.
+* And the differences are really not obvious.
-.	Add Reference Line by right click the X axis. Change the settings as followed screenshot.
+=Conclusion=
-And the chart finished.
-[[File:Assign1 LI NANXUN 7.png|800px]]
-<br/>
-====Main Findings====
-•	The average price of 2015 is S$4817/m2,  which is right biased due to the fat right tail.
-•	What we can find is, the resale prices are affected by the human preference—the most traded prices are multiples of 100, especially those which are multiples of 500.
 <br/>
+Our assumption is roughly true-- higher academic level means lower scores for Wikipedia. Some of the reasons are quality concerns but not obvious, while other Wikipedia usage behaviour (I.e. contribution, job relevance) can tell much of that.
+Contribution from the sample university faculty is low, although they believe that sharing is important.
-===General (With bins)===
-Histogram is commonly used to display distribution. Instead of the discrete-like chart above, I will show easy-accepted distribution chart.
-====Generate The Chart====
-.	Create a new sheet.
+Interesting findings are:
-.	Right click “Price per sqm” in “Measures”, “Create” “Bins”, and set the size of bins as 100, because we noticed the price pattern from the chart above.
+* The highest scores are shown in Item section SA (Sharing Attitude), which is not so related with Wikipedia, but their themselves value system. Luckily, most of university faculty think it is important to share via other media other than journals and books. However, the scores related with online material contribution( Exp4, Pf1 and Pf3)are all relatively low.
+* There is only one obvious decreasing line (Avg. Peu1)in the graph, which means Wikipedia user friendliness for those who are registered users is not as good as for unregistered users!!! So that may because the Wikipage is so hard to edit, since this is the main function used only by registered users. Anyway, the Peu1 score is still very high.
-.	Drag “Price per sqm(bin)” from “Dimensions” to “Columns”. Set from “sum” to “Dimension”
+* Law & Politics domain has relatively low scores for almost all of the score sections. This may due to the domain features that it is hard to learn and write, the quality can hardly be improved by people from other domains.
-.	Drag “Number of Records” to “Rows”.
-.	Drag “Registration Date” to “Filter”, select “#Year” in the pop-up window, and check the years you want to observe in the pop-up window after the previous one.
-.	Drag “Registration Date” to “Columns” on the left of “Price per sqm(bin)”.
-.	“Entire View”
-.	Change the title to “Resale Public Housing Price Distribution”
-.	In Order to compare the price distribution between 2015 and 20161H, we need to add “Quick Table Calculation” “Percent of Total”, and “Compute Using” from “Table(across)” to “Pane”.
-[[File:Assign1 LI NANXUN 8.png|800px]]
-<br/>
-====Main Findings====
-* The Trading price per m2 largely following a normal distribution with a fat right tail.
-* The Price Distribution Patterns of 2015 and 20161H are very similar.
 <br/>
-===Break down by Town===
+=Recommendation=
+* Improve the Wikipedia user friendliness to those registered users especially the Wiki page editing UX, so that profound but busy university faculty won't be scared away.
-I would like to show the price distribution of 2015 bread down by Town with transaction volumes put together. This will show how the price and the transaction volume related.
+* Wikipedia should keep enhancing relationship with well-known universities by developing tools for faculty so that there will be more touching points with university and more attractive for faculty to actively participate in Wiki. And if possible, Wiki can used the high quality materials developed by faculty and students.
-<br/>
-====Generate The Chart====
-.	Create a new sheet.
-.	Drag “Price per sqm” from “Measures” to “Rows”. Set from “sum” to “avg.”
-.	Drag “Town” from “Dimensions” to “Columns”.
-.	Drag “Registration Date” to “Filter”, select “#Year” in the pop-up window, and only check “2015” in the pop-up window after the previous one.
-.	“Fit with width”
-.	Sort in descending order.
-.	“Show me” “Box Plot”
-.	Drag “Town” from “Columns” to “Marks”.
-.	Set “Price per sqm” from “avg.” to “Dimension”.
-.	Drag “Town” from “Marks” back to “Columns”
-.	In order to reflate the price and supply relationship, I would like to add a duel axis of the supply volumes of respective regions. Drag “Number of Records” to “Rows” and put it on the right of “Price per sqm”. Set “Duel Axis”.
-.	Set the configuration of “SUM(Number of Records)” in “Marks” to make it easier to recognize. (shapes, colours, size)
-.	Change the title.
-[[File:Assign1 LI NANXUN 9.png|800px]]
-<br/>
-====Main Findings====
-•	The price and supply show a rough relationship that the cheaper the price, the higher the supply, but it is not very straightforward.
-•	For the most expensive areas, the price ranges are commonly wider if the supply is not too low.
-<br/>
-===Break down by Flat Type===
-Beside Location, the Flat Type can also affect the price of flats. And based on the previous effort, we can easily get the result by slightly changing several places. So, why not have a try!
-<br/>
-====Chart====
-The way to generate the chart is relatively the same with the above one. What we need to change is using “Flat Type” instead of “Town”.
-[[File:Assign1 LI NANXUN 10.png|800px]]
 <br/>
-====Main Findings====
-•	As we can see, the more the ROOMs of a flat, the less expensive the average price (We should only look at the “2 ROOM” to “EXECUTIVE”, because their transaction volumes are big enough to tell the market. “1 ROOM”’s and “MULTI-GENERATION”’s transaction volumes are too small to be representatives of the market), which is consistent with economies of scale—more ROOMs mean bigger floor area, and that will make the building cost and other costs that are allocated to per m2 cheaper.
-•	The most transacted top 3 types of the market are respectively “4 ROOM”, “3 ROOM” and “5 ROOM”, but their prices ranking (low to high) is not consistent with the transaction volume ranking, which means the economical one is not always the most traded.  And that means other factors like people’s usage needs and market supply pattern are also affecting people’s property preference, which is very reasonable in real-life property transaction.
-<br/>
-==Price Trend by Town==
-<br/>
-===By average prices===
-In order to fully use the data, I would like to analysis the price trend from 2012 to 2016, which will contain the answer for the compulsory question 3, which is Question 4 in my report.
-<br/>
-====Generate The Chart====
-.	Create a new sheet.
-.	Drag “Price per sqm” from “Measures” to “Rows”. Set from sum to avg.
-.	Drag “Town” to “Columns”.
-.	Sort the chart by Town in descending order.
-.	Drag “Registration Date” to “Columns” and put it on the left of “Town”.
-.	“Entire View”
-.	Change the title
-.	Drag “Registration Date” to “Marks” and change its type to colour, and then change the legend colours to colour-blind-friendly by right clicking it and selecting “Edit Colour”.
-.	In order to make it clearer, I added in three reference lines. One is based on the Central Area Average Price in 2012 and one is based on Choa Chu Kang’s average price of 2016, and the average prices of the regions’ average prices respective year (that’s why this average price of 2015 is not the same as mentioned in 6.1.1.).
-[[File:Assign1 LI NANXUN 11.png|800px]]
-<br/>
-.2.1.2.	Main findings
-* The town price ranking is relatively stable, and the prices for the central areas are more expensive than those new remote areas, which are consistent with our common sense.
-* As you can see, starting from 2015, for those expensive areas (especially Central Area), the price per m2 actually goes even higher, and for those economical areas, the price per m2 is going lower. To sum up, the high higher, the low lower.
-<br/>
-===By percentage of average price of the respective year===
-In order to see location preference trends of the market, I used the ‘percentage of total’ of the numbers of the transaction records.
-<br/>
-====Generate The Chart====
-.	Keep using the previous sheet. We just need to do some changes on the previous one.
-.	Drag “Town” from “Columns” to “Marks”, delete “Year(Registration Date)”, and change the type of “Town” from “Detail” to “Colour”.
-.	“Show Me” “Lines(continuous)”.
-.	Add “Quick Table Calculation” to the “AVG(Price per sqm)” in “Rows” as “Percentage of total” and “Edit Calculation Table” from “Table(across)” to  “Table (down)”
-.	Change the title
-[[File:Assign1 LI NANXUN 12.png|800px]]
-<br/>
-====Main Findings====
-* For 2012 to 2016 1H, the central area’s price changes much more dramatically than other areas, experiencing a big jump in 2015 and dropped back slightly in the first half of 2016.
-* If looking at the lines closer, we can find the same observation that the high higher, the low lower. The high increased more obviously than the low drop.
-<br/>
-==Price Trend by Flat Type==
-The way to generate the statistic charts is relatively the same as 6.2, the only difference is that we use “Flat Type” instead of “Town”.
-<br/>
-===By average prices===
-<br/>
-====Chart====
-The way to generate the chart is relatively the same with the above one. What we need to change is using “Flat Type” instead of “Town”.
-[[File:Assign1 LI NANXUN 13.png|800px]]
-<br/>
-====Main Findings====
-•	The transaction prices are dropping during the years. This can be observed by looking at the average prices for each year(average price for each type, can be recognized as a price level indicator.)
-<br/>
-===By percentage of the average price of the respective year===
-In order to see flat type preference trends of the market, I used the ‘percentage of total’ of the average of the flat price of the respective year. By looking at this chart, we will understand the pattern how the prices of different flat types change through the last 4.5 years.
-<br/>
-====Chart====
-The way to generate the chart is relatively the same with the above one. What we need to change is using “Flat Type” instead of “Town”.
-[[File:Assign1 LI NANXUN 14.png|800px]]
-<br/>
-====Main Findings====
-•	Because we have discussed about the “1 ROOM” and “MULTI-GENERATION”, which are seldom traded and their average prices cannot show the market, we should ignore them by hiding them. After hiding the records, we can easily find that the average prices of each type are very stable during the years when compared with the average prices of the respective years.
-=Finding Summary=
-<br/>
-''General observations''
-The entire resale public housing market is relatively stable with price dropping slightly and volume rebounding from 2013's dive.
-For Supply:
-* Most of the transactions are located in the new towns which are far away from CBD and like JuRong West, Tampines and Woodlands
-* On the other hand, the less traded regions (top 3) are Bukit Timah, Marine Parade and Central Area, which are all located in the centre of the country.
-* Within one region, the most traded flat type is normally “4 ROOM” for high transacted regions and “3 Room” for the less transacted regions.“1 ROOM” and “MULTI-GENERATION” are so few traded that we can hardly recognize them in the chart above.And “2 ROOM” are also traded much less than the other types.
-* The Supply Volume experienced a big drop in 2013 and rebounded in 2015.
-* Volume Distribution in terms of flat types is very stable, and the volume ranking never change.
-* The most traded type is “4 ROOM”, followed by “3 ROOM” and “5 ROOM” when talking about the total supply of each year. These three flat types count more than 90% of the market.
-* The transaction pattern of the major types of flat don’t have big change, they are all fluctuating slightly.
-* For the most transacted type—“4 ROOM”, its portion of total is on an uptrend. On the opposite, the second transacted type—“3 ROOM” is experiencing a slight drop starting from 2014.
-* The fluctuation shows that the supply has time series pattern. Normally there is a peak in October, and around April and May. And there is no apparent difference in terms of different flat types.
-<br/>
-For Price:
-* The Trading price per m2 largely following a normal distribution with a fat right tail. The average price of 2015 is S$4817/m2, which is right biased due to the fat right tail.
-* What we can find is, the resale prices are affected by the human preference—the most traded prices are multiples of 100, especially those which are multiples of 500.
-* The Price Distribution Patterns of 2015 and 20161H are very similar.
-* As we can see, the more the ROOMs of a flat, the less expensive the average price which is consistent with economies of scale—more ROOMs mean bigger floor area, and that will make the building cost and other costs that are allocated to per m2 cheaper.
-* The most transacted top 3 types of the market are respectively “4 ROOM”, “3 ROOM” and “5 ROOM”, but their prices ranking (low to high) is not consistent with the transaction volume ranking, which means the economical one is not always the most traded.  And that means other factors like people’s usage needs and market supply pattern are also affecting people’s property preference, which is very reasonable in real-life property transaction.
-* The town price ranking is relatively stable, and the prices for the central areas are more expensive than those new remote areas, which are consistent with our common sense.
-* As you can see, starting from 2015, for those expensive areas (especially Central Area), the price per m2 actually goes even higher, and for those economical areas, the price per m2 is going lower. To sum up, the high higher, the low lower. The high increased more obviously than the low drop.
-* For 2012 to 2016 1H, the central area’s price changes much more dramatically than other areas, experiencing a big jump in 2015 and dropped back slightly in the first half of 2016.
-* The transaction prices are dropping during the years.
-* The average prices of each type are very stable during the years when compared with the average prices of the respective years.
-<br/>
 =Data Dictionary=
 <br/>