Difference between revisions of "T15 Final Delivery"

From Analytics Practicum
Jump to navigation Jump to search
Line 58: Line 58:
  
 
=== Data Preparation ===
 
=== Data Preparation ===
 +
The data preparation follows 3 basic steps.
 +
 +
Step 1: Aggregate student test score
 +
The original student test score table records the correct and incorrect answers by each student. We convert these into percentage score of each student by dividing the total number of correct answers by the total number of questions student attempted. Questions that are labeled 7 and 8 are not applicable to that student, and thus not taken into account when calculating percentage score.
 +
Finally, we calculate the average test score for each school, in each subject based on student percentage score.
 +
<br />
 +
[[File:3.3.1.png|800px]]
 +
<br />
 +
Step 2: Joining data tables
 +
Score data and school data tables are joined by matching school ID. Since the granularity of these 2 tables is not the same – for score table, each record represents one student; for school table, each record represents one school – hence the need to aggregate data in score table in Step 1. The aggregated score will then be joined with school data table.
 +
Joining score table and student table is more straightforward, simply by matching school ID and student ID. As the result of this step, student score will be used as measure for their academic performance, and other factors from school and student data tables will be used to help explain the difference in their performance.
 +
<br />
 +
[[File:3.3.2.png|800px]]
 +
Step 3: Data cleaning and standardization
 +
Missing records in PISA data follow a consistent set of labels: ‘I’/’Invalid’, ‘M’/ ‘Missing’ and ‘N’/’No response’. In JMP Pro, we used Standardize Attribute to convert these values into a common null, so that JMP will recognize them all as missing data and process accordingly when building models.
 +
<br />
 +
[[File:3.3.3.png|800px]]
 +
<br />
 +
Upon checking for missing data pattern, some attributes are revealed to have more than 50% null values. These will be excluded from analysis. An example is given below: out of 5,546 records in student table, 3,721 records in Mathematics Interest (Anchored) are missing, so this attribute will not be chosen for analysis.<br />
 +
[[File:3.3.4.png|800px]]
 +
<br />
 +
Lastly, attributes that do not help differentiating the entities in data are also excluded. For example, all attributes regarding “Acculturation” aspects only have ‘N’ values (No response) and these too will be excluded.
 +
 
== Methodology ==
 
== Methodology ==
 
=== Framework of analysis ===
 
=== Framework of analysis ===

Revision as of 13:58, 17 April 2016

G15PISA HOME.png

HOME

 

ABOUT US

 

PROJECT OVERVIEW

 

FINAL DELIVERY

 

PROJECT MANAGEMENT

 

DOCUMENTATION


Dataset

Data Retrieval

The data used in this project is questionnaire result from the latest PISA survey in 2012. All raw data files are publicly available on PISA website (https://pisa2012.acer.edu.au/downloads.php). However, the raw data is in flat file text format, where a fix number of characters represent a value (eg. first 3 letters indicate country code), as follows:

3 1.1.png

The raw data in this form is not ready for cleaning and analysis. PISA database has scripts to convert the raw text data into table forms.

  • Download raw questionnaire results (zipped text files) from PISA 2012 website and extract
  • Retrieve SAS programs for appropriate data files
  • Open the SAS scripts in SAS Enterprise Guide
  • Ensure that the path to raw text files are correct

3.1.2.png

  • Run the programs in SAS Enterprise Guide to get output SAS data table
  • Export the output SAS data table in desired formats (.sas7bdat, .csv and so on); display labels as column names for easy interpretation later on.

Data Extraction

Only Singapore data is of interest for our scope of project, therefore only the records with Country code ‘SGP’ are extracted. This process gives us the following for analysis:

  • School survey results - 172 secondary schools
  • Student survey results - 5,546 records, approx. 35 students per school
  • Student test score in Math, Science, Reading, Computer-based assessment

Each of the 3 mentioned data tables have a rich set of attributes. The summary below shows an overview of aspects that the data covers.
3.2.1.png

Data Preparation

The data preparation follows 3 basic steps.

Step 1: Aggregate student test score The original student test score table records the correct and incorrect answers by each student. We convert these into percentage score of each student by dividing the total number of correct answers by the total number of questions student attempted. Questions that are labeled 7 and 8 are not applicable to that student, and thus not taken into account when calculating percentage score. Finally, we calculate the average test score for each school, in each subject based on student percentage score.
3.3.1.png
Step 2: Joining data tables Score data and school data tables are joined by matching school ID. Since the granularity of these 2 tables is not the same – for score table, each record represents one student; for school table, each record represents one school – hence the need to aggregate data in score table in Step 1. The aggregated score will then be joined with school data table. Joining score table and student table is more straightforward, simply by matching school ID and student ID. As the result of this step, student score will be used as measure for their academic performance, and other factors from school and student data tables will be used to help explain the difference in their performance.
3.3.2.png Step 3: Data cleaning and standardization Missing records in PISA data follow a consistent set of labels: ‘I’/’Invalid’, ‘M’/ ‘Missing’ and ‘N’/’No response’. In JMP Pro, we used Standardize Attribute to convert these values into a common null, so that JMP will recognize them all as missing data and process accordingly when building models.
3.3.3.png
Upon checking for missing data pattern, some attributes are revealed to have more than 50% null values. These will be excluded from analysis. An example is given below: out of 5,546 records in student table, 3,721 records in Mathematics Interest (Anchored) are missing, so this attribute will not be chosen for analysis.
3.3.4.png
Lastly, attributes that do not help differentiating the entities in data are also excluded. For example, all attributes regarding “Acculturation” aspects only have ‘N’ values (No response) and these too will be excluded.

Methodology

Framework of analysis

Techniques of analysis and variable selection

K-means clustering Partition analysis for school profiling Constructing regression model

School-level Findings

Exploratory Data Analysis

Profiling high performance and low performance schools

Multiple regression model

Student-level Findings

Exploratory Data Analysis

Multiple regression model

Discussion

General recommendation

Gaps identified in data for future research efforts

Conclusion