Difference between revisions of "T15 Final Delivery"

From Analytics Practicum
Jump to navigation Jump to search
Line 84: Line 84:
 
== Methodology ==
 
== Methodology ==
 
=== Framework of analysis ===
 
=== Framework of analysis ===
 +
Understanding that student performance is affected by a school, family and personal factors, the analysis is carried out based on this assumption. At the school level, the steps to analyze data can be summarized as such:<br />
 +
[[File:4.1.1.png|800px]]<br />
 +
The rationale of such a framework is to first have a good overview of school performance in Singapore, then slowly expanding the analysis to consider more factors as we formulate more in-depth business questions, based on an increasingly better understanding of the data and context. The framework is applied iteratively so as to refine the models.
 +
 +
In the first step, the goal is to have an overview of academic performance of schools in Singapore, hence distribution analysis is done using school test scores. Thereafter, having identified the spread of performance, we separate schools into different segments, based on their academic results, and study each of the segments to observe how school factors might differentiate their students’ performance. Features that separate high-performing from low-performing schools are to be identified, and we aim to profile the schools using those features. Finally, a regression model is constructed to explain in greater details to what extent the factors might cause student performance to change.
 +
 +
At the student level, the framework and techniques applied are largely the same as with schools. However, understanding that student data is more atomic and also has a richer set of attribute, we use score as the direct measure to build models, instead of segmenting students based on their performance as done with schools.
 +
 +
<br />
 +
[[File:4.1.2.png|800px]]<br />
 +
 
=== Techniques of analysis and variable selection ===
 
=== Techniques of analysis and variable selection ===
 
'''K-means clustering'''
 
'''K-means clustering'''

Revision as of 14:02, 17 April 2016

G15PISA HOME.png

HOME

 

ABOUT US

 

PROJECT OVERVIEW

 

FINAL DELIVERY

 

PROJECT MANAGEMENT

 

DOCUMENTATION


Dataset

Data Retrieval

The data used in this project is questionnaire result from the latest PISA survey in 2012. All raw data files are publicly available on PISA website (https://pisa2012.acer.edu.au/downloads.php). However, the raw data is in flat file text format, where a fix number of characters represent a value (eg. first 3 letters indicate country code), as follows:

3 1.1.png

The raw data in this form is not ready for cleaning and analysis. PISA database has scripts to convert the raw text data into table forms.

  • Download raw questionnaire results (zipped text files) from PISA 2012 website and extract
  • Retrieve SAS programs for appropriate data files
  • Open the SAS scripts in SAS Enterprise Guide
  • Ensure that the path to raw text files are correct

3.1.2.png

  • Run the programs in SAS Enterprise Guide to get output SAS data table
  • Export the output SAS data table in desired formats (.sas7bdat, .csv and so on); display labels as column names for easy interpretation later on.

Data Extraction

Only Singapore data is of interest for our scope of project, therefore only the records with Country code ‘SGP’ are extracted. This process gives us the following for analysis:

  • School survey results - 172 secondary schools
  • Student survey results - 5,546 records, approx. 35 students per school
  • Student test score in Math, Science, Reading, Computer-based assessment

Each of the 3 mentioned data tables have a rich set of attributes. The summary below shows an overview of aspects that the data covers.
3.2.1.png

Data Preparation

The data preparation follows 3 basic steps.

Step 1: Aggregate student test score The original student test score table records the correct and incorrect answers by each student. We convert these into percentage score of each student by dividing the total number of correct answers by the total number of questions student attempted. Questions that are labeled 7 and 8 are not applicable to that student, and thus not taken into account when calculating percentage score. Finally, we calculate the average test score for each school, in each subject based on student percentage score.
3.3.1.png
Step 2: Joining data tables Score data and school data tables are joined by matching school ID. Since the granularity of these 2 tables is not the same – for score table, each record represents one student; for school table, each record represents one school – hence the need to aggregate data in score table in Step 1. The aggregated score will then be joined with school data table. Joining score table and student table is more straightforward, simply by matching school ID and student ID. As the result of this step, student score will be used as measure for their academic performance, and other factors from school and student data tables will be used to help explain the difference in their performance.
3.3.2.png

Step 3: Data cleaning and standardization Missing records in PISA data follow a consistent set of labels: ‘I’/’Invalid’, ‘M’/ ‘Missing’ and ‘N’/’No response’. In JMP Pro, we used Standardize Attribute to convert these values into a common null, so that JMP will recognize them all as missing data and process accordingly when building models.
3.3.3.png
Upon checking for missing data pattern, some attributes are revealed to have more than 50% null values. These will be excluded from analysis. An example is given below: out of 5,546 records in student table, 3,721 records in Mathematics Interest (Anchored) are missing, so this attribute will not be chosen for analysis.
3.3.4.png
Lastly, attributes that do not help differentiating the entities in data are also excluded. For example, all attributes regarding “Acculturation” aspects only have ‘N’ values (No response) and these too will be excluded.

Methodology

Framework of analysis

Understanding that student performance is affected by a school, family and personal factors, the analysis is carried out based on this assumption. At the school level, the steps to analyze data can be summarized as such:
4.1.1.png
The rationale of such a framework is to first have a good overview of school performance in Singapore, then slowly expanding the analysis to consider more factors as we formulate more in-depth business questions, based on an increasingly better understanding of the data and context. The framework is applied iteratively so as to refine the models.

In the first step, the goal is to have an overview of academic performance of schools in Singapore, hence distribution analysis is done using school test scores. Thereafter, having identified the spread of performance, we separate schools into different segments, based on their academic results, and study each of the segments to observe how school factors might differentiate their students’ performance. Features that separate high-performing from low-performing schools are to be identified, and we aim to profile the schools using those features. Finally, a regression model is constructed to explain in greater details to what extent the factors might cause student performance to change.

At the student level, the framework and techniques applied are largely the same as with schools. However, understanding that student data is more atomic and also has a richer set of attribute, we use score as the direct measure to build models, instead of segmenting students based on their performance as done with schools.


4.1.2.png

Techniques of analysis and variable selection

K-means clustering Partition analysis for school profiling Constructing regression model

School-level Findings

Exploratory Data Analysis

Profiling high performance and low performance schools

Multiple regression model

Student-level Findings

Exploratory Data Analysis

Multiple regression model

Discussion

General recommendation

Gaps identified in data for future research efforts

Conclusion