T15 Final Delivery

From Analytics Practicum
Jump to navigation Jump to search

G15PISA HOME.png

HOME

 

ABOUT US

 

PROJECT OVERVIEW

 

FINAL DELIVERY

 

PROJECT MANAGEMENT

 

DOCUMENTATION


Dataset

Data Retrieval

The data used in this project is questionnaire result from the latest PISA survey in 2012. All raw data files are publicly available on PISA website (https://pisa2012.acer.edu.au/downloads.php). However, the raw data is in flat file text format, where a fix number of characters represent a value (eg. first 3 letters indicate country code), as follows:

3 1.1.png

The raw data in this form is not ready for cleaning and analysis. PISA database has scripts to convert the raw text data into table forms.

  • Download raw questionnaire results (zipped text files) from PISA 2012 website and extract
  • Retrieve SAS programs for appropriate data files
  • Open the SAS scripts in SAS Enterprise Guide
  • Ensure that the path to raw text files are correct

3.1.2.png

  • Run the programs in SAS Enterprise Guide to get output SAS data table
  • Export the output SAS data table in desired formats (.sas7bdat, .csv and so on); display labels as column names for easy interpretation later on.

Data Extraction

Only Singapore data is of interest for our scope of project, therefore only the records with Country code ‘SGP’ are extracted. This process gives us the following for analysis:

  • School survey results - 172 secondary schools
  • Student survey results - 5,546 records, approx. 35 students per school
  • Student test score in Math, Science, Reading, Computer-based assessment

Each of the 3 mentioned data tables have a rich set of attributes. The summary below shows an overview of aspects that the data covers.
3.2.1.png

Data Preparation

The data preparation follows 3 basic steps.

Step 1: Aggregate student test score The original student test score table records the correct and incorrect answers by each student. We convert these into percentage score of each student by dividing the total number of correct answers by the total number of questions student attempted. Questions that are labeled 7 and 8 are not applicable to that student, and thus not taken into account when calculating percentage score. Finally, we calculate the average test score for each school, in each subject based on student percentage score.
3.3.1.png
Step 2: Joining data tables Score data and school data tables are joined by matching school ID. Since the granularity of these 2 tables is not the same – for score table, each record represents one student; for school table, each record represents one school – hence the need to aggregate data in score table in Step 1. The aggregated score will then be joined with school data table. Joining score table and student table is more straightforward, simply by matching school ID and student ID. As the result of this step, student score will be used as measure for their academic performance, and other factors from school and student data tables will be used to help explain the difference in their performance.
3.3.2.png

Step 3: Data cleaning and standardization Missing records in PISA data follow a consistent set of labels: ‘I’/’Invalid’, ‘M’/ ‘Missing’ and ‘N’/’No response’. In JMP Pro, we used Standardize Attribute to convert these values into a common null, so that JMP will recognize them all as missing data and process accordingly when building models.
3.3.3.png
Upon checking for missing data pattern, some attributes are revealed to have more than 50% null values. These will be excluded from analysis. An example is given below: out of 5,546 records in student table, 3,721 records in Mathematics Interest (Anchored) are missing, so this attribute will not be chosen for analysis.
3.3.4.png
Lastly, attributes that do not help differentiating the entities in data are also excluded. For example, all attributes regarding “Acculturation” aspects only have ‘N’ values (No response) and these too will be excluded.

Methodology

Framework of analysis

Understanding that student performance is affected by a school, family and personal factors, the analysis is carried out based on this assumption. At the school level, the steps to analyze data can be summarized as such:
4.1.1.png
The rationale of such a framework is to first have a good overview of school performance in Singapore, then slowly expanding the analysis to consider more factors as we formulate more in-depth business questions, based on an increasingly better understanding of the data and context. The framework is applied iteratively so as to refine the models.

In the first step, the goal is to have an overview of academic performance of schools in Singapore, hence distribution analysis is done using school test scores. Thereafter, having identified the spread of performance, we separate schools into different segments, based on their academic results, and study each of the segments to observe how school factors might differentiate their students’ performance. Features that separate high-performing from low-performing schools are to be identified, and we aim to profile the schools using those features. Finally, a regression model is constructed to explain in greater details to what extent the factors might cause student performance to change.

At the student level, the framework and techniques applied are largely the same as with schools. However, understanding that student data is more atomic and also has a richer set of attribute, we use score as the direct measure to build models, instead of segmenting students based on their performance as done with schools.


4.1.2.png

Techniques of analysis and variable selection

K-means clustering
K-means clustering is used to segment the schools based on their student performance. To this end, in JMP Pro, we perform Cluster analysis, with the average school scores in each subject as response variables. In terms of k-value, numbers 3-6 are chosen. This is due to the consideration that upon segmenting the schools, we will study the profile of each cluster and ultimately devise recommendation accordingly; hence clusters should come with a reasonably large number of data points, and that too many clusters are not favourable. The configuration in JMP Pro is shown as follow:
4.2.1.1.png
Running the analysis gives us the following clustering results: Table 1: Results from k-means clustering
4.2.1.2.png
Statistically speaking, k = 4 gives the best result. However, eventually k = 3 is chosen for further analysis because, as shown below, when using 4 clusters, the lowest performing schools simply get further segmented but this does not give much additional insights. With k=3, we already achieve a clear segmentation and it makes logical sense to consider the schools as low-performing, average and high-performing. Table 2: Results from k-means clustering (k=3 & k=4)
4.2.1.3.png


Partition analysis for school profiling
Using the clusters in the previous step as response variable, we use decision tree to profile the schools as below. This analysis will split the data according to school cluster, using the attribute that gives highest LogWorth, meaning that the attribute has the greatest impact in differentiating the schools.
4.2.2.1.png
In order to observe the combined effect of all school factors on the student performance, we use 225 attributes, all of which have been checked for missing data (attributes with more than 50% missing data are excluded). Running the analysis results in a decision tree with RSquare value of 0.703 and total number of splits being 20. The details of this decision tree will be discussed in later section.

Table 3: Decision tree analysis results
4.2.2.2.png
Using the tree, it is thus possible to identify the features that differentiate the schools into high-performing and low-performing clusters. The result will be discussed in the later sections.



Constructing regression model
Creating dummy variables for categorical attributes
For regression analysis to account for categorical attributes, they need to be converted into numeric form. Partition analysis can also be used to generate dummy variables for this purpose. Using school cluster (for school-level analysis) and student score (for student-level analysis) as response variables, Partition analysis will generate dummy variables for selected categorical attributes, such that the difference between nodes will be greatest.
School-level There are several regression models to be built. For analyzing performance at school level, the following configuration is carried out in JMP
4.2.3.1.png
Response variable in this case is the overall mean score of each school, but we extend the analysis to cover all subject scores, which are Mathematics, Reading, Science and Computer-based assessment. The purpose is to compare the impact, if any, that various factors have on different subject performance by students. We will also be able to compare the extent of impact and how relatively important the factors. The type of regression used is Stepwise, with p-value threshold as stopping rule. This allows us to interactively refine the regression model by adding/ removing attributes as necessary. Attributes selected to build regression model should have p-value (Prob>F) of 0.01 or less.
4.2.3.2.png
Student-level
At the student level of analysis, the subject scores are used as response variables, and the factors used to build regression models are numeric attributes and dummy variables of categorical attributes generated in the previous section. The stepwise configuration for student-level regression analysis is as follow:
4.2.4.1.png
Similar to school-level of analysis, we use p-value threshold as the stopping rule. Variables selected to run models must have p-value (Prob>F) less than 0.01.
4.2.4.2.png
Using the standard techniques discussed above, we build on the student-level analysis by segregating students from low-performing schools from those from high-performing schools, and build separate regression models for each group. This is to discover more in-depth what constitutes the difference in performance for students on individual level, separate from the school factors. We have also noticed that some students from low-performing schools can still perform reasonably well; this group of students needs to be studied more carefully. Understanding how personal factors can overcome environment factors when environment is less than ideal could be the key to improving performances of students in Singapore.

School-level Findings

Exploratory Data Analysis

Profiling high performance and low performance schools

Multiple regression model

Student-level Findings

Exploratory Data Analysis

Multiple regression model

Discussion

General recommendation

Gaps identified in data for future research efforts

Conclusion