Difference between revisions of "GeoEstate PROPOSAL"

From Geospatial Analytics and Applications
Jump to navigation Jump to search
 
Line 147: Line 147:
  
 
<br>
 
<br>
 +
<table class="wikitable" style="background-color:#FFF; margin: 1em auto;" width="80%; font-size: 15px;">
 +
<tr>
 +
<th> No. </th>
 +
<th> Required Data</th>
 +
<th> Action</th>
 +
</tr>
 +
 +
<tr>
 +
<td> 1. </td>
 +
<td> Count of facilities </td>
 +
<td>
 +
As property prices are affected by the various points of interest located in their proximity, we needed a method to determine the quantity of these facilities around the property we are querying.
 +
 +
To achieve this, we created a buffer of radius 1km around the specific property using method gBuffer from the rgeos package in R. After this buffer was created, we then used the over method from the SP package to count how many facilities were within points of interest.
 +
</td>
 +
</tr>
 +
 +
<tr>
 +
<td> 2. </td>
 +
<td> Freehold duration left </td>
 +
<td>
 +
Another variable that we felt might affect the overall prediction was the effective freehold duration where a property with less than 20 years left would be worth much less than a property with 50 years left.
 +
 +
From the data provided by REALIS, only Freehold tenure and building completion year was provided. From there, we determine whether it was a 60/ 99 / 100 / 999 years tenure before subtracting its age (2015 - completion year) to get our effective tenure value.
 +
</td>
 +
</tr>
 +
</table>
 
<br>
 
<br>
  
Line 204: Line 231:
  
 
{| style="background-color:#ffffff ; margin: 3px 10px 3px 10px; width="80%"|
 
{| style="background-color:#ffffff ; margin: 3px 10px 3px 10px; width="80%"|
| style="font-family:Open Sans, Arial, sans-serif; font-size:24px; border-top:solid #ffffff; border-bottom:solid #2DB0AF" width="9999px" | Approach
+
| style="font-family:Open Sans, Arial, sans-serif; font-size:24px; border-top:solid #ffffff; border-bottom:solid #2DB0AF" width="9999px" | Methods used
 
|}
 
|}
  
* Geographically Weighted Regression (GWR)
+
=== 1. Multi-linear regression (MLR) ===
<br>
+
 
 +
MLR or hedonic price model assumes that the property is valued by the sum of their characteristics such as the size, neighborhood, accessibility and proximity facilities. Therefore, with such generalization, the general pricing model can be simplified to the expression below.
 +
 
 +
[[File:GeoEstate MLR eq.png|300px]]
 +
 
 +
We have included MLR into our project as not only MLR was traditionally used in these Real Estates sites we are evaluating against, but also, we wanted to compare the performance difference of a Non-spatial predictor compared to a spatial model such as GWR and SAR.
 +
 
 +
 
 +
 
 +
=== 2. Geographically Weighted Regression (GWR) ===
 +
GWR is similar to MLR in w.r.t the regression formula, however for estimating the coefficients of the model for a flat i we assume that flats located near to the flat i,  have  higher influence on the coefficients than flats located far from the flat i.
 +
In terms of regression, it means that flats located near to the point i, have larger weights.
 +
Usually the bi-square function is used for calculating a weight of the point j for the point i model:
 +
 
 +
[[File:GeoEstate GWR eq.png]]
 +
 +
where distance ij, is a distance from the point i,  to the point j,. The bandwidth reflects the speed of weight decreasing. There are two main approaches for the bandwidth selection – fixed and adaptive bandwidths.
 +
The fixed bandwidth is selected once for all data points, the adaptive can be changed from point to point depending on data density. The illustration of the adaptive bandwidth is presented by the diagram below.
 +
 
 +
[[File:GeoEstate GWR bw.png|500px]]
 +
 
 +
 
 +
 
 +
=== 3. Spatial Autoregression (SAR) ===
 +
As defined in Anselin (2001), spatial autoregression is referred to as the `coincidence of value similarity with locational similarity'. Therefore, due to SAR, property in near proximity of one another tend to have similar transaction price.
 +
 
 +
First, with a nearby location, homeowners tend to follow their neighbors' improvement activities, which result in similar dwelling size, vintage, designs, and other structural characteristics.
 +
 
 +
Second, spatial autoregression arises from the shared locational amenities of houses in the nearby location and neighborhood (Basu and Thibodeau, 1998; Militino et al, 2004), such as schools, hawker centers and shopping malls. Last, Real Estate agents tend to evaluate the value of houses by referring to the neighborhood conditions, an activity which also results in similar housing values in the nearby locations.
 +
 
 +
Overall the formula could be generalized to:
 +
 
 +
[[File:GeoEstate SAR eq.png]]
  
{| style="background-color:#ffffff ; margin: 3px 10px 3px 10px; width="80%"|
+
Whereby Wy represents the influence of the neighboring properties on the targeted property final transaction price.
| style="font-family:Open Sans, Arial, sans-serif; font-size:24px; border-top:solid #ffffff; border-bottom:solid #2DB0AF" width="9999px" | Project Prototype
 
|}
 
  
<br>
 
 
<br>
 
<br>
  

Latest revision as of 23:53, 14 April 2019

GeoEstate logo.png
GeoEstate

HOME

 

PROPOSAL

 

POSTER

 

APPLICATION

 

RESEARCH PAPER

Project Description

Our project aims provide an easy way for end users to calculate the predicted resale housing prices of apartments, condominiums and executive condominiums, using inputs such as the postal code, square area and type of apartment. To achieve this, we use 3 regression models, the geographically weighted regression model, the spatial autoregression model and the multiple linear regression model. Users can read about the methodology for each model and select the one that he or she feels best fits his or her situation, or simply look at which model fits the data points best by looking at the r square value.

Furthermore, we aim to allow for extensive exploratory data analysis, allowing users to see how various variables in the property market such as the age of the property and square feet of the property correlate to property resale prices. Through this, we aim to educate interested consumers and real estate agents alike on what truly matters when determining the price of real estate resale property.


Project Motivation

How do you know if you are getting a reasonable price for your apartment? Due to vested interests, for people who are interested in being educated consumers, taking your Real Estate Agent's word for the price of a property may not be enough. In our current age, websites like PropertyGuru appear to give us some semblance of what prices are competitive. However, this may be misleading as it only is a snapshot in time.

What if you were able to predict the price of the property you want to sell, or conversely, the dream property you wish to purchase, using masses of data accumulated over past years?

Through our application, we aim to educate consumers on the value of property using rigorous statistical methods.


Storyboard

StoryBoard1.png StoryBoard2.png StoryBoard3.png StoryBoard4.png



Data sources
Data Source Data Type/Method
2014 Master Plan Planning Subzone (Web) Data.gov.sg SHP
URA Private Residential Property Transactions Ura.gov.sg

CSV
Data was geocoded using Google Geocoding API
Postal code was geocoded using OneMap API

Pre-School Locations Data.gov.sg KML
Converted to Shapefile
Primary/Secondary School Locations Data.gov.sg CSV
Data was geocoded using OneMap API
MRT/LRT Station Locations LTA Datamall
(Direct Download)
SHP
Supermarket Locations Data.gov.sg KML
Converted to Shapefile
Shopping Mall Locations Wikipedia Text
Data was converted to Shapefile after geocoding using OneMap API
Park Locations Data.gov.sg KML
Converted to Shapefile
Sports Facilities Locations Data.gov.sg KML
Converted to Shapefile
Hawker Centre Locations

Public Food Centres:
1. Data.gov.sg

Private Food Centres:
2. Kopitam
3. Koufu
4. Food Junction
5. Food Republic

1: KML - Converted to Shapefile
2 - 5: Text - Data scraped from sites and geocoded using OneMap API

Data Transformation


No. Required Data Action
1. Count of facilities

As property prices are affected by the various points of interest located in their proximity, we needed a method to determine the quantity of these facilities around the property we are querying.

To achieve this, we created a buffer of radius 1km around the specific property using method gBuffer from the rgeos package in R. After this buffer was created, we then used the over method from the SP package to count how many facilities were within points of interest.

2. Freehold duration left

Another variable that we felt might affect the overall prediction was the effective freehold duration where a property with less than 20 years left would be worth much less than a property with 50 years left.

From the data provided by REALIS, only Freehold tenure and building completion year was provided. From there, we determine whether it was a 60/ 99 / 100 / 999 years tenure before subtracting its age (2015 - completion year) to get our effective tenure value.


Literature Review

1. A Spatial Analysis of House Prices in the Kingdom of Fife, Scotland

(By: Julia Zmölnig, Melanie N Tomintz, Stewart A Fotheringham)

GeoEstate interpolation.jpg

Aim of Study: to analyse the spatial variations in house price adjustments due to economic conditions, and to quantify and describe patterns in the variations of house prices in the study area of Fife, Scotland

Methodology:

Spatial Interpolation Technique - using points with known values to estimate values at other unknown points. There were 3 main methods being used:

  • Diffusion Interpolation with Boundaries
  • Inverse-distance weighting
  • Deterministic ordinary Kriging (Most accurate)

Learning Points:

  • House price hot spot will migrate from year to year and multiple models is required if the study duration spans over multiple years
  • Economic downturn actually leads to increase of property prices despite more supply from unemployed people

Areas for Improvement:

  • Data lacked information such as the size and type of real estates which while could be approximated via interpolation, overall still hurts the accuracy of the model
  • Using a different model such Geographically Weighted Regression (GWR) to identify spatial patterns apparent in the study area.


2. Statistical analysis of the relationship between public transport accessibility and flat prices in Riga

(By: Dmitry Pavlyuk)

GeoEstate public transport accessibility.JPG

Aim of Study: to examine the relationship between public transport accessibility and residential land value in Riga, Latvia

Methodology:

  • Geographically Weighted Regression (GWR)
  • Global Regression Model

Learning Points:

  • Within city centre, accessibility has no significant relationship on flat prices as the city centres are already rich in transport route and new routes have a diminishing impact
  • For the population with higher income, higher public transport accessibility will possibility lead to lower property prices
  • Overall GWR performed significantly better than global regression
    • Variable that have no significant relation in one model might be significant in another. For example, the influence of the first floor on the price was insignificant in the global regression model, it was a local dependency in GWR.

Areas for Improvement:

  • Overall limited impact by transport which was the main focus of the study
  • Possibility of using Manhattan distance to compute the actual distance travelled rather than straight line distance


Methods used

1. Multi-linear regression (MLR)

MLR or hedonic price model assumes that the property is valued by the sum of their characteristics such as the size, neighborhood, accessibility and proximity facilities. Therefore, with such generalization, the general pricing model can be simplified to the expression below.

GeoEstate MLR eq.png

We have included MLR into our project as not only MLR was traditionally used in these Real Estates sites we are evaluating against, but also, we wanted to compare the performance difference of a Non-spatial predictor compared to a spatial model such as GWR and SAR.


2. Geographically Weighted Regression (GWR)

GWR is similar to MLR in w.r.t the regression formula, however for estimating the coefficients of the model for a flat i we assume that flats located near to the flat i, have higher influence on the coefficients than flats located far from the flat i. In terms of regression, it means that flats located near to the point i, have larger weights. Usually the bi-square function is used for calculating a weight of the point j for the point i model:

GeoEstate GWR eq.png

where distance ij, is a distance from the point i, to the point j,. The bandwidth reflects the speed of weight decreasing. There are two main approaches for the bandwidth selection – fixed and adaptive bandwidths. The fixed bandwidth is selected once for all data points, the adaptive can be changed from point to point depending on data density. The illustration of the adaptive bandwidth is presented by the diagram below.

GeoEstate GWR bw.png


3. Spatial Autoregression (SAR)

As defined in Anselin (2001), spatial autoregression is referred to as the `coincidence of value similarity with locational similarity'. Therefore, due to SAR, property in near proximity of one another tend to have similar transaction price.

First, with a nearby location, homeowners tend to follow their neighbors' improvement activities, which result in similar dwelling size, vintage, designs, and other structural characteristics.

Second, spatial autoregression arises from the shared locational amenities of houses in the nearby location and neighborhood (Basu and Thibodeau, 1998; Militino et al, 2004), such as schools, hawker centers and shopping malls. Last, Real Estate agents tend to evaluate the value of houses by referring to the neighborhood conditions, an activity which also results in similar housing values in the nearby locations.

Overall the formula could be generalized to:

GeoEstate SAR eq.png

Whereby Wy represents the influence of the neighboring properties on the targeted property final transaction price.


Tools & Technology
GeoEstate tech stack.png
Project Timeline
GeoEstate timeline.jpg
Challenges


No. Key Challenges Mitigation
1. Unfamiliarity with R, its packages and R Shiny
  1. Self-directed learning with online resources such as Datacamp,
  2. Browsing community forum (Stackoverflow / discuss.onemap) for help
  3. Looking at official documentation for various packages
2. Limited oneMap API call for standard account
  1. Creation of R script to catch timeout & wait
  2. Filtering out distinct records to query oneMap to reduce the quantity of duplicated request