Difference between revisions of "ANLY482 AY2017-18T2 Group06 Project Overview"

From Analytics Practicum
Jump to navigation Jump to search
Line 60: Line 60:
 
At the end of the project, the teams aims to design a unique predictive model from the data insights discovered during the analysis.  
 
At the end of the project, the teams aims to design a unique predictive model from the data insights discovered during the analysis.  
 
 
 
 
 +
 +
==<div style="background: #708090; line-height: 0.5em; font-family:'Century Gothic';  border-left: #2E5593 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">METHODOLOGY</font></div></div>==
 +
Our methodology will be a 5 step approach to data prediction, explanation modelling for USD/JPY 1 minute chart.
 +
<br>
 +
===<div style="font-family:'Century Gothic';">Exploratory Segment</div>===
 +
 +
<b>1. Data Collection</b> <br>
 +
At the initial phases of data collection, we must ensure that we have the sufficient fields that are needed for modelling in the later stage.  <br>
 +
<br>
 +
<b>2. Data Cleaning + Transformation</b> <br>
 +
In the data cleaning and transformation phase, the data would be tweaked into necessary statistical and analytics parameters necessary for prediction later. <br>
 +
<br>
 +
<b>3. Initial Data Exploration</b> <br>
 +
In this area, the data would be initially explored and we would determine the approach of modelling based on the nature of the dataset. Necessary preparations such as checking for multicollinearity of the variables would be taken into consideration before modelling of the variables would be done. Due to the nature of our dataset, careful data exploration must be done.
 +
</p>
 +
 +
===<div style="font-family:'Century Gothic';">Iterative Segment</div>===
 +
<b>4. Model Building</b> <br>
 +
Creating model, determining predictor and target variables. In this area, we would be experimenting with multiple different approaches based on our initial understanding of the dataset after the exploration. It could range from visualizations to machine learning algorithms to achieve the objectives of our client. <br>
 +
<br>
 +
<b>5. Model Validation</b> <br>
 +
We would be proposing a multi-variate methodology of sampling data in order to validate our model. In this aspect, we would be using the 3 way of approach of model validation called “train, test and validate”. Due to the nature of the project, we would like to avoid overfitting and bias in our models. Hence, we will be aiming for a more rigorous testing process with a larger sample data size to avoid such issues. <br>
 +
<br>
 +
We would also be using benchmark metrics to test our predictive modelling to ensure that it is satisfactory. Should it not be satisfactory, we would go back to phase 4 of model building or phase 2 to rebuild the model till the results is satisfactory. <br>
 +
  
 
==<div style="background: #708090; line-height: 0.5em; font-family:'Century Gothic';  border-left: #2E5593 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">DATA</font></div></div>==
 
==<div style="background: #708090; line-height: 0.5em; font-family:'Century Gothic';  border-left: #2E5593 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">DATA</font></div></div>==
Line 125: Line 150:
 
&nbsp;
 
&nbsp;
  
==<div style="background: #708090; line-height: 0.5em; font-family:'Century Gothic';  border-left: #2E5593 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">METHODOLOGY</font></div></div>==
 
Our methodology will be a 5 step approach to data prediction, explanation modelling for USD/JPY 1 minute chart.
 
<br>
 
===<div style="font-family:'Century Gothic';">Exploratory Segment</div>===
 
 
<b>1. Data Collection</b> <br>
 
At the initial phases of data collection, we must ensure that we have the sufficient fields that are needed for modelling in the later stage.  <br>
 
<br>
 
<b>2. Data Cleaning + Transformation</b> <br>
 
In the data cleaning and transformation phase, the data would be tweaked into necessary statistical and analytics parameters necessary for prediction later. <br>
 
<br>
 
<b>3. Initial Data Exploration</b> <br>
 
In this area, the data would be initially explored and we would determine the approach of modelling based on the nature of the dataset. Necessary preparations such as checking for multicollinearity of the variables would be taken into consideration before modelling of the variables would be done. Due to the nature of our dataset, careful data exploration must be done.
 
</p>
 
 
===<div style="font-family:'Century Gothic';">Iterative Segment</div>===
 
<b>4. Model Building</b> <br>
 
Creating model, determining predictor and target variables. In this area, we would be experimenting with multiple different approaches based on our initial understanding of the dataset after the exploration. It could range from visualizations to machine learning algorithms to achieve the objectives of our client. <br>
 
<br>
 
<b>5. Model Validation</b> <br>
 
We would be proposing a multi-variate methodology of sampling data in order to validate our model. In this aspect, we would be using the 3 way of approach of model validation called “train, test and validate”. Due to the nature of the project, we would like to avoid overfitting and bias in our models. Hence, we will be aiming for a more rigorous testing process with a larger sample data size to avoid such issues. <br>
 
<br>
 
We would also be using benchmark metrics to test our predictive modelling to ensure that it is satisfactory. Should it not be satisfactory, we would go back to phase 4 of model building or phase 2 to rebuild the model till the results is satisfactory. <br>
 
  
 
==<div style="background: #708090; line-height: 0.5em; font-family:'Century Gothic';  border-left: #2E5593 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">SCOPE OF WORK</font></div></div>==
 
==<div style="background: #708090; line-height: 0.5em; font-family:'Century Gothic';  border-left: #2E5593 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">SCOPE OF WORK</font></div></div>==

Revision as of 21:58, 25 February 2018

Logo.PNG

 

HOME

ABOUT US

PROJECT OVERVIEW

ANALYSIS & FINDINGS

PROJECT MANAGEMENT

DOCUMENTATION

MAIN PAGE


 
Proprietary trading has long relied on computers to help automate and execute trades. Data scientists, or more commonly known as Quants by Wall Street, have developed huge statistical models for the purpose of this automation. These models though complex, are somewhat static and as the market changes, a commonality in finance markets, they do not work as well as they do in the past.


As technology advances, we are entering an era of Artificial Intelligence and Machine Learning. Systems have capabilities to analyse large amounts of data at enormous speed and improve themselves through the process. This evolutionary computation and deep learning are seen to be able to automatically recognize changes in the market and adapt in ways the previous statistical models fail to do so.



 

MOTIVATION

The team’s motivation for doing this project is primarily an interest in undertaking a challenging project in an interesting area of research which has been a hot topic among the finance industry, Algorithmic-centric Funding. The opportunity to learn and put into practice a new area of machine learning not covered in our academics was appealing. Algorithmic-centric funding is expected to take a huge role in automated system trades, causing a notable shift the the trading markets. Utilising past data,the opportunity pH7 Global has given us allows us to tap on their expertise in trading of financial instruments and the existing market data they have collected. This gives us a whole new experience of applying analytics in the financial markets.

 

OBJECTIVES

Utilising the minute tick data from our sponsor, we would like to discover useful and practical insights which will allow traders to make more informed decisions in their trading. We would be coming up with a predictive modelling for currency pair.

The team and our sponsor pH7 Global have identified 2 areas of focus for this project:

1. Preliminary Data Analysis and Information Research
2. Predictive Algorithm Modeling and Strategy Testing

At the end of the project, the teams aims to design a unique predictive model from the data insights discovered during the analysis.  

METHODOLOGY

Our methodology will be a 5 step approach to data prediction, explanation modelling for USD/JPY 1 minute chart.

Exploratory Segment

1. Data Collection
At the initial phases of data collection, we must ensure that we have the sufficient fields that are needed for modelling in the later stage.

2. Data Cleaning + Transformation
In the data cleaning and transformation phase, the data would be tweaked into necessary statistical and analytics parameters necessary for prediction later.

3. Initial Data Exploration
In this area, the data would be initially explored and we would determine the approach of modelling based on the nature of the dataset. Necessary preparations such as checking for multicollinearity of the variables would be taken into consideration before modelling of the variables would be done. Due to the nature of our dataset, careful data exploration must be done.

Iterative Segment

4. Model Building
Creating model, determining predictor and target variables. In this area, we would be experimenting with multiple different approaches based on our initial understanding of the dataset after the exploration. It could range from visualizations to machine learning algorithms to achieve the objectives of our client.

5. Model Validation
We would be proposing a multi-variate methodology of sampling data in order to validate our model. In this aspect, we would be using the 3 way of approach of model validation called “train, test and validate”. Due to the nature of the project, we would like to avoid overfitting and bias in our models. Hence, we will be aiming for a more rigorous testing process with a larger sample data size to avoid such issues.

We would also be using benchmark metrics to test our predictive modelling to ensure that it is satisfactory. Should it not be satisfactory, we would go back to phase 4 of model building or phase 2 to rebuild the model till the results is satisfactory.


DATA

Data Source

The dataset given to us includes multiple timeframes of the same period of time series data for a 2 years’ time period; 1st July 2015 to 30th June 2017.

The data fields include:
- Timestamp (timestamp of the data)
- High (High point of the currency pair for the minute)
- Low (Low point of the currency pair for the minute)
- Open (open price of the currency pair for the minute)
- Close (closing price of the currency pair for the minute)

Data1.PNG

To access our client’s database, we used Rstudio codes to directly access the AWS servers and retrieve the data as needed for our analysis. This gave us the flexibility of choosing time periods we want to work with for our analysis.

The resulting data retrieval for 2 years worth of minute tick data for 1 currency pairs comes close to 750,000 rows.

Data2.PNG

Below would be initial visualizations 2 sets of the data by Tableau, using the original data without any transformation:

Data3.PNG
Data4.PNG

Initial observations of the data revealed incomplete dataset, which was revealed to be time periods of the weekends when the market is closed. Future analysis of the data will take this information into consideration.

Attempting to visualize minute tick data is restricted to a maximum of 1-month time periods due to the volume of data. The result of this visualization is as shown below:

Data5.PNG

Data Cleaning and Preparation

For our data cleaning and preparation, we used the following software to both visualize and ETL the data into other forms: 1. JNP Pro 2. Tableau 3. SQL Server Data Tools 2015 (MSSQL) 4. Microsoft Excel

Through the visualization seen earlier in the report, we realized that there is a need to perform data transformation to visualize all the data. Therefore, the data was prepared with MSSQL instead to produce a ‘day aggregated’ data-set for analysis on the day-time period basis.

Our initial methodology was to use the clustering method to identify clusters which could be treated as baskets for investment. As the currency values of USDJPY and the rest are vastly different, there was a need to transform into percentage change and standard deviation for clustering.

However, our client does not have 15-20 or more currency pairs in their database. Hence, we would be focusing on forecasting with these 5 currency pairs. Our team used the ARIMA forecasting method and thus the data transformation method would not be required as the ARIMA model uses its own unique method to transform data.

The image below shows the result of our first data transformation.

Data5.PNG

We performed data transformation to allows us to visualize the data differently and derive new insights on the data:

Data5.PNG
Data5.PNG

As seen in the visualization of the data of the same currency pair and time period, we can see the trends and price movements for the entire time period of 2 years for USDJPY data.

This provided additional data discoveries which we observe significant shifts in the price movements and their variations throughout the time period. This allows us to visually compare across multiple currency pairs to spot any prominent similarities and trends between them. Although nothing of significance was identified through the visualization charts as shown below, we could identify periods of time which could increase the granularity of the data points to allow deeper analysis for our forecasting.


 


SCOPE OF WORK

We intend to adopt the following steps in our analysis:

• Discover insights within the provided data
• To collect and ensure the data of currency pair is relevant in modelling
• Ensure accuracy of data by checking for multicollinearity during data exploration stage
• Identification of approaches that range from visualization to machine learning algorithms to determine predictor and target variables
• Validate model through “train, test and validate”
• Use a large sample data to prevent overfitting and bias in our model
• Design a unique predictive model
• Utilisation of benchmark metrics to test the success rate of the predictive model
• It is important to note that the scope of the project is versatile and can be furthered to address additional questions pH7 might have on the dataset

REFERENCES

Rise of the billionaire robots: how algorithms have redefined hedge funds. (2016, May 15). Retrieved from https://www.theguardian.com/business/us-money-blog/2016/may/15/hedge-fund-managers-algorithms-robots-investment-tips

Satariano, A., & Kumar, N. (2017, September 27). The Massive Hedge Fund Betting on AI. Retrieved from https://www.bloomberg.com/news/features/2017-09-27/the-massive-hedge-fund-betting-on-ai