Difference between revisions of "Group09 Proposal"

From Visual Analytics and Applications
Jump to navigation Jump to search
(Created page with "<div style=background:#565656 border:#565656> 350px <font size = 6;text-align:center; color="#FFFFFF"> Visualization on Stock Market Data </font> </div...")
 
Line 12: Line 12:
 
;
 
;
  
[[Group09_Poster| <font color="#FFFFFF">PosterData Preparation</font>]]
+
[[Group09_Poster| <font color="#FFFFFF">PosterData</font>]]
  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#565656; text-align:center;" width="25%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#565656; text-align:center;" width="25%" |  
Line 29: Line 29:
 
<br/>
 
<br/>
  
<font size="5">'''To be a Visual Detective'''</font>
+
<font size="5">'''Visualization on Stock Market Data'''</font>
  
The assignments require you to put the concepts, methods and techniques you had learned in class to solve real world problem using visual analytics techniques.  Students should also use the assignments to gain hands-on experience on using the data visualisation toolkits I had shared with you to complate the assignment.
 
 
=Overview=
 
 
Air pollution is an important risk factor for health in Europe and worldwide. A recent review of the global burden of disease showed that it is one of the top ten risk factors for health globally.  Worldwide an estimated 7 million people died prematurely because of pollution; in the European Union (EU) 400,000 people suffer a premature death. The Organisation for Economic Cooperation and Development (OECD) predicts that in 2050 outdoor air pollution will be the top cause of environmentally related deaths worldwide. In addition, air pollution has also been classified as a leading environmental cause of cancer. 
 
 
Air quality in Bulgaria is a big concern: measurements show that citizens all over the country breathe in air that is considered harmful to health. For example, concentrations of PM2.5 and PM10 are much higher than what the EU and the World Health Organization (WHO) have set to protect health.
 
 
Bulgaria had the highest PM2.5 concentrations of all EU-28 member states in urban areas over a three-year average. For  PM10, Bulgaria is also leading on the top polluted countries with 77 μg/m3on the daily mean concentration (EU limit value is 50 μg/m3).
 
 
According to the WHO, 60 percent of the urban population in Bulgaria is exposed to dangerous (unhealthy) levels of particulate matter (PM10).
 
[[File:WechatIMGd.png|thumb]]
 
 
=The Task=
 
 
In this assignment, you are required to use visual analytics approach to reveal spatio-temporal patterns of air quality in Sofia City and to identify issues of concern. 
 
 
==Task 1: Spatio-temporal Analysis of Official Air Quality==
 
 
Characterize the past and most recent situation with respect to air quality measures in Sofia City. What does a typical day look like for Sofia city? Do you see any trends of possible interest in this investigation?  What anomalies do you find in the official air quality dataset? How do these affect your analysis of potential problems to the environment? 
 
 
Your submission for this questions should contain no more than 10 images and 1000 words.
 
 
==Task 2: Spatio-temporal Analysis of Citizen Science Air Quality Measurements ==
 
 
Using appropriate data visualisation, you are required will be asked to answer the following types of questions:
 
 
* Characterize the sensors’ coverage, performance and operation. Are they well distributed over the entire city? Are they all working properly at all times? Can you detect any unexpected behaviors of the sensors through analyzing the readings they capture? Limit your response to no more than 4 images and 600 words.
 
* Now turn your attention to the air pollution measurements themselves.  Which part of the city shows relatively higher readings than others?  Are these differences time dependent? Limit your response to no more than 6 images and 800 words.
 
 
==Task 3==
 
 
Urban air pollution is a complex issue.  There are many factors affecting the air quality of a city.  Some of the possible causes are:
 
 
* Local energy sources.  For example, according to [http://unmaskmycity.org/project/sofia/ Unmask My City], a global initiative by doctors, nurses, public health practitioners, and allied health professionals dedicated to improving air quality and reducing emissions in our cities, Bulgaria’s main sources of PM10, and fine particle pollution PM2.5 (particles 2.5 microns or smaller) are household burning of fossil fuels or biomass, and transport. 
 
* Local meteorology such as temperature, pressure, rainfall, humidity, wind etc
 
* Local topography
 
* Complex interactions between local topography and meteorological characteristics.
 
* Transboundary pollution for example the haze that intruded into Singapore from our neighbours.
 
 
In this third task, you are required to reveal the relationships between the factors mentioned above and the air quality measure detected in Task 1 and Task 2.  Limit your response to no more than 5 images and 600 words. 
 
 
 
=The Data Sets=
 
 
Four major data sets in zipped file format are provided for this assignment, they are:
 
 
* Official air quality measurements (5 stations in the city)(EEA Data.zip) – as per EU guidelines on air quality monitoring see the data description [https://drive.google.com/file/d/1v5yCL-LdriDwa65qXPbFL7b0tydylDlb/view HERE…]
 
* Citizen science air quality measurements (Air Tube.zip) , incl. temperature, humidity and pressure (many stations) and topography (gridded data).
 
* Meteorological measurements (1 station)(METEO-data.zip): Temperature; Humidity; Wind speed; Pressure; Rainfall; Visibility
 
* Topography data (TOPO-DATA)
 
 
They can be download by click on this [https://storage.cloud.google.com/global-datathon-2018/sofia-air/air-sofia.zip link].
 
 
 
=Visualisation Software=
 
 
To perform the visual analysis, students are encouraged to explore any one or a combination of the following software:
 
*Tableau
 
*JMP Pro
 
*Qlik Sense
 
*Microsoft Power BI
 
 
One of the goals of this assignment is for you to learn to use and evaluate the effectiveness of these visual analytics tools.
 
 
 
=Submission details=
 
 
This is an individual assignment. You are required to work on the assignment and prepare submission individually. Your completed assignment is due on '''18th November 2018, by 11.59pm mid-night'''.
 
 
You need to edit your assignment in the appropriate wiki page of the Assignment Dropbox. The title of the wiki page should be in the form of: ISSS608_2018-19_T1_Assign_FullName.
 
 
The assignment wiki page should include the URL link to the web-based interactive data visualization system prepared.
 
  
 +
= Background =
 +
An interesting research conducted by Researchers Emer Soyer and Robin Hogarth has tested a same question concerning a dataset on three groups of economists. And the results show that the group that was given only the graph performs better on the question compared with the group that was given the dataset and a standard statistical analysis of the data. This result suggest that the data visualization provides context and accurate representation of the numbers and will help the user extract important information from data quickly and efficiently.
 +
In our general impression, when we access a trading platform to make investment, we need to deal with plenty of price data to make our investment decision, which includes the opening price, closing price, highest price of the day and lowest price of the day. And the K line is the most popular visualization chart of the stock data for investor to refer to. However, if a fresh investor is not very professional and sensitive to the financial data, he may be distracted by various price data and are not able to make appropriate financial decision.
 +
Therefore, visualization of stock market data is quite useful for technical stock market analysis and will help investors to gain a comprehensive understanding on how the stock market is changing, which lead to our analysis objective for this project.
 +
= Project Objective =
 +
This project will focus on visualizing the stock market data in many perspectives.
 +
Firstly, we apply various visualization tools such as scatter plot, tree map to derive useful and interesting insights from the IPO companies. Also, we will map the companies and visualize the companies’ characteristics on map, for example, visualizing the companies by market shares in different region.
 +
Secondly, to discover the trend in each issuer and find the best model to predict the future stock values of each issuer, we will perform time-series analysis using R programming on a sequential dataset. Visualizing our time-series data also enables us to make inferences about important components such as trend (either a long-term increase or decrease) and seasonality (appear with a pattern that repeat over a fixed periods of time).
 +
= Data Source =
 +
The data set we are using consist of quite a lot attributes, but we will mainly focus on the following characteristics to do our analysis.
 +
{| class="wikitable"
 +
|-
 +
| Company name || Indicate the issuer name of the stock
 +
|-
 +
| Stock Code || Indicate the stock symbol
 +
|-
 +
| Market type || Indicate the market of the issuer (Growth market, Shang Hai, Shen Zhen)
 +
|-
 +
| IPO date || Indicate the IPO date of issuer
 +
|-
 +
| Total capital stock || Indicate the number of common and preferred shares that a company is authorized to issue
 +
|-
 +
| Region || Indicate the location of the issuer
 +
|-
 +
| IPO price || Indicate the price of IPO date
 +
|-
 +
| Total No of shares issued || Indicate the total No of shares that issued on IPO date
 +
|-
 +
| IPO value || Equals to companies’ IPO price*IPO value
 +
|}
  
=Reference=
+
= Methodology =
 +
Histogram:
 +
Histogram gram is an accurate graphical representation of the distribution of numeric data. The variable is cut into several bins, and the number of observation per bin is represented by the height of the bar. The shape of the histogram can be different according to the number of bins you set.
 +
We will visualize the distribution of the IPO value by market type ( Growth market, Shen Zhen and Guang Zhou) and customer can interactive with the bar chart to customize their own visualization results.
 +
Treemap:
 +
Treemap display hierarchical data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. A leaf node’s rectangle has an area proportional to a specified dimension of the data. It is an alternative way of visualizing the structure of a tree diagram.
 +
We will use the Treemap library in R to visualize the data.
 +
Scatter Plot:
 +
A Scatterplot displays the value of 2 sets of data on 2 dimensions. Each dot represents an observation. The position on the X (horizontal) and Y (vertical) axis represents the values of the 2 variables. It is useful to study the relationship between both variables. It is common to provide even more information using colors or shapes (to show groups, or a third variable). It is also possible to map another variable to the size of each dot, what makes a bubble plot. If you have many dots and struggle with overplotting, consider using 2D density plot.
 +
Time-series Analysis:
 +
In R programming, we can use ts() function to convert a numeric vector into an R time series object. The format is ts(vector, start=, end=, frequency=) where start and end are the times of the first and last observation and frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.).
 +
Also, the forecast package provides functions for the automatic selection of exponential and ARIMA models. The ets() function supports both additive and multiplicative models. The auto.arima() function can handle both seasonal and nonseasonal ARIMA models. Models are chosen to maximize one of several fit criteria.
 +
Portfolio Analysis:
 +
The PerformanceAnalytics package consolidates functions to compute many of the most widely used performance metrics. tidquant integrates this functionality so it can be used at scale using the split, apply, combine framework within the tidyverse. Two primary functions integrate the performance analysis functionality:
 +
tq_performance implements the performance analysis functions in a tidy way, enabling scaling analysis using the split, apply, combine framework.
 +
tq_portfolio provides a useful tool set for aggregating a group of individual asset returns into one or many portfolios.
  
* [https://wiki.smu.edu.sg/1617t1ISSS608g1/ISSS608_2016-17_T1_Assign3_Ong_Han_Ying Dino Holmes Series]
+
= Future Work =
* [https://wiki.smu.edu.sg/1617t3isss608g1/ISSS608_2016-17_T3_Assign_GUAN_YIFEI Mystery at the Wildlife Preserve]
+
Because of data limitation, we are only able to extract stock data from SZSE and SSE. Therefore, a lack of representation will be the main shortcoming of our analysis. In the future, we will extend our analysis to more foreign stock markets such as NYSE, LSE, etc.  
* [https://wiki.smu.edu.sg/1718t1isss608g1/ISSS608_2017-18_T1_Assign_RACHEL_TONG Sickness in SmartPolis]
+
Also, to validate our analysis, we need to test our visualization together with professionals using technical chart analysis. And modify our analysis and apply it in real financial market decision making.
* [https://wiki.smu.edu.sg/1718t3isss608/ISSS608_2017-18_T3_Assign_Tan_Yong_Ying Suspense at the Wildlife Preserve]
 

Revision as of 11:37, 22 November 2018

Cover.jpg Visualization on Stock Market Data

Project Proposal

PosterData

Report

Application


Visualization on Stock Market Data


Background

An interesting research conducted by Researchers Emer Soyer and Robin Hogarth has tested a same question concerning a dataset on three groups of economists. And the results show that the group that was given only the graph performs better on the question compared with the group that was given the dataset and a standard statistical analysis of the data. This result suggest that the data visualization provides context and accurate representation of the numbers and will help the user extract important information from data quickly and efficiently. In our general impression, when we access a trading platform to make investment, we need to deal with plenty of price data to make our investment decision, which includes the opening price, closing price, highest price of the day and lowest price of the day. And the K line is the most popular visualization chart of the stock data for investor to refer to. However, if a fresh investor is not very professional and sensitive to the financial data, he may be distracted by various price data and are not able to make appropriate financial decision. Therefore, visualization of stock market data is quite useful for technical stock market analysis and will help investors to gain a comprehensive understanding on how the stock market is changing, which lead to our analysis objective for this project.

Project Objective

This project will focus on visualizing the stock market data in many perspectives. Firstly, we apply various visualization tools such as scatter plot, tree map to derive useful and interesting insights from the IPO companies. Also, we will map the companies and visualize the companies’ characteristics on map, for example, visualizing the companies by market shares in different region. Secondly, to discover the trend in each issuer and find the best model to predict the future stock values of each issuer, we will perform time-series analysis using R programming on a sequential dataset. Visualizing our time-series data also enables us to make inferences about important components such as trend (either a long-term increase or decrease) and seasonality (appear with a pattern that repeat over a fixed periods of time).

Data Source

The data set we are using consist of quite a lot attributes, but we will mainly focus on the following characteristics to do our analysis.

Company name Indicate the issuer name of the stock
Stock Code Indicate the stock symbol
Market type Indicate the market of the issuer (Growth market, Shang Hai, Shen Zhen)
IPO date Indicate the IPO date of issuer
Total capital stock Indicate the number of common and preferred shares that a company is authorized to issue
Region Indicate the location of the issuer
IPO price Indicate the price of IPO date
Total No of shares issued Indicate the total No of shares that issued on IPO date
IPO value Equals to companies’ IPO price*IPO value

Methodology

Histogram: Histogram gram is an accurate graphical representation of the distribution of numeric data. The variable is cut into several bins, and the number of observation per bin is represented by the height of the bar. The shape of the histogram can be different according to the number of bins you set. We will visualize the distribution of the IPO value by market type ( Growth market, Shen Zhen and Guang Zhou) and customer can interactive with the bar chart to customize their own visualization results. Treemap: Treemap display hierarchical data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. A leaf node’s rectangle has an area proportional to a specified dimension of the data. It is an alternative way of visualizing the structure of a tree diagram. We will use the Treemap library in R to visualize the data. Scatter Plot: A Scatterplot displays the value of 2 sets of data on 2 dimensions. Each dot represents an observation. The position on the X (horizontal) and Y (vertical) axis represents the values of the 2 variables. It is useful to study the relationship between both variables. It is common to provide even more information using colors or shapes (to show groups, or a third variable). It is also possible to map another variable to the size of each dot, what makes a bubble plot. If you have many dots and struggle with overplotting, consider using 2D density plot. Time-series Analysis: In R programming, we can use ts() function to convert a numeric vector into an R time series object. The format is ts(vector, start=, end=, frequency=) where start and end are the times of the first and last observation and frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.). Also, the forecast package provides functions for the automatic selection of exponential and ARIMA models. The ets() function supports both additive and multiplicative models. The auto.arima() function can handle both seasonal and nonseasonal ARIMA models. Models are chosen to maximize one of several fit criteria. Portfolio Analysis: The PerformanceAnalytics package consolidates functions to compute many of the most widely used performance metrics. tidquant integrates this functionality so it can be used at scale using the split, apply, combine framework within the tidyverse. Two primary functions integrate the performance analysis functionality: tq_performance implements the performance analysis functions in a tidy way, enabling scaling analysis using the split, apply, combine framework. tq_portfolio provides a useful tool set for aggregating a group of individual asset returns into one or many portfolios.

Future Work

Because of data limitation, we are only able to extract stock data from SZSE and SSE. Therefore, a lack of representation will be the main shortcoming of our analysis. In the future, we will extend our analysis to more foreign stock markets such as NYSE, LSE, etc. Also, to validate our analysis, we need to test our visualization together with professionals using technical chart analysis. And modify our analysis and apply it in real financial market decision making.