Group03 Report
|  |  |  |  | 
Tourism Investigator
Contents
Abstract
Where previous generations see travel as luxury, the current generation these days view travelling as an essential. This is prevalent among the millennials, who perceive travelling as a fulfilling experience that enhances their standard of living as well as an avenue to be exposed to various cultures. Not be to the left behind, the trend of silver generation travelling has been increasing. This is largely due to the affordability, as compared to travelling as a luxury item during their younger years. Apart from widening their horizon and experiences, travel also allows senior citizens to spend more quality time with their children whilst on holiday.
Apart from tourists, another group that make up a sizeable portion is the business travellers. Globalisation has resulted in many global corporations setting various regional headquarters to be closer to the respective marketplaces. Executives are at times, required to travel to meet up to discuss on the business strategies and directions. The availability of tele-conferencing negates the need to travel for minor meetings, but critical political and economic decisions are still conducted on a face to face setup.
Tourism Landscape in Singapore
Over the years, Singapore’s recognition in the global stage as been compelling. The Crown Jewel of the Formula One Race Circuit, backdrop of the successful Hollywood Film “Crazy Rich Asian” and the honourable host of the Memorable North Korea-United States Summit, Singapore’s ability to position herself as a neutral yet vibrant destination has led to hordes of visitors setting foot onto her sunny shores. It is no surprise that the tourism sector has been developing into a growth engine for Singapore’s economy. For 2017, Singapore’s tourism sectors attained records highs in both tourists’ arrivals and spending. According to the data released by Singapore Tourism Board, the number of arrivals increased by 6.2 per cent to $17.4 million, while tourism receipts increased by 3.9 per cent to $26.8 million. The increasing affordability of travel, with the prevalence of low-cost carriers globally, as contribute to the opportunistic trend.
Beyond tourism, Singapore is also an ideal venue for the conduit of businesses. Singapore has constantly been ranked as the top few, if not the top, amongst Asian cities for hosting Meetings, Inventive Travel, Conventions & Exhibitions (MICE) events. Its premium geographical location and stable political climates have been the two main reasons for being the prime destination for international MICE events. In 2017, a total of 935 international meetings took place in Singapore.
Objective and Motivations
During our exploratory analysis on the data comprising of the tourism arrival into Singapore, we noticed that the arrival patterns of tourists and business travellers from respective countries at heterogenous. The analysis obtained from The World Bank and Singapore Tourism Board provides a macro-view on the overall tourism activity. As much our team aims to address the gap but shifting the analysis to country-specific. A keen understanding of the unique travel behaviours can reveal their travel preference which is essential for local businesses to devise plans to attract more tourism receipts boosting their business revenue. The ability of the analysts to grapple the data and transform the insights into actionable business decision will see their businesses flourishes. In addition, beyond analysis, we aim to provide a forecast on the visitor's future travel and expenditure pattern. This will allow the local businesses to be better prepared to capture the tourism dollars in the next few years.
Dataset and Data Preparation
The CEIC contains many related tourism data information with regards to Singapore. However, to investigate our premise, we must narrow our search and work with datasets that illustrate each countries’ tourist behaviour pattern. As such, we narrowed our relevant dataset to tourists’ arrival numbers, the mode of arrival and expenditure level. The datasets were further whittled down and only countries with consistent yearly data points were selected. In summary, forty-seven countries with mainly arrival data and twenty countries with annual expenditure data were chosen based on consistency between the period from 2007 to 2017. Twenty countries have appropriate data on both arrival and expenditure.
Data Preparation
After the aforementioned datasets where chosen, we have to carry out a series of transformation to ensure the dataset will be able to portray the story that we want to tell. The first step was to melt the datasets and separate the id=Data into three different columns, Date, Country and Arrival respectively, using the bind_rows() function. Next, we add in a new columns to illustrate the type of arrival using the rep() function to differentiate the arrivals between Total, Sea, Air and Land. Lastly, we add a column named “Year” to the table to better plot the datasets on an annual basis. The tourists’ expenditure data are treated in the same manner to ensure consistency and for quick manipulation at the later stages.
Design Framework and Visualisation Methodologies
Dashboard Overview
As the first point of contact with users, the dashboard provides a quick snap shot of the Singapore tourists’ arrival against the selected country. The top left-hand side illustrates the tourist total expenditure per capital (SGD) and total arrivals (person) whereas the top right-hand side shows similar data but with respect to the selected country. In addition, the highest and lowest month for tourist’s arrival for the selected year will be reflected as well. At the bottom of the page, the countries that exhibits similar tourists’ arrival and expenditure pattern for the selected year will be listed. This is carried out via K-mean clustering and the number of clusters to sort out the twenty countries with both arrival and expenditure has been fixed at six for consistency. We utilised the R cluster package to derive the most optimal number of clusters given our dataset and its attributes. As most our users may not be a technically verse in understanding the mechanism behind clustering, we selected six based on the results shown above. Adding more clusters thereafter does not give much better modelling of the data.
Indicator display
Indicator exploration
The first part of our application focuses on the showcase of indicators used in the analysis that follows; what type of structure does the data follow, what countries were selected, and eventually what variables are used for modelling.
One of the most basic views is a straightforward data table, which is generated using the package DT. This is shown below.
Relationship display
Indicator correlation matrix
After getting an overview of the trends of various indicators, the next step for the user is understand the extent of correlations between indicators and perform a clustering on the 31 countries based on selected indicators.
The first tab shows the correlation matrix for finding the high or low-correlated indicators. In case the time range is a concern, users can choose the range of years other than just the indicator that they would like to exclude from the correlation analysis.
The corrplot package is used to demonstrate the correlation between indicators. Blue indicates positive correlation, while red denotes negative correlation. In addition, the width of ellipse can represents the size of the correlation. Compared to heatmaply correlation matrix, corrplot is more intuitive for viewing correlation – it also allows us to reduce the visual clutter by only displaying half of the tradition correlation matrix, split along the diagonal.
Number of clusters
When it comes to clustering, one of the dilemmas faced is effectively deciding the optimal number of clusters. We provide two ways to do this in our application as seen below:
The two methods provide different results on the final suggested number of clusters, which can then be input by the user into later sections for clustering analysis.
Country clustering
This panel provides three main parameters, including transformation, distance method as well as agglomeration method. The previous selected indicators will be passed to compute clusters of countries. We see in each cluster similarities relating to the topic for certain indicators. For example, in chart below, when we select education related indicators for input, the heatmap shows that “Spain”, “Netherlands”, “Ireland”, “Sweden”, “France”, “Argentina” are grouped into one cluster and this cluster perform better on education variables for empowerment of women. For this function, heatmaply is suitable for its ability to represent values of each indicator in shades from the same colour palette.
Variable Importance 1 (cross sectional analysis) 
In this section, we provide users with more specific information to aid in the final selection of variables. The first tab in the menu displays the correlation matrix plot which we saw in an earlier tab, with an accompanying VIF table (leveraging on the vif function from Car package). These values immediately show the multicollinearity of the regression model each time users change their input variables, so they can easily determine which indicators they should put into model (usually VIF should be less than 10).
To give a better idea of the rate of change in indicators, the changes are provided in a tabular form, based on the input of starting and end years. This table is captured in the second tab. While model results are an important reference, we have opted to compress the visual real estate of this section within the app by enabling scrolling with CSS styling; this allows us to better show case other important results in the same page.
The modelling section contains both parameter options on the left side and regression model summary and two kinds of plots on the right side. Parameters’ change results in every result change (from left to right).
The packages MASS and Car, amongst others, have been used mainly for multi-regression model because they have many handy statistics related functions which are useful for interpreting of the model results. Diagnostic plots illustrate the assumptions of econometric model (e.g Non-normality, non-constant error variance, and nonlinearity). The Car package possesses a function influencePlot to display influential observations. However, this was abandoned in favour of using Plotly which is far superior in terms of visualisation. This is shown at the bottom right in the screenshot, where the three countries which may affect response variable significantly are clearly seen from the graph.
In the last tab as shown above, the function regsubsets() in the library leaps is used for regression subset selection. Users can view the ranked models according to different scoring criteria (we provide BIC, R2 and adjusted R2 values) by plotting the results of regsubsets(). R function Step() is used for stepwise variable selection method, offering p-values and AIC criterion with regression direction and significance level. These parameters affect the updated model summary output.
As we want to find what the dominant factors are in each year, the method can extract independent variables which meet the input method conditions with their coefficients. Besides, we focus on retrieving the information that reflects how important the selected variables are against response variables. The Relaimpo package provides a function to calculate relative importance for each predictor and obtain bootstrap measures of relative importance. However, similar to the Car package, we found the plotting capabilities lacking. To address this, we defined a new function by calculating correlations between original predictors, new orthogonal variables, and regression coefficients of dependent variable Y on orthogonal variables to derive the percentage of the predictor to the overall R2 value. This is then visualised in the form of a simple pie chart with plotly.
Variable Importance 2 (Panel Analysis) 
Apart from the earlier cross-sectional analysis, the base dataset of contains indicators that have panel data. Thus we created a separate section which aims to perform analysis on panel data based on year range and the response variables selected by users.
The above featured menu mainly uses PLM package to perform panel data analysis based on the year range and response variable use selects. Unit root test is to find whether the data is stationary, while the Dickey-Fuller test enables users to check for stochastic trends. The null hypothesis assumes the series has a unit root (i.e. non-stationary). If unit root is present, it takes the first difference of the variable. Additionally, with the heteroscedasticity test, robust covariance matrix is used to account for any detected heteroscedasticity.
The PLM package make it more convenient to compare different models by calling fitting model functions. Usually it compares fixed models (without country, year and so on) and random models as well as mix pooling models. In the large scheme of things, our application takes these models and makes comparisons between each before providing a summary output of the optimal one
Next to the results tab, we’ve also added a visual of the estimated individual effects using countries as independent variables, in the form of a bar chart. Sorted by descending order, users can quickly understand and identify from the bar chart which countries’ data have affected the response variable most. No additional colouring or shading has been applied to minimise visual cluster.
Summary and Future Work
Our application allows users to have a detail explanatory, analysis and forecasting of the tourists’ expenditure and arrival data in Singapore. However, more insights could have been generated if we are able to obtain other relevant datasets such as breakdown of the various age groups, duration of stay, purpose of stay, type of expenditure and type of accommodation on a respective country. Currently these data are available on the aggregated level, i.e collectively of the entire visitors into Singapore. Should much data be made available, we can create an in-depth profiling on the various groups of visitors into Singapore and businesses will be able to conduct targeted marketing and offers to better incentivise tourists and travellers to spend more.
Acknowledgements
We would like to sincerely thank Professor Kam Tin Seong for his unwavering support and clear guidance to improve our application. The relevant changes would not have been possible if not for his recommendations and advice.
References
[1] Gabriel Martos. Cluster Analysis with R. Retrieved from https://rpubs.com/gabrielmartos/ Cluster Analysis
[2] Dr. Kam Tin Seong. Hands-on Exercise 4: Creating Ternary Plot with R. Retrieved from https://elearn.smu.edu.sg/d2l/le/content/219098/viewContent/1217926/View
[3] MPA 635: Data visualization. Retrieved from https://datavizf17.classes.andrewheiss.com/class/05-class/
[4] Ternary Plots in R with Plotly. Retrieved from https://xang1234.github.io/ternary/
[5 ]Rob J Hyndman and George Athanasopoulos. Monash University Australia. Forecasting: Principles and Practice. Retrieved from https://otexts.org/fpp2/
[6] Analysing Time Series Data. Retrieved from https://ourcodingclub.github.io/2017/04/26/time.html
[7] Riaz Khan, MS (Statistics) student, South Dakota State University. ARIMA model for forecasting– Example in R. Retrieved from https://rpubs.com/riazakhan94/arima_with_example
[8] Carson Sievert. plotly for R. Retrieved from https://plotly-book.cpsievert.me/index.html
[9] DataCamp. Data Visualization with ggplot2 (Part 3). Retrieved from https://www.datacamp.com/courses/data-visualization-with-ggplot2-part-3
[10] More Examples on Styling Cells, Rows, and Tables. Retrieved from https://rstudio.github.io/DT/010-style.html


