ISSS608 2016 17 T1 Project Team 7 Report
|
|
|
|
Contents
Motivation
TalkingData, China’s largest third-party mobile data platform, understands that everyday choices and behaviors paint a picture of who we are and what we value. Currently, TalkingData is seeking to leverage behavioral data from more than 70% of the 500 million mobile devices active daily in China to help its clients better understand and interact with their audiences. In this project, we are challenged to visualize users’ demographic characteristics based on their app usage, geolocation, and mobile device properties. Doing so will help millions of developers and brand advertisers around the world pursue data-driven marketing efforts which are relevant to their users and catered to their preferences.
Review and critic on past works
Design framework
Age Segment
In the first step, we will first transform the age of users into nominal attribute. This would be helpful when we want to get better understanding of our users by age groups.
The graph shown below is a basic
Get Location
In the second step, we would like to get the geographical information of all events. To achieve this, we’ll try to use a python library named Shapely. Below is a screenshot of how we do this in the source code.
Time Flag
In the third step, we ‘ll try to transform out time attribute into numeric attribute and put them in bags. This would helpful for us to make matrix for each user. Below is a screenshot of the key function of how we made it.
Ratio Matrix
After giving time flag to every event record, we can build user’s behavioral matrix based on this. To do so, for each user, we first count the number of events happened in each different time period. Then we build it as a matrix and normalize it to make the sum of each matrix as one.
Finally, we can get a matrix like this:
In the matrix shown above, the last row is users’ device id and others are the ratio of events happened in different time period.
With such a matrix, we can know our users are mainly using their mobile phone in which time period.
User Clustering
After build the ratio matrix, we can do a clustering for our users to put them into different groups. In this case, we use the K-means function of scikit-learn which is a Python library to achieve this. Basically, we’ve finally get three meaningful clusters, each cluster has different behaviour pattern compare with others. The details of this will be discussed in the insights part.
Demonstration
To visualize our data and get better understand of user’s behavior, we’ve built a dashboard to visualize the volume of events with certain conditions. The graph shown below is a screenshot of it.
There are six modules in this dashboard, people can interact with the dashboard to see the volume of events based on customize conditions. In following chapters, we’ll try to explain each part of this dashboard.
Number of Events’ Variation
The graph shown below is a screenshot of “Number of Events’ Variation”. This is a line chart based on time and the volume of events.
In this chart, we can see the variation of the volume of events along with time. It’s significantly that there’s a cycle along with time. In generally, people are mainly using their cell phones during 9:00 am to 0:00 am. There some peaks in this time duration. In the time between 0:00 am to 9:00 am, there’re low peaks.
Activation per Person
Then we have a tree map for average activation per person. As shown below, it’s a screenshot of the tree map.
This tree map is a zoomable tree map. The first level is brand and the second level is time flag. The size of each square is depended on the number of activation per person. In generally, HTC has the largest value here which is about 20 while others only have around 15. It’s quite a meaningful find since HTC actually is a very unpopular brand in China now. But for other unpopular brands, they don’t have such a value like HTC.
It can infer that HTC’s users more heavily rely on their cell phones although it’s becoming a minority choice.
Chinese Province
Below is the screenshot of the bar chart of the volume for different provinces.
As a developing country, China’s economy is not balance of different provinces. From the chart we can also find this. The volume of events of Beijing and Shanghai are even higher than some provinces although they are just cities.
In generally, the value of this chart is basically related with the economy status and population of each province / city.
(P.S.: Since the data is collected from a mainland telecom operator, HK’s value is quite low here)
Brand
Below is a bar chart for the volume of events of each brand.
In this bar chart, we can see the top 3 brand are: HUAWEI, XIAOMI and SAMSUNG. There’s no iPhone since the data only contains android users (for security consideration, Apple won’t allow companies collect users’ data easily).
Compare with other countries, it’s quite different. In many countries, the most popular android cell phone is Samsung while in China is HUAWEI and SAMSUNG only takes one-fourth of HUAWEI.As we’ve discussed in the tree map, HTC here is actually a quite unpopular brand but with much higher average events on each user.
XIAOMI takes the second place in generally. But in some certain age groups, it’s more popular than HUAWEI.
Map
And we’ve built a heat map shown as below:
In this heat map, we can see that most of the events are happened in the south east part of China which has better economy status than other part.
As we’ve discussed before, a province/city with higher population or better economy will have higher volume of events here. If we combine this map with the bar chart of brand, we can find that different part of China has different preference on brands. For example, HUAWEI is the most popular one in Guangdong while XIAOMI is the best in Sichuan. Both provinces have a high volume of events.
Actually, it’s quite valuable for those vendors to choose which brand of cell phone should be sold in which part of China.
Cluster & Age Segment
Finally, we’ve two pie charts for different clusters and age groups shown below:
For age segment, the number people who are more than 39 years old have taken the first place while the number of who are less than 23 years old is the lowest one. It seems that, in China, young people especially middle school students are less likely to have or use their own cell phones.
For clusters, we’ve have three clusters here. The sizes of cluster-2 and cluster-3 are almost the same while the rest one cluster takes about 60% of the total. Cluster-2 and cluster-3 mainly contain people who are using cell phones centralized in certain time periods. People in cluster-1 are much more smooth than compare with previous two clusters. More details of these clusters will be introduced in the insights part.
Discussion
Future Work
Installation guide
User Guide