IS480 Team wiki: 2018T1 analyteaka research
Under PDPA’s guidance, we are not legally obligated to care for personal data. However, we would follow the best practice tips by exploring
1. Set out how the personal data in custody may be well-protected.
2. Classify the personal data to better manage housekeeping
3. Set clear timelines for the retention of the various personal data and cease to retain documents containing personal data that is no longer required for business or legal purposes.
4. For the transfer of personal data overseas, including the use of contractual agreements with the organizations involved in the transfer to provide a comparable standard of protection overseas.
The above classification is based on our interpretation of Federal Information Processing Standards (FIPS) publication 199 published by the National Institute of Standards and Technology as stated by Carnegie Mellon University to reflect the level of impact to the company if confidentiality, integrity or availability is compromised.
Potential Impact table
|Confidentiality||Leakage of information could be expected to have a limited adverse effect on the company’s operation, assets or individuals.||Leakage of information could be expected to have a serious adverse effect on the company’s operation, assets or individuals.||Leakage of information could be expected to have a severe or catastrophic adverse effect on the company’s operation, assets or individuals.|
|Integrity||Unauthorized modification or destruction of information could be expected to have a limited adverse effect on the company’s operation, assets or individuals.||Unauthorized modification or destruction of information could be expected to have a serious adverse effect on the company’s operation, assets or individuals.||Unauthorized modification or destruction of information could be expected to have a severe or catastrophic adverse effect on the company’s operation, assets or individuals.|
|Availability||The disruption of the information or system could be expected to have a limited adverse effect on the company’s operation, assets or individuals.||The disruption of the information or system could be expected to have a serious adverse effect on the company’s operation, assets or individuals.||The disruption of the information or system could be expected to have a severe or catastrophic adverse effect on the company’s operation, assets or individuals.|
Based on the above tips and impact. We decided to split the data into 5 different class.
- Class 1 contain at least 2 high impacts
- Class 2 contain at least 1 high impacts
- Class 3 contain at least 1 moderate impacts
- Class 4 contain 0 impacts
- Class 5 contain 0 impacts with easily accessible public data
|1||Highly confidential data||CVV code, credit card number||Never stored or process.|
|2||Uniquely personally identifiable information.||Fingerprints, eye scan, session token, NRIC, password||Never stored, process and discard.|
|3||Personally identifiable information||DoB, email, address||Store only hashed value.|
|4||non-Personally identifiable information||State, city, region, subzone||Can be stored as is it|
|5||public website available content||Item details, category, item price||Can be stored as is it|
Data Analytics is the process of examining data sets in order to draw conclusions about the information they contain, increasingly with the aid of specialized systems and software.
Typical mechanisms: Database (only Data)
Typical timeframe: Offline
The outcome of analytics is informed business decisions to verify or disprove scientific models, theories and hypotheses. The typical goals is to improve efficiency, optimize processes, increase revenues etc.
The hardest part of analytics project is asking the question. As Robert Half once mentioned, "Asking the right questions takes as much skill as giving the right answers." - Robert Half
|Descriptive analytics||Insight into the past:
|Predictive analytics||Understanding the future:
|Prescriptive analytics||Advise on possible outcomes:
Based on the above details our modules are split into the respective section
- Customer Profile module
- Store Profile
- Staff profile
- Machine Learning
- Data visualisation module
- Analytics and reporting module
What's Machine Learning?
In a nutshell, using algorithms to parse data, learn from it, and then make a determination or predictions. To be specific, it’s a field of computer science that use statistical techniques to give computer system the ability to “learn (e.g progressively improve performance on a specific task) with data, without being explicitly programmed.
However, there are some misconceptions about machine learning.
- It's not logic based, its stats based.
- It's not a solution without proper understanding and expectations.
- AI versus ML vs Deep learning Source
- AI: Human intelligence exhibited by Machines - ML: An approach to achieve AI - Deep learning: A technique for implementing Machine learning
- Lastly, there's nothing new about the concept of machine learning (it exist as early as 1950s). It just became much more relevant due to the rise of IOT devices and the potential to store endless data.
What we would like to achieve out of machine learning for this project would be the predictive analytics (predict the outcome based on historical and current data). To do so, there are a few potential techniques we can use.
- Classification (Decision tree/hierarchical/forest logistic/linear regression etc). - Clustering (EM, kMeans, Canopy, Hierarchical, CowWeb etc) - Recommendations (Filtered Association, Apriori etc
To better understand how we can apply these techniques, we would need to better understand the different categories of machine learning.
The type of machine learning algorithms Source
Machine learning algorithms can be broken down into three main categories.
1. Supervised learning In supervised learning. the input training dataset contains both the input and the desired output, called labels (wrong, correct). e.g. In other words, dataset that contains examples of the answer we desired. An example would be the spam filter, it's being provided with many example email along with their label (wrong or correct), as it learns to classify new emails.
It's also usable for prediction of numeric value based on a set of features ( potential variables that affect the end result). An example would be the sales price, the sale price is dependent on the season, location, targeted segments, and cost. There, the training set would require both the sales price and the features. The training process is continued until the model achieves the desired level of accuracy on the training data. For our case, we adopt the dataset into 70/15/15 split of training/testing/validating.
Example of supervised learning algorithms:
1. Linear regression
2. Logistic regression
3. Support vector machines
4. Decision trees and random forest
5. k-nearest neighbors
6. Neural network
2. Unsupervised learning
For unsupervised learning, the dataset has no labels (no right or wrong answer). Rather, the goal is to look for the relationship/ correlation between the data. For instance, diapers and beer has a statistically significant correlation (in some cases). The significant correlation leads to store promoting cross-selling for beer and diapers. Thus similarly, we make use of unsupervised learning to look for such correlation even if it sounds absurd.
Example of unsupervised algorithms:
- k-Means - This is the most well-known clustering algorithms. Commonly taught in various introductory data science and machine learning class (e.g. IS424). It's relative easy to understand and implement.
Steps for performing k-mean 1. Select a value (known as k-value) and randomly select their respective center points. In the gif above, there's 3 distinct cluster as the k value is set as 3. (to determine the k-value, there are a few potential techniques such as elbow method.
2. Each data points is then calculated based on the distance between that point and each group's center. The point will then be classified to be in the group whose center is closest to it.
3. Based on the newly classified points, we recompute the group center by taking the mean of all the vectors in the group.
4. Repeat these steps for a set number of iterations or until the group center doesn't change much between iteration (for our case, we use 100 iterations).
- very fast, linear complexity O(n).
- Suitable for large dataset
- You need to calculate the optional k-value.
- Random choice of cluster center could produce a different result every run. Thus, the result could lack consistency.
A potential alternative for smaller dataset would be to use k-medians (Due to the increase in computational calculation needed). As it uses the median vector of the group instead or recomputing the mean value of the cluster.
Mean shift clustering a sliding windows-based algorithm that attempts to find the densest area of data points. In other words, this centroid-based algorithm's goal is to locate the center point of each cluster (As shown in the gif).
Steps for performinng mean-shift
1. It starts with a random point and surround itself with a radius and slowly shifts itself to a higher density region. It continues this process until it reaches the point where the number of points is the highest possible density.
2. The process is repeated multiple times till every data point lies within that radius. If multiple windows overlap, the one with the most point is preserved.
3. The data point then get clustered based on the circle they reside in.
- No k-value needed. Auto discover.
- Relatively fast
- Suitable for large dataset
- The selection of the radius size can be non-trivial.
- Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
DBSCAN is a density-based cluster algorithm similar to mean shift but with a couple notable advantages.
Steps for performing DBSCAN
1. It starts with a random point that has not been visited. The neighborhood of this point is being grouped together so long the distance is within the value stated in epsilon(epsilion: A numeric distance value to determine the furthest possible location a point can be considered part of the cluster).
2. If there's a sufficient number of point (according to minPoint). The neighborhood will be grouped with other neighborhood. If in any chance the distance is further than the value stated in epsilion or number of points falls below the minPoint. It will be labeled as noise.
3. Step 1 and 2 are repeated until every point has been visited and label as a cluster or noise.
- No preset number of cluster needed.
- Identifies outlier as noise.
- Able to find arbitrarily sized and shaped cluster quite well
- Not suitable for a huge dataset (acceptable for our use case)
- distance threshold and minPoint would vary from cluster to cluster.
There's plenty of clustering methods such as agglomerative hierarchical clustering, birch or gaussianmixture. However, for our project use case. We will stick to the top 3 most popular and common used algorithm.
What can we use clustering for? We can make use of clustering to :
1. To figure out what customer segment to target every quarter
2. Facebook targeted advertising
Instead of asking store manager for feedback and recommendation, we can cluster customer into multiple category for targeted marketing.
Association rule learning: - Apriori
3. Reinforcement learning
Reinforcement learning is rather special - the learning system or agent needs to learn to make specific decisions. The agent observed the parameter it's being exposed to, it selects and perform actions (from a list of acceptable action) and get rewards in return (penalties for failing). The goal is to choose the action which maximizes reward over time. After countless iteration, the system learns the best strategy called policy on its own. Example being DeepMind's AlphaGo. The system learns by observing the winning policy through millions of iteration (games) and apply their policy by playing against itself.
Reinforcement Learning is a bit special and more advanced category – the learning system or agent needs to learn to make specific decisions. The agent observes the environment to which it is exposed, it selects and performs actions, and gets rewards in return (or penalties in the form of negative rewards), and it’s goal is to choose actions which maximize the reward over time. So by trial and error and based on past experience, the system learns the best strategy, called policy, on its own.
Problem with machine learning
Machine learning isn't the solution for everything. there are flaws involved. Fortunately, we've advanced much and develop solutions for resolving the flaws.
1. Developing the relevant ML algorithm to use for training and learning. How do we decide which algorithm to use?
2. Loss Mitigation: We have training data, but how to make model learn to keep the prediction loss to a minimum.
3. How to identify use case?