IS480 Team wiki: 2012T1 6-bit Technical Overview

From IS480
Revision as of 14:57, 5 December 2012 by Huiling.ong.2010 (talk | contribs) (→‎Staging Area)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
6-bit logo.png
6-bit's Chapalang! is a social utility that connects people with friends and new friends
by offering a place for exchanging ideas and information on its public domain.

Final Wikipage
Home Technical Overview Project Deliverables Project Management Learning Outcomes

Technical Resources

6-bit Technical.png

Data Architecture

6-bit dataarch.png
In this data architecture, there are 2 Operational Data Sources (ODS), 4 data stores and 1 consuming portal which is chapalang.com.

Operational Data Source (ODS)

Operational data are obtained through Facebook Graph API, which provides data such as user’s posts, likes, friends and basic information, and stored in the Staging Area. Additionally, user activities data captured on the portal are also stored Staging Area for further Extract, Transform, Load (ETL) cycle.

Staging Area

Each data store is hosted independently by design. The Staging Area data store is required to store crawled data from ODS. Due to the nature of crawling which is highly dependent on the login frequency and concurrency of users, the Staging Area should be independently hosted to avoid usage spikes to cause operational problems.

Additionally, it will be configured with writing performance optimization since this is the primary activity. Hence, Staging Area is not designed to be normalized, and will instead have a database structure that is matching to the incoming data structure.

There is also a Time To Live (TTL) to provide suitable expiry of the data stored on the Staging Area. As the objective of each crawling execution is to obtain the latest data of a user, hence the staging area should not store outdated data. For the purpose of Chapalang! operations in the rudimentary stage, data on the Staging Area is dropped each time a user logs in and be replaced with new data.

Data Warehouse

After the storage of crawled data into the Staging Area, the data go through an ETL cycle to be archived in a Data Warehouse. A Data Warehouse is independently hosted as it is optimized for storage instead of read or write activities.

The purpose of the Data Warehouse is for storage of all data for an indefinite period of time, or otherwise dropped at the discretion of the administrator should it be determined to be outdated and irrelevant. As such, Data Warehouse has a highly normalized database structure for the purpose of storage and removal of redundancy in data. With a highly normalized structure, it is usually sub-optimal for Online Transaction Processing (OLTP) purposes.

Data Mart (OLTP)

It may be too intensive to work directly on a Data Warehouse which is optimized for storage by the design of a highly normalized database structure, and it will cause high resources consumption and long query lead-time due to the multiple joins, re-indexing and sub-queries. Similarly, an OLTP Data Mart will be hosted independently to ensure isolated uptime and performance reliability for the operations of our portal, Chapalang.com.

As such, materialized views of denormalized data, possibly of selective time periods, are created for the purpose of transactional processing. Denormalization introduces redundant data or data groups to optimize read performance of a database. These materialized views will periodically be replaced and stored in a OLTP Data Mart, which supports the direct transactional queries of the portal.

Data Mart (OLAP)

Online Analytical Processing (OLAP) supports the basis of Chapalang Analytics, a set of Business Intelligence and Analytics tools which performs analysis of data collected about a user from various ODS.

Chapalang Analytics predicts user behaviours, personalities and preferences, and matches them with generated set of meta data tags of products or discussion content, before finally generating recommendations of products or content suitable for a user and publishing to the user’s view.

In its process of generating recommendations, Chapalang Analytics adopt a combination of analytical techniques such as Sentimental Analysis and Rule-Based Analysis. Recommendations are generated on-the-fly, integrating multiple sources of data in large amount, as and when users hit on a particular page which requires content filtering in favour of recommendations. Hence, it requires high amount of computing resources on-demand supports for an OLAP Data Mart to be hosted independently.

In addition, another advantage of hosting the OLAP Data Mart independently is that in the event of any unusual failure or deteriorating performance of the OLAP Data Mart for any reason, the OLAP Data Mart will continue to function, generating sub-optimal random recommendations for the users while the administrator is notified about the anomaly. Therefore, it reduces the impact of single point of failure and isolates the bottleneck in the event of catastrophic occurrences.

System Architecture

In our setup, database, web and application servers are running independently to offer performance optimization and future scalability.

6-bit systemarch.png

There is total of 4 database servers, running off virtual machines of a single power server. The purpose of segregation is to leverage on different configurations of database servers, to optimize for storage or for performance. In this manner, database servers which consume high resources for indexing or creating views from big data will not cannibalize the resources to be reserved or prioritized for operational purposes.

Additionally, virtual servers allow advance network configurations which enable each virtual machine to have custom network facilities. Bandwidth shaping can be design to place operational and analytical servers on different network to avoid cross interference in performance.

Correspondingly, there is an Analytics Server which performs complex analysis on the data from Operational Data Sources. It is independent because it is expected to perform data analysis on-the-fly which takes up high amount of resources.

Lastly, there is only 1 web server at this stage. The web server stores the application files, as well as media files uploaded by user. It can be further scaled up to have more web servers with a load balancer, and another storage server, when consumption of the service grows. 

Scope Prioritization

As we progress in the development, we will conduct several rounds of tests and consultation with stakeholders. In the process, many recommendations are offered and we need a methodology to evaluate what we can adopt, what we have to reject, and what we may have to sacrifice if we were to adopt something.
Hence, we have come up with a Scope Prioritization Decision Model to help us cope with the scope changes.


6-bit Scope Prior.png

The first step in our scope prioritization is to list down all proposed functions, and categorize them into 3 groups. Group A classifies the basic utility functions, Group B consists of main features, and Group C groups the supplementary features. The categorization of features into different groups, allow us to be more objective in deciding what features offers more value.

Decision Table

6-bit DecisionTable.png

Using the table, we evaluate a group of functions based on a set of criteria which includes the following:

  • Time required
  • Ease of implementation
  • Combined risk
  • Aligned with business strategy (whether it matches with existing strategy)
  • Provide competitive advantage (whether it provides additional value to existing strategy)

For each function-criteria box, we will assign a rank value between 1 and 3. Finally, we will tabulate the scores to calculate the overall benefit. The overall benefit value will be helpful in deciding to add or drop a proposed function.