IS480 Team wiki: 2012T1 6-bit Project Management UT4
by offering a place for exchanging ideas and information on its public domain.
http://www.chapalang.com
Final Wikipage |
Home | Technical Overview | Project Deliverables | Project Management | Learning Outcomes |
Contents
- 1 Schedule
- 2 Testing
- 2.1 Test Cases
- 2.2 Test Plans
- 2.3 User Testing
- 2.3.1 User Testing 4
- 3 Milestones
- 4 Schedule Metric
- 5 Bug Metric
- 6 Risk & Mitigation
Schedule
Planned Schedule
Meeting Minutes
Team Meeting Minutes
Supervisor Meeting Minutes
|Meeting Minute 1
|Meeting Minute 2
|Meeting Minute 3
|Meeting Minute 4
|Meeting Minute 5
|Meeting Minute 6
|Meeting Minute 7
|Meeting Minute 8
|Meeting Minute 9
|Meeting Minute 10
|Meeting Minute 11
Testing
Test Cases
Test Plans
Test Plan 1 on 17 September 2012
Test Plan 2 on 28 September 2012
Test Plan 3 on 19 October 2012
Test Plan 4 on 4 November 2012
User Testing
User Testing 1 | User Testing 2 | User Testing 3 | User Testing 4 |
User Testing 4
Test Description
The objective of User Test 4 is on scalability, performance and analytics testing of the system. This is a 2-part test session, firstly on scalability and performance which does not require physical testers, and secondly on inter-rater reliability which requires rating judges.
The coverage of the scalability and performance test is focused on the bottleneck functions, which are the discussion forums and marketplace. The terms “performance” and “scalability” are commonly used interchangeably, but the two are distinct: performance measures the speed with which a single request can be executed, while scalability measures the ability of a request to maintain its performance under increasing load.
Additionally, Inter-Rater Reliability Test is performed on the Personalized Dashboard to determine the concordance of the personalized results and actual personality of user stereotypes.
Testers Background
Scalability & Performance Testing
As the test does not require physical testers, the following appends the test environment.
Inter-Rater Reliability Test
Testers will assume the role of raters or judges for our Inter-Rater Reliability Test, represented by a total of 20 people with a 50-50 male is to female ratio. Testers are stratified from a diverse background, intended to represent personality stereotypes designed.
Personality stereotypes include characteristics such as gender, age group, education, personality traits, online activity, mobility and interested topics.
Test Groups
There is no test grouping employed in this test.
Test Procedures
Scalability & Performance Testing
The first part of the test is on scalability and performance testing. Chapalang Benchmark is configured to measure the time taken for each controller method before it starts and as soon as it ends. The results will be used to study the performance of the system and application at different scales of operations.
Additionally, custom application is used to perform a series of activities on a forum and marketplace, simulating an arbitrary number of concurrent users on the system, or the load.
Subsequently, we will study the benchmark timing data to understand the performance differences under different load.
Inter-Rater Reliability Test
An Inter-Rater Reliability (IRR) is the degree of agreement among raters. It gives a score of how much homogeneity or consensus there is in the ratings given by judges.
The first rater is the system itself, which will generate a list of 10 products and 10 discussion topics recommendations in descending order of relevance to a target user. Every recommendation is tied to a specific and distinct order number.
The second rater is a human tester, who will be provided with the same list of 10 products and 10 discussion topics generated by the system in relevance to him or herself. To mitigate the effects of Experimenter’s Bias, the order of each item is unordered and randomized without any intended logic. The second rater is expected to reorder the items according to his or her preferences in descending order.
Subsequently, we will make use of Spearman’s Rank Correlation Coefficient to understand the reliability of our personalized dashboard which features product and topic recommendations.
Test Instruction
Inter-Rater Reliability Test
This is a sample output of the first rater, for a product recommendation test.
This is a sample input sheet for the second rater, on product recommendation test.
In a descending order, 1 represents the most relevant item while 10 represents the least relevant item.
Test Results
Scalability & Performance Test
File:6bituser-testing4resultsfigure.png
Inter-Rater Reliability Test
In order to evaluate the test results, we rely on the statistical model called Spearman’s Rank Correlation Coefficient (SRCC). The model is appended below:
In short, the SRCC model takes into account the rank rating from 2 different raters, represented by xi and yi respectively. In conventional Correlation of determination model, it takes in absolute data instead to find out statistical and data-driven correlation between 2 inputs.
However, we are interested in the consensus between human judgment on the data, therefore SRCC is a suitable model of analysis.
The SRCC model assumes that the rating scale is ordinal, or basically serial scale of rating. This assumption is aligned with our 1 – 10 rating score which is incremental and serial. Additionally, the SRCC model considers only relative position of the ratings. For example, (1, 2, 1, 3) is considered perfectly correlated with (2, 3, 2, 4). This is acceptable in our test because each rating in our test is distinct and exhaustive, where no repeats or unused score is allowed.
The following is a sample of data tabulation in visual form.
Click Data Analysis(User Test 1 vs. User Test 2 – Forum Functions Only)
Additionally, click data of each test session has also been collected and analysed. They are also being compared to that with the results of User Test 1.
The above box-plot represents 3 sets of data comparing the number of clicks per task, for discussion forum functions only. UT1 represents the results from User Test 1, UT2A represents the results from Group A testers of User Test 2, while UT2B represents the results from Group B testers of User Test 2. For the objective of fair comparisons, the results from User Test 2 has been drilled down to consists of data
The median number of clicks it takes per tester to accomplish a forum-related task in User Test 2 ranges from 1 to 3 clicks with 2 clicks being the median, a decrement from the median of 3 clicks, as well as a smaller variance as compared to User Test 1. Additionally, it can also be observed that there is no significant difference in the results between Group A and Group B users.
Preliminary, we can observe an improvement in the user experience for Group A users between the 2 tests. The improvement can be broadly attributed to the improvements made as well as the high learnability of the system interface design. However, this observation is not conclusive and more data is required.
Additional statistics were computer and observed that the median time spent to accomplish a forum-related task for Group A tester is 4 seconds, and Group B tester is 5 seconds. Again, this is a significant decrement in the time spent from User Test 1, where testers spent a median of 10 seconds between each task.
Based on the finding, we can reasonably derive that there is an improved user experience between User Test 1 and User Test 2, attributing to the improvements made and high learnability of the system. In addition, the improved user experience is shared between both Group A and Group B users, possibly suggesting that the improved system does not require much training or high learning curve.
Click Data Analysis (Group A vs. Group B – Marketplace Functions)
Prospectively, we will also study the user experience difference between Group A and Group B testers on marketplace functions, based on the click data which measures the number of clicks involved per task and time taken in seconds between each task.
In the box-plot diagram above, UT2A refers to User Test 2 Group A testers, while UT2B refers to User Test 2 Group B testers. Each box-plot is represented by data of a specific group of users, and the results computed based on the number of clicks of time pertaining to forum or marketplace functions.
Comparing marketplace functions, both Group A and Group B testers have made a median of 2 clicks to accomplish each task. While Group B testers have a wider variance of clicks of up to 4 clicks, it can be broadly attributed to outliers, user experiments or some learning curve involved in getting used to the functions or interface objects placements.
The result when comparing the time taken is consistent with the preliminary conclusion when comparing the number of clicks per task. The median time taken for Group A and Group B testers for forum and marketplace functions are within the range of 4 to 5 seconds. The difference between the median records is insignificant.
Overall, the result is consistent across forums and marketplace functions, between testers from both User Tests and test groups. It is also consistent with our earlier preliminary conclusion that the improvements made between the two User Tests have resulted in improved user experience, and there is a good level of learnability in the interface design.
While there are limitations in this test, where there are other externalities such as network performance, computing habits of testers and response time of each users, the macro results of the test provide a reasonable sampling on the objective of the test.
Milestones
Schedule Metric
Bug Metric
Bug Log: |Click Here