ANLY482 Enigma Methodology

From Analytics Practicum
Jump to navigation Jump to search
Is482 team enigma.png


HOME

 

ABOUT US

 

PROJECT MANAGEMENT

 

METHODOLOGY

 

FINAL PROGRESS

 

DOCUMENTATION

 

ANLY 482 HOMEPAGE

Application Overview

Application overview.png

Data Wrangling

The data cleaning section aims to automate the current manual data verification process in the spreadsheet. The user should just upload the raw data file in excel format to our application portal respectively, and two data files will be generated automatically: one clean data set and one data set only contains those suspicious records with an additional column of the verify information. Additionally, once the error has been fixed, user is supposed to upload the “suspicious data set” to the same window again, the application will append these rows to the clean data file if the records could pass through the verification process. And this is an iterative process until all errors in the dataset have been cleaned and the dataset is readily available for report generation.

Firstly, as the raw data file’s size could be large, and to ensure the application performance, we get help from one R library “openxlsx”, which allows the system to read the massive data set in excel format more effectively than “xlse”:

Upload.png
Figure 1. R code to upload file from a relative path for data cleaning with “openxlsx”

To have a better user experience, we use ShinyR as shown below to build a user interface for file uploading, and the ”read.xlsx” reads the relative path, which is more dynamic.

Interface.png
Figure 2. User interface for file uploading of data cleaning

Once the data file has been uploaded successfully. We would perform data cleaning according to various business rules provided by KTPH. As required, we need to add a new column called “Verification” on existing data frame to store the comments for any suspicious record, and the default value is empty.

Verification.png
Figure 3.R code to create a new column of a data frame with default value as empty

Some of the procedures required to just keep a certain part of the data from the original raw data set, so we use a ‘filter’ function here from “dplyr” library. This function allows us to keep those records based on the particular values in the certain column only. For example, the R code below will give back a new data frame “df_soc_vt”, which contains records from “df_soc” with “visit_type” of AB, CD, EF or GH only.

Filter.PNG
Figure 4. R code to filter according to a column value

Furthermore, most of the cleaning procedures need to check a certain column’s value for each row of the data. For example, in “df_soc_vt” data frame, a record could be considered as ‘error’ if its value under “col_name_A” column is “XXX”, so the system will add a comment message in the “Verification” column.

Nrow.PNG
Figure 5. R code to check a column value and assign comment message if necessary

This part of code will check through each row of “df_soc_vt” in a for-loop, and “nrow” is an effective way to calculate the total number of rows in the data frame; “paste” function is to write a content of the data frame cell.

Another useful function for data cleaning is “grepl”, which helps you to pick out those records contain the target information. For instance, we need to check whether any row contains the “ABC” under “wclass” column.

Grepl.PNG
Figure 6. R code to check the partial value of a column

Lastly, the system will differentiate those clean and suspicious records, and save them in the different data frame accordingly to write them back in csv files. This could be done by checking the “Verification” column and R-built-in “write.table” function will write the data frame into a csv file effectively, as shown below:

Write.png
Figure 7. R code to get clean data set and write back to a csv file

To summarize, these are the functions used frequently in data cleaning process. Depending on the business rules, we need to adjust them accordingly to meet various needs.

Report Generation

The report generation section aims to automatically retrieve relevant information from cleaned dataset based on user input, compute required attribute values and output an Excel report with fixed format.

Cleaned dataset uploading and handling

Upon launching our application, an user interface powered by ‘Shiny’ is displayed (Figure 8). User is prompted to input attributes including, Month, Year, Institution and Citizenship Type. On top of which, user is required to upload the cleaned ABC-201612.xlsx verified by the data cleaning section.

Report interface.png
Figure 8. Report generation user interface

Uploaded clean dataset is captured by ‘Shiny’ ui section (Figure 9.1) and passed over to ‘Shiny’ server section (Figure 9.2), which is then transformed into a data-frame df_soc, by using R-built-in function ‘read.csv’.

9.1.png
Figure 9.1. Shiny UI section for file upload

9.2.png
Figure 9.2. Shiny server section for file handling

Data Transformation

Meanwhile, we realize that attributes of subspec_desc column (clean dataset) does not align with that of C department column (report). For example, while A department is marked as ‘A1’ in subspec_desc column, it is reflected as ‘A2’ in C department column. Both of which refer to the same department and yet may cause confusion due to name mismatch. As a result, we have created a mapping table by iterating through all rows of df_soc and map all subspec_desc values to their respective values.

10.PNG
Figure 10. Mapping table for subspec_desc to C Department

Data Filtering based on User Inputs

Building upon which, we have filtered df_soc based on user inputs of institution and patient citizenship type with ‘filter’ function from ‘dplyr’ library (Figure 11).

11.png
Figure 17. Filter input data frame by institution and patient citizenship type

Report Required Attribute Computation

Report required attributes computation starts off by filtering the data-frame based on department and class with ‘filter’ function from ‘dplyr’ library. After which, we sum up patient days for that data-frame with R-built-in function ‘sum’ and store output values in predefined data-frames (Figure 12). Upon iterating through all departments and classes, we collate all data frames into one output data-frame named df_F10, (Figure 13), which will be used as the basis for generated report data reference.

12.png
Figure 12. Data-frame filtering and summing for patient days

13.png
Figure 13. Data-frame collation

Excel Report Generation and Formatting

‘createWorkbook’ and ‘addWorksheet’ functions from ‘openxlsx’ library are used to generate Excel workbook containing a worksheet on desktop (Figure 14.1). Meanwhile, other ‘openxlsx’ functions like ‘setColWidths’, ‘setRowHeights’ and ‘mergeCells’ are used for Excel cell formatting (Figure 14.2)

14.1.png
Figure 14.1. Excel workbook and worksheet generation

14.2.png
Figure 14.2. Excel cell formatting functions

Building upon which, ‘openxlsx’ functions like ‘createStyle’ and ‘addStyle’ are used to standardize styling elements in the output Excel report (Figure 15). The above-mentioned functions effectively ensure our output Excel report format is in line with that of standard template.

15.png
Figure 15. Excel styling functions

Eventually, ‘writeData’ function from ‘openxlsx’ library is used to output data from previously generated data-frame, df_F10, into the created worksheet (Figure 16).

16.png
Figure 16. Output data from collated data-frame to Excel worksheet

Lastly, in order to facilitate version control for the user, an extra line of remark is added to the output Excel report (Figure 17). It captures current time with ‘Sys.time()’ function and input dataset filename with ‘inFile$name’ attribute (Figure 18)

17.png
Figure 17. Version Control remark

18.png
Figure 18. Capturing current time and input dataset filename

In the end, an Excel report with customized filename and formatting, containing computed data output is generated on user’s desktop. A status message is displayed on the user interface, signalling the report is stored on the user’s desktop (Figure 19).

19.png
Figure 19. Report Generation status message

Data Visualization

We use R libraries DT and sparkline to produce interactive data tables and inline graphs within data table cells. We also use Shiny to create the visualization dashboard.

Date Preparation

As we would like to visualize the changes in number of cases for each clinical department through multiple months, we need to firstly organize the data in such a way that records for the same clinical department but different time period should be grouped together. However, in the data file, data are organized in time sequence. Hence, we firstly need to extract a list of unique months from the data records. We use lubridate library to process the date as shown in the screenshot below.

20.png
Figure 20. Extract unique month from the data record

Then we use for loop to iterate through the records and group them by the clinical departments.

Data Column Transformation

After the for-loop, we get a data frame in which each data column has data separated by comma. For example, private_new_case = c(1,2,3,4,5,6). However, this format couldn’t be understood by the sparkline library. Hence, we need to further process the data columns. Here, we concatenate the original data column into one single string using the paste() method as shown below. After the transformation, original data column would be in the format “1,2,3,4,5,6”.

21.png
Figure 21. Concatenate the data vector into one single string

DT + Shiny

Finally we integrate the data table with shiny. In the server logic, we define the css format for the sparkline plot. We also define the targets for the plot, which indicates that for which data column you would like to apply the sparkline configuration. Lastly in the output$tbl, we specify the data frame we constructed in as.data.frame(). In this case, dt is the data frame we are using for the visualization.

22.png
Figure 22. Integrate DT with Shiny

Visualization Dashboard

Dashboard is designed and implemented with the help of DT library and Shiny library. Function datatable in DT library allows us to create an HTML widget to display rectangular data (a matrix or data frame) using the JavaScript library DataTables. DT library also provides search and sort features, which allows user to filter out specific data rows for further analysis. The created HTML widget is then incorporated into Shiny app for display.

Below is the screenshot for the visualization dashboard. The leftmost column represents the clinical departments, which serves as the identifier. The rest of the columns displays the changes in number of cases in sparkline plot. For example, private_new_case presents the changes in number of new cases in private group. By default, 10 entries would be shown on the first page. However, user could select the number of entries they would like to display on the screen via the dropdown list. Moreover, search box allows the user to filter out a specific clinical department. The search box is rather smart as it responds immediately once the user inputs in a few letters without inputting the complete name.

23.png
Figure 23. Visualization Dashboard