ANLY482 Enigma Methodology
Application Overview
Data Wrangling
The data cleaning section aims to automate the current manual data verification process in the spreadsheet. The user should just upload the raw data file in excel format to our application portal respectively, and two data files will be generated automatically: one clean data set and one data set only contains those suspicious records with an additional column of the verify information. Additionally, once the error has been fixed, user is supposed to upload the “suspicious data set” to the same window again, the application will append these rows to the clean data file if the records could pass through the verification process. And this is an iterative process until all errors in the dataset have been cleaned and the dataset is readily available for report generation.
Firstly, as the raw data file’s size could be large, and to ensure the application performance, we get help from one R library “openxlsx”, which allows the system to read the massive data set in excel format more effectively than “xlse”:
Figure 1. R code to upload file from a relative path for data cleaning with “openxlsx”
To have a better user experience, we use ShinyR as shown below to build a user interface for file uploading, and the ”read.xlsx” reads the relative path, which is more dynamic.
Figure 2. User interface for file uploading of data cleaning
Once the data file has been uploaded successfully. We would perform data cleaning according to various business rules provided by KTPH. As required, we need to add a new column called “Verification” on existing data frame to store the comments for any suspicious record, and the default value is empty.
Figure 3.R code to create a new column of a data frame with default value as empty
Some of the procedures required to just keep a certain part of the data from the original raw data set, so we use a ‘filter’ function here from “dplyr” library. This function allows us to keep those records based on the particular values in the certain column only. For example, the R code below will give back a new data frame “df_soc_vt”, which contains records from “df_soc” with “visit_type” of AB, CD, EF or GH only.
Figure 4. R code to filter according to a column value
Furthermore, most of the cleaning procedures need to check a certain column’s value for each row of the data. For example, in “df_soc_vt” data frame, a record could be considered as ‘error’ if its value under “col_name_A” column is “XXX”, so the system will add a comment message in the “Verification” column.
Figure 5. R code to check a column value and assign comment message if necessary
This part of code will check through each row of “df_soc_vt” in a for-loop, and “nrow” is an effective way to calculate the total number of rows in the data frame; “paste” function is to write a content of the data frame cell.
Another useful function for data cleaning is “grepl”, which helps you to pick out those records contain the target information. For instance, we need to check whether any row contains the “ABC” under “wclass” column.
Figure 6. R code to check the partial value of a column
Lastly, the system will differentiate those clean and suspicious records, and save them in the different data frame accordingly to write them back in csv files. This could be done by checking the “Verification” column and R-built-in “write.table” function will write the data frame into a csv file effectively, as shown below:
Figure 7. R code to get clean data set and write back to a csv file
To summarize, these are the functions used frequently in data cleaning process. Depending on the business rules, we need to adjust them accordingly to meet various needs.
Report Generation
The report generation section aims to automatically retrieve relevant information from cleaned dataset based on user input, compute required attribute values and output an Excel report with fixed format.
Cleaned dataset uploading and handling
Upon launching our application, an user interface powered by ‘Shiny’ is displayed (Figure 8). User is prompted to input attributes including, Month, Year, Institution and Citizenship Type. On top of which, user is required to upload the cleaned ABC-201612.xlsx verified by the data cleaning section.
Figure 8. Report generation user interface
Uploaded clean dataset is captured by ‘Shiny’ ui section (Figure 9.1) and passed over to ‘Shiny’ server section (Figure 9.2), which is then transformed into a data-frame df_soc, by using R-built-in function ‘read.csv’.
Figure 9.1. Shiny UI section for file upload
Figure 9.2. Shiny server section for file handling
Data Transformation
Meanwhile, we realize that attributes of subspec_desc column (clean dataset) does not align with that of C department column (report). For example, while A department is marked as ‘A1’ in subspec_desc column, it is reflected as ‘A2’ in C department column. Both of which refer to the same department and yet may cause confusion due to name mismatch. As a result, we have created a mapping table by iterating through all rows of df_soc and map all subspec_desc values to their respective values.
Figure 10. Mapping table for subspec_desc to C Department
Data Filtering based on User Inputs
Building upon which, we have filtered df_soc based on user inputs of institution and patient citizenship type with ‘filter’ function from ‘dplyr’ library (Figure 11).
Figure 17. Filter input data frame by institution and patient citizenship type
Report Required Attribute Computation
Report required attributes computation starts off by filtering the data-frame based on department and class with ‘filter’ function from ‘dplyr’ library. After which, we sum up patient days for that data-frame with R-built-in function ‘sum’ and store output values in predefined data-frames (Figure 12). Upon iterating through all departments and classes, we collate all data frames into one output data-frame named df_F10, (Figure 13), which will be used as the basis for generated report data reference.
Figure 12. Data-frame filtering and summing for patient days
Figure 13. Data-frame collation
Excel Report Generation and Formatting
‘createWorkbook’ and ‘addWorksheet’ functions from ‘openxlsx’ library are used to generate Excel workbook containing a worksheet on desktop (Figure 14.1). Meanwhile, other ‘openxlsx’ functions like ‘setColWidths’, ‘setRowHeights’ and ‘mergeCells’ are used for Excel cell formatting (Figure 14.2)
Figure 14.1. Excel workbook and worksheet generation
Figure 14.2. Excel cell formatting functions
Building upon which, ‘openxlsx’ functions like ‘createStyle’ and ‘addStyle’ are used to standardize styling elements in the output Excel report (Figure 15). The above-mentioned functions effectively ensure our output Excel report format is in line with that of standard template.
Figure 15. Excel styling functions
Eventually, ‘writeData’ function from ‘openxlsx’ library is used to output data from previously generated data-frame, df_F10, into the created worksheet (Figure 16).
Figure 16. Output data from collated data-frame to Excel worksheet
Lastly, in order to facilitate version control for the user, an extra line of remark is added to the output Excel report (Figure 17). It captures current time with ‘Sys.time()’ function and input dataset filename with ‘inFile$name’ attribute (Figure 18)
Figure 17. Version Control remark
Figure 18. Capturing current time and input dataset filename
In the end, an Excel report with customized filename and formatting, containing computed data output is generated on user’s desktop. A status message is displayed on the user interface, signalling the report is stored on the user’s desktop (Figure 19).
Figure 19. Report Generation status message
Data Visualization
We use R libraries DT and sparkline to produce interactive data tables and inline graphs within data table cells. We also use Shiny to create the visualization dashboard.
Date Preparation
As we would like to visualize the changes in number of cases for each clinical department through multiple months, we need to firstly organize the data in such a way that records for the same clinical department but different time period should be grouped together. However, in the data file, data are organized in time sequence. Hence, we firstly need to extract a list of unique months from the data records. We use lubridate library to process the date as shown in the screenshot below.
Figure 20. Extract unique month from the data record
Then we use for loop to iterate through the records and group them by the clinical departments.
Data Column Transformation
After the for-loop, we get a data frame in which each data column has data separated by comma. For example, private_new_case = c(1,2,3,4,5,6). However, this format couldn’t be understood by the sparkline library. Hence, we need to further process the data columns. Here, we concatenate the original data column into one single string using the paste() method as shown below. After the transformation, original data column would be in the format “1,2,3,4,5,6”.
Figure 21. Concatenate the data vector into one single string
DT + Shiny
Finally we integrate the data table with shiny. In the server logic, we define the css format for the sparkline plot. We also define the targets for the plot, which indicates that for which data column you would like to apply the sparkline configuration. Lastly in the output$tbl, we specify the data frame we constructed in as.data.frame(). In this case, dt is the data frame we are using for the visualization.
Figure 22. Integrate DT with Shiny
Visualization Dashboard
Dashboard is designed and implemented with the help of DT library and Shiny library. Function datatable in DT library allows us to create an HTML widget to display rectangular data (a matrix or data frame) using the JavaScript library DataTables. DT library also provides search and sort features, which allows user to filter out specific data rows for further analysis. The created HTML widget is then incorporated into Shiny app for display.
Below is the screenshot for the visualization dashboard. The leftmost column represents the clinical departments, which serves as the identifier. The rest of the columns displays the changes in number of cases in sparkline plot. For example, private_new_case presents the changes in number of new cases in private group. By default, 10 entries would be shown on the first page. However, user could select the number of entries they would like to display on the screen via the dropdown list. Moreover, search box allows the user to filter out a specific clinical department. The search box is rather smart as it responds immediately once the user inputs in a few letters without inputting the complete name.