ISSS608 2015 16T1 Group2 Report
- 1 Application Motivation
- 2 Review and Critic on Past Works
- 3 Design Framework
- 4 Technical Processes
- 4.1 Data Scraping
- 4.2 Data Processing
- 4.3 Text Analytics
- 4.4 Visualization Work
- 5 Demonstration: Sample Test Cases
- 6 Discussion
- 7 Future Work
- 8 Installation Guide
- 9 User Guide
The application aims to encourage users who are interested in local politics to explore and read up more about Singapore Parliamentary discussions. The application does this by transforming a database of text-based records of discussion minutes to an interactive online browser that provides an effective visualizations of both quantitative and qualitative data. The table below shows an initial product draft of our product.
Review and Critic on Past Works
Official Hansard Records
The first source we referred to was the official Singapore Parliamentary Debates records (Hansard) viewer.
Secondly, it will take a significant amount of time to go through each article to gauge whether it was of use to the reader. This includes information about the speakers or parties involved, or whether it was a simple question and answer or an elaborate debate. It was very difficult to understand the profile of any speakers unless someone took the time to read through all the articles.
Thirdly, there is no "download all" button for easier retrieval. At the time of our project, the data for 12th parliament was not still available in CDROM version yet.
Attendance records & speaking frequency
Next, we have recently seen an example on the use of data in journalism: http://themiddleground.sg/2015/08/12/tmg-exclusive-speaking-truth-parliament/
While this is a great first step in collating data from the records, it is still limited to attendance and frequency speaking. This might not be sufficient to evaluate the performance of MPs' in the Parliamentary discussions.
Our core aim of the project is to provide Singaporean politically savvy voters a tool to review MPs' participation in the Parliament without having to go through thick stacks of meeting minutes. In other words, to be able to provide users with a quick way to keep up with trends happening in the Parliament.
However, we also strongly feel that there is a limit on what quantitative variables can do for helping a voter pick his/her choice MP. Thus, we want to ensure that our tools will always point the user to access the original conversation links to read further and catch up on the qualitative spectrum as well.
Regarding evaluating the MPs, we have done a small informal survey on what constitutes a good MP when it comes to behavior in Parliament. In general, we can find a strong recurring pattern in the 3 aspects as below:
- Interest (Whether the MP speaks about areas that I am concerned about)
- Knowledge (The depth of knowledge that each MP has in his/her specialized field)
- Quality of conversation (Whether the MP engages in constructive debate in parliament)
The first stage of the project is to brainstorm what kind of topic we are going to do. We want to do a topic that is meaningful and provides a significant increase in value to users. At the same time, we want to apply the visualization concepts that we have learnt during the course.
After 1 week of topic research, we decided to select Singapore Parliament debates as our project. We choose this topic because we observed during the recent Singapore Election, people focused plenty of the media coverage and speeches, but there was not enough work done on evaluating the past discussions they had in the Parliament.
During the initial phase, we considered:
1) What is the aim and scope of the project?
2) What would the audience be interested in the Parliament? Topics? Their MPs? Opposition party?
2) Where to get the data source and what kind of data preparation is needed?
3) How to extract the data from the web content which are in textual form?
4) What types of scraping tools is available and which tools are the most suitable?
5) Which kind of visualizations tools are suitable for our project? Tableau, Tibco Spotfire or Datawatch?
6) What are kind of graphs are we going to use? Tree Map? Sunburst or Heat Map?
At the second phase, scraping the data into Excel format is a tedious process and time consuming. As the data has over 3,000 records, care need to be taken at every step. If there is a mistake, the scraping process of the data need to be repeated all over again. At this point, we have loaded the data into Wordstat to see what kind of topics is possible, as well as the distribution of conversation types.
Due to limited space on a modern browser, the design of the layout is very important in communicating only the key facts and figures to the reader. Therefore, we wanted our visualizations to be focused on only the three aspects of each MP, looking at one MP's profile at a time.
Mainly we were interested in visualizations that could "summarize" lengthy texts such as books or news:
Final End Product
With the Hansard browser, we decided to split the usage into three modes:
Mode 1:Conversations started by speakers
Mode 2: Conversations participated by speakers
Mode 3:Topics of conversations participated over time
The visual flow for the layout is always set from left to right, from topics distribution to conversation list in the middle, and then to the speakers who have participated or started the discussions.
As the key point of the first two modes were to encourage users to read into the logs, the lists are always placed in the center of the screen, taking up majority of the screen space. It also allowed users to be able to quickly glance through the various titles of the conversations. We made sure the sequence of the sorting of the lists is by date first, follow by exchanges and the last will be word count . In the ranking order from most useful for insights to the least.
While starting a conversation might be important, participating is equally if not more important. However, the visuals for mode 2 is less intuitive, given that each topic might have multiple participants. Filtering by "participated by Chinese MP" will also net values in other races, as they have also participated in the same conversations. This part would require a more significant time to explain in the user manual
The date chart was placed at the bottom to give a rough guide to when the discussions took place and how many spreading across both topics and speakers.
For "Exchanges" and "wordcount", these two metrics were not relevant for mode 1, and were only shown in mode 2. They also received the additional distribution indicator for those interested. For Mode 3, we focused mainly on the top 6 discussed topics, but as mentioned in the user guide, the users are free to customize the order and the graphs that they wish to have.
In the network graph, we wanted to achieve three main insights for the user:
- Where different parties stand in terms of conversations
- In terms of clusters, which MPs or conversations are more specialized or "niche"
- Visualize any patterns between starting/participating in conversations
Using three main RGB colors for the parties to provide the most differentiation between parties. We sized the nodes according to the out-degree, or number of conversations participated in. Conversations were given a more dull grey so that the party colors can stand out more.
For edges, we decided to make it yellow for "started" and cyan for "participated",using color temperature to scale the level of participation from high to low.
Using the Force layout 2 plug-in, we were able to achieve a relatively good spread of speakers and conversations. We then removed the parliament speakers from the graph to provide better node size scaling, but left the admin conversations in as reference.
We merged the conversation title and MP names into the same column so that the mouseover option will show both options correctly.
For this chart we used the modularity function to split the nodes into various groups. We deleted the administrative posts to give a better perspective on the main debates. We kept the edge coloring scheme consistent, to be in line with the previous graph.
We colored the conversation nodes too so that when a MP node is clicked, a users can quickly gauge the width of coverage through the location distribution and variety of colors that it is connected to.
The grouping option was set to the cluster variable so that users can view specifc clusters in isolation.
Sidebar: On clicking of any nodes, the side bar will show the full details of the MPs, or the URLs of the conversation so that users can readily access the link to it.
Hive plots & Co-occurrence Matrix
In here, we mainly want to illustrate the patterns in interactivity, whether there are certain "pairings" that are more likely to happen than the others, and cross interaction between variables groups (GRC, gender, race, etc)
Here, we split the nodes by party, and sized according to the degree of each node (number of connections). While the inter-party relationships are clear, it is rather hard to see intra-party connections as they overlap the nodes.
With the 6 axis charts, we can have a clearer view of the intra party relationships. We did other variants that sorted by gender and/or race, but found that the most effective version is by:
Through the feedback we got from the poster presentation, we feel that while the hive plots are great for an overview, it is hard to quickly compare between individuals, as it requires the user to mouse over/click each name and retain a visual memory before moving to the next.
Thus we leveraged on the co-occurrence matrix to show all connections at a single glance. On top of the demo provided by Mike Bostock, we added more options for both sorting and coloring the nodes. This allowed the user to jump between pairs of variables to find out any patterns in interactions.
Unique Conversation Links
Discussion Content Scraping
Once we have the links, we used Kimonolabs to perform the scraping of content from each conversation. Kimonolabs is able to recognize each component through html tags and copy these as separate fields for each record. It was also able to scrape the names (text in bold) and the body of each conversation paragraph. Due to the slow speed, we split the tasks across different scraper instances to run simultaneously and combine the data through Kimonolab's APIs.
Our first export was in .csv format and it required a significant amount of table rework as the tables of speakers and conversations were mixed together. In the later part, we find out that it was possible to export the pulled tables into Google drive as separate sheets, and therefore reducing the hassle required significantly.
Matching speaker names to each paragraph
After we have scraped the content of each conversation post, we break the data down in Excel to assign speakers for each paragraph. For rows that do not start with speaker name, we look for the content before a colon ":"
For paragraphs that do not start with a name, we assign it to the speaker before. If a conversation's first paragraph has no name detected, we will classify this into ParlStatus.
Consolidating a list of speakers
We gather the scrapped names and bundle them together as a single ID's manually, this is as each MP had multiple aliases within the conversation records.
We clean up any other status related paragraphs with an "parlstatus" column in Excel. These paragraphs will be tagged to parliament status messages instead of any speakers.
We also added other profile data for the MPs such as gender, race, GRC during the 12th Parliament for later use.
Technical review of available methods
In text analytics, we looked for various ways of creating topics. Initially, we have considered using hierichial topic modelling as proposed by David M. Blei, to enable users to look at either macro topics or drill down into smaller topics. We did not pick this up as there were no code readily available to do this.
We loaded the data into WordStat to visualize the top words use, and ran initial trials of topic clustering. We did this by combining the conversation title into the conversation through R, and then exporting as .csv format.
WordStat is part of the QDAminer package developed by Provalis Research http://provalisresearch.com/. QDA handles the file and records while the Wordstat module handles the text analytics.
We first skimmed the top occurring words, sorted by TF-IDF https://en.wikipedia.org/wiki/Tf%E2%80%93idf , looking for any words that we can add to the stopword list. We can then use the clustering function to group topics via LDA, and observe the top words given per topic. We would also read up on the conversations in each topic, to see if it makes sense to belong in that particular grouping. Usually, we will end up with "junk" topics that do not seem to mean anything, so we will repeat the first step of stopword removal again.
We repeated this process many times and once we could reliably get decent topics, we exported the stopword list as a text file to be used in the next step.
Next, we loaded the text data into Graphlabs through Python in order to generate the topics we needed through the text clustering API. We also included the new stopword list we had from Wordstat. We loaded the detected topics back into Excel to provide the names of each possible topic.
The GIF file above highlights the steps we took in Python notebook to generate the topics and assign to each conversation.
Sentiment Analysis Attempt
We also performed sentiment analysis using Graphlabs, but the trained result did not prove to be useful as majority of the vocabulary used is fairly neutral. This could be revisited in the future with better trained samples that are ranked on level of agreement.
We used keshif to visualize our dataset as it provided a quick way to both visualize key metrics of our data as well as provide a list of conversations and direct links.
Because our data included multiple responses, (one conversation, multiple speakers), we had to add extra html code in order to bring in the data successfully. Thankfully, there are various example codes available to refer to. We would also like to highlight that the guides had a particularly useful section on effective data preparation.
We load the speaker-conversation data into Gephi, with one file stating the various speakers + conversations, and the other file showing the links between each speaker and each conversation. We also made sure to include the information on whether an "link" or "edge" is the first paragraph in each conversation to identify when speakers initiate an conversation.
From here, we used Force Layout 2 to plot the layout, sizing the nodes by "out-degree", meaning the number of conversations each speaker have participated in.
We also used the Modularity function to detect "communities" within the network, to identify clusters of conversations and speakers.
While the speaker conversation network graph was useful, we wanted to see how diverse a speaker is in terms of speaking with different MPs. We had initially tried to use the same network graphs, but because of the high amount of connections it was difficult to see any patterns. (refer to diagrams to the left)
Thus, we approached Hive Plot, which is a more effective method to show overlapping network connections, or "hairballs".
The challenge was the data that we had on hand was in the format of each conversation to each speaker. We need to transfer it into speaker to speaker, which means that the two speakers who had a conversation before should appear in the same row. After that, we could further transfer it into standard Hive Plot edge & node tables.
We found a quite efficient way to do this kind of transformation by using SAS Enterprise Miner Association node. Firstly, we extracted our Speaker ID and Conversation ID from our Speaker-Conv Connection table. Then we set SAS EG Maximum Items to two, Minimum Confidence Level to zero, Support Count to one, Maximum Number of Items to Process to 10000 to run the Associate node. After that, we could get the rules table. We could use SAS JMP and excel to reformat the rules table into node & edge tables.
Note1: Here we only pick "Clarification", "Oral Answers to Questions", " Written Answers to Questions", "Written Answers to Questions for Oral Answer Not Answered by Time" these four conversation sections, because these sections are where speakers have an choice to participate (as opposed to budget debates), so that we could see interaction choices between MPs.
Note2: One drawback for this kind of Hive Plot is that we could not see how many times each pair of speakers have interacted with each other. Thus, we used Co-occurrence matrix to solve this kind of problem in the next stage.
Adapting the code from Mike Bostock's demo :http://bost.ocks.org/mike/miserables/, we constructed our own by creating json files with the node/edge data from Gephi.
- Index in JSON files start from 0, so we had to do a -1 on all our existing user IDs.
- The sort is only possible with integers, thus we had to recode variables such as party, gender, race into dummy integer values.
- We have tried various coloring themes, but the party grouping provides the most useful split for insights.
- We have also added a color by option to improve the matrix ability to find patterns between groups
- For community detection, we have limited to 4 groups to reduce the number of colors on the matrix.
Test Charts in Tableau
We wanted to test the data for creating an MP's profile and see if there are any insights to obtain before implementing in D3.
We ran the following metrics in Tableau:
1) % share of speaker's topic (Of all conversations participated by speaker, what percentage does each topic take ( e.g. 4 health out of 100 discussions participated by the speaker)
2) % share of topic (Of all conversations of a single topic, how many times did the speaker participate in (e.g. 4 of all 20 Health discussions)
3) Patterns in topics over time:
We did not pick this up as we realize that it is very difficult to compare between speakers without lots of back and forth. The second metric was also very low for most speakers across all topics, and might not provide a strong differentiating factor between MPs.
Demonstration: Sample Test Cases
Below are two sample test cases for two types users that we expect to be early adopters of this tool.
Test case 1: Politically savvy user evaluating two MPs
Interest areas & depth of knowledge
To see the distribution of topics across two MPs':
1) Topics started
With the lock feature, we can easily compare the absolute count of number of conversations started under each topic for the two MPs. We can also analyze the percentage of all conversations by clicking on the axis. The date chart below shows that the locked MPs have been consistently higher on the amount of topics started and increase in the recent year.
2) Topics participated
In the topics participated page, we can see a much larger picture on the conversations between the two MPs who have interacted with. We could also filter by transport to see the timeline for just transport related topics.
3) Focus on topic trends
As we dive deeper, the third mode allows us to focus only on key topics' timeline.
Quality of conversations
1) Exchange and speaker counts:
If we revisit mode 2 on the browser, we can also see the number of exchanges and speakers for conversations that these two MPs' have participated in. We can then focus on the two MP's by selecting both as OR filter criteria. We can see some slight differences in the conversations that both MPs have participated in. If the user desires, he can also filter on a single MP, sort according to exchanges or number of speakers and read up on the actual conversation text.
2) Network visualization:
With the network chart, we can easily search for the two MPs' and see that they start different topics, while having some slight overlap in transport related conversations.
Test case 2: Data journalists reviewing how parties perform in parliament
1) Conversations by topics started
Here we can see that there are some topic differences between NMP and WP. Upon drilling down, we can also see that the number of topics raised by NMPs have been relatively lower in the last year.
While there the decline in recent years remain the same, we see that while Workers'Party have been involved in more conversations, they also tend to have lesser participants per conversation.
Interactions in parliament sessions
1) Network graph colored by party
As we can see on the network graph, WP tends to be more spread out than NMPs. This might imply that WP might have a better breath of topics.
2) Hive plots
- Looking into the three axis plots, we see that while WP and NMP are very active in conversing with PAP members, they rarely connect with each other.
- An interesting insight is that there seems to be only a select portion of PAP MPs that interacts with WP MPs, but this is not the case with NMPs'.
- On the 6 axis plots, we can see that there are plenty of inter-PAP communications, whereas not so for NMP and opposition parties.
3) Co-occurence matrix
Looking into the co-occurence matrix, we can confirm that there is some sort of pattern for WP-PAP interactions, indicated by "stripes" on the matrix. We can also observe a relatively sparse inter-NMP interaction.
What has the audience learned from your work?
From discussions and demo sessions during poster presentation:
- Coverage of arts & culture topics in parliament is relatively isolated with the rest of the MPs.
- There are a certain NMPs or opposition MP's who cover a significant range of topics actively despite the lack of press reports.
- Ministers tend to be very specialized in their own fields, and are usually on the receiving end of questions.
- Relatively more interactions between opposition MPs and Ministers in Parliament.
- Some of the NMPs tend to cover overlapping topics, not as spread out as opposition members.
- Process on scraping the content from the official servers.
- Usage of network graphs to show interpersonal connections.
- Fundamentals of topic modeling.
What new insights or practices has your system enabled?
- A new macro view of how conversations and MPs' interact with each other through the use of network charts.
- A differentiation between specialized MPs' or those that cover a wide spectrum.
- Ability to catch up easily on discussions without having to go through the archives.
- An application for readers who want to focus on single MPs to those who are interested in the entire political spectrum.
Explore increasing number of parliament records and figuring out how to handle changing GRC memberships over different batches. Currently, we are still limited to only content from 12th Parliament.
With the Parliament sessions to reopen next year, we will also need to figure a more efficient way of capturing new arrivals and corrections from the parliament records.
In the ideal scenario, we will have a direct connection to the Parliament records, and as well as having topic tags added to each parliament conversation. This will however, depend greatly on our discussions with the parliament office.
Data preparation for network charts will continue to be lengthy and expect to be done only once every 3-4 months.
Data inclusion for the Keshif browser, however, will be relatively easy by appending the required data into the google sheet or database.
Fine tune topic modeling to be based on trained samples instead of “bag-of-words” approach
While our current topics generated are relatively reliable, however, it is not 100% accurate. We foresee that using only "bag of words" approach might not be sufficient in topics that might share similar words but used in different context.
Further work would be to manually tag "topics" assigned as per Ministry to each topic, and classify using naive bayes or support vector machine methods to obtain better accuracy.
Our initial efforts of sentiment detection did not provide any useful predictions. We figure that this is due to the political language that is used in Parliament discussions. We plan to generate manually trained samples on whether a paragraph can be viewed as "Agree", "Disagree" or "Neutral", by providing a much clearer view on the flow of discussions.
Future work for the chart could involve putting topics and conversations onto the axis, connecting with each MP and his/her connected ministries to the topics they covered in parliament, or adding sorting by GRC’s.
Explore trellis layout for hive charts in order to compare across different demographics
Sunburst diagrams to show the general flow of discussions by tagging agree/disagree/end in each paragraph. This will help to see the difference between simple Q&A's to lengthy debates.
Further Development Work
To perform user testing to potential audiences of this tool
For online usage
No installation is required. All charts are hosted on a live server. Project page All raw data for the browser can be accessed from the google drive link.
To run the network graphs locally, users will need to install Gephi and the SigmaJS exporter plug-in to bring it online.
Before embarking on using the browser, please watch the video below on Keshif functions such as filtering or customizing the layout of the charts:
5 Minute Tutorial on Keshif functions
How to read charts in Conversations participated by each MP
In mode 2, the browser is showing the speakers that are present in each conversation:
This is why when a filter is added (e.g. for WP), the data might still show for the other criteria. In the example below, the filter has selected conversations that the Aljunied GRC MPs have participated in. Out of the 439 conversations, Tampines GRC MPs' have participated in 67 of them. All other speaker variables act the same way and will display all possible data for those who have participated.
Unique features for Hansard dataset
- Sort options: For the conversation list, we have added the sort by using "Number of speakers", "Exchanges", and "Wordcount". Users can use these sort options to gauge the level of participation in each conversation.
- Additional info button: Users can also quickly view the full details on each conversation by clicking on the circular logo.
- External link: As stated, each item in the conversation list also carry an external link to the original article in the official servers.
- Additional topic trend lines: For mode 3, there are more topics available in the variable drawer. Users can customize the topics and in which order it shows on the dashboard.
Explanation of network graph colored by party
In the first chart, each non-grey node represents a MP, colored by the various parties. They are then connected to the conversations that they have started "yellow line", or have participated in "blue line". The speaker nodes are then sized by the number of conversations they are connected with. Users can mouse over the nodes to quickly see the name/title of each node, or to click in to see the direct connections of a particular node.
The layout of each node in the graph is calculated by an algorithm called Force Atlas 2 in Gephi. This helps us to group conversations and speakers that are likely to happen together to be near each other. In this way, we can quickly also see that transport conversations will be closely linked to the Transport Minister.
This layout would help to visual the estimate coverage of each MP and MPs' on the outer edges tend to be more specialized. On the other end, conversation nodes in the middle tend to be participated by many MPs from different cliches. Parties that cover a wide range of topics will have members dispersed around the entire graph, while other parties might focus on certain areas.
Colored by Cluster
In the second graph, we remove the administrative posts, reorganize the layout and color the conversation and speaker nodes by "community". We do this via a function in Gephi called modularity. This function helps to group conversations and speakers that have many connections with each other, but relatively less "outwards". Just like social networks, this will help to group MPs that tend to talk more together.
In the image above, we demonstrated how users can filter the graph to see members of each community and their discussions. We also compared specialized MPs on the outer rings who only focus on specific topics (connected mainly conversations of his/her own node color), versus MPs who covered a wide range of topics (connected to many node colors)
1. Axis Description The axes are colored by party, blue=PAP, green=NMP, red=WP, yellow=SPP. Each party is showing on one axis. The curves among each pair of nodes represent the conversation connections.
2. Search Function
3. Group Filtering
- Six axis Hive Plot
1. Axis Description The axes are colored by party, blue=PAP, green=NMP, red=WP, yellow=SPP. Each party is showing on two axes. The curves among one party (two axes that have the same color) means the conversation happened within their own party. Otherwise, each node could only be connected to the node in the neighbouring axis. Currently, the same speaker node will appear in different locations in the two axes. Further improvement is required to be done to accomplish the standard Hive Plot.
2. Search Function
3. Group Filtering
Co-occurrence matrix plot
In this plot, we chart whether each MP has interacted with another in at least one conversation in the Singapore Parliament question and answer discussions.
Each colored cell represents two MPs that have spoken in the same conversation; darker cells indicate characters that co-occurred more frequently.
Cells are default colored by party if the two MPs interacting are from the same party. (e.g. PAP to PAP), whereas they will be monochrome black for interaction between parties (e.g. WP to PAP)
You can use the drop-down menus to reorder the matrix and/or re-color the nodes to explore different patterns in the data.