A sentiment analysis of Singapore Presidential Election 2011 using Twitter data with census correction

Sentiment analysis is a new area in text analytics where it focuses on the analysis and understanding of the emotions from the text patterns. This new form of analysis has been widely adopted in customer relation management especially in the context …

Authors: Murphy Choy, Michelle L.F. Cheong, Ma Nang Laik

A sentiment analysis of Singapore Presidential E lection 2011 using Twitter data with census correction Murphy Choy 1 Michelle L.F. Cheong 2 Ma Nang Laik 3 Koo Ping Shung 4 Abstract Sentiment analysis is a new area in text ana lytics where i t focuses on the analysis a nd understanding of the emotions from the text p atterns. This new form of analysis has been widely adopted in customer rela tion management especially in t he context of complaint management. With increasing level of interest in this technology, more and more companies are adopting it an d using it to champ ion their marketing effo rts. However, sentiment analysis using twitter has remain ed extremely d ifficult to manage due to t he sampling bias. In this paper, we will d iscuss about the application of using reweighting techniques in conjunction with online sentiment division s to predict t he vote percentage th at individual cand idate will receive. There will be in depth discussion about the various aspects using sentiment analysis to predict outcomes as well as the potential pitfalls in t he estimation due to the anonymous nature of the internet. Introduction Social med ia has been widely adopted b y many private enterprises to market their products as well as services. With the successful campaign of President Obama in the US presidentia l campaign 2008, social media platforms such as facebook, t witter as well as other sites have catapulted to great success as th e leading platform to engage voters. Many political analysts (Stirland, 2008; Pasek, 2006; Xenos, 2007) have at tributed his success t o the active use of social media to en gage voters esp ecially the younger voters whose concerns are usually ignored or given less importance (Pasek, 2006). This combined with the poor adoption of social media by McCain and Palin increased h is overall advantage in attracting the more voters (Stirland , 2008). Extreme intricate p lanning and grassroots activities sealed his campaign success (Stirland, 2008). While originally dismissed as random ranting by youths, twitter has now becom e the to ol of choice for voicing and linking up with people. Originally, twitter was developed as a micro -blogging t ool where u sers can p ost a very short blog on line to update peop le who are linked about their st atus and opinions. All the status 1 Instructor, School o f Information System, Singapo re Managemen t University 2 Practice Associate Prof essor, School of Information Syst em, Singapore Managem ent University 3 Practice Assistant Profe ssor, School of Information Sy stem, Singapore Man agement University 4 Manager, School of In formation System, Singapore Mana g ement University updates as well as the opinions written are reflected on the twitter on line bo ards or can be search and extracted using the t witter search API (Twitter, 2011). The twitter search capabilities allow almo st real life time searching of the information that is also real time streaming. In the earlier days of twitter, most of the t weets or messages are personal opinio ns. With the development of the twitter market, a variety of uses such as n ews and prod uct marketing have proliferated rapidly. Political associations as well as various intere st groups have successfully used it to voice their opinions, political positions as well as gat hering supports f rom th e online audiences. While t here are many analysts who believe that t witter is not very useful (Pearanalytics, 2009), others have lauded the immense potential of twitter (Skemp ,2009) . There are three major objectives to be achieved using twitter information. The first aim is to assess the amount of information with regards to political election events in a conservative yet well connect ed country. The second aim is to d evelop the methodology to reconcile the online information with the political events that t ranspire to examine how well is the information reflected. The last objective is to use the information gained to predict the incumbent president. Background to the Singapore Presidential Election 2011 In this st udy, we hav e collected 16,616 tweets from t he twitter which spanned across th e period of 17 August till 25 August consisting of the nomination period as well as the campaigning period. T his is the second Presidential election since the f irst o ne t hat is held in 1993. After being the President for 12 years, President S.R.Nathan had decided to step due to his age. This resulted in several individuals from various segments of the society coming forth to compete for the position. The President of Singapore is the head of the state and ceremonial in nature. Wh ile the President hold s several veto powers, the role has limited executive rights and privileges . After t he screening process by the Presidential Elections Committee, 4 individuals were identified to be eligible candidates from the a pplicants and all 4 took part in the Presidential election. The 4 candidates are Dr Tan Cheng Bock, Dr Tony Tan Keng Yam, T an Jee Say and Ta n Kin Lian. Dr Tan Cheng Bock is a doctor who u sed to be a member of the parliament f or a period of 26 years . He h as served in variou s roles ranging from managing director of private companies to advisor for counseling centers. Dr Tony Tan is a former Depu ty Prime Minister of Singapore with experience in political appointments ranging from Education to National Defense . He has served as Deputy Chairman of GIC Sin gapore which is the sovereign wealth f und of the N ation. Mr Tan Jee Say is a fo rmer civil servant turned investment banker. He has served as the Principal Private Secretary of the then Deputy Prime Minister Goh Chok Tong (Former Prime Minister of Singapore) as well as in the administrative arms of the civil service. Mr Tan Kin Lian is the former CEO of NTUC In come In surance. He has held various directorships at d ifferent companies and is a member of various p rofessional bodies. Dr Tony Tan, Dr Tan Cheng Bock and Mr Tan Kin Lian were all former members of the People Action Party while Mr Tan Jee Say is a former member of Singapore Democratic Party. The Presidential Election is taking p lace some 3 months since th e General Elections 2011 . The General Elections 2011 is widely viewed as a historic moment in the political history o f Singapore. It was the first time that all the wards except one (Tanjong Pagar GRC) were challenged by opposition parties. A record number of political parties are involved in the election process. It was also t he first time that social media was considered as a legal mean of political campaigning and thus fall under the legislation and regulation of the Election rules. This is also the first t ime that a cooling-day approach was implemented. The incumbent party, PAP, managed to retain its strong maj ority although th ere was a negative swing of 7%. I t marked the second time t hat a high ranking and important minister was defeated in the history of Singapore. Social Media was noted to have certain level of influence over the campaign period where several incidents on the So cial Media were widely criticised and h ighlighted the ferocious nature of cyber -political campaigning (Economists, 2011; Scoop, 2011; Fong, 2011) . It also highlighted how some online cha nnels are influencing the people (Economist s, 2011) . Candidates while being non-partisan are still linked by social media to their political a ffliations. Literature Review The growth of twitter has increased its overall interest to researchers from disciplines such as sociology, marketing and computer science. There have been multitudes of publications in this area notably in marketing. Amon g the different research groups, there is a group of researchers who st udy the effect of social med ia on t he market ( Honeycutt and Herrin g 2009; Nielsen Media Research, 2009). Research has shown that there is a huge variation in the intensity and usage o f twitter. The uses of twitters rang ed from conversat ions (Honeycutt and Herring 2009 ) to word of mouth marketing (Jansen et al. 2009). Th e main theme of the researches done so far focuses on th e generic nature of twitter operating in a function but not specifi cally specialized to evaluate political themes (Tumasjan et. Al., 2010) . There are wide sp read d iscussions and resear ch about web forum, blogs and twitters as alternative form of political d ebate. Some researchers have acknowledged the quality of the more prominent political b logs (Woodley, 2008) while others d oubted the capabilities of the blogs to aggregate and convey the information (Sunstein, 2008). Research has also shown that while there are active participation in many of the political discussion forum(Fong, 2011) as well as blogs (Jansen and Koo p 2009), the population actively participating is very small . At th e same time, there was n o additional information ab out the overall relevance of twitter in th is case (Tumasjan et. Al., 2010). Most of the current literature are focused on the the effect of s ocial media on the actu al population for issues such as politics, public policies and causes. The literature covere d acknowledged the lack of recognition for the non-online population influence on the political landscape (Drezner and Farrell, 20 08). S everal case studies have found t hat the online information has been quite successful act ing as indicator fo r electoral success. (Williams an d Gulati, 2008). There is very little research on the use of twitter for this purpose (Tumasjan et. Al., 2010) . Therefore the goal of this paper is to:  Assess the amount of information with regards t o political election events in a conservative yet well connected country.  Develop the methodology to reconcile the on line information with the political events that transpire to examine how well is the information reflected.  Use the information gained to predict the incumbent preside nt. Data and Methodology We collected twitter information from the start of the campaign on the 17 th August 2011 to 25 th August 2011. T he information was gathered f rom th e twitter search engine with the help of the google API. The data collected is based on the Candidates’ name (only the English name is u sed; this is due to the difficulties of assessing the Chin ese tweets.) with a total of 16,616 tweets collected . Repetitive t weeting b y u nique users are purged and further de -duplications between t he different searches are done. T his is to ensure that a proper and unadul te rate d collection of tweets can be used for analysis. To extract the sentiments from the data automatically, a customized corpus was created and developed for this an alysis. While there are several corpus and programs online to conduct sentiment analysis, most of these are not suitable for analyzing this context. Part of t he issue with the analysis is the part icularly complex abbreviations t hat are peculiar to this campaign. The localized version of English is sl ightly different from US or UK English and certain word s are used differently . Another problem is the massive use of Par odies which also affects the standard corpus u sed in the an alysis of text information . To this en d, a new corpu s assembled from online sou rces, d ictionary as well as earlier general elections 2011 campaign was used. It is to be noted that thi s corpus is strictly for t his cam paign and not applicable to o ther campaigns. In th is analysis, we will also focus on t he sentiment in each tweet and ignore some of the more comp licated aspects of tweeting ( Tumasjan et. Al., 2010). Due to the possib le bias in the data, additional information in the f orm of census (DOS, 2010) as well as su rvey from government bodies(IDA, 2010) are collected to correct the inherent bias in the online data. To estimate the votes, we have developed the following census recorrection framework. In this framework, there are several key information that are required. At the same time, there are 2 assumptions about the framework. Assumptions 1. The p eople who voted in the general elections most likely to be voting along t he party lines. 2. The online sentiment is representative of the peo ple who are expressing their views. Both assumptions are due to necessities. If we do not assume that people will vote along their party lines, we will be suggesting an amo unt of swing voters which we ca nnot estima te accurately. While polls from both online or offline situation might give some information, it is however uncertain and could yield potentially huge amount of margin error. The second assumption assumes t hat emotion expressed online represents the viewpoint of t he individual and similarly to the people on the ground. This assumption is necessary to avoid further complication in the estimation of the sentiment. To calculate the percentage of votes that each candidate should receive, we will be u sing t he formula below.                                                                                             From the above, we obtain the constituent components as below.             󰇛󰇜          󰇛     󰇜   󰇛󰇜      󰇛     󰇜   󰇛󰇜 Combining equation (1), (2) and (3),     󰇝               󰇛     󰇜      󰇛     󰇜   󰇞 󰇛󰇜 Consolidating the information from each age group, we obtain               󰇛󰇜 We will use the above framework to model the P residential Election 2011. Result In this section, we will evalu ate all the information made available and use the framework to calculate the estimated percentage of vote for each candidate. Based on the p reliminary in formation from the census and surveys, table 1 below shows the population percentages as well as the computer l iteracy rate. Age Group (Years) Total ('000) % Pop Comp Literacy (IDA, 2009) 0 - 4 194.4 5.15% 99% 5 - 9 215.7 5.72% 99% 10 - 14 244.3 6.48% 99% 15 - 19 263.8 6.99% 97% 20 - 24 247.2 6.55% 97% 25 - 29 272.6 7.23% 95% 30 - 34 298.7 7.92% 95% 35 - 39 320.0 8.48% 76% 40 - 44 309.4 8.20% 76% 45 - 49 323.5 8.58% 76% 50 - 54 303.0 8.03% 44% 55 - 59 248.7 6.59% 44% 60 - 64 192.0 5.09% 14% 65 - 69 111.5 2.96% 14% 70 - 74 92.6 2.46% 14% 75 - 79 65.2 1.73% 14% 80 - 84 39.8 1.06% 14% 85 & Over 29.2 0.77% 14% Total 3,771.7 Table 1: Population percentages and computer literacy rate However, due to the secrecy of the voting process and the obvious lack of p olls, we were not able to obtain direct information of the popular support for t he parties. As a result of t his, we use some of the information obtained from th e internet about the voting pop ulace. One of the key information is that the older age gro ups are the more staunch supporters of PAP . At the same time, the general elect ions 2011 also indic ated that the younge r gen erations are less supportive with th e general support for PAP to be at 60.1% . Usin g this information, we use a simple linear programming form to calculate the support for PA P.                                            Constraints:                       Using this optimization process, we obtain the following result of the PAP supporters in t able 2 . While these are the numbers we derived, there are several possible soluti ons to this problem. Age Group Est. % of PAP Supporter % of Pop Online % Non Social Media % Social Media (Years) 20 - 24 43.3% 6.4% 2.9% 3.4% 25 - 29 46.0% 6.9% 4.5% 2.4% 30 - 34 48.8% 7.5% 4.9% 2.6% 35 - 39 51.9% 6.4% 5.4% 1.1% 40 - 44 55.0% 6.2% 5.2% 1.1% 45 - 49 58.4% 6.5% 5.4% 1.1% 50 - 54 62.0% 3.5% 3.3% 0.2% 55 - 59 65.8% 2.9% 2.7% 0.2% 60 - 64 69.9% 0.7% 0.7% 0.0% 65 - 69 74.2% 0.4% 0.4% 0.0% 70 - 74 78.8% 0.3% 0.3% 0.0% 75 - 79 83.6% 0.2% 0.2% 0.0% 80 - 84 88.7% 0.1% 0.1% 0.0% 85 & Over 94.2% 0.1% 0.1% 0.0% Overall % 60.0% Table 2: Table showing the estimated percentage of PAP supporters as well as the breakdown of online, social media active and inactive percentages. The p opulation online is th en calculated in this case using the computer litera cy percentages. Active social media po pulation and non active population are further separate d u sing the information f rom surveys (IDA, 20 09). For the off tweet population, we will separate them into the PAP and non-PAP supporter by the same ratio as the general elections 2011 as shown in table 3. Age Group (Years) PAP % Opp % 20 - 24 1.4% 1.8% 25 - 29 2.2% 2.6% 30 - 34 2.6% 2.7% 35 - 39 3.8% 3.6% 40 - 44 3.9% 3.2% 45 - 49 4.4% 3.1% 50 - 54 4.8% 3.0% 55 - 59 4.2% 2.2% 60 - 64 3.5% 1.5% 65 - 69 2.2% 0.8% 70 - 74 1.9% 0.5% 75 - 79 1.4% 0.3% 80 - 84 0.9% 0.1% 85 & Over 0.7% 0.0% Total Pop % 38.0% 25.3% Table 3: Offline population spread between party lines. Using the online sentiments, we calculate the overall sentiment for each cand idate within their respective party line. Grouping Candidate Sentiment (Value) % Split 1 TT 275 49.1% 1 TCB 285 50.9% 2 TJS 356 59.3% 2 TKL 244 40.7% Table 4: Candidates’ online twitter positive sentiments There are several points to note about the sen timent value. The sentiment value is the aggregated positive and negative emotion from the tweets. One or multiple candidates may be endorsed in a tweet while there could be mixed endorsements. Correlation analysis between the candidates also indicated th at there are no strong correlations between the t weets of the candidates. Tweets wh ich contain endorsement for all candidates were removed as they provide no critical useful information. Given the splits in table 4, we then apply the information i nto framework. Offline Online Overall Age Group (Year s) TT TCB TJS TKL TT TCB TJS TKL TT TCB TJS TKL 20 - 24 0.66% 0.69% 1.04% 0.72% 0.73% 0.76% 1.15% 0.79% 1.39% 1.45% 2.20% 1.52% 25 - 29 1.09% 1.13% 1.54% 1.07% 0.54% 0.56% 0.77% 0.53% 1.63% 1.69% 2.31% 1.60% 30 - 34 1.27% 1.32% 1.60% 1.11% 0.63% 0.65% 0.80% 0.55% 1.90% 1.97% 2.39% 1.66% 35 - 39 1.88% 1.95% 2.10% 1.46% 0.28% 0.29% 0.31% 0.21% 2.16% 2.24% 2.41% 1.67% 40 - 44 1.93% 2.01% 1.89% 1.32% 0.29% 0.30% 0.28% 0.19% 2.21% 2.30% 2.18% 1.51% 45 - 49 2.14% 2.23% 1.83% 1.27% 0.32% 0.33% 0.27% 0.19% 2.46% 2.56% 2.11% 1.46% 50 - 54 2.37% 2.46% 1.74% 1.21% 0.08% 0.08% 0.06% 0.04% 2.44% 2.54% 1.80% 1.25% 55 - 59 2.06% 2.15% 1.29% 0.90% 0.07% 0.07% 0.04% 0.03% 2.13% 2.21% 1.33% 0.92% 60 - 64 1.73% 1.80% 0.90% 0.62% 0.01% 0.02% 0.01% 0.01% 1.74% 1.81% 0.90% 0.63% 65 - 69 1.07% 1.11% 0.45% 0.31% 0.01% 0.01% 0.00% 0.00% 1.07% 1.12% 0.45% 0.31% 70 - 74 0.94% 0.98% 0.31% 0.21% 0.01% 0.01% 0.00% 0.00% 0.95% 0.99% 0.31% 0.21% 75 - 79 0.70% 0.73% 0.17% 0.12% 0.01% 0.01% 0.00% 0.00% 0.71% 0.74% 0.17% 0.12% 80 - 84 0.46% 0.47% 0.07% 0.05% 0.00% 0.00% 0.00% 0.00% 0.46% 0.48% 0.07% 0.05% 85 & Over 0.35% 0.37% 0.03% 0.02% 0.00% 0.00% 0.00% 0.00% 0.36% 0.37% 0.03% 0.02% Total Pop % 21.61% 22.47% 18.65% 12.92% Election % 28.6% 29.7% 24.7% 17.1% Table 5: Estimated results Using the model predicted, we then validated it against the final result w hich is released on 28 August 2011 morning as shown in table 6. Predicted Actual TT 28.60% 35.19% TCB 29.70% 34.85% TJS 24.70% 25.04% TKL 17.10% 4.91% Table 6: Result comparison From the results, we can see the difference between the pred icted and the actual value. While we have successfully calculated the thin margin between the top two candidates, th e model did not succeed in p redictin g th e right winner. At the same time, one of the candidates also has much lower results than the predicted result. We will offer some explanations to the discrepancies. The first problem which the model is currently unable to address is the percentage of swing voters. This group of voters does not vote according to the party line and often change their position depending on the policies of the parties. At the same time, due to th e non - partisan nature of the p residency, some of the swing voters might n ot have voted as th ey have done previously. This could account for the differences in the actual vs. predicte d. The second p roblem is the issue with fake tweeter sentiment. This situation can be attributed to 2 major causes. The first source o f such fak e sentiment is astroturfing. The other source may be related to the scenario where the voters do n ot truly refle ct their online sentiments from their choice of candidate. This seems to be a pec uliar situation that does not occur elsewhere. Conclusion From the results, t he framework has been able to predict the top two contenders in the f our corner fight. W hile the predicted w inner does not emerge as the president due the smal l margin, the estimation of the small margin of vot es between the contenders i ndicated the model’s ability to realistically model the scenario. The framework h as been able to convert the twitter information into realistic prediction. At the same time, this is the first time that twitter information has been collected in a conservative highly connected environment wh ere a significant p ortion of the population are not involved in the twitter movement. The analysis h as proved that given proper recalib ration using census information, the twitter information can translate into pretty accurate information about the p olitical landscape even thou gh the twitter users are not as common. Further work will be needed in order to modify t he model slightly in order to calibrate the swing voters into the equation. At the same time, more imp ortant wor k will be needed to handle the issue of as troturfing as well the psychology of voters. Bibliography Andranik Tumasjan, T O Sprenger, P G Sandn er, I M Welpe . Pr edicting elect ions with Twitter: What 140 characters reveal about political sentiment , International AAAI Conference on Weblogs and Social Media Washington DC (2010) Böhringer, M., and Richter, A. 2009. Adopting Social Software to the Intranet: A Case Stud y on Enterprise Microblogging. In Proceedings of the 9th Mensch & Computer Conference , 293-302. Berlin. Fang, Charlene (7 May 2011). "Charlene Fang: Why this Singapore General Election is important Read mo re: Charlene Fang: Why this Singapore General Election is important" . CNN Go. Retrieved 7 August 2011. Drezner, D.W., and Henry F. 2007. Introduction: Blogs, politics and power: a sp ecial issue of Public Choice. Public Choice, 134(1-2): 1- 13. Department of Statistics Singapore 2011, Census 2010. Farrell, H., and Drezner D.W. 2008. The power and politics of blogs. Public Choice . 134(1-2): 15- 30. Honeycutt, C., and Herring, S. C. 200 9. Beyond microblogging: Conversation and collaboration via Twitter. In 42nd Hawaii International Conference on System Sciences , 1-10, Hawaii. Huberman, B. A.; Romero, D. M.; and Wu, F. 2008. Social Networks that Matter: Twitter Under the Microscope. Retrieved on December 15, 2009 from http://ssrn.com/abstract=1313405 IDA 2010 survey on household computer usage, 2010. Jansen, B. J.; Zhang, M .; Sobel, K.; and Chowdury, A. 20 09. Twitter power: Tweets as electronic word ofmouth. Journal of the American Soci ety for Information Science and Technology , 60: 1 – 20. Jansen, HJ, and Koop. R. 2005. Pundits, Ideologues, and Ranters: The British Columbia Election Online. Canadian Journal of Communication , 30(4): 613-632. Java, A.; Song, X.; Finin, T.; and Tseng, B. 2007. Why we t witter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 work shop on Web mining and social network analysis , 56 – 65. San Jose, CA: ACM. Koop, R., and Jansen, H. J. 2009. Political Blogs and Blogrolls in Canada: Forums for Democratic Deliberation? Social Science Computer Review , 27(2): 155-173. McKenna, L., and Pole A. 2007. What do bloggers do: an average day on an avera ge political blog. Public Choice , 134(1-2): 97- 108. Nielsen Media Research. 2009 . Das Phänomen Twitter in Deutschland. Retrieved December 15, 2009 from http://de.nielsen.com/news/ NielsenPressemeldung04.08.2009-Twitter.shtml Pasek, Josh , Daniel Romer, and Kathleen Hall Jamieson. "America's Youth and Community Engagement: How Use of Mass Media Is Related t o Civic Activity and Political Awareness in 14 - to - 22 -Year-Olds." Communication Research 33 (2006): 115- 35. pearanalytics 2009. Twitter Study. Retrieved December 15 , 2009 from ht tp://www.pearanalytics. com/ wpcontent/uploads/2009/08/Twitter -Study-August-2009.pdf "Social media dents invincibility of Singapore’s ruler s" . Scoop. 9 May 2011. Retrieved 7 Au gust 2011. Skemp, K. 2009. All A-Twitter about the Massachusetts Senate Prima ry. Ret rieved December 15, 2009 from http://bostonist.com/2009/12/01/massachusetts_senate_primary_debate_twitter.p hp Sunstein, Cass. 2007. Neither Hayek nor Habermas. Public Choice , 13 4(1-2): 87- 95. "A Singaporean minister again in a h ot seat" . Straits Times. 28 April 2011. Retrieved 8 May 2011. http://www.menafn.com/qn_news_story.asp?StoryId={d88efbec-96e4-4fe4 -a726- 349b3147c5b8} Stirland,S. Obama's Secret Weapons: htt p://www.wired.com/threatlevel/2008/10/obamas- secret- w/ ," , Internet, Databases and Psychology", October 29, 2008. Retrieved on Aug 26, 2011. The Economists, Banyan Tree, Low expectations, retrieved 28 A ugust 2011. http://www.economist.com/node/18681827?story_id=18681827&fsrc=rss Twitter , 2011, https://dev.twitter.com/docs/api/1/get/search , August 09, 2011. Retrieved on Aug 26, 2011 Williams, C., and Gu lati, G. 2008. What is a Social Network Worth? Facebook and Vote Share in the 2008 Presidential Primaries. In Annual Meet ing of the American Political Science Association , 1-17. Boston, MA. Xenos,M. and W. Lance Ben nett."THE D ISCONNECTION IN ONLINE POLITICS: The Youth Political Sphere and US Election Sites, 2002-2004." Information, Communication, and Society 10 (2007): 443 - 64.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment