KeyXtract Twitter Model - An Essential Keywords Extraction Model for Twitter Designed using NLP Tools

Reading time: 6 minute
...

📝 Original Info

  • Title: KeyXtract Twitter Model - An Essential Keywords Extraction Model for Twitter Designed using NLP Tools
  • ArXiv ID: 1708.02912
  • Date: 2017-08-10
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Since a tweet is limited to 140 characters, it is ambiguous and difficult for traditional Natural Language Processing (NLP) tools to analyse. This research presents KeyXtract which enhances the machine learning based Stanford CoreNLP Part-of-Speech (POS) tagger with the Twitter model to extract essential keywords from a tweet. The system was developed using rule-based parsers and two corpora. The data for the research was obtained from a Twitter profile of a telecommunication company. The system development consisted of two stages. At the initial stage, a domain specific corpus was compiled after analysing the tweets. The POS tagger extracted the Noun Phrases and Verb Phrases while the parsers removed noise and extracted any other keywords missed by the POS tagger. The system was evaluated using the Turing Test. After it was tested and compared against Stanford CoreNLP, the second stage of the system was developed addressing the shortcomings of the first stage. It was enhanced using Named Entity Recognition and Lemmatization. The second stage was also tested using the Turing test and its pass rate increased from 50.00% to 83.33%. The performance of the final system output was measured using the F1 score. Stanford CoreNLP with the Twitter model had an average F1 of 0.69 while the improved system had a F1 of 0.77. The accuracy of the system could be improved by using a complete domain specific corpus. Since the system used linguistic features of a sentence, it could be applied to other NLP tools.

💡 Deep Analysis

Deep Dive into KeyXtract Twitter Model - An Essential Keywords Extraction Model for Twitter Designed using NLP Tools.

Since a tweet is limited to 140 characters, it is ambiguous and difficult for traditional Natural Language Processing (NLP) tools to analyse. This research presents KeyXtract which enhances the machine learning based Stanford CoreNLP Part-of-Speech (POS) tagger with the Twitter model to extract essential keywords from a tweet. The system was developed using rule-based parsers and two corpora. The data for the research was obtained from a Twitter profile of a telecommunication company. The system development consisted of two stages. At the initial stage, a domain specific corpus was compiled after analysing the tweets. The POS tagger extracted the Noun Phrases and Verb Phrases while the parsers removed noise and extracted any other keywords missed by the POS tagger. The system was evaluated using the Turing Test. After it was tested and compared against Stanford CoreNLP, the second stage of the system was developed addressing the shortcomings of the first stage. It was enhanced using Na

📄 Full Content

Proceedings of the 10th KDU International Research Conference, Sri Lanka KeyXtract Twitter Model - An Essential Keywords Extraction Model for Twitter Designed using NLP Tools Tharindu Weerasooriya1#, Nandula Perera2, S.R. Liyanage3 1#Department of Statistics and Computer Science, University of Kelaniya, Sri Lanka 2Department of English, University of Kelaniya, Sri Lanka 3Department of Software Engineering, University of Kelaniya, Sri Lanka 1#

Abstract— Since a tweet is limited to 140 characters, it is ambiguous and difficult for traditional Natural Language Processing (NLP) tools to analyse. This research presents KeyXtract which enhances the machine learning based Stanford CoreNLP Part-of-Speech (POS) tagger with the Twitter model to extract essential keywords from a tweet. The system was developed using rule-based parsers and two corpora. The data for the research was obtained from a Twitter profile of a telecommunication company. The system development consisted of two stages. At the initial stage, a domain specific corpus was compiled after analysing the tweets. The POS tagger extracted the Noun Phrases and Verb Phrases while the parsers removed noise and extracted any other keywords missed by the POS tagger. The system was evaluated using the Turing Test. After it was tested and compared against Stanford CoreNLP, the second stage of the system was developed addressing the shortcomings of the first stage. It was enhanced using Named Entity Recognition and Lemmatization. The second stage was also tested using the Turing test and its pass rate increased from 50.00% to 83.33%. The performance of the final system output was measured using the F1 score. Stanford CoreNLP with the Twitter model had an average F1 of 0.69 while the improved system had a F1 of 0.77. The accuracy of the system could be improved by using a complete domain specific corpus. Since the system used linguistic features of a sentence, it could be applied to other NLP tools.

Keywords — Natural Language Processing, Stanford CoreNLP, Tweet Analysis, Named Entity Recognition, Lemmatization, Keyword Extraction, Turing Test

I. INTRODUCTION Natural Language Processing (NLP) has seen unprecedented development over the past two decades (Zitouni, 2014). Keyword extraction of NLP is used during Question and Answering (Q&A) processes.
In understanding a question, humans extract keywords that are vital in synthesizing the answer. These specific words can also be used to back-formulate the question. In NLP, POS tags could be used to extract key ideas from a sentence. One of the most fertile grounds to put NLP to test is Twitter. A tweet might be ambiguous and is not always grammatically correct. Hence, conventional POS tagging methods cannot be used to extract keywords from a tweet.
Corporate giants often answer customer support requests through Twitterä, which has 320 million active users per month (Twitter Usage / Company Facts, 2016). In Sri Lanka, Dialog Axiata is a prominent telecommunication company that provides this service. Automating this process is challenging for a machine, as interpreting a tweet could be problematic.
This research presents KeyXtract which is a new utilization of the Stanford CoreNLP (Manning et al., 2014) tool, a widely used machine learning based NLP tool. The research was conducted in two stages. The Twitter Model for KeyXtract presented in this paper is the extension of Stage 1 developed at the first stage of the research. In the first stage (Weerasooriya, Perera and S R Liyanage, 2016), Stanford CoreNLP was enhanced using parsers (to extract essential keywords using the linguistic features of a sentence) and a domain specific corpus (consisting of 206 words). The second stage presented in this paper consists of improvements made based on the evaluation results of stage 1. The Turing test was used to evaluate the success of this method in imitating the human logic, and its performance was measured using the F1 score.
II. RELATED WORK A. Extracting keywords Mitkov and Ha, state that to extract a “ ‘keyword phrase’, a list of semantically close terms including a noun phrase, verb phrase, adjective phrase and adverb phrase” (Mitkov and Ha, 1999) should be considered. In the current study, Noun Phrase (NP) and Verb Phrase (VP) are used in keyword extraction. B. Current tools in NLP and POS Tagging Currently, Stanford CoreNLP (version 3.6.0) (Manning et al., 2014),Open NLP (version 1.6.0) (Welcome to Apache OpenNLP, 2013) and NLP4J (version 1.1.3) (emorynlp/nlp4j: NLP tools developed by Emory University, 2016) are the widely used machine learning based Open Source NLP tools for Java. These are the NLP tools with the highest level of accuracy. The NLP tool named ANNIE POS tagger (included with GATE, version 8.2) (Cunningham et al., 2001) uses a rule-based approach in contrast to machine le

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut