The exponential increase in publication rate of new articles is limiting access of researchers to relevant literature. This has prompted the use of text mining tools to extract key biological information. Previous studies have reported extensive modification of existing generic text processors to process biological text. However, this requirement for modification had not been examined. In this study, we have constructed Muscorian, using MontyLingua, a generic text processor. It uses a two-layered generalization-specialization paradigm previously proposed where text was generically processed to a suitable intermediate format before domain-specific data extraction techniques are applied at the specialization layer. Evaluation using a corpus and experts indicated 86-90% precision and approximately 30% recall in extracting protein-protein interactions, which was comparable to previous studies using either specialized biological text processing tools or modified existing tools. Our study had also demonstrated the flexibility of the two-layered generalization-specialization paradigm by using the same generalization layer for two specialized information extraction tasks.
Deep Dive into Reconstruction of Protein-Protein Interaction Pathways by Mining Subject-Verb-Objects Intermediates.
The exponential increase in publication rate of new articles is limiting access of researchers to relevant literature. This has prompted the use of text mining tools to extract key biological information. Previous studies have reported extensive modification of existing generic text processors to process biological text. However, this requirement for modification had not been examined. In this study, we have constructed Muscorian, using MontyLingua, a generic text processor. It uses a two-layered generalization-specialization paradigm previously proposed where text was generically processed to a suitable intermediate format before domain-specific data extraction techniques are applied at the specialization layer. Evaluation using a corpus and experts indicated 86-90% precision and approximately 30% recall in extracting protein-protein interactions, which was comparable to previous studies using either specialized biological text processing tools or modified existing tools. Our study ha
Reconstruction of Protein-Protein Interaction Pathways
by Mining Subject-Verb-Objects Intermediates
Maurice HT Ling1,2, Christophe Lefevre3,
Kevin R. Nicholas2, Feng Lin1
1 BioInformatics Research Centre, Nanyang Technological University, Singapore
2 CRC for Innovative Dairy Products, Department of Zoology,
The University of Melbourne, Australia
3 Victorian Bioinformatics Consortium, Monash University, Australia
mauriceling@acm.org, k.nicholas@zoology.unimelb.edu.au,
Chris.Lefevre@med.monash.edu.au, ASFLIN@ntu.edu.sg
Abstract. The exponential increase in publication rate of new articles is
limiting access of researchers to relevant literature. This has prompted the use
of text mining tools to extract key biological information. Previous studies have
reported extensive modification of existing generic text processors to process
biological text. However, this requirement for modification had not been
examined. In this study, we have constructed Muscorian, using MontyLingua, a
generic text processor. It uses a two-layered generalization-specialization
paradigm previously proposed where text was generically processed to a
suitable intermediate format before domain-specific data extraction techniques
are applied at the specialization layer. Evaluation using a corpus and experts
indicated 86-90% precision and approximately 30% recall in extracting protein-
protein interactions, which was comparable to previous studies using either
specialized biological text processing tools or modified existing tools. Our
study had also demonstrated the flexibility of the two-layered generalization-
specialization paradigm by using the same generalization layer for two
specialized information extraction tasks.
Keywords:
biomedical
literature
analysis,
protein-protein
interaction,
montylingua
1 Introduction
PubMed currently indexes more than 16 million papers with about one million papers
and 1.2 million added in the years 2005 and 2006 respectively. A simple keyword
search in PubMed showed that nearly 900 thousand papers on mouse and more than
1.3 million papers on rat research had been indexed in PubMed to date, and in the last
four years, more than 150 thousand papers have been published on each of mouse and
rat research. This trend of increased volume of research papers indexed in PubMed
over the last 10 years makes it difficult for researchers to maintain an active and
productive assessment of relevant literature. Information extraction (IE) has been
used as a tool to analyze biological text to derive assertions on specific biological
domains [30], such as protein phosphorylation [19] or entity interactions [1].
A number of IE tools used for mining information from biological text can be
classified according to their capacity for general application or tools that considers
biological text as specialized text requiring domain-specific tools to process them.
This has led to the development of specialized part-of-speech (POS) tag sets (such as
SPECIALIST [28]), POS taggers (such as MedPost [33]), ontologies [11], text
processors (such as MedLEE [15]), and full IE systems, such as GENIES [16],
MedScan [29], MeKE [4], Arizona Relation Parser [10], and GIS [5]. On the other
hand, an alternative approach assumes that biological text are not specialized enough
to warrant re-development of tools but adaptation of existing or generic tools will
suffice. To this end, BioRAT [12] had modified GATE [8], MedTAKMI [36] had
modified TAKMI [27], originally used in call centres, Santos [31] had used Link
grammar parser [32].
Although both systems demonstrated similar performance, either developing these
systems or modifying existing systems were time consuming [20]. Although work by
Grover [17] suggested that native generic tools may be used for biological text, a
recent review had highlighted successful uses of a generic text processing system,
MontyLingua [14, 23], for a number of purposes [22]. For example, MontyLingua has
been used to process published economics papers for concept extraction [35]. The
need to modify generic text processors had not been formally examined and the
question of whether an un-modified, generic text processor can be used in biological
text analysis with comparable performance, remains to be assessed.
In this study, we evaluated a native, generic text processing system, MontyLingua
[23], in a two-layered generalization-specialization architecture [29] where the
generalization layer processes biological text into an intermediate knowledge
representation for the specialization layer to extract genic or entity-entity interactions.
This system demonstrated 86.1% precision using Learning Logic in Languages 2005
evaluation data [9], 88.1% and 90.7% precisions in extracting protein-protein binding
and activation interactions respectively. Our results were comparable to previous
work which modified generic text processing systems which reported precision
ranging from 53% [24] to 84% [5], suggesting this modification may
…(Full text truncated)…
This content is AI-processed based on ArXiv data.