Numerous challenges persist that delay clinical interpretation of human genetic variants, to name a few: (1) un- structured PubMed articles are the most abundant source of evidence, yet their variant annotations are difficult to query uniformly, (2) variants can be reported many different ways, for example as DNA sequence change or protein modification, (3) historical drift in annotations over time between various genome reference assemblies and transcript alignments, (4) no single laboratory has sufficient numbers of human samples, necessitating precompetitive efforts to share evidence for clinical interpretation.
Bioinformatics, 2015, 0–0
doi: 10.1093/bioinformatics
Letter to the editor
Searching PubMed for articles relevant to clinical interpretation of rare
human genetic variants
Andrew J. McMurry, PhD 1*
1 The Apache Software Foundation Dept. 9660 Los Angeles, CA 90084-9660 U.S.A.
*To whom correspondence should be addressed.
Contact: AndyMC@apache.org
To the editor:
While the speed and cost of genome sequencing has im-
proved dramatically, the task of interpreting gene se-
quences for clinical purposes remains challenging [1-4].
Thousands of investigations into the pathogenicity of ge-
netic variants have been completed and reported in peer-
reviewed studies – however – which studies should be
reviewed for each patient genome? The task of matching
sequenced variants to the available evidence quickly ex-
ceeds human capacity.
Numerous challenges persist that delay clinical interpreta-
tion of human genetic variants, to name a few: (1) un-
structured PubMed articles are the most abundant source
of evidence, yet their variant annotations are difficult to
query uniformly, (2) variants can be reported many differ-
ent ways, for example as DNA sequence change or protein
modification, (3) historical drift in annotations over time
between various genome reference assemblies and tran-
script alignments, (4) no single laboratory has sufficient
numbers of human samples, necessitating precompetitive
efforts to share evidence for clinical interpretation.
Meaningful progress at the US National Library of Medi-
cine and elsewhere provides improved capability to query
and mine genetic databases[5-7] and PubMed articles[8-
10] for rare genetic variants. Used together, these tools
can be used to extract genetic variants from article
text[10] and translate these mentions into formats recog-
nized by the Human Genome Variation Society (HGVS)
standards organization [11]. The implications of this ca-
pability are profound: PubMed articles can be indexed to
provide near-real time lookup of articles relevant to a spe-
cific genetic variant.
Search PubMed for articles relevant to a patient genome
As an example, imagine a physician wants to rule out
breast cancer for a patient, and orders genetic testing ac-
cording to the approved guidelines for breast cancer[12].
In this example, a very rare genetic variant was found in
the BRCA2 gene, with unknown disease pathogenicity.
The standard list of widely cited databases is then
checked: BIC – the breast cancer information core [13];
ClinVar – the NCBI repository of clinical variants [5]; and
HGMD – the Human Gene Mutation Database[14]. It is
rather typical that rare genetic sequence variants have no
annotations available in any structured database, resulting
in either a laborious review of potentially thousands of
genetics articles or an uncertain interpretation – Variant
of Unknown Significance (VUS).
To aid clinical interpretation, Natural Language Pro-
cessing (NLP) tools can be combined to extract mentions
of genetic variants from PubMed article text [10]. Variant
NLP tools exist for both rare and common genetic vari-
ants, germline and somatic. Relatively common genetic
variants – variants having greater than 1% Minor Allele
Frequency (MAF) – may have standardized nomenclature
and annotation available in dbSNP [7]. Intuitively, rare
variants are rarely available in gene databases and thus
present the greatest challenge for interpretation.
From DNA to RNA to protein
Results from published genetic studies may describe the
variant of interest in terms of the genomic variant, DNA
coding sequence, RNA transcript, or change in the result-
ing protein [15,16,17]. Modern variant tools such as
tmVar [18] and SETH [9] extract mentions for all of these
molecular types. These tools capture not only capture
standard HGVS mentions of DNA, RNA, and protein
mentions, but also improperly formatted mentions. Map-
pings between molecular types can be achieved using a
wide range of open source tools. Importantly, variant
mapping tools enable lookups of variants annotated in
McMurry et al.
previous genomic references – older articles annotated
using outdated genome and transcript assemblies can still
be indexed and searched.
Open source variant tools for searching PubMed
Multiple Open Source packages can be combined[10] to
provide as comprehensive a literature search as possible
(Table 1). PubMed abstracts are routinely indexed by
NCBI for gene names using GNorm+ [19,20] and textual
mentions of genetic variants using NCBI tmVar [18].
NCBI conveniently provides downloadable files of the
indexes in the NCBI public FTP site, which can be quick-
ly downloaded and indexed in a local database [8]. For
articles where abstract-text or full-text are available,
SETH [9] can be used to extract variant mentions accord-
ing to current and deprecated HGVS nomenclature.
Table 1. Open access variant tools useful for searc
This content is AI-processed based on open access ArXiv data.