Computational chemistry develops fast in recent years due to the rapid growth and breakthroughs in AI. Thanks for the progress in natural language processing, researchers can extract more fine-grained knowledge in publications to stimulate the development in computational chemistry. While the works and corpora in chemical entity extraction have been restricted in the biomedicine or life science field instead of the chemistry field, we build a new corpus in chemical bond field annotated for 7 types of entities: compound, solvent, method, bond, reaction, pKa and pKa value. This paper presents a novel BERT-CRF model to build scientific chemical data chains by extracting 7 chemical entities and relations from publications. And we propose a joint model to extract the entities and relations simultaneously. Experimental results on our Chemical Special Corpus demonstrate that we achieve state-of-art and competitive NER performance.
Deep Dive into Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model.
Computational chemistry develops fast in recent years due to the rapid growth and breakthroughs in AI. Thanks for the progress in natural language processing, researchers can extract more fine-grained knowledge in publications to stimulate the development in computational chemistry. While the works and corpora in chemical entity extraction have been restricted in the biomedicine or life science field instead of the chemistry field, we build a new corpus in chemical bond field annotated for 7 types of entities: compound, solvent, method, bond, reaction, pKa and pKa value. This paper presents a novel BERT-CRF model to build scientific chemical data chains by extracting 7 chemical entities and relations from publications. And we propose a joint model to extract the entities and relations simultaneously. Experimental results on our Chemical Special Corpus demonstrate that we achieve state-of-art and competitive NER performance.
Transfer Learning for Scientific Data Chain
Extraction in Small Chemical Corpus with
BERT-CRF Model
Na Pang1,2, Li Qian1,2, Weimin Lyu3, Jin-Dong Yang4
1 National Science Library, Chinese Academy of Science, Beijing 100190, China
2 Department of Library, Information and Archives Management, University of
Chinese Academy of Science, Beijing 100190, China
pangna@mail.las.ac.cn
3 City University of New York, New York , USA
4 Center of Basic Molecular Science (CBMS), Department of Chemistry,
Tsinghua University, Beijing, 100084, China
Abstract. Computational chemistry develops fast in recent years due
to the rapid growth and breakthroughs in AI. Thanks for the progress
in natural language processing, researchers can extract more fine-grained
knowledge in publications to stimulate the development in computational
chemistry. While the works and corpora in chemical entity extraction
have been restricted in the biomedicine or life science field instead of
the chemistry field, we build a new corpus in chemical bond field anno-
tated for 7 types of entities: compound, solvent, method, bond, reaction,
pKa and pKa value. This paper presents a novel BERT-CRF model to
build scientific chemical data chains by extracting 7 chemical entities
and relations from publications. And we propose a joint model to ex-
tract the entities and relations simultaneously. Experimental results on
our Chemical Special Corpus demonstrate that we achieve state-of-art
and competitive NER performance.
Keywords: transfer learning· pre-training· fine-tuning· entity extrac-
tion· relation extraction· Scientific data chain extraction · BERT-CRF.
1
Introduction
Recently, AI has stimulated the application of chemistry in many fields, such
as computational chemistry and synthetic chemistry. Several tasks have high-
lighted the significance of the AI’s role in chemistry. Scientists utilized deep
neural networks and Monte Carlo tree to plan chemical syntheses and discover
more retrosynthetic routes in short time[1], proposed machine learning method
to perform chemical reactions and analysis faster than they could be performed
manually and predict the reactivity of possible reagent combinations[2] and bor-
rowed word2vec of NLP to create unsupervised machines Atom2Vec to predict
materials properties[3]. There is no doubt that AI is revolutionizing our un-
derstanding on chemistry. In chemistry, especially in computational chemistry,
2
though the chemical bond energy (pKa) is essential, most values existing in sci-
entific papers are extracted by experts manually and there exists no work to try
to extract the pKa with the method of NLP.
In particular, we consider three challenges in the application of scientific data
chain extraction: (1) The existing corpora may not satisfy the aim of our task
because they focus on general chemicals or drugs; (2) The popular chemical
NER systems use the machine learning methods or deep learning methods, but
it requires abundant data to train; (3) Unlike the start-of-art method to extract
triplets {E1, relation, E2}, the entities are not confined in triplets and some of
them are irrelevant to our relation extraction and some of them don not have
1:1 relation, but 1:n or n:1 relation. This difference makes extracting scientific
chemical data chains significantly a tough task.
The first challenge is caused by corpus accessibility. Currently most experi-
ments to extract named entities and corpora are in the field of biomedicine or life
science which focus on extracting the chemical drugs. And the corpora may not
be accessible, such as, PubMed corpus and Sciborg corpora[18]. Considering the
need of automatically extracting chemical bond energy to promote the develop-
ment in computational chemistry, and solving challenges of semantic problems
and numerous unknown words, we create a new corpus of papers of chemical
bond field.
The second challenge is caused by the ability of start-of-art deep learning
architecture. The deep learning methods usually requires big data to train in
order to get a better model, however the existing corpus for data chain extraction
is not only hard to obtain but also in small scale. What’s worse, most corpus
focuses on other fields instead of chemical field. Considering this situation, we
try also to use transfer learning method to relief the challenge by pre-training on
large out domain corpus before training on chemistry in-domain specific corpus.
The third challenge is caused by the aim of our project and the character-
istic of our corpus. In our project, we not only extract the entities which have
relations, but also extract the irrelevant entities to aid researchers to read and
confirm the right relations extracted by our system. And the multiple entities
in one relation is more complex than the traditional triplets. For this reason,
we construct our own tagging scheme to extract more extensive
…(Full text truncated)…
This content is AI-processed based on ArXiv data.