Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model

Reading time: 6 minute
...

📝 Original Info

  • Title: Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model
  • ArXiv ID: 1905.05615
  • Date: 2019-05-15
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Computational chemistry develops fast in recent years due to the rapid growth and breakthroughs in AI. Thanks for the progress in natural language processing, researchers can extract more fine-grained knowledge in publications to stimulate the development in computational chemistry. While the works and corpora in chemical entity extraction have been restricted in the biomedicine or life science field instead of the chemistry field, we build a new corpus in chemical bond field annotated for 7 types of entities: compound, solvent, method, bond, reaction, pKa and pKa value. This paper presents a novel BERT-CRF model to build scientific chemical data chains by extracting 7 chemical entities and relations from publications. And we propose a joint model to extract the entities and relations simultaneously. Experimental results on our Chemical Special Corpus demonstrate that we achieve state-of-art and competitive NER performance.

💡 Deep Analysis

Deep Dive into Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model.

Computational chemistry develops fast in recent years due to the rapid growth and breakthroughs in AI. Thanks for the progress in natural language processing, researchers can extract more fine-grained knowledge in publications to stimulate the development in computational chemistry. While the works and corpora in chemical entity extraction have been restricted in the biomedicine or life science field instead of the chemistry field, we build a new corpus in chemical bond field annotated for 7 types of entities: compound, solvent, method, bond, reaction, pKa and pKa value. This paper presents a novel BERT-CRF model to build scientific chemical data chains by extracting 7 chemical entities and relations from publications. And we propose a joint model to extract the entities and relations simultaneously. Experimental results on our Chemical Special Corpus demonstrate that we achieve state-of-art and competitive NER performance.

📄 Full Content

Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model

Na Pang1,2, Li Qian1,2, Weimin Lyu3, Jin-Dong Yang4

1 National Science Library, Chinese Academy of Science, Beijing 100190, China 2 Department of Library, Information and Archives Management, University of Chinese Academy of Science, Beijing 100190, China pangna@mail.las.ac.cn 3 City University of New York, New York , USA 4 Center of Basic Molecular Science (CBMS), Department of Chemistry,
Tsinghua University, Beijing, 100084, China

Abstract. Computational chemistry develops fast in recent years due to the rapid growth and breakthroughs in AI. Thanks for the progress in natural language processing, researchers can extract more fine-grained knowledge in publications to stimulate the development in computational chemistry. While the works and corpora in chemical entity extraction have been restricted in the biomedicine or life science field instead of the chemistry field, we build a new corpus in chemical bond field anno- tated for 7 types of entities: compound, solvent, method, bond, reaction, pKa and pKa value. This paper presents a novel BERT-CRF model to build scientific chemical data chains by extracting 7 chemical entities and relations from publications. And we propose a joint model to ex- tract the entities and relations simultaneously. Experimental results on our Chemical Special Corpus demonstrate that we achieve state-of-art and competitive NER performance.

Keywords: transfer learning· pre-training· fine-tuning· entity extrac- tion· relation extraction· Scientific data chain extraction · BERT-CRF.

1 Introduction

Recently, AI has stimulated the application of chemistry in many fields, such as computational chemistry and synthetic chemistry. Several tasks have high- lighted the significance of the AI’s role in chemistry. Scientists utilized deep neural networks and Monte Carlo tree to plan chemical syntheses and discover more retrosynthetic routes in short time[1], proposed machine learning method to perform chemical reactions and analysis faster than they could be performed manually and predict the reactivity of possible reagent combinations[2] and bor- rowed word2vec of NLP to create unsupervised machines Atom2Vec to predict materials properties[3]. There is no doubt that AI is revolutionizing our un- derstanding on chemistry. In chemistry, especially in computational chemistry, 2

though the chemical bond energy (pKa) is essential, most values existing in sci- entific papers are extracted by experts manually and there exists no work to try to extract the pKa with the method of NLP. In particular, we consider three challenges in the application of scientific data chain extraction: (1) The existing corpora may not satisfy the aim of our task because they focus on general chemicals or drugs; (2) The popular chemical NER systems use the machine learning methods or deep learning methods, but it requires abundant data to train; (3) Unlike the start-of-art method to extract triplets {E1, relation, E2}, the entities are not confined in triplets and some of them are irrelevant to our relation extraction and some of them don not have 1:1 relation, but 1:n or n:1 relation. This difference makes extracting scientific chemical data chains significantly a tough task. The first challenge is caused by corpus accessibility. Currently most experi- ments to extract named entities and corpora are in the field of biomedicine or life science which focus on extracting the chemical drugs. And the corpora may not be accessible, such as, PubMed corpus and Sciborg corpora[18]. Considering the need of automatically extracting chemical bond energy to promote the develop- ment in computational chemistry, and solving challenges of semantic problems and numerous unknown words, we create a new corpus of papers of chemical bond field. The second challenge is caused by the ability of start-of-art deep learning architecture. The deep learning methods usually requires big data to train in order to get a better model, however the existing corpus for data chain extraction is not only hard to obtain but also in small scale. What’s worse, most corpus focuses on other fields instead of chemical field. Considering this situation, we try also to use transfer learning method to relief the challenge by pre-training on large out domain corpus before training on chemistry in-domain specific corpus. The third challenge is caused by the aim of our project and the character- istic of our corpus. In our project, we not only extract the entities which have relations, but also extract the irrelevant entities to aid researchers to read and confirm the right relations extracted by our system. And the multiple entities in one relation is more complex than the traditional triplets. For this reason, we construct our own tagging scheme to extract more extensive

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut