Computer Science / Digital Libraries Computer Science / NLP

Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model

February 23, 2026

Reading time: 6 minute

...

#NLP #Computer Science #Data #Model #Learning #Digital Libraries

📝 Original Info

Title: Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model
ArXiv ID: 1905.05615
Date: 2019-05-15
Authors: Researchers from original ArXiv paper

📝 Abstract

Computational chemistry develops fast in recent years due to the rapid growth and breakthroughs in AI. Thanks for the progress in natural language processing, researchers can extract more fine-grained knowledge in publications to stimulate the development in computational chemistry. While the works and corpora in chemical entity extraction have been restricted in the biomedicine or life science field instead of the chemistry field, we build a new corpus in chemical bond field annotated for 7 types of entities: compound, solvent, method, bond, reaction, pKa and pKa value. This paper presents a novel BERT-CRF model to build scientific chemical data chains by extracting 7 chemical entities and relations from publications. And we propose a joint model to extract the entities and relations simultaneously. Experimental results on our Chemical Special Corpus demonstrate that we achieve state-of-art and competitive NER performance.

💡 Deep Analysis

Deep Dive into Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model.

📄 Full Content

Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model

Na Pang1,2, Li Qian1,2, Weimin Lyu3, Jin-Dong Yang4

1 National Science Library, Chinese Academy of Science, Beijing 100190, China 2 Department of Library, Information and Archives Management, University of Chinese Academy of Science, Beijing 100190, China pangna@mail.las.ac.cn 3 City University of New York, New York , USA 4 Center of Basic Molecular Science (CBMS), Department of Chemistry,
Tsinghua University, Beijing, 100084, China

Abstract. Computational chemistry develops fast in recent years due to the rapid growth and breakthroughs in AI. Thanks for the progress in natural language processing, researchers can extract more fine-grained knowledge in publications to stimulate the development in computational chemistry. While the works and corpora in chemical entity extraction have been restricted in the biomedicine or life science field instead of the chemistry field, we build a new corpus in chemical bond field anno- tated for 7 types of entities: compound, solvent, method, bond, reaction, pKa and pKa value. This paper presents a novel BERT-CRF model to build scientific chemical data chains by extracting 7 chemical entities and relations from publications. And we propose a joint model to ex- tract the entities and relations simultaneously. Experimental results on our Chemical Special Corpus demonstrate that we achieve state-of-art and competitive NER performance.

Keywords: transfer learning· pre-training· fine-tuning· entity extrac- tion· relation extraction· Scientific data chain extraction · BERT-CRF.

1 Introduction

Recently, AI has stimulated the application of chemistry in many fields, such as computational chemistry and synthetic chemistry. Several tasks have high- lighted the significance of the AI’s role in chemistry. Scientists utilized deep neural networks and Monte Carlo tree to plan chemical syntheses and discover more retrosynthetic routes in short time[1], proposed machine learning method to perform chemical reactions and analysis faster than they could be performed manually and predict the reactivity of possible reagent combinations[2] and bor- rowed word2vec of NLP to create unsupervised machines Atom2Vec to predict materials properties[3]. There is no doubt that AI is revolutionizing our un- derstanding on chemistry. In chemistry, especially in computational chemistry, 2

though the chemical bond energy (pKa) is essential, most values existing in sci- entific papers are extracted by experts manually and there exists no work to try to extract the pKa with the method of NLP. In particular, we consider three challenges in the application of scientific data chain extraction: (1) The existing corpora may not satisfy the aim of our task because they focus on general chemicals or drugs; (2) The popular chemical NER systems use the machine learning methods or deep learning methods, but it requires abundant data to train; (3) Unlike the start-of-art method to extract triplets {E1, relation, E2}, the entities are not confined in triplets and some of them are irrelevant to our relation extraction and some of them don not have 1:1 relation, but 1:n or n:1 relation. This difference makes extracting scientific chemical data chains significantly a tough task. The first challenge is caused by corpus accessibility. Currently most experi- ments to extract named entities and corpora are in the field of biomedicine or life science which focus on extracting the chemical drugs. And the corpora may not be accessible, such as, PubMed corpus and Sciborg corpora[18]. Considering the need of automatically extracting chemical bond energy to promote the develop- ment in computational chemistry, and solving challenges of semantic problems and numerous unknown words, we create a new corpus of papers of chemical bond field. The second challenge is caused by the ability of start-of-art deep learning architecture. The deep learning methods usually requires big data to train in order to get a better model, however the existing corpus for data chain extraction is not only hard to obtain but also in small scale. What’s worse, most corpus focuses on other fields instead of chemical field. Considering this situation, we try also to use transfer learning method to relief the challenge by pre-training on large out domain corpus before training on chemistry in-domain specific corpus. The third challenge is caused by the aim of our project and the character- istic of our corpus. In our project, we not only extract the entities which have relations, but also extract the irrelevant entities to aid researchers to read and confirm the right relations extracted by our system. And the multiple entities in one relation is more complex than the traditional triplets. For this reason, we construct our own tagging scheme to extract more extensive

…(Full text truncated)…

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on ArXiv data.

Transfer Learning for Scientific Data Chain Extraction in Small Chemical Corpus with BERT-CRF Model

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

A Datamining Approach to the Short Title Catalogue Flanders: the Case of Early Modern Quiring Practices

A students guide to searching the literature using online databases

Analysing Scientific Collaborations of New Zealand Institutions using Scopus Bibliometric Data

Start searching

No results found