Towards Accurate Word Segmentation for Chinese Patents
📝 Abstract
A patent is a property right for an invention granted by the government to the inventor. An invention is a solution to a specific technological problem. So patents often have a high concentration of scientific and technical terms that are rare in everyday language. The Chinese word segmentation model trained on currently available everyday language data sets performs poorly because it cannot effectively recognize these scientific and technical terms. In this paper we describe a pragmatic approach to Chinese word segmentation on patents where we train a character-based semi-supervised sequence labeling model by extracting features from a manually segmented corpus of 142 patents, enhanced with information extracted from the Chinese TreeBank. Experiments show that the accuracy of our model reached 95.08% (F1 score) on a held-out test set and 96.59% on development set, compared with an F1 score of 91.48% on development set if the model is trained on the Chinese TreeBank. We also experimented with some existing domain adaptation techniques, the results show that the amount of target domain data and the selected features impact the performance of the domain adaptation techniques.
💡 Analysis
A patent is a property right for an invention granted by the government to the inventor. An invention is a solution to a specific technological problem. So patents often have a high concentration of scientific and technical terms that are rare in everyday language. The Chinese word segmentation model trained on currently available everyday language data sets performs poorly because it cannot effectively recognize these scientific and technical terms. In this paper we describe a pragmatic approach to Chinese word segmentation on patents where we train a character-based semi-supervised sequence labeling model by extracting features from a manually segmented corpus of 142 patents, enhanced with information extracted from the Chinese TreeBank. Experiments show that the accuracy of our model reached 95.08% (F1 score) on a held-out test set and 96.59% on development set, compared with an F1 score of 91.48% on development set if the model is trained on the Chinese TreeBank. We also experimented with some existing domain adaptation techniques, the results show that the amount of target domain data and the selected features impact the performance of the domain adaptation techniques.
📄 Content
Towards Accurate Word Segmentation for Chinese Patents
Si Li1, Nianwen Xue2
1Beijing University of Posts and Telecommunications, Beijing, China.
2Brandeis University, Massachusetts, USA.
Email: lisi@bupt.edu.cn, xuen@brandeis.edu
A patent is a property right for an invention granted by the government to the inventor. An invention is a solution to a specific technological problem. So patents often have a high concentration of scientific and technical terms that are rare in everyday language. The Chinese word segmentation model trained on currently available everyday language data sets performs poorly because it cannot effectively recognize these scientific and technical terms. In this paper we describe a pragmatic approach to Chinese word segmentation on patents where we train a character-based semi-supervised sequence labeling model by extracting features from a manually segmented corpus of 142 patents, enhanced with information extracted from the Chinese TreeBank. Experiments show that the accuracy of our model reached 95.08% (F1 score) on a held-out test set and 96.59% on development set, compared with an F1 score of 91.48% on development set if the model is trained on the Chinese TreeBank. We also experimented with some existing domain adaptation techniques, the results show that the amount of target domain data and the selected features impact the performance of the domain adaptation techniques.
Introduction
Patents are exclusive rights granted by a sovereign state to an inventor in exchange for detailed public disclosure of an invention. By analyzing large amounts of patent data, one can potentially gain insights into new technological trends for purposes of technology forecasting or competitor monitoring. With the large number of patent filings, it is increasingly hard for human analysts to manually examine the patents to identify technological trends, and there is a pressing need to develop Natural Language Processing techniques to automate the process. This article is concerned with the issue of processing Chinese patents with natural language techniques, which has its unique challenges. It is well known that Chinese text does not come with natural word delimiters, and the first step for many Chinese language processing tasks is word segmentation, the automatic determination of word boundaries in Chinese text. Tremendous progress was made in this area in the last decade or so due to the availability of large-scale human segmented corpora coupled with better statistical modeling techniques. On the data side, there exist a few large-scale human annotated corpora based on established word segmentation standards, and these include the Chinese TreeBank (Xue et al. 2005), the Sinica Balanced Corpus (Chen et al. 1996), the PKU Peoples’ Daily Corpus (Duan et al. 2003), and the LIVAC balanced corpus (T’sou et al. 1997) developed in mainland China, Hong Kong, Taiwan and the United States. These corpora were used in a series of international Chinese word segmentation bake-offs (http://www.sighan.org/ ) that played a crucial role in advancing the state of the art in Chinese word segmentation. Another driver for the improvement in Chinese word segmentation accuracy comes from the evolution of statistical modeling techniques. Dictionaries used to play a central role in early heuristics-based word segmentation techniques such as the maximum match, where entries in a dictionary are used to match strings in an unsegmented input sentence (Chen and Liu 1992). Their role was affirmed in statistical finite-state models (Sproat et al. 1996) where dictionaries are used to build segmentation graphs for a sentence and statistics are then used to search for the best word segmentation path. Modern word segmentation systems have moved away from dictionary-based approaches in favor of character tagging approaches, where each character is assigned a label indicating its position within a word. This allows the word segmentation problem to be modeled as a sequence labeling problem, and lends itself to advanced discriminative sequence modeling techniques such as Maximum Entropy Markov models (Xue 2003) and Conditional Random Fields (Peng et al. 2004) that can take advantage of a large feature space. More recently, perceptron learning based systems also produced very competitive performance (Zhang and Clark 2007). With these better modeling techniques, state-of-the-art systems routinely report accuracy in the high 90 percentage points, with a few recent systems reporting accuracies of over 98% in F1 score (Sun 2011; Zeng et al. 2013b). Chinese word segmentation is far from being a solved problem however and significant challenges remain. Advanced word segmentation systems perform very well in domains such as newswire where there is a large amount of human annotated training data. There is often a rapid degradation in performance when systems trained on one domain (let us
This content is AI-processed based on ArXiv data.