Semantic Similarity Strategies for Job Title Classification

February 23, 2026

Reading time: 5 minute

...

📝 Abstract

Automatic and accurate classification of items enables numerous downstream applications in many domains. These applications can range from faceted browsing of items to product recommendations and big data analytics. In the online recruitment domain, we refer to classifying job ads to pre-defined or custom occupation categories as job title classification. A large-scale job title classification system can power various downstream applications such as semantic search, job recommendations and labor market analytics. In this paper, we discuss experiments conducted to improve our in-house job title classification system. The classification component of the system is composed of a two-stage coarse and fine level classifier cascade that classifies input text such as job title and/or job ads to one of the thousands of job titles in our taxonomy. To improve classification accuracy and effectiveness, we experiment with various semantic representation strategies such as average W2V vectors and document similarity measures such as Word Movers Distance (WMD). Our initial results show an overall improvement in accuracy of Carotene[1].

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Semantic Similarity Strategies for Job Title Classification

Yun Zhu, Faizan Javed, Ozgur Ozturk Careerbuilder LLC. 5550-A Peachtree Pkwy Norcross, GA 30092 {yun.zhu, faizan.javed, ozgur.ozturk}@careerbuilder.com

ABSTRACT Automatic and accurate classification of items enables nu- merous downstream applications in many domains. These applications can range from faceted browsing of items to product recommendations and big data analytics. In the online recruitment domain, we refer to classifying job ads to pre-defined or custom occupation categories as job title classification. A large-scale job title classification system can power various downstream applications such as seman- tic search, job recommendations and labor market analytics. In this paper, we discuss experiments conducted to improve our in-house job title classification system. The classification component of the system is composed of a two-stage coarse and fine level classifier cascade that classifies input text such as job title and/or job ads to one of the thousands of job ti- tles in our taxonomy. To improve classification accuracy and effectiveness, we experiment with various semantic represen- tation strategies such as average W2V vectors and document similarity measures such as Word Movers Distance (WMD). Our initial results show an overall improvement in accuracy of Carotene[1].

Keywords Semantic Enrichment; Supervised learning; Job Title Clas- sification

INTRODUCTION Many e-commerce and web properties have a need to au- tomatically classify millions of items to thousands of cate- gories with a high level of accuracy. Such large-scale item classification systems have many downstream applications such as product recommendations, faceted search, semantic search and big data analytics. In the online recruitment do- main, classification of job ads can power applications such as labor market analytics, job recommendations, and seman- tic search. We refer to classifying job ads (text documents composed of title, description and requirements fields) to pre-defined or custom occupation categories as job title clas- sification. For automatic job title classification, we devel- oped a system called Carotene, which has: i) a taxonomy discovery component that leverages clustering techniques to discover job titles from data sets to create a custom job title taxonomy, and ii) a two stage coarse and fine level classi- fier that uses an SVM-KNN cascade to classify input text to the most appropriate job titles in our custom taxonomy. More details on the system can be found in [1]. The focus of this paper is on improvements based on semantically rich document representations (e.g., using enriching vectors) and similarity measures (e.g., WMD) for the classification com- ponent of the system. These semantic enrichment strate- gies replace the bag of words (BOW) representation that is most commonly used in text classification with semantically related terms (or concepts) derived from resources that es- tablish relations between entities and concepts. Semantic representations are more adept than BOW representations at handling synonyms, polysemous words and multi-word expressions. In this paper, we show our experiment results with job title representation using Word2Vec, Doc2Vec and WMD. The rest of the paper is arranged as follows. Section 2 discusses related work on semantic enrichment strategies for text classification as well as their application in various domains. Section 3 briefly describe the three methods we experiment with. Section 4 explains the use case and shows the performance results as well.
RELATED WORK Enriching vectors and semantic kernels are the two most commonly used semantic enrichment techniques for text clas- sification. In the enriching vectors approach [2], a document representation is enriched with some or all of the follow- ing: hypernyms, synonyms and related concepts. Semantic kernels [3] leverage a semantic proximity matrix to trans- form document representations into linearly separable se- mantic representations. Semantic kernels are usually used with SVMs and applied before classifier training time. In [4], enriching vectors and semantic kernels were used to classify medical text. Their observation was that enriching vectors performed better than BOW-based approaches while seman- tic kernels degraded performance by introducing noise in document representations. However, [5] details a semantic kernel approach that actually improved the performance of the SVM algorithm when the dimensionality of the input feature space is large and training data is scarce. For com- puting similarities between documents,

View Original ArXiv

This content is AI-processed based on ArXiv data.

Semantic Similarity Strategies for Job Title Classification

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found