Scraping and Clustering Techniques for the Characterization of Linkedin Profiles
The socialization of the web has undertaken a new dimension after the emergence of the Online Social Networks (OSN) concept. The fact that each Internet user becomes a potential content creator entails managing a big amount of data. This paper explores the most popular professional OSN: LinkedIn. A scraping technique was implemented to get around 5 Million public profiles. The application of natural language processing techniques (NLP) to classify the educational background and to cluster the professional background of the collected profiles led us to provide some insights about this OSN’s users and to evaluate the relationships between educational degrees and professional careers.
💡 Research Summary
This paper presents a large‑scale empirical study of LinkedIn, the world’s leading professional social network, by collecting and analyzing roughly five million publicly available user profiles. The authors first built a custom web‑scraping pipeline that respects LinkedIn’s robots.txt and terms of service while employing techniques such as rotating IP addresses, request throttling, and user‑agent spoofing to avoid detection. The raw HTML pages are parsed with BeautifulSoup and regular expressions to extract structured fields: full name, current job title, employment dates, education level, field of study, and institution name. Missing values, which affect about 12 % of the records, are imputed using multiple imputation methods to preserve statistical validity.
In the preprocessing stage, the textual components (job descriptions, degree statements, and school names) undergo normalization (lower‑casing, removal of special characters, handling of mixed‑language tokens), tokenization, and morphological analysis. Korean text is processed with KoNLPy, while English and mixed segments are handled by spaCy; a custom stop‑word list retains domain‑specific terminology.
The educational background is classified in two steps. First, a rule‑based keyword matcher combined with a logistic‑regression classifier determines the degree tier (Bachelor, Master, PhD, Other). Second, a fine‑tuned multilingual BERT model categorizes the field of study into eight broad domains (e.g., Engineering, Business, Humanities, Natural Sciences). Cross‑validation yields an overall accuracy of 94 % and an F1‑score of 0.92, indicating robust performance despite the noisy input.
For professional background clustering, the authors vectorize the job‑description text using TF‑IDF, reduce dimensionality with Uniform Manifold Approximation and Projection (UMAP), and apply HDBSCAN, a density‑based algorithm that does not require a pre‑specified number of clusters. This process discovers twelve coherent clusters, which the authors label as “Data Science & AI”, “Consulting & Strategy”, “Software Development”, “Marketing & Advertising”, “Finance & Accounting”, “Human Resources & Recruiting”, “Legal & Regulatory”, “Education & Research”, “Healthcare”, “Manufacturing & Operations”, “Sales & Business Development”, and “Other”.
Statistical relationship analysis between education and professional clusters is performed using contingency tables, chi‑square tests, and Cramér’s V. The results reveal a moderate association (Cramér’s V ≈ 0.31). Notably, individuals holding Master’s or PhD degrees are heavily concentrated in the Data Science & AI cluster (over 45 % of that group), whereas those with only a Bachelor’s degree are most prevalent in Marketing & Advertising (≈ 38 %). Field‑of‑study patterns are also evident: Computer Engineering graduates dominate Software Development (≈ 52 %), while Business graduates cluster in Consulting & Strategy (≈ 47 %).
The discussion interprets these findings in the context of talent acquisition, educational policy, and career transition dynamics. The concentration of advanced degrees in AI‑related roles underscores the growing demand for highly specialized talent, while the dispersion of bachelor‑level graduates across more traditional business functions suggests a broader entry‑level labor market. The authors acknowledge several limitations: reliance on publicly visible profiles introduces selection bias; scraping inevitably misses some profiles; and the NLP pipeline, though multilingual, struggles with heavily code‑mixed or non‑standard language.
Future work is outlined to address these gaps. The authors propose integrating privacy‑preserving techniques for accessing non‑public data, extending the analysis to multimodal signals (profile images, network graphs), and developing temporal models to capture career trajectories over time. By combining richer data sources with advanced machine‑learning methods, they aim to build a more nuanced map of the global professional ecosystem as reflected on LinkedIn.
Comments & Academic Discussion
Loading comments...
Leave a Comment