PeopleSearchBench: A Multi-Dimensional Benchmark for Evaluating AI-Powered People Search Platforms

AI-powered people search platforms are increasingly used in recruiting, sales prospecting, and professional networking, yet no widely accepted benchmark exists for evaluating their performance. We introduce PeopleSearchBench, an open-source benchmark…

Authors: Wei Wang, Tianyu Shi, Shuai Zhang

PeopleSearchBench: A Multi-Dimensional Benchmark for Evaluating AI-Powered People Search Platforms
P eopleSearc hBenc h: A Multi-Dimensional Benc hmark for Ev aluating AI-P o w ered P eople Searc h Platforms W ei W ang*, Tian yu Shi* † , Sh uai Zhang* † , Bo yang Xia, Zequn Xie, Chen yu Zeng, Qi Zhang, Lynn Ai, Y aqi Y u, Kaiming Zhang, F eiyue T ang LessieAI researc h team dev@lessie.ai tys@cs.toronto.edu https://github.com/LessieAI/people- search- bench Abstract AI-p o wered people search platforms are increasingly used in recruiting, sales prosp ecting, and profes- sional net w orking, yet there is still no standard, comprehensive benchmark for ev aluating and com- paring their p erformance. T o address this gap, we present PeopleSear chBench , an op en-source b enc hmark that ev aluates four p eople searc h platforms on 119 real-world queries among four distinct scenarios: corp orate recruiting, B2B sales prospecting, exp ert searc h with deterministic answ ers, and influencer/K OL disco very . A cen tral con tribution of this work is Criteria-Grounded V erification, an ev aluation pipeline for factual relev ance assessment. The pipeline extracts explicit, verifiable crite- ria from eac h query and chec ks whether each returned p erson satisfies them using liv e web search. This pro cess produces binary relev ance judgmen ts grounded in factual verification, rather than the more sub jectiv e qualit y scores often used in holistic LLM-as-judge ev aluation. W e ev aluate systems along three dimensions: Relev ance Precision, measured by padded nDCG@10; Effective Co v erage, measured by task completion and qualified result yield; and Information Utilit y , measured by the completeness and usefulness of the returned profiles. These dimensions are av eraged with equal w eigh t to pro duce an ov erall score. Across four people search platforms, our benchmark shows that Lessie, a specialized AI p eople search agent, achiev es the strongest ov erall performance. The ov erall score of Lessie is 65.2, which is 18.5% higher than that of the second-ranked system. It is also the only system to ac hieve 100% task completion across all 119 queries. T o support reproducibility and reliabilit y , we also rep ort confidence interv als, h uman v alidation of the verification pip eline (Cohen’s κ = 0 . 84), ablation studies on key design choices, and full do cumen tation of queries, prompts, and normalization pro cedures. All co de, query definitions, and aggregated results are publicly av ailable at https://github.com/LessieAI/people- search- bench . 1 In tro duction P eople search—the task of finding individuals who matc h a sp ecific combination of role, skills, lo cation, and domain expertise—is a common w orkflow in recruiting, sales, and mark eting. As AI-pow ered platforms increasingly automate this pro cess, comparing their effectiveness has b ecome imp ortan t but remains difficult. 0 † Corresponding authors. 1 Despite rapid adoption, there is still no widely accepted metho dology for ev aluating people-search systems in a rigorous and reproducible manner. Existing benchmarks for information retriev al [ 12 , 16 ] and question answ ering [ 9 ] do not adequately address this setting, where outputs are real individuals, v alid answers are often non-exhaustiv e, and k ey profile attributes require independent verification. The c hallenge extends beyond the absence of a lab eled dataset to the ev aluation metho dology itself. Standard b enc hmarks t ypically rely on pre-defined relev ance lab els, while holistic LLM-as-judge approac hes often dep end on sub jective o v erall assessmen ts [ 20 ]. Neither is fully adequate for p eople search. Most queries admit many correct answ ers, and practical utility dep ends not only on retrieving relev ant individuals, but also on returning enough qualified candidates with verifiable and navigable profile information to supp ort immediate downstream action. As a result, ev aluation must accoun t for multiple criteria and supp ort factual v erification against external evidence. W e introduce P eopleSearchBenc h, an op en-source b enc hmark con taining 119 queries in four languages (En- glish, P ortuguese, Spanish, Dutc h) grouped in to four commercially relev ant scenarios: Recruiting (30 queries), B2B Prosp ecting (32), Exp ert/Deterministic Search (28), and Influencer/KOL Disco v ery (29). Ev aluation is conducted through our Criteria-Grounded V erification pip eline, which breaks do wn each query in to explicit c hec k able criteria and v erifies each result against those criteria through liv e w eb search. This pro duces binary factual judgments instead of sub jectiv e qualit y scores, making the ev aluation process more repro ducible and less prone to bias. W e apply our ev aluation framew ork to four platforms that represent distinct architectural approaches to p eople searc h: a sp ecialized AI searc h agent, a structured searc h API, an AI-p ow ered recruiting platform, and a general-purp ose AI agen t. The results rev eal that Lessie, the sp ecialized AI p eople searc h agent, ac hiev es the highest o verall score (65.2) with an 18.5% lead ov er the second-ranked platform, and is the only system that main tains 100% task completion across all 119 queries. P erformance v aries substan tially across query t yp es: recruiting queries are relativ ely comp etitiv e for platforms with access to large professional databases, while influencer disco v ery shows the widest performance gap betw een systems. The original conference version of this work presented the core benchmark design and main exp erimen tal results. In this extended tec hnical rep ort, we address the need for greater reproducibility and statistical rigor by adding several key contributions: (1) b ootstrap confidence interv als and paired significance tests for all scores; (2) h uman v alidation of the v erification pip eline on 200 p erson-query pairs (Cohen’s κ = 0.84); (3) the complete set of 119 queries with metadata and the normalization schema; (4) all ev aluation prompts and execution proto cols; (5) cost and latency analysis; (6) systematic error analysis with case studies; and (7) ablation studies on the qualified-result threshold, dimension weigh ts, top-K, and partial credit. W e believe these additions make this b enchmark a v aluable resource for the researc h communit y as AI-pow ered p eople searc h con tinues to ev olv e. 2 Related W ork This section surveys four areas that inform our b enc hmark design and iden tifies the specific gap each leav es op en for people search ev aluation. Information r etrieval b enchmarks. The TREC b enc hmarks [ 13 ] established the template for mo dern in- formation retriev al ev aluation with test collections and p ooling-based relev ance judgments. More recently , BEIR [ 12 ] broadened the scope to include 18 heterogeneous datasets for zero-shot retriev al ev aluation, and MTEB [ 11 ] extended this approach to em b edding models across a wide range of tasks.Ho wev er, all of these b enc hmarks ev aluate do cumen t-level or passage-lev el retriev al where a result is judged as a single text unit. P eople search differs in that eac h result is a real individual with multiple indep enden tly v erifiable attributes (role, emplo yer, lo cation, skills), and relev ance cannot be reduced to a single topical-matc h judgmen t. LLM-b ase d evaluation. Zheng et al. [ 20 ] demonstrated that LLM judges can approximate human prefer- ences in op en-ended text generation tasks, and follow-up work has addressed limitations including p ositional bias [ 14 ], m ulti-dimensional rubrics [ 8 ], and score calibration [ 15 ]. A shared limitation of existing w ork is 2 that judges still rely primarily on parametric knowledge to assess output quality . F or people searc h, para- metric knowledge is insufficient because a p erson’s current emplo yer, title, and lo cation change ov er time and must b e v erified against external sources. Our Criteria-Grounded V erification pipeline addresses this by decomp osing ev aluation in to explicit factual chec ks grounded in live web search. Entity-c entric se ar ch. En tity retriev al from knowledge bases [ 6 ] and en terprise corp ora [ 1 ] w as studied in the INEX and SemSearc h tracks, whic h assume fixed entit y collections with known attributes. Balog et al. [ 2 ] survey ed exp ertise retriev al within closed organizational corp ora, and Geyik et al. [ 5 ] describ ed Link edIn’s talen t search, ev aluated using platform-sp ecific engagement data. Neither setting supp orts cross-platform comparison o ver op en-w eb results. Moreo ver, these efforts predate LLM-p o wered autonomous searc h agen ts and do not provide ev aluation proto cols suited to their output formats and capabilities. Our b enc hmark addresses b oth gaps by using externally verifiable criteria applied uniformly across architecturally diverse platforms. A gentic AI evaluation. Recen t agent b enc hmarks cov er soft ware engineering [ 7 ], w eb in teraction [ 21 ], and general task completion [ 10 ]. These primarily ev aluate binary task success—whether the agent completed a single well-defined goal. P eople search requires ev aluating not only whether the agent returned v alid results, but also how man y it found, how precisely each matches multi-attribute criteria, and whether the returned profiles are actionable. This com bination of set-level ev aluation, per-result factual verification, and information-qualit y assessmen t is not addressed b y existing agen t b enc hmarks. 3 Metho dology This section describ es the b enc hmark dataset (Section 3.1 ), our Criteria-Grounded V erification pip eline (Section 3.2 ), and the three ev aluation dimensions w e use to measure p erformance (Section 3.3 ). 3.1 Benc hmark Dataset Our benchmark consists of 119 queries that were designed to reflect the actual needs of practitioners across four commercially imp ortan t scenarios. T able 1 provides an ov erview of the query distribution. R e cruiting (30 queries). These queries seek candidates with specific combinations of skills, experience levels, and geographic preferences, for example: “Find bac kend developers in London with experience in microser- vices arc hitecture.” B2B Pr osp e cting (32 queries). These queries target decision-makers at p oten tial customer companies, for example: “Find corp orate innov ation leaders in Europe working at large enterprises who speak ab out digital transformation on LinkedIn.” Exp ert / Deterministic Se ar ch (28 queries). These are queries with verifiable correct answ ers or that seek sp ecific domain exp erts, for example: “Find all co-founders of T ogether AI” or “List all research scientists at Op enAI.” This category is particularly useful for v alidating factual accuracy . Influenc er / KOL (29 queries). These queries target conten t creators and thought leaders in specific domains, for example: “Find AI KOLs with 10K+ follow ers on Twitter.” This scenario tends to produce the largest p erformance differences across platforms. The query set is inten tionally m ultilingual to reflect the global nature of mo dern p eople search, cov ering English, Portuguese, Spanish, and Dutch. The 119 queries are balanced across the four categories (b et ween 28 and 32 queries p er category), which provides sufficient statistical pow er for comparing p erformance across scenarios. 3 T able 1 Query category distribution with summary metadata. Category Queries Languages Avg. Constrain ts Deterministic Recruiting 30 EN, PT, ES 3.2 ± 1.1 0% B2B Prospecting 32 EN, ES 2.8 ± 0.9 0% Exp ert / Deterministic 28 EN 2.1 ± 0.7 100% Influencer / KOL 29 EN, NL, ES 2.6 ± 1.0 0% T otal 119 4 2.7 ± 1.0 23.5% 3.2 Criteria-Grounded V erification Our approac h to ev aluation differs fundamen tally from traditional LLM-as-judge methods that assign holistic sub jective scores. Instead, we decomp ose the ev aluation pro cess in to a sequence of explicit, v erifiable factual judgmen ts. The pipeline runs in three stages, whic h w e describ e b elo w. Stage 1: Criteria Extr action. F or eac h search query , we use an LLM to extract N explicit, indep enden tly c hec k able conditions from the stated search inten t. An example is shown b elo w: Query: “Find senior ML engineers at Go ogle in Bay Area” → c1: Role is Senior ML Engineer or equiv alent → c2: Currently emplo yed at Google → c3: Lo cated in San F rancisco Ba y Area Stage 2: Per-Person V erific ation. Each p erson returned b y the platform is verified against ev ery extracted criterion using live w eb search via the T avily Search API with adv anced depth settings. Each criterion receiv es one of three judgmen ts: • met (1.0) — the criterion is fully satisfied with external evidence • partially met (0.5) — the criterion is partially satisfied • not met (0.0) — no supporting evidence or contradicting evidence exists The person’s relev ance grade is then calculated as the a verage of the individual criterion scores: rel( p i ) = 1 N N X j =1 score( c j , p i ) (1) Stage 3: Information Utility Assessment. At the same time, the verification agen t assesses the qualit y of the returned p erson’s data along three sub-dimensions: structural completeness, query-sp ecific evidence, and actionabilit y . W e describ e these in more detail in Section 3.3.3 . A dvantages over holistic LLM-as-judge. T able 2 con trasts our approac h with traditional holistic judgment metho ds. By requiring explicit factual chec ks verified through external web searc h, w e substantially reduce the scope for sub jectiv e bias and improv e repro ducibilit y . 3.3 Ev aluation Dimensions Eac h platform is scored on three independently computed dimensions, all scaled to the 0–100 range. These dimensions are then com bined via equal-weigh t a v eraging to pro duce an ov erall score. 3.3.1 Relev ance Precision (Padded nDCG@10) Relev ance Precision measures whether the returned people matc h the query and are correctly rank ed, using a v ariant of nDCG@10 that we call padded nDCG. 4 T able 2 Comparison of Criteria-Grounded V erification and traditional holistic LLM-as-judge. Asp ect T raditional LLM-as-Judge Criteria-Grounded V erifica- tion Judgmen t type Sub jective qualit y score (0–10) F actual yes/no per criterion Evidence source LLM parametric knowledge External web searc h verification Repro ducibilit y Low (prompt-sensitiv e) High (criteria are explicit) Bias risk High (style, length bias) Lo w (binary factual chec ks) Figure 1 Overview of the PeopleSearchBench ev aluation pip eline. Queries are executed across all platforms, results are normalized to a unified sc hema, and each p erson is independently v erified against extracted criteria using w eb search. Disc ounte d Cumulative Gain. Given relev ance grades rel( p 1 ) , . . . , rel( p K ) for the top- K res ults, DCG@K is calculated as: DCG@ K = K X i =1 rel( p i ) log 2 ( i + 1) (2) Padde d Ide al DCG. Unlike standard nDCG, which normalizes against the b est p ossible ordering of the returned results, we use a padded ideal that alwa ys assumes K = 10 p erfectly relev an t results are achiev able: IDCG@ K = 10 X i =1 1 . 0 log 2 ( i + 1) (3) This design preven ts platforms that return only a few perfect results from receiving an artificially high score. A platform that returns 3 p erfectly relev ant p eople will receive a lo wer score than one that returns 10, whic h aligns with user exp ectations for p eople searc h where finding more qualified candidates is almost alwa ys b etter. 5 Platform sc or e. Relev ance Precision = 1 | Q | X q ∈ Q DCG@10( q ) IDCG@10 × 100 (4) 3.3.2 Effectiv e Cov erage Effectiv e Cov erage measures ho w many correct p eople the platform can find p er query . W e b egin with tw o definitions: Definition 1 (Qualified result) . A p erson with r el ( p i ) ≥ 0 . 5 , me aning they match at le ast half of the extr acte d criteria. Definition 2 (T ask success) . A query achieves task suc c ess if the platform r eturns at le ast one qualifie d r esult. The co verage score com bines task completion rate (TCR) with the a verage yield of qualified results per query: Effectiv e Co verage = TCR × 1 | Q | X q ∈ Q min  qualified( q ) K , 1 . 0  × 100 (5) where K is the target num b er of results p er query (10 in our exp erimen ts), and TCR = |{ q : qualified( q ) ≥ 1 }| / | Q | . 3.3.3 Information Utilit y Information Utilit y measures whether the returned data is sufficiently complete and structured that users can tak e action without further man ual v erification. It is the av erage of three equally weigh ted sub-dimensions: 1. Profile Completeness (structural): the richness of the person’s data, including name, title, compan y , con tact information, work history , and education. 2. Query-Sp ecific Evidence : whether the result includes explanations for wh y the p erson matc hes each criterion and provides sources for v erification. 3. Actionability : whether the user can take next steps (contact, shortlisting, outreach) based on the pro vided data alone. Eac h sub-dimension is scored on a 0.0–1.0 scale: utilit y( p i ) = structural + evidence + actionabilit y 3 (6) Information Utilit y = 1 | Q | X q ∈ Q   1 | P q | X p i ∈ P q utilit y( p i )   × 100 (7) where P is the set of all ev aluated p ersons across all queries. While our current metric ev aluates individual profile completeness, ov erall information utility and result presen tation could b e further enhanced in the future b y incorporating clustering algorithms (e.g., [ 17 – 19 ]) to group similar candidates and reduce redundancy . 3.3.4 Ov erall Score Ov erall = Relev ance Precision + Effectiv e Co verage + Information Utility 3 (8) W e use equal-weigh t av eraging following the Multi-Criteria Decision Analysis principle that equal weigh ts p erform comparably to optimized w eights in most multi-attribute decision problems [ 3 ]. W e verify this choice through ablation studies in Section 9 . 6 T able 3 Characteristics of the ev aluated platforms. Platform T yp e Data Sources Max Results Lessie AI Agent (special- ized) Multi-source: w eb, social, professional, academic 15 Exa Searc h API Structured entit y database 15 Juiceb o x AI Recruiting Plat- form 800M+ profiles, 60+ sources 15 Claude Co de General AI Agent W eb search (Claude Sonnet 4.6) V ariable 4 Exp erimen tal Setup 4.1 Platforms Ev aluated W e ev aluate four platforms that represen t diverse architectural approaches to AI-p o w ered p eople search. T able 3 summarizes their c haracteristics. Lessie is a sp ecialized AI p eople search agent that autonomously searc hes across professional netw orks, so cial platforms, academic databases, and public registries. Exa is an AI-pow ered search API that returns structured entit y results from its proprietary database. Juiceb o x (P eopleGPT) is an AI recruiting platform with access to 800 million+ professional profiles from 60 different sources. Claude Co de is Anthropic’s general-purp ose AI co ding agen t (Claude Sonnet 4.6) that pro duces text-based searc h reports with v ariable result coun ts. 4.2 Ev aluation Configuration W e ev aluate up to 15 results p er query per platform to ensure consistent comparison. The v erification pipeline uses Gemini 3 Flash Preview via OpenRouter for all LLM judgments, and the T avily Searc h API (adv anced depth) for all web-based fact-chec king. The same mo del and configuration are applied iden tically to all platforms, and the verification agent has no information ab out whic h platform pro duced each result to av oid an y bias. T emp or al Contr ol. All platform ev aluations w ere conducted betw een January 15 and Jan uary 22, 2025, with eac h platform ev aluated on the same da y using identical query ordering. W e recorded the specific v ersions and configurations: Lessie (v2.1.0, web interface), Exa (API v1, entit y searc h endpoint), Juiceb o x (P eopleGPT v3.2, w eb in terface), Claude Co de (claude-sonnet-4-6-20250101, via API). W eb v erification timestamps were logged for each result to facilitate future replication. 4.3 Statistical Metho dology T o provide rigorous statistical guarantees, we use b o otstrap resampling with 1000 iterations to estimate 95% confidence interv als for all rep orted mean scores. F or pairwise comparisons b et ween platforms, we use paired b ootstrap tests to assess statistical significance, follo wing the pro cedure describ ed in Efron and Tibshirani [ 4 ]. W e also report query-level win/tie/loss statistics to provide a gran ular view of p erformance differences. 5 Main Results The ov erall b enc hmark results of the four platforms are shown in T able 4 , with 95% confidence in terv als estimated via b ootstrap. Lessie ranks first o verall (65.2 ± 1.5), follow ed b y Exa (55.0 ± 1.8), Claude Co de (46.0 ± 2.1), and Juiceb o x (45.8 ± 1.9). Lessie leads in all three dimensions and is the only platform with 100% task completion (T able 5 ). All differences b et ween the top-ranked and second-rank ed platform are statistically significant ( p < 0 . 05, paired b ootstrap). 7 T able 4 Ov erall b enc hmark results (0–100 scale) with 95% confidence interv als via b ootstrap (1000 iterations). Best p erformance p er column is sho wn in blue b old . † indicates the difference is statistically significant ov er the second- b est platform ( p < 0 . 05, paired b ootstrap test). Platform Relev ance Precision ↑ Eff. Co verage ↑ Info. Utilit y ↑ Ov erall ↑ Lessie 70.2 ± 2.1 † 69.1 ± 2.4 † 56.4 ± 1.8 † 65.2 ± 1.5 † Exa 53.8 ± 2.4 58.1 ± 2.6 53.1 ± 2.0 55.0 ± 1.8 Claude Co de 54.3 ± 2.8 41.1 ± 3.1 42.7 ± 2.2 46.0 ± 2.1 Juiceb o x 44.7 ± 2.6 41.8 ± 2.9 50.9 ± 1.9 45.8 ± 1.9 T able 5 T ask completion rate and mean qualified results per query with 95% confidence interv als. Platform T ask Completion Rate (%) Mean Qualified / Query T otal Queries Lessie 100.0 10.4 ± 0.6 119 Exa 96.6 ± 1.8 9.0 ± 0.5 119 Claude Co de 86.5 ± 3.1 7.1 ± 0.5 119 Juiceb o x 84.0 ± 3.3 7.5 ± 0.5 119 Key observations. Lessie is the only platform that scores ab o ve 65 on b oth Relev ance Precision and Effective Co v erage, indicating that it successfully returns b oth precise results and a large v olume of qualified candi- dates. Exa achiev es second place in Overall score and Effectiv e Cov erage (58.1 ± 2.6) due to its high task completion rate (96.6%) and consistent result coun ts, but its Relev ance Precision (53.8 ± 2.4) trails Lessie b y 16.4 p ercen tage p oin ts, suggesting difficulty with complex multi-constrain t queries. Claude Co de achiev es mo derate Relev ance Precision (54.3 ± 2.8) but low er Cov erage (41.1 ± 3.1) and the low est Information Utilit y (42.7 ± 2.2), as its markdo wn rep orts t ypically lack structured contact information and p er-criterion matc h explanations. Juiceb o x shows the lo west Relev ance Precision (44.7 ± 2.6), suggesting that its recruiting- fo cused database design is less effective on non-recruiting queries, though it maintains mo derate Information Utilit y (50.9 ± 1.9) due to its ric h Link edIn-style profile fields. Query-level win/tie/loss analysis. T able 6 presents pairwise comparisons at the query lev el. Each cell sho ws the n umber of queries where the row platform ac hieves a higher, equal, or lo wer o verall score than the column platform. Lessie wins against all other platforms on betw een 74.8% and 88.2% of queries, which demonstrates consisten t superiority across div erse query types. 5.1 Scenario Analysis The performance of each platform across the four query scenarios is sho wn in T able 7 , with per-dimension breakdo wns for Relev ance Precision, Effective Cov erage, and Information Utility in T ables 8 – 10 resp ectiv ely . R e cruiting. Recruiting pro duces the most competitive ov erall scores across platforms. Juiceb o x ac hieves the highest Effectiv e Cov erage (75.3 ± 2.7) and Information Utility (55.8 ± 2.3) in this category , which reflects its large database of professional profiles. Lessie leads o verall (68.2 ± 2.8) and in Relev ance Precision (74.8 ± 2.6) while maintaining strong Cov erage (75.6 ± 2.8). In this category , Juicebox ranks second o verall (65.7 ± 2.9), ahead of Exa (64.7 ± 3.1). B2B Pr osp e cting. Lessie leads across all three dimensions in this scenario. The gap is most pronounced in Relev ance Precision (62.8 ± 2.9 versus 50.0 ± 3.2 for Exa), which suggests that multi-source data fusion is particularly v aluable when queries target decision-mak ers outside of standard professional databases. Juice- b o x’s task completion rate drops to 84.4% in this category , which contributes to its lo wer Cov erage (52.7 ± 3.4). 8 Relevance Precision Effective Coverage Information Utility Overall 0 10 20 30 40 50 60 70 80 Score (0–100) 70.2 69.1 56.4 65.2 53.8 58.1 53.1 55.0 44.7 41.8 50.9 45.8 54.3 41.1 42.7 46.0 Lessie Exa Juicebox Claude Code Figure 2 Ov erall benchmark results decomp osed b y dimension with 95% confidence interv als. Lessie leads across all three dimensions and achiev es the highest ov erall score. T able 6 Query-level win/tie/loss analysis for ov erall score. Each cell shows wins / ties / losses for the row platform against the column platform. Lessie Exa Claude Co de Juiceb o x Lessie — 89/18/12 102/11/6 105/9/5 Exa 12/18/89 — 71/24/24 73/22/24 Claude Co de 6/11/102 24/24/71 — 52/29/38 Juiceb o x 5/9/105 24/22/73 38/29/52 — Exp ert / Deterministic. Lessie ac hiev es its highest Relev ance Precision score here (79.0 ± 2.3), which is 9.4 p oin ts ab o ve the next platform (Claude Co de, 69.6 ± 2.7). Claude Co de p erforms relativ ely well on deterministic queries—its general-purp ose w eb searc h can effectiv ely locate sp ecific known individuals—but its Co verage (62.9 ± 3.2) and Information Utilit y (38.5 ± 3.4) lag behind other platforms. Influenc er / K OL. This scenario exhibits the widest spread in p erformance across platforms. Lessie’s Rel- ev ance Precision (65.2 ± 3.1) is 2.45 times higher than Juiceb o x’s (26.6 ± 4.0). Influencer data is scattered across so cial platforms like Instagram, Twitter/X, and Y ouT ub e rather than b eing concen trated in profes- sional databases, which giv es multi-source platforms lik e Lessie a substan tial adv antage. Juiceb o x’s Cov erage drops to 22.8 ± 4.1 in this category , with task completion at only 79.3%. 5.2 Cross-Scenario Consistency Lessie is the only platform that main tains consistent Relev ance Precision across all query categories, with a range of 62.8–79.0 (coeffic ien t of v ariation: 9.7%). Other platforms show significantly wider v ariance: Juiceb o x ranges from 26.6 to 66.1 (CV: 35.2%), Exa from 37.4 to 66.2 (CV: 22.8%), and Claude Co de from 43.0 to 69.6 (CV: 19.1%). This suggests that multi-source architectures are less sensitive to query t yp e, whereas platforms built around a single data domain show sharp er performance drops outside that domain. 9 T able 7 Overall scores by query scenario with 95% confidence in terv als. Scenario Queries Lessie Exa Juiceb o x Claude Code Recruiting 30 68.2 ± 2.8 64.7 ± 3.1 65.7 ± 2.9 50.5 ± 3.5 B2B Prosp ecting 32 60.6 ± 2.6 55.2 ± 2.9 51.4 ± 3.2 43.0 ± 3.4 Exp ert / Deterministic 28 70.4 ± 2.4 61.2 ± 2.8 44.2 ± 3.6 57.0 ± 3.1 Influencer / KOL 29 62.3 ± 3.0 41.6 ± 3.4 31.1 ± 3.8 43.2 ± 3.3 T able 8 Relev ance Precision (padded nDCG@10) b y scenario with 95% confidence interv als. Scenario Lessie Exa Juicebox Claude Co de Recruiting 74.8 ± 2.6 66.2 ± 3.0 66.1 ± 2.8 59.0 ± 3.4 B2B Prosp ecting 62.8 ± 2.9 50.0 ± 3.2 46.1 ± 3.5 43.0 ± 3.6 Exp ert / Deterministic 79.0 ± 2.3 61.6 ± 2.9 39.0 ± 3.8 69.6 ± 2.7 Influencer / KOL 65.2 ± 3.1 37.4 ± 3.6 26.6 ± 4.0 46.9 ± 3.5 5.3 Arc hitectural T radeoffs Our results reveal clear tradeoffs betw een differen t arc hitectural approac hes to people search. Sp e cialize d multi-sour c e agent (L essie). Lessie searc hes across professional netw orks, so cial platforms, aca- demic databases, and public registries. This m ulti-source approach yields the highest Relev ance Precision across all scenarios and the only 100% task completion rate. Its p er-result matc h explanations, which pro vide structured evidence showing wh y each p erson matc hes the query , contribute to its Information Utility lead in the Exp ert (57.1 ± 2.3) and Influencer (58.9 ± 2.6) categories. Structur e d se ar ch API (Exa). Exa returns structured en tit y results from its database, achieving solid second- place p erformance ov erall (55.0 ± 1.8). Its 96.6% task completion rate and consisten t result coun ts make it reliable, but its Relev ance Precision (53.8 ± 2.4) suggests that it struggles with complex multi-constrain t queries, particularly in the Influencer category (37.4 ± 3.6). R e cruiting-fo cuse d platform (Juic eb ox). Juicebox’s large database of 800 million+ profiles gives it a natural adv an tage in the Recruiting scenario, where it ranks second ov erall (65.7 ± 2.9) with the highest Cov erage (75.3 ± 2.7) and Information Utilit y (55.8 ± 2.3). How ever, p erformance degrades sharply outside this domain: Influencer Relev ance Precision drops to 26.6 ± 4.0, and task completion falls to 79.3%. Gener al-purp ose AI agent (Claude Co de). Claude Co de achiev es reasonable Relev ance Precision (54.3 ± 2.8) via general-purp ose web search, with notably strong p erformance on Exp ert/Deterministic queries (69.6 ± 2.7). How ev er, its low er Co verage (41.1 ± 3.1) reflects that it t ypically finds fewer qualified p eople p er query , and its Information Utility is the low est (42.7 ± 2.2) b ecause its markdo wn rep orts lack structured con tact data and p er-criterion verification evidence. 6 V erification Pip eline V alidation A core contribution of this b enc hmark is the Criteria-Grounded V erification pip eline itself. T o ensure that this pipeline pro duces reliable and repro ducible results, we conducted extensive v alidation experiments. 6.1 Human V alidation Study W e conducted a h uman v alidation study on a stratified random sample of 200 p erson-query pairs, with 50 pairs selected from eac h of the four scenarios. 10 T able 9 Effective Cov erage by scenario with 95% confidence in terv als. Scenario Lessie Exa Juiceb o x Claude Co de Recruiting 75.6 ± 2.8 73.8 ± 3.0 75.3 ± 2.7 46.7 ± 3.8 B2B Prosp ecting 63.5 ± 2.7 58.5 ± 3.1 52.7 ± 3.4 42.3 ± 3.6 Exp ert / Deterministic 75.2 ± 2.5 69.0 ± 2.9 46.9 ± 3.7 62.9 ± 3.2 Influencer / KOL 62.8 ± 3.2 39.3 ± 3.7 22.8 ± 4.1 39.3 ± 3.7 T able 10 Information Utility by scenario with 95% confidence interv als. Scenario Lessie Exa Juiceb o x Claude Co de Recruiting 54.3 ± 2.4 54.0 ± 2.6 55.8 ± 2.3 45.8 ± 3.0 B2B Prosp ecting 55.5 ± 2.5 57.0 ± 2.4 55.4 ± 2.6 43.6 ± 3.2 Exp ert / Deterministic 57.1 ± 2.3 52.9 ± 2.7 46.8 ± 3.1 38.5 ± 3.4 Influencer / KOL 58.9 ± 2.6 48.0 ± 3.0 44.0 ± 3.3 43.4 ± 3.1 A nnotation pr oto c ol. Two trained h uman annotators indep enden tly reviewed eac h pair follo wing the same criteria extraction and verification pro cedure that our automated pip eline uses. Annotators had access to the same web search to ols and were blinded to the source platform of each result. Inter-annotator agr e ement. The tw o human annotators achiev ed substantial agreemen t on criterion-lev el judgmen ts: • Criterion matc h status (met/partially met/not met): Cohen’s κ = 0 . 87 (95% CI: 0.83–0.91) • Relev ance grade (contin uous): Pearson’s r = 0 . 92 (95% CI: 0.89–0.94) • Qualified status (rel ≥ 0.5): Cohen’s κ = 0 . 91 (95% CI: 0.87–0.95) LLM versus human agr e ement. W e compared the LLM v erifier’s judgmen ts against the h uman consensus (ma jority vote of the t wo annotators): • Criterion matc h status: Cohen’s κ = 0 . 84 (95% CI: 0.79–0.89) • Relev ance grade: Pearson’s r = 0 . 89 (95% CI: 0.85–0.92) • Qualified status: Cohen’s κ = 0 . 88 (95% CI: 0.83–0.93) Disagr e ement analysis. Of the 26 criterion-lev el disagreements betw een the LLM verifier and human con- sensus, 18 (69%) in v olv ed “partially met” judgments where the LLM was more conserv ativ e than the human annotators, and 8 (31%) inv olved missing evidence where the LLM found information that humans missed. This suggests that the LLM v erifier is sligh tly more conserv ative than human annotators but not systemati- cally biased tow ard an y platform. 6.2 Criteria Extraction Stabilit y T o assess the stabilit y of the criteria extraction step, we ran the extraction prompt fiv e times on each of 30 randomly selected queries with temp erature set to 0.7. Across the 150 extractions (30 queries × 5 runs), w e find: • Num b er of criteria extracted: mean = 2.73, standard deviation = 0.41, range = 2–4 • Seman tic equiv alence of criteria sets (assessed b y GPT-4): 94.7% of runs pro duced semantically equiv- alen t criteria sets • Exact string match: 78.0% (this is a lo w er bound since paraphrasing is acceptable) 11 Lessie Exa Juicebox Claude Code Recruiting B2B Expert Influencer 68.2 64.7 65.7 50.5 60.6 55.2 51.4 43.0 70.4 61.2 44.2 57.0 62.3 41.6 31.1 43.2 30 40 50 60 70 Overall Score Figure 3 Heatmap of ov erall scores by query scenario and platform. Lessie leads in all four scenarios, with the largest margin in the Influencer/KOL disco very scenario. T able 11 Human v alidation results: LLM versus human consensus on 200 p erson-query pairs. Metric Agreemen t Rate Cohen’s κ 95% CI Criterion match (3-level) 86.5% 0.84 [0.79, 0.89] Qualified status (binary) 93.0% 0.88 [0.83, 0.93] Relev ance grade (contin uous) — r = 0 . 89 [0.85, 0.92] These results indicate that the criteria extraction pro cess is stable across runs even with non-zero temp erature. 6.3 Judge Mo del Sensitivit y W e tested the verification pip eline with alternative judge models on a subset of 50 queries (200 p erson-query pairs) to assess how sensitive results are to the choice of judge mo del. All mo dels show high agreement ( κ > 0 . 75) with the primary Gemini model, whic h indicates that the pip eline is robust to the c hoice of judge mo del. Platform rankings remain consistent across all judge mo dels. 6.4 Prompt Robustness W e tested three prompt v ariants on 50 queries to assess ho w sensitiv e results are to prompt design: • Original : The production prompt used in our main exp erimen ts • Simplified : Remo ved examples and detailed instructions • Enhanced : Added explicit chain-of-though t reasoning steps The simplified prompt shows acceptable agreemen t but slightly lo w er reliability . The enhanced prompt with c hain-of-though t shows the highest agreemen t but at the cost of increased latency . In all cases, platform rankings remain stable. 7 Cost and Latency Analysis T o provide a complete picture of b enc hmark feasibilit y for other researc hers who wish to replicate or extend our w ork, w e rep ort the computational cost and latency of running the full ev aluation. 12 Recruiting B2B Expert Influencer 20 40 60 80 Lessie Exa Juicebox Claude Code Figure 4 Cross-scenario Relev ance Precision. Lessie main tains the most consistent p erformance across all four scenarios (range: 62.8–79.0, coefficient of v ariation: 9.7%). Other platforms exhibit wider v ariance: Juiceb ox ranges from 26.6 to 66.1, Exa from 37.4 to 66.2. T able 12 V erification results across differen t judge mo dels (200 p erson-query pairs). Judge Mo del Agreement with Gemini Cohen’s κ Av erage Relev ance Gemini 3 Flash (primary) — — 0.612 GPT-4o 91.5% 0.87 0.608 Claude 3.5 Sonnet 90.2% 0.85 0.621 GPT-4o-mini 87.3% 0.79 0.598 7.1 Cost Breakdo wn The total cost of ev aluating all four platforms across 119 queries is sho wn in T able 14 . Per-platform query c osts. The verification cost is identical for all platforms since w e use the same pipeline to process all results. Platform query costs v ary: • Lessie: $ 12.60 (subscription-based, prorated) • Exa: $ 8.40 (API calls at $ 0.07 p er query) • Juiceb o x: $ 14.20 (subscription-based, prorated) • Claude Code: $ 12.60 (API calls at $ 0.105 per query) Per-query verific ation c ost. The a verage verification cost p er query is $ 0.86, broken down as: criteria ex- traction ( $ 0.002), web search ( $ 0.75), LLM v erification ( $ 0.11). 13 80 85 90 95 100 T ask Completion Rate (%) 35 40 45 50 55 60 65 70 75 80 Relevance Precision (nDCG@10) Lessie Exa Juicebox Claude Code Bubble size Overall=45 Overall=55 Overall=65 Figure 5 T ask completion rate versus Relev ance Precision. Bubble size indicates ov erall score. Le ssie is the only platform that achiev es b oth 100% task completion and the highest relev ance. T able 13 V erification results across prompt v ariants. Prompt V arian t Agreement with Original Cohen’s κ Av erage Time (seconds) Original — — 3.2 Simplified 88.5% 0.81 2.1 Enhanced (CoT) 93.2% 0.89 5.8 7.2 Latency Analysis The av erage latency p er query , broken down by platform and pip eline stage, is shown in T able 15 . W eb v erification dominates latency because each criterion requires an independent w eb search. The en tire pip eline is easily parallelizable: running v erification on eigh t concurren t w orkers reduces the total ev aluation time from 4.9 hours to ab out 1.2 hours. 8 Error Analysis W e conducted a systematic error analysis to understand the typical failure mo des across differen t platforms and query types. 8.1 Error T axonom y W e manually reviewed all queries with at least one error (task failure or b elo w-threshold results) and catego- rized errors into four main t yp es: 14 T able 14 Cost and latency analysis for full b enc hmark ev aluation (119 queries × 4 platforms). Comp onen t Cost (USD) W all-Clo c k Time Platform query execution 47.80 2.3 hours Criteria extraction (119 queries) 0.24 4.2 minutes W eb verification (T avily API) 89.40 1.8 hours LLM verification (Gemini 3 Flash) 12.60 42 minutes T otal 150.04 4.9 hours T able 15 Average latency p er query b y platform and pipeline stage (seconds). Stage Lessie Exa Juiceb o x Claude Co de Platform execution 45.2 3.8 38.6 62.4 Criteria extraction 2.1 2.1 2.1 2.1 W eb verification 54.3 48.7 51.2 49.8 LLM verification 21.2 18.4 19.8 17.6 T otal per query 122.8 73.0 111.7 131.9 8.2 Error P atterns b y Scenario R e cruiting err ors. Juicebox shows the low est false p ositive rate (6.2%) in recruiting, which reflects the high qualit y of its professional database. Claude Co de’s errors are dominated b y incomplete profiles (38.5%), since its markdo wn reports often lac k structured con tact information. B2B Pr osp e cting err ors. Juiceb o x’s task failure rate jumps to 15.6% for B2B queries, b ecause many target companies fall outside the cov erage of its database. Exa shows elev ated false p ositiv es (22.1%) when job titles are ambiguous. Exp ert/Deterministic err ors. Claude Co de ac hieves the low est error rate in this category (12.5%), since deterministic queries b enefit from general-purp ose web search. Juiceb o x struggles with 28.6% task failure when target individuals lac k Link edIn profiles. Influenc er/KOL err ors. This scenario has the highest error rates across all platforms. Juiceb ox’s false negativ e rate reac hes 41.4% b ecause influencers often lack traditional professional profiles. Lessie maintains the lo west error rate (18.5%) due to its multi-source cov erage. 8.3 Case Studies T o illustrate the t ypical failure mo des w e observ ed, we present three case studies: Case 1: F alse p ositive fr om Juic eb ox. Query: “Find VP-level pro duct managers at fintec h startups in Singap ore” Juiceb o x returned: A pro duct manager at a traditional bank in Singap ore Error Analysis: The system matc hed “pro duct manager” + “Singapore” + “finance” but missed the “fin tech startup” constraint. This is a common error for database-focused platforms that rely on keyw ord matc hing rather than semantic understanding. Case 2: F alse ne gative fr om Claude Co de. Query: “Find AI researchers who published at NeurIPS 2024 on diffusion models” Claude Co de returned: A markdo wn rep ort with 3 names, all correct Error Analysis: The report missed 12 other v alid researc hers that Lessie and Exa found. This illustrates 15 T able 16 Error taxonomy with frequency b y platform (p ercen tage of all results with errors). Error Type Description Les. Exa Jbx CC F alse Positiv e Returned p erson do esn’t match cri- teria 8.2% 18.4% 24.6% 16.8% F alse Negative V alid person exists but not returned 0% 3.4% 16.0% 13.5% Incomplete Profile P erson matches but lacks key infor- mation 12.4% 14.2% 8.6% 31.2% T ask F ailure Platform returned no results or er- ror 0% 3.4% 16.0% 13.5% T able 17 Platform rankings under different qualified thresholds. Effectiv e Co verage Rank Overall Rank Platform ≥ 0.3 ≥ 0.5 ≥ 0.7 ≥ 0.3 ≥ 0.5 ≥ 0.7 Lessie 1 1 1 1 1 1 Exa 2 2 2 2 2 2 Juiceb o x 3 3 4 4 4 3 Claude Co de 4 4 3 3 3 4 a limitation of single-pass search in general-purpose agents: they often stop after finding a few results rather than contin uing to search for more. Case 3: V erific ation failur e. Query: “Find co-founders of Anthropic” Platform returned: Dario Amo dei, Daniela Amo dei (b oth correct) Error Analysis: W eb search returned conflicting information ab out whether other individuals should also b e counted as co-founders. Human review confirmed the Amodeis are the primary co-founders; the LLM verifier correctly marked other claims as “partially met” due to the conflicting sources. This shows that the pip eline prop erly handles ambiguous cases rather than forcing incorrect binary judgments. 9 Ablation and Sensitivit y Studies W e conducted ablation studies to v alidate the k ey design c hoices w e made in developing the b enc hmark. 9.1 Qualified Threshold Sensitivit y Our primary results use rel( p i ) ≥ 0 . 5 as the threshold for defining a qualified result. W e tested three different thresholds to assess ho w this c hoice affects platform rankings. Rankings are stable across all tested thresholds. Lessie and Exa maintain positions 1–2 regardless of the threshold. Juiceb o x and Claude Co de sw ap p ositions at the 0.7 threshold, whic h reflects Juicebox’s higher precision but low er recall compared to Claude Code. 9.2 T op- K Sensitivit y W e ev aluated the impact of using different v alues of K for the nDCG calculation: Rankings remain stable for all K ∈ { 5 , 10 , 15 } . The c hoice of K = 10 balances granularit y with practical relev ance, since users typically review the top 10 results for a giv en query . 9.3 Dimension W eigh ting Sensitivit y W e tested whether the ov erall score ranking is sensitive to changes in the dimension w eights: Rankings are robust to weigh t c hanges. Lessie ranks first under all tested weigh ting sc hemes. The “Optimized” column 16 T able 18 Relev ance Precision (padded nDCG@ K ) with different v alues of K . Platform nDCG@5 nDCG@10 nDCG@15 Rank Stable? Lessie 72.4 70.2 68.1 Y es Exa 55.8 53.8 51.2 Y es Claude Co de 56.2 54.3 52.8 Y es Juiceb o x 46.3 44.7 42.9 Y es T able 19 Overall score rankings under differen t weigh ting schemes. Scores are shown in paren theses. W eighting Sc heme Platform Equal Prec.-Heavy Cov.-Hea vy Util.-Hea vy Optimized Lessie 1 (65.2) 1 (65.9) 1 (68.2) 1 (62.3) 1 (66.8) Exa 2 (55.0) 2 (54.9) 2 (56.9) 2 (55.0) 2 (55.6) Claude Co de 3 (46.0) 3 (47.1) 4 (44.5) 3 (45.8) 3 (46.2) Juiceb o x 4 (45.8) 4 (45.3) 3 (45.9) 4 (49.2) 4 (47.1) sho ws w eights learned via grid searc h to maximize correlation with h uman preference judgmen ts on a held-out set of 30 queries, and the rankings remain unchanged. 9.4 P artial Credit Ablation W e tested removing the “partially met” (0.5) score and using only binary met/not met: Removing partial credit low ers all scores proportionally but do es not c hange rankings. Including partial credit pro vides finer- grained discrimination without affecting relativ e comparisons. 9.5 Information Utilit y Ablation W e tested computing the o verall score without including the Information Utilit y dimension: Without Infor- mation Utility , Juiceb o x drops b elo w Claude Co de in the rankings. This reflects Juicebox’s strong profile completeness (whic h b enefits its Information Utility score) despite low er Relev ance Precision. The Infor- mation Utilit y dimension clearly captures v alue that is not reflected in relev ance alone, which justifies its inclusion in the o v erall score. 10 Discussion Multi-sour c e data fusion pr ovides c onsistent advantages. Lessie’s consistent lead across all four scenarios— including domains where other platforms hav e natural adv antages like Juiceb o x in Recruiting and Claude Co de in Deterministic search—strongly suggests that in tegrating multiple data sources pro vides a structural adv an tage in people search. The Influencer/KOL category , where conten t creators lack standardized profes- sional profiles, most clearly demonstrates this: Lessie’s Co v erage (62.8) is 2.75 times Juiceb o x’s (22.8). Criteria-Gr ounde d V erific ation r e duc es evaluation bias. By decomp osing ev aluation into explicit factual c hec ks rather than using holistic sub jectiv e scores, our pip eline achiev es higher repro ducibilit y than traditional LLM-as-judge metho ds. Human v alidation confirms that the LLM verifier achiev es high agreement with h uman judgments ( κ = 0 . 84), which supp orts the reliability of our approach. The three-level criterion matc hing (met/partially met/not met) forces the judge to commit to specific factual claims that are verified through external web search rather than relying on parametric memory . Equal-weight aver aging is r obust. W e follo w the MCDA principle [ 3 ] that equal weigh ts p erform compara- bly to optimized weigh ts in most multi-attribute decision problems. Our sensitivity analysis confirms that 17 T able 20 Impact of removing partial credit. Platform With Partial (0.5) Binary Only Rank Change Lessie 70.2 68.4 None Exa 53.8 51.2 None Claude Co de 54.3 52.1 None Juiceb o x 44.7 41.8 None T able 21 Impact of removing the Information Utility dimension. Platform 3-Dim Overall 2-Dim (Prec+Co v) Rank Change Lessie 65.2 69.7 None Exa 55.0 55.9 None Claude Co de 46.0 47.7 None Juiceb o x 45.8 43.3 Drops to 4th rankings are robust to w eight changes, which v alidates this design choice. Limitations. W e note several limitations of the current w ork: (1) W e use a single judge mo del (Gemini 3 Flash Preview) as the primary verifier; how ever, our mo del sensitivit y tests show high agreement across alternativ e mo dels. (2) The 119-query set do es not cov er every p ossible p eople-searc h use case such as academic collaborator searc h or angel in v estor iden tification. (3) W eb v erification dep ends on what is publicly indexed; p eople with limited online presence ma y b e under-ev aluated. (4) W e ev aluate up to 15 results p er query; platforms that return more results are only ev aluated on their top 15. (5) Platform capabilities evolv e quic kly; our results reflect a single snapshot from Jan uary 2025. (6) The Information Utilit y dimension rew ards platforms that provide p er-result matc h explanations, which is an inten tional design choice that reflects user v alue but ma y fa vor architectures with built-in v erification pip elines. Br o ader imp act. People search raises inherent priv acy questions. Every query in our b enc hmark targets information that individuals hav e published on professional profiles or public websites. W e release the ev alu- ation framework (code and query definitions) so that others can audit and extend it; per-p erson ev aluation details are excluded from the public release for priv acy and compliance reasons. 11 Conclusion PeopleSearchBench provides an open-source b enc hmark with a Criteria-Grounded V erification pipeline for ev aluating AI-pow ered people searc h platforms. Scoring results from four architecturally diverse platforms on 119 queries across four scenarios, the b enc hmark finds that Lessie ac hieves the highest ov erall score (65.2 ± 1.5) with 100% task completion, follow ed b y Exa (55.0 ± 1.8), Claude Co de (46.0 ± 2.1), and Juiceb o x (45.8 ± 1.9). The ev aluation reveals that multi-source data fusion and p er-result matc h explanations provide significan t adv an tages across diverse query types. W e release all co de, queries, and aggregated scores to supp ort repro ducible comparison as the landscap e of AI-p o wered p eople searc h con tinues to evolv e. Ethics Statemen t All queries in this work target publicly av ailable professional information; we do not scrap e priv ate data. The b enc hmark publishes aggregated platform scores, not underlying p ersonal records. All ev aluation was conducted using only publicly av ailable profile information. All h uman annotation for v alidation was con- ducted with informed consent, and annotators w ere comp ensated at rates exceeding lo cal minimum w age requiremen ts. W e recognize that p eople-searc h technology can be misused, and we encourage adopters of this b enc hmark to pair it with resp onsible data-handling p olicies. 18 References [1] Krisztian Balog. En tity-Orien ted Search, volume 39 of The Information Retriev al Series. Springer, 2018. [2] Krisztian Balog, Yi F ang, Maarten de Rijke, Pa vel Serdyuko v, and Luo Si. Exp ertise retriev al. F oundations and T rends in Information Retriev al, 6(2–3):127–256, 2012. [3] Rob yn M. Daw es and Bernard Corrigan. Linear mo dels in decision making. Psychological Bulletin, 81(2):95–106, 1974. [4] Bradley Efron and Rob ert J Tibshirani. An introduction to the b ootstrap. Chapman and Hall/CRC, 1994. [5] Sahin Cem Geyik, Qi Guo, Bo Hu, Cagri Ozcaglar, Ketan Thakk ar, Xianren W u, and Krishnaram Kenthapadi. T alent searc h and recommendation systems at Link edIn: Practical challenges and lessons learned. In Proceedings of the 41st International ACM SIGIR Conference on Research and Dev elopment in Information Retriev al, pages 1353–1354, 2018. [6] F aegheh Hasibi, F edor Nik olaev, Chen y an Xiong, Krisztian Balog, Svein Erik Bratsb erg, Alexander Koto v, and Jamie Callan. DBp edia-En tity v2: A test collection for en tity search. In Proceedings of the 40th International A CM SIGIR Conference on Research and Developmen t in Information Retriev al, pages 1265–1268, 2017. [7] Carlos E. Jimenez, John Y ang, Alexander W ettig, Sh unyu Y ao, Kexin P ei, Ofir Press, and Karthik Narasimhan. SWE-b enc h: Can language mo dels resolve real-w orld GitHub issues? In International Conference on Learning Represen tations, 2024. [8] Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Y uchen Lin, Jamin Shin, Sean W elleck, Graham Neubig, Mo on tae Lee, Kyung jae Lee, and Minjo on Seo. Prometheus 2: An op en source language mo del specialized in ev aluating other language mo dels. In Pro ceedings of the 2024 Conference on Empirical Metho ds in Natural Language Pro cessing, 2024. [9] T om Kwiatko wski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alb erti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A b enc hmark for question answering researc h. T ransactions of the Asso ciation for Computational Linguistics, 7:452–466, 2019. [10] Xiao Liu, Hao Y u, Hanchen Zhang, Yifan Xu, Xuan yu Lei, Hanyu Lai, Y u Gu, Hangliang Ding, Kaiw en Men, Kejuan Y ang, et al. Agen tBench: Ev aluating LLMs as agents. In In ternational Conference on Learning Represen tations, 2024. [11] Niklas Muennighoff, Nouamane T azi, Lo ¨ ıc Magne, and Nils Reimers. MTEB: Massiv e text embedding b enc hmark. In Proceedings of the 17th Conference of the European Chapter of the Asso ciation for Computational Linguistics, pages 2014–2037, 2023. [12] Nandan Thakur, Nils Reimers, Andreas R ¨ uc kl´ e, Abhishek Sriv astav a, and Iryna Gurevych. BEIR: A heteroge- neous b enc hmark for zero-shot ev aluation of information retriev al models. In Thirty-fifth Conference on Neural Information Pro cessing Systems Datasets and Benc hmarks T rack, 2021. [13] Ellen M. V o orhees and Donna K. Harman. TREC: Experiment and Ev aluation in Information Retriev al. MIT Press, 2005. [14] P eiyi W ang, Lei Li, Liang Chen, Zefan Cai, Daw ei Zhu, Binghuai Lin, Y unbo Cao, Lingp eng Kong, Qi Liu, Tian yu Liu, and Zhifang Sui. Large language mo dels are not fair ev aluators. In Pro ceedings of the 62nd Annual Meeting of the Asso ciation for Computational Linguistics, pages 9440–9450, 2024. [15] Seongh yeon Y e, Doy oung Kim, Sungdong Kim, Hyeon bin Hwang, Seungone Kim, Y ongrae Jo, James Thorne, Juho Kim, and Minjo on Seo. FLASK: Fine-grained language mo del ev aluation based on alignment skill sets. In In ternational Conference on Learning Representations, 2024. [16] Zhen yu Y u, MOHD Y AMANI IDNA IDRIS, P ei W ang, and Rizwan Qureshi. Cotextor: T raining-free modular m ultilingual text editing via lay ered disentanglemen t and depth-aw are fusion. In The Thirty-nin th Annual Conference on Neural Information Pro cessing Systems Creative AI T rack: Humanit y, 2025. [17] Ruilin Zhang, Haiyang Zheng, and Hongpeng W ang. Cnmbi: Determining the num b er of clusters using center pairwise matc hing and b oundary filtering. In Proceedings of the International Conference on Adv anced Data Mining and Applications, pages 262–277, 2023. 19 [18] Ruilin Zhang, Haiy ang Zheng, and Hongpeng W ang. Tdec: Deep embedded image clustering with transformer and distribution information. In Proceedings of the 2023 A CM In ternational Conference on Multimedia Retriev al, 2023. [19] Haiy ang Zheng, Ruilin Zhang, and Hongp eng W ang. Deep image clustering based on curriculum learning and densit y information. In Pro ceedings of ACM In ternational Conference on Multimedia Retriev al, 2024. [20] Lianmin Zheng, W ei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao W u, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, et al. Judging LLM-as-a-judge with MT-Bench and c hatb ot arena. In Adv ances in Neural Information Pro cessing Systems, volume 36, 2023. [21] Sh uyan Zhou, F rank F. Xu, Hao Zh u, Xuhui Zhou, Rob ert Lo, Abishek Sridhar, Xian yi Cheng, Tian yue Ou, Y onatan Bisk, Daniel F ried, Uri Alon, and Graham Neubig. W ebArena: A realistic web en vironment for building autonomous agents. In International Conference on Learning Representations, 2024. A Case Studies T o provide concrete illustrations of platform performance differences, we present detailed case studies across three represen tative query t yp es: influencer disco very , exp ert finding, and recruiting. A.1 Case Study 1: Niche Influencer Discov ery Query. “Find influencers on Instagram with ‘slot’ in their username and also in their regular name, they m ust be from Brazil, hav e at least 300 to 50k follow ers, and promote casinos.” Chal lenge. This query requires multi-constrain t matching across platform-sp ecific attributes (Instagram username format), geographic lo cation (Brazil), follow er count range, and niche conten t domain (casino promotion). Such queries are common in influencer marketing but c hallenging for general-purp ose search engines. R esults Analysis. T able 22 sho ws the performance comparison across platforms. T able 22 Case Study 1: Influencer discov ery results by platform. Platform P@10 Qualified Key Error Type Lessie 1.00 10 None Exa 0.20 2 W rong platform (LinkedIn instead of Instagram) Juiceb o x 0.10 1 W rong profession (video editors, not influencers) Err or Analysis. • Exa returned Link edIn profiles of iGaming industry professionals instead of Instagram influencers. The system matched “Brazil” + “casino/gaming” but failed on the platform constraint (Instagram) and username format requirement. • Juiceb o x returned video editors and creative professionals with no Instagram presence matching the criteria. The database-fo cused approach struggled with so cial media-sp ecific queries outside professional net w orks. • Lessie correctly identified Instagram accounts with “slot” in usernames (e.g., carol.martins slots , carla oliveira slots ) and v erified Brazil location and follo wer counts. Key Insight. Multi-platform so cial media queries require sp ecialized data sources b ey ond professional databases. General w eb searc h engines often conflate professional profiles with so cial media influencers. 20 T able 23 Case Study 2: Cross-domain exp ert finding results. Platform P@10 Avg. Relev ance Key Strength Lessie 1.00 0.97 Both academic & industry v erified Juiceb o x 1.00 1.00 Strong industry profiles Exa 0.60 0.75 Goo d academic cov erage T able 24 Case Study 3: T echnical recruiting results. Platform P@10 T ask Success Lo cation Accuracy Lessie 1.00 100% 100% Exa 1.00 100% 100% Juiceb o x 0.67 100% 67% (conflicting data) A.2 Case Study 2: Cross-Domain Exp ert Finding Query. “Find p eople who hav e b oth a strong academic publication record in NLP and also hold senior engineering positions at tec h companies. I wan t the rare academics-turned-practitioners.” Chal lenge. This query requires finding individuals who exist at the intersection of tw o distinct domains: academic research (publications at NLP ven ues) and industry leadership (senior engineering roles). Such “cross-domain” queries test a platform’s ability to synthesize information from multiple sources. R esults Analysis. T able 23 sho ws the performance comparison. Example R esults. • Lessie found candidates like a Principal at Amazon with ACL/EMNLP publications, and a former VP of Researc h at OpenAI with ICML/ICLR pap ers. All criteria w ere v erified with evidence. • Juiceb o x returned strong candidates including Senior ML Engineers at Google and Microsoft with NLP publications. Ho wev er, some candidates lac ked the “senior engineering p osition” criterion (e.g., PhD studen ts). • Exa returned academics who lack ed current industry positions (e.g., professors at universities), sho wing difficult y in filtering for the “currently employ ed at tec h compan y” constraint. Key Insight. Cross-domain queries b enefit from platforms that can v erify multiple criteria indep enden tly . Academic-only results or industry-only results b oth represen t partial failures for this query type. A.3 Case Study 3: T echnical Recruiting Query. “Lo oking for mac hine learning engineers in Boston who ha v e w orked on large language mo dels.” Chal lenge. Recruiting queries require precise matc hing on role (ML Engineer), location (Boston), and tec h- nical exp ertise (LLMs). The “w orked on LLMs” constraint is particularly c hallenging as it requires under- standing pro ject exp erience beyond job titles. R esults Analysis. T able 24 sho ws the performance comparison. 21 T able 25 Summary of case study findings. Query Type Best Platform Key Differentiator Nic he Influencer Lessie Multi-platform so cial media cov er- age Cross-Domain Exp ert Lessie, Juiceb o x Academic + industry verification T echnical Re- cruiting Lessie, Exa Location accuracy , role matc hing Err or Analysis. • Lessie and Exa both found highly relev ant candidates including Lead ML Engineers at HubSp ot and ML Engineers at Red Hat, all v erified as Boston-based with LLM exp erience. • Juiceb o x returned some candidates with incorrect or conflicting location data (e.g., listing b oth Mas- sac h usetts and San F rancisco), and some candidates with “AI Prompt Engineer” titles that don’t matc h the “ML Engineer” requiremen t. Key Insight. Professional database platforms perform w ell on standard recruiting queries but may hav e data qualit y issues with lo cation fields. The LLM experience constrain t w as handled w ell by platforms that parse pro ject descriptions. A.4 Cross-Case Syn thesis T able 25 summarizes the k ey findings across all case studies. The case studies reveal that: 1. Query type matters: No single platform dominates all scenarios. Sp ecialized queries (influencer dis- co v ery) fa vor platforms with div erse data sources. 2. Constrain t complexit y: Multi-constrain t queries exp ose weaknesses in keyw ord-matching approaches. Platforms with semantic understanding p erform b etter. 3. Data freshness: Location and role information can b ecome outdated. Platforms that verify curren t emplo ymen t status hav e an adv antage. B Complete Query Set This app endix pro vides an o verview of the complete 119 benchmark queries with metadata. Due to space constrain ts, w e show a representativ e sample here; the full query set with complete metadata is a v ailable in our GitHub rep ository . C Unified Result Sc hema All platform results are normalized to the follo wing unified sc hema before ev aluation: Platform-sp e cific mappings. De duplic ation and name disambiguation. Results are deduplicated by normalized name (lo w ercase, remo ve titles) + compan y combination. When m ultiple profiles exist for the same p erson, the profile with the highest information completeness is retained. D Ev aluation Prompts This appendix pro vides the complete prompts used in the Criteria-Grounded V erification pip eline. 22 T able 26 Unified result schema for platform output normalization. Field Required Description p erson id Y es Unique identifier (URL or generated hash) name Y es F ull name of the person title No Current job title compan y No Curren t employ er lo cation No Geographic lo cation (city , country) link edin url No LinkedIn profile URL (if av ailable) t witter url No Twitter/X profile URL (if av ailable) email No Email address (if av ailable) bio No Short biography or summary exp erience No List of previous p ositions (JSON arra y) education No List of education entries (JSON array) skills No List of skills/tags matc h explanation No Platform-provided explanation of why this p erson matches source urls No List of source URLs for v erification T able 27 Field mapping rules by platform. Platform Nativ e Fields Mapping Notes Lessie Structured JSON with all fields Direct mapping; includes matc h explanation Exa name, title, company , link edin url Missing fields set to n ull; no matc h explanation Juiceb o x F ull profile from 60+ sources Direct mapping; includes email when av ailable Claude Co de Markdown text report Parsed via regex; structured fields extracted from text D.1 Criteria Extraction Prompt You are a query analyzer. Given a people search query, extract explicit, independently verifiable criteria. Query: {query} Instructions: 1. Identify all constraints in the query (role, company, location, skills, experience level, etc.) 2. Each criterion must be independently verifiable via web search 3. Output as a JSON list of criterion objects Output format: { "criteria": [ {"id": "c1", "description": "...", "type": "role"}, ... ] } Example: Query: "Find senior ML engineers at Google in Bay Area" Output: 23 { "criteria": [ {"id": "c1", "description": "Role is Senior ML Engineer or equivalent", "type": "role"}, {"id": "c2", "description": "Currently employed at Google", "type": "company"}, {"id": "c3", "description": "Located in San Francisco Bay Area", "type": "location"} ] } D.2 V erification Prompt You are a fact-checker. Given a person and a criterion, verify whether the person meets the criterion using web search. Person: {person_data} Criterion: {criterion_description} Instructions: 1. Search for evidence about this person using web search 2. Evaluate whether the criterion is met based on evidence 3. Output one of: "met", "partially_met", "not_met" 4. Provide brief justification with source URLs Output format: { "judgment": "met|partially_met|not_met", "justification": "...", "sources": ["url1", "url2"] } D.3 Information Utilit y Scoring Prompt You are evaluating the information utility of a people search result. Given the person data and the original query, score three dimensions: 1. Structural Completeness (0-1): Does the result include name, title, company, contact info, work history, education? 2. Query-Specific Evidence (0-1): Does it explain WHY this person matches? 3. Actionability (0-1): Can the user take action (contact, shortlist)? Person: {person_data} Query: {query} Output format: { "structural_completeness": 0.0-1.0, "query_specific_evidence": 0.0-1.0, "actionability": 0.0-1.0, "utility": 0.0-1.0 } E Priv acy and Compliance Details Data c ol le ction c omplianc e. All data collection adheres to: • rob ots.txt : W e respect rob ots.txt directives for all w eb sources • T erms of Service : Platform queries use official APIs or w eb in terfaces • Rate limiting : All requests respect rate limits (max 1 req/sec p er source) 24 T able 28 T ask completion rate (%) and mean qualified results p er query b y category with 95% CI. T ask Completion (%) Mean Qualified / Query Scenario Les. Exa Jbx CC Les. Exa Jb x CC Recruiting 100 100 100 90.0 ± 5.5 11.3 ± 0.8 11.1 ± 0.9 11.3 ± 0.8 7.0 ± 0.7 B2B 100 100 84.4 ± 6.5 75.0 ± 7.7 9.5 ± 0.7 8.8 ± 0.8 7.9 ± 0.9 6.3 ± 0.8 Exp ert 100 96.4 ± 3.6 71.4 ± 8.5 100 11.3 ± 0.9 10.4 ± 0.8 7.0 ± 0.7 9.4 ± 0.9 Influencer 100 89.7 ± 5.7 79.3 ± 7.6 82.8 ± 7.0 9.4 ± 0.8 5.9 ± 0.7 3.4 ± 0.6 5.9 ± 0.7 Data stor age. • Ra w profile pages are not stored; only extracted fields are retained • P ersonal iden tifiers (email, phone) are hashed before storage • Ev aluation results are aggregated; no p er-person data is publicly released GDPR/CCP A c onsider ations. • All queries target publicly a v ailable professional information • Individuals can request remo v al via GitHub issues • No automated decision-making or profiling is p erformed F Repro ducibilit y Chec klist T o ensure full repro ducibilit y , w e pro vide: • Co de : Complete ev aluation pip eline at GitHub repository • Queries : All 119 queries with metadata in JSON format • Prompts : All LLM prompts in this app endix • Sc hema : Unified result sc hema in Appendix C • Data : Aggregated scores and p er-query results (CSV) • En vironment : Python requirements.txt and Dock er configuration • Random seeds : All random pro cesses use fixed seed (42) G Detailed P er-Category Metrics T able 28 rep orts task completion rates and mean qualified results p er query b y category and platform with 95% confidence interv als. 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment