PeopleSearchBench: A Multi-Dimensional Benchmark for Evaluating AI-Powered People Search Platforms

P eopleSearc hBenc h: A Multi-Dimensional Benc hmark for Ev aluating AI-P o w ered P eople Searc h Platforms W ei W ang*, Tian yu Shi* † , Sh uai Zhang* † , Bo yang Xia, Zequn Xie, Chen yu Zeng, Qi Zhang, Lynn Ai, Y aqi Y u, Kaiming Zhang, F eiyue T ang LessieAI researc h team dev@lessie.ai tys@cs.toronto.edu https://github.com/LessieAI/people- search- bench Abstract AI-p o wered people search platforms are increasingly used in recruiting, sales prosp ecting, and profes- sional net w orking, yet there is still no standard, comprehensive benchmark for ev aluating and com- paring their p erformance. T o address this gap, we present PeopleSear chBench , an op en-source b enc hmark that ev aluates four p eople searc h platforms on 119 real-world queries among four distinct scenarios: corp orate recruiting, B2B sales prospecting, exp ert searc h with deterministic answ ers, and inﬂuencer/K OL disco very . A cen tral con tribution of this work is Criteria-Grounded V eriﬁcation, an ev aluation pipeline for factual relev ance assessment. The pipeline extracts explicit, veriﬁable crite- ria from eac h query and chec ks whether each returned p erson satisﬁes them using liv e web search. This pro cess produces binary relev ance judgmen ts grounded in factual veriﬁcation, rather than the more sub jectiv e qualit y scores often used in holistic LLM-as-judge ev aluation. W e ev aluate systems along three dimensions: Relev ance Precision, measured by padded nDCG@10; Eﬀective Co v erage, measured by task completion and qualiﬁed result yield; and Information Utilit y , measured by the completeness and usefulness of the returned proﬁles. These dimensions are av eraged with equal w eigh t to pro duce an ov erall score. Across four people search platforms, our benchmark shows that Lessie, a specialized AI p eople search agent, achiev es the strongest ov erall performance. The ov erall score of Lessie is 65.2, which is 18.5% higher than that of the second-ranked system. It is also the only system to ac hieve 100% task completion across all 119 queries. T o support reproducibility and reliabilit y , we also rep ort conﬁdence interv als, h uman v alidation of the veriﬁcation pip eline (Cohen’s κ = 0 . 84), ablation studies on key design choices, and full do cumen tation of queries, prompts, and normalization pro cedures. All co de, query deﬁnitions, and aggregated results are publicly av ailable at https://github.com/LessieAI/people- search- bench . 1 In tro duction P eople search—the task of ﬁnding individuals who matc h a sp eciﬁc combination of role, skills, lo cation, and domain expertise—is a common w orkﬂow in recruiting, sales, and mark eting. As AI-pow ered platforms increasingly automate this pro cess, comparing their eﬀectiveness has b ecome imp ortan t but remains diﬃcult. 0 † Corresponding authors. 1 Despite rapid adoption, there is still no widely accepted metho dology for ev aluating people-search systems in a rigorous and reproducible manner. Existing benchmarks for information retriev al [ 12 , 16 ] and question answ ering [ 9 ] do not adequately address this setting, where outputs are real individuals, v alid answers are often non-exhaustiv e, and k ey proﬁle attributes require independent veriﬁcation. The c hallenge extends beyond the absence of a lab eled dataset to the ev aluation metho dology itself. Standard b enc hmarks t ypically rely on pre-deﬁned relev ance lab els, while holistic LLM-as-judge approac hes often dep end on sub jective o v erall assessmen ts [ 20 ]. Neither is fully adequate for p eople search. Most queries admit many correct answ ers, and practical utility dep ends not only on retrieving relev ant individuals, but also on returning enough qualiﬁed candidates with veriﬁable and navigable proﬁle information to supp ort immediate downstream action. As a result, ev aluation must accoun t for multiple criteria and supp ort factual v eriﬁcation against external evidence. W e introduce P eopleSearchBenc h, an op en-source b enc hmark con taining 119 queries in four languages (En- glish, P ortuguese, Spanish, Dutc h) grouped in to four commercially relev ant scenarios: Recruiting (30 queries), B2B Prosp ecting (32), Exp ert/Deterministic Search (28), and Inﬂuencer/KOL Disco v ery (29). Ev aluation is conducted through our Criteria-Grounded V eriﬁcation pip eline, which breaks do wn each query in to explicit c hec k able criteria and v eriﬁes each result against those criteria through liv e w eb search. This pro duces binary factual judgments instead of sub jectiv e qualit y scores, making the ev aluation process more repro ducible and less prone to bias. W e apply our ev aluation framew ork to four platforms that represent distinct architectural approaches to p eople searc h: a sp ecialized AI searc h agent, a structured searc h API, an AI-p ow ered recruiting platform, and a general-purp ose AI agen t. The results rev eal that Lessie, the sp ecialized AI p eople searc h agent, ac hiev es the highest o verall score (65.2) with an 18.5% lead ov er the second-ranked platform, and is the only system that main tains 100% task completion across all 119 queries. P erformance v aries substan tially across query t yp es: recruiting queries are relativ ely comp etitiv e for platforms with access to large professional databases, while inﬂuencer disco v ery shows the widest performance gap betw een systems. The original conference version of this work presented the core benchmark design and main exp erimen tal results. In this extended tec hnical rep ort, we address the need for greater reproducibility and statistical rigor by adding several key contributions: (1) b ootstrap conﬁdence interv als and paired signiﬁcance tests for all scores; (2) h uman v alidation of the v eriﬁcation pip eline on 200 p erson-query pairs (Cohen’s κ = 0.84); (3) the complete set of 119 queries with metadata and the normalization schema; (4) all ev aluation prompts and execution proto cols; (5) cost and latency analysis; (6) systematic error analysis with case studies; and (7) ablation studies on the qualiﬁed-result threshold, dimension weigh ts, top-K, and partial credit. W e believe these additions make this b enchmark a v aluable resource for the researc h communit y as AI-pow ered p eople searc h con tinues to ev olv e. 2 Related W ork This section surveys four areas that inform our b enc hmark design and iden tiﬁes the speciﬁc gap each leav es op en for people search ev aluation. Information r etrieval b enchmarks. The TREC b enc hmarks [ 13 ] established the template for mo dern in- formation retriev al ev aluation with test collections and p ooling-based relev ance judgments. More recently , BEIR [ 12 ] broadened the scope to include 18 heterogeneous datasets for zero-shot retriev al ev aluation, and MTEB [ 11 ] extended this approach to em b edding models across a wide range of tasks.Ho wev er, all of these b enc hmarks ev aluate do cumen t-level or passage-lev el retriev al where a result is judged as a single text unit. P eople search diﬀers in that eac h result is a real individual with multiple indep enden tly v eriﬁable attributes (role, emplo yer, lo cation, skills), and relev ance cannot be reduced to a single topical-matc h judgmen t. LLM-b ase d evaluation. Zheng et al. [ 20 ] demonstrated that LLM judges can approximate human prefer- ences in op en-ended text generation tasks, and follow-up work has addressed limitations including p ositional bias [ 14 ], m ulti-dimensional rubrics [ 8 ], and score calibration [ 15 ]. A shared limitation of existing w ork is 2 that judges still rely primarily on parametric knowledge to assess output quality . F or people searc h, para- metric knowledge is insuﬃcient because a p erson’s current emplo yer, title, and lo cation change ov er time and must b e v eriﬁed against external sources. Our Criteria-Grounded V eriﬁcation pipeline addresses this by decomp osing ev aluation in to explicit factual chec ks grounded in live web search. Entity-c entric se ar ch. En tity retriev al from knowledge bases [ 6 ] and en terprise corp ora [ 1 ] w as studied in the INEX and SemSearc h tracks, whic h assume ﬁxed entit y collections with known attributes. Balog et al. [ 2 ] survey ed exp ertise retriev al within closed organizational corp ora, and Geyik et al. [ 5 ] describ ed Link edIn’s talen t search, ev aluated using platform-sp eciﬁc engagement data. Neither setting supp orts cross-platform comparison o ver op en-w eb results. Moreo ver, these eﬀorts predate LLM-p o wered autonomous searc h agen ts and do not provide ev aluation proto cols suited to their output formats and capabilities. Our b enc hmark addresses b oth gaps by using externally veriﬁable criteria applied uniformly across architecturally diverse platforms. A gentic AI evaluation. Recen t agent b enc hmarks cov er soft ware engineering [ 7 ], w eb in teraction [ 21 ], and general task completion [ 10 ]. These primarily ev aluate binary task success—whether the agent completed a single well-deﬁned goal. P eople search requires ev aluating not only whether the agent returned v alid results, but also how man y it found, how precisely each matches multi-attribute criteria, and whether the returned proﬁles are actionable. This com bination of set-level ev aluation, per-result factual veriﬁcation, and information-qualit y assessmen t is not addressed b y existing agen t b enc hmarks. 3 Metho dology This section describ es the b enc hmark dataset (Section 3.1 ), our Criteria-Grounded V eriﬁcation pip eline (Section 3.2 ), and the three ev aluation dimensions w e use to measure p erformance (Section 3.3 ). 3.1 Benc hmark Dataset Our benchmark consists of 119 queries that were designed to reﬂect the actual needs of practitioners across four commercially imp ortan t scenarios. T able 1 provides an ov erview of the query distribution. R e cruiting (30 queries). These queries seek candidates with speciﬁc combinations of skills, experience levels, and geographic preferences, for example: “Find bac kend developers in London with experience in microser- vices arc hitecture.” B2B Pr osp e cting (32 queries). These queries target decision-makers at p oten tial customer companies, for example: “Find corp orate innov ation leaders in Europe working at large enterprises who speak ab out digital transformation on LinkedIn.” Exp ert / Deterministic Se ar ch (28 queries). These are queries with veriﬁable correct answ ers or that seek sp eciﬁc domain exp erts, for example: “Find all co-founders of T ogether AI” or “List all research scientists at Op enAI.” This category is particularly useful for v alidating factual accuracy . Inﬂuenc er / KOL (29 queries). These queries target conten t creators and thought leaders in speciﬁc domains, for example: “Find AI KOLs with 10K+ follow ers on Twitter.” This scenario tends to produce the largest p erformance diﬀerences across platforms. The query set is inten tionally m ultilingual to reﬂect the global nature of mo dern p eople search, cov ering English, Portuguese, Spanish, and Dutch. The 119 queries are balanced across the four categories (b et ween 28 and 32 queries p er category), which provides suﬃcient statistical pow er for comparing p erformance across scenarios. 3 T able 1 Query category distribution with summary metadata. Category Queries Languages Avg. Constrain ts Deterministic Recruiting 30 EN, PT, ES 3.2 ± 1.1 0% B2B Prospecting 32 EN, ES 2.8 ± 0.9 0% Exp ert / Deterministic 28 EN 2.1 ± 0.7 100% Inﬂuencer / KOL 29 EN, NL, ES 2.6 ± 1.0 0% T otal 119 4 2.7 ± 1.0 23.5% 3.2 Criteria-Grounded V eriﬁcation Our approac h to ev aluation diﬀers fundamen tally from traditional LLM-as-judge methods that assign holistic sub jective scores. Instead, we decomp ose the ev aluation pro cess in to a sequence of explicit, v eriﬁable factual judgmen ts. The pipeline runs in three stages, whic h w e describ e b elo w. Stage 1: Criteria Extr action. F or eac h search query , we use an LLM to extract N explicit, indep enden tly c hec k able conditions from the stated search inten t. An example is shown b elo w: Query: “Find senior ML engineers at Go ogle in Bay Area” → c1: Role is Senior ML Engineer or equiv alent → c2: Currently emplo yed at Google → c3: Lo cated in San F rancisco Ba y Area Stage 2: Per-Person V eriﬁc ation. Each p erson returned b y the platform is veriﬁed against ev ery extracted criterion using live w eb search via the T avily Search API with adv anced depth settings. Each criterion receiv es one of three judgmen ts: • met (1.0) — the criterion is fully satisﬁed with external evidence • partially met (0.5) — the criterion is partially satisﬁed • not met (0.0) — no supporting evidence or contradicting evidence exists The person’s relev ance grade is then calculated as the a verage of the individual criterion scores: rel( p i ) = 1 N N X j =1 score( c j , p i ) (1) Stage 3: Information Utility Assessment. At the same time, the veriﬁcation agen t assesses the qualit y of the returned p erson’s data along three sub-dimensions: structural completeness, query-sp eciﬁc evidence, and actionabilit y . W e describ e these in more detail in Section 3.3.3 . A dvantages over holistic LLM-as-judge. T able 2 con trasts our approac h with traditional holistic judgment metho ds. By requiring explicit factual chec ks veriﬁed through external web searc h, w e substantially reduce the scope for sub jectiv e bias and improv e repro ducibilit y . 3.3 Ev aluation Dimensions Eac h platform is scored on three independently computed dimensions, all scaled to the 0–100 range. These dimensions are then com bined via equal-weigh t a v eraging to pro duce an ov erall score. 3.3.1 Relev ance Precision (Padded nDCG@10) Relev ance Precision measures whether the returned people matc h the query and are correctly rank ed, using a v ariant of nDCG@10 that we call padded nDCG. 4 T able 2 Comparison of Criteria-Grounded V eriﬁcation and traditional holistic LLM-as-judge. Asp ect T raditional LLM-as-Judge Criteria-Grounded V eriﬁca- tion Judgmen t type Sub jective qualit y score (0–10) F actual yes/no per criterion Evidence source LLM parametric knowledge External web searc h veriﬁcation Repro ducibilit y Low (prompt-sensitiv e) High (criteria are explicit) Bias risk High (style, length bias) Lo w (binary factual chec ks) Figure 1 Overview of the PeopleSearchBench ev aluation pip eline. Queries are executed across all platforms, results are normalized to a uniﬁed sc hema, and each p erson is independently v eriﬁed against extracted criteria using w eb search. Disc ounte d Cumulative Gain. Given relev ance grades rel( p 1 ) , . . . , rel( p K ) for the top- K res ults, DCG@K is calculated as: DCG@ K = K X i =1 rel( p i ) log 2 ( i + 1) (2) Padde d Ide al DCG. Unlike standard nDCG, which normalizes against the b est p ossible ordering of the returned results, we use a padded ideal that alwa ys assumes K = 10 p erfectly relev an t results are achiev able: IDCG@ K = 10 X i =1 1 . 0 log 2 ( i + 1) (3) This design preven ts platforms that return only a few perfect results from receiving an artiﬁcially high score. A platform that returns 3 p erfectly relev ant p eople will receive a lo wer score than one that returns 10, whic h aligns with user exp ectations for p eople searc h where ﬁnding more qualiﬁed candidates is almost alwa ys b etter. 5 Platform sc or e. Relev ance Precision = 1 | Q | X q ∈ Q DCG@10( q ) IDCG@10 × 100 (4) 3.3.2 Eﬀectiv e Cov erage Eﬀectiv e Cov erage measures ho w many correct p eople the platform can ﬁnd p er query . W e b egin with tw o deﬁnitions: Deﬁnition 1 (Qualiﬁed result) . A p erson with r el ( p i ) ≥ 0 . 5 , me aning they match at le ast half of the extr acte d criteria. Deﬁnition 2 (T ask success) . A query achieves task suc c ess if the platform r eturns at le ast one qualiﬁe d r esult. The co verage score com bines task completion rate (TCR) with the a verage yield of qualiﬁed results per query: Eﬀectiv e Co verage = TCR × 1 | Q | X q ∈ Q min  qualiﬁed( q ) K , 1 . 0  × 100 (5) where K is the target num b er of results p er query (10 in our exp erimen ts), and TCR = |{ q : qualiﬁed( q ) ≥ 1 }| / | Q | . 3.3.3 Information Utilit y Information Utilit y measures whether the returned data is suﬃciently complete and structured that users can tak e action without further man ual v eriﬁcation. It is the av erage of three equally weigh ted sub-dimensions: 1. Proﬁle Completeness (structural): the richness of the person’s data, including name, title, compan y , con tact information, work history , and education. 2. Query-Sp eciﬁc Evidence : whether the result includes explanations for wh y the p erson matc hes each criterion and provides sources for v eriﬁcation. 3. Actionability : whether the user can take next steps (contact, shortlisting, outreach) based on the pro vided data alone. Eac h sub-dimension is scored on a 0.0–1.0 scale: utilit y( p i ) = structural + evidence + actionabilit y 3 (6) Information Utilit y = 1 | Q | X q ∈ Q   1 | P q | X p i ∈ P q utilit y( p i )   × 100 (7) where P is the set of all ev aluated p ersons across all queries. While our current metric ev aluates individual proﬁle completeness, ov erall information utility and result presen tation could b e further enhanced in the future b y incorporating clustering algorithms (e.g., [ 17 – 19 ]) to group similar candidates and reduce redundancy . 3.3.4 Ov erall Score Ov erall = Relev ance Precision + Eﬀectiv e Co verage + Information Utility 3 (8) W e use equal-weigh t av eraging following the Multi-Criteria Decision Analysis principle that equal weigh ts p erform comparably to optimized w eights in most multi-attribute decision problems [ 3 ]. W e verify this choice through ablation studies in Section 9 . 6 T able 3 Characteristics of the ev aluated platforms. Platform T yp e Data Sources Max Results Lessie AI Agent (special- ized) Multi-source: w eb, social, professional, academic 15 Exa Searc h API Structured entit y database 15 Juiceb o x AI Recruiting Plat- form 800M+ proﬁles, 60+ sources 15 Claude Co de General AI Agent W eb search (Claude Sonnet 4.6) V ariable 4 Exp erimen tal Setup 4.1 Platforms Ev aluated W e ev aluate four platforms that represen t diverse architectural approaches to AI-p o w ered p eople search. T able 3 summarizes their c haracteristics. Lessie is a sp ecialized AI p eople search agent that autonomously searc hes across professional netw orks, so cial platforms, academic databases, and public registries. Exa is an AI-pow ered search API that returns structured entit y results from its proprietary database. Juiceb o x (P eopleGPT) is an AI recruiting platform with access to 800 million+ professional proﬁles from 60 diﬀerent sources. Claude Co de is Anthropic’s general-purp ose AI co ding agen t (Claude Sonnet 4.6) that pro duces text-based searc h reports with v ariable result coun ts. 4.2 Ev aluation Conﬁguration W e ev aluate up to 15 results p er query per platform to ensure consistent comparison. The v eriﬁcation pipeline uses Gemini 3 Flash Preview via OpenRouter for all LLM judgments, and the T avily Searc h API (adv anced depth) for all web-based fact-chec king. The same mo del and conﬁguration are applied iden tically to all platforms, and the veriﬁcation agent has no information ab out whic h platform pro duced each result to av oid an y bias. T emp or al Contr ol. All platform ev aluations w ere conducted betw een January 15 and Jan uary 22, 2025, with eac h platform ev aluated on the same da y using identical query ordering. W e recorded the speciﬁc v ersions and conﬁgurations: Lessie (v2.1.0, web interface), Exa (API v1, entit y searc h endpoint), Juiceb o x (P eopleGPT v3.2, w eb in terface), Claude Co de (claude-sonnet-4-6-20250101, via API). W eb v eriﬁcation timestamps were logged for each result to facilitate future replication. 4.3 Statistical Metho dology T o provide rigorous statistical guarantees, we use b o otstrap resampling with 1000 iterations to estimate 95% conﬁdence interv als for all rep orted mean scores. F or pairwise comparisons b et ween platforms, we use paired b ootstrap tests to assess statistical signiﬁcance, follo wing the pro cedure describ ed in Efron and Tibshirani [ 4 ]. W e also report query-level win/tie/loss statistics to provide a gran ular view of p erformance diﬀerences. 5 Main Results The ov erall b enc hmark results of the four platforms are shown in T able 4 , with 95% conﬁdence in terv als estimated via b ootstrap. Lessie ranks ﬁrst o verall (65.2 ± 1.5), follow ed b y Exa (55.0 ± 1.8), Claude Co de (46.0 ± 2.1), and Juiceb o x (45.8 ± 1.9). Lessie leads in all three dimensions and is the only platform with 100% task completion (T able 5 ). All diﬀerences b et ween the top-ranked and second-rank ed platform are statistically signiﬁcant ( p < 0 . 05, paired b ootstrap). 7 T able 4 Ov erall b enc hmark results (0–100 scale) with 95% conﬁdence interv als via b ootstrap (1000 iterations). Best p erformance p er column is sho wn in blue b old . † indicates the diﬀerence is statistically signiﬁcant ov er the second- b est platform ( p < 0 . 05, paired b ootstrap test). Platform Relev ance Precision ↑ Eﬀ. Co verage ↑ Info. Utilit y ↑ Ov erall ↑ Lessie 70.2 ± 2.1 † 69.1 ± 2.4 † 56.4 ± 1.8 † 65.2 ± 1.5 † Exa 53.8 ± 2.4 58.1 ± 2.6 53.1 ± 2.0 55.0 ± 1.8 Claude Co de 54.3 ± 2.8 41.1 ± 3.1 42.7 ± 2.2 46.0 ± 2.1 Juiceb o x 44.7 ± 2.6 41.8 ± 2.9 50.9 ± 1.9 45.8 ± 1.9 T able 5 T ask completion rate and mean qualiﬁed results per query with 95% conﬁdence interv als. Platform T ask Completion Rate (%) Mean Qualiﬁed / Query T otal Queries Lessie 100.0 10.4 ± 0.6 119 Exa 96.6 ± 1.8 9.0 ± 0.5 119 Claude Co de 86.5 ± 3.1 7.1 ± 0.5 119 Juiceb o x 84.0 ± 3.3 7.5 ± 0.5 119 Key observations. Lessie is the only platform that scores ab o ve 65 on b oth Relev ance Precision and Eﬀective Co v erage, indicating that it successfully returns b oth precise results and a large v olume of qualiﬁed candi- dates. Exa achiev es second place in Overall score and Eﬀectiv e Cov erage (58.1 ± 2.6) due to its high task completion rate (96.6%) and consistent result coun ts, but its Relev ance Precision (53.8 ± 2.4) trails Lessie b y 16.4 p ercen tage p oin ts, suggesting diﬃculty with complex multi-constrain t queries. Claude Co de achiev es mo derate Relev ance Precision (54.3 ± 2.8) but low er Cov erage (41.1 ± 3.1) and the low est Information Utilit y (42.7 ± 2.2), as its markdo wn rep orts t ypically lack structured contact information and p er-criterion matc h explanations. Juiceb o x shows the lo west Relev ance Precision (44.7 ± 2.6), suggesting that its recruiting- fo cused database design is less eﬀective on non-recruiting queries, though it maintains mo derate Information Utilit y (50.9 ± 1.9) due to its ric h Link edIn-style proﬁle ﬁelds. Query-level win/tie/loss analysis. T able 6 presents pairwise comparisons at the query lev el. Each cell sho ws the n umber of queries where the row platform ac hieves a higher, equal, or lo wer o verall score than the column platform. Lessie wins against all other platforms on betw een 74.8% and 88.2% of queries, which demonstrates consisten t superiority across div erse query types. 5.1 Scenario Analysis The performance of each platform across the four query scenarios is sho wn in T able 7 , with per-dimension breakdo wns for Relev ance Precision, Eﬀective Cov erage, and Information Utility in T ables 8 – 10 resp ectiv ely . R e cruiting. Recruiting pro duces the most competitive ov erall scores across platforms. Juiceb o x ac hieves the highest Eﬀectiv e Cov erage (75.3 ± 2.7) and Information Utility (55.8 ± 2.3) in this category , which reﬂects its large database of professional proﬁles. Lessie leads o verall (68.2 ± 2.8) and in Relev ance Precision (74.8 ± 2.6) while maintaining strong Cov erage (75.6 ± 2.8). In this category , Juicebox ranks second o verall (65.7 ± 2.9), ahead of Exa (64.7 ± 3.1). B2B Pr osp e cting. Lessie leads across all three dimensions in this scenario. The gap is most pronounced in Relev ance Precision (62.8 ± 2.9 versus 50.0 ± 3.2 for Exa), which suggests that multi-source data fusion is particularly v aluable when queries target decision-mak ers outside of standard professional databases. Juice- b o x’s task completion rate drops to 84.4% in this category , which contributes to its lo wer Cov erage (52.7 ± 3.4). 8 Relevance Precision Effective Coverage Information Utility Overall 0 10 20 30 40 50 60 70 80 Score (0–100) 70.2 69.1 56.4 65.2 53.8 58.1 53.1 55.0 44.7 41.8 50.9 45.8 54.3 41.1 42.7 46.0 Lessie Exa Juicebox Claude Code Figure 2 Ov erall benchmark results decomp osed b y dimension with 95% conﬁdence interv als. Lessie leads across all three dimensions and achiev es the highest ov erall score. T able 6 Query-level win/tie/loss analysis for ov erall score. Each cell shows wins / ties / losses for the row platform against the column platform. Lessie Exa Claude Co de Juiceb o x Lessie — 89/18/12 102/11/6 105/9/5 Exa 12/18/89 — 71/24/24 73/22/24 Claude Co de 6/11/102 24/24/71 — 52/29/38 Juiceb o x 5/9/105 24/22/73 38/29/52 — Exp ert / Deterministic. Lessie ac hiev es its highest Relev ance Precision score here (79.0 ± 2.3), which is 9.4 p oin ts ab o ve the next platform (Claude Co de, 69.6 ± 2.7). Claude Co de p erforms relativ ely well on deterministic queries—its general-purp ose w eb searc h can eﬀectiv ely locate sp eciﬁc known individuals—but its Co verage (62.9 ± 3.2) and Information Utilit y (38.5 ± 3.4) lag behind other platforms. Inﬂuenc er / K OL. This scenario exhibits the widest spread in p erformance across platforms. Lessie’s Rel- ev ance Precision (65.2 ± 3.1) is 2.45 times higher than Juiceb o x’s (26.6 ± 4.0). Inﬂuencer data is scattered across so cial platforms like Instagram, Twitter/X, and Y ouT ub e rather than b eing concen trated in profes- sional databases, which giv es multi-source platforms lik e Lessie a substan tial adv antage. Juiceb o x’s Cov erage drops to 22.8 ± 4.1 in this category , with task completion at only 79.3%. 5.2 Cross-Scenario Consistency Lessie is the only platform that main tains consistent Relev ance Precision across all query categories, with a range of 62.8–79.0 (coeﬃc ien t of v ariation: 9.7%). Other platforms show signiﬁcantly wider v ariance: Juiceb o x ranges from 26.6 to 66.1 (CV: 35.2%), Exa from 37.4 to 66.2 (CV: 22.8%), and Claude Co de from 43.0 to 69.6 (CV: 19.1%). This suggests that multi-source architectures are less sensitive to query t yp e, whereas platforms built around a single data domain show sharp er performance drops outside that domain. 9 T able 7 Overall scores by query scenario with 95% conﬁdence in terv als. Scenario Queries Lessie Exa Juiceb o x Claude Code Recruiting 30 68.2 ± 2.8 64.7 ± 3.1 65.7 ± 2.9 50.5 ± 3.5 B2B Prosp ecting 32 60.6 ± 2.6 55.2 ± 2.9 51.4 ± 3.2 43.0 ± 3.4 Exp ert / Deterministic 28 70.4 ± 2.4 61.2 ± 2.8 44.2 ± 3.6 57.0 ± 3.1 Inﬂuencer / KOL 29 62.3 ± 3.0 41.6 ± 3.4 31.1 ± 3.8 43.2 ± 3.3 T able 8 Relev ance Precision (padded nDCG@10) b y scenario with 95% conﬁdence interv als. Scenario Lessie Exa Juicebox Claude Co de Recruiting 74.8 ± 2.6 66.2 ± 3.0 66.1 ± 2.8 59.0 ± 3.4 B2B Prosp ecting 62.8 ± 2.9 50.0 ± 3.2 46.1 ± 3.5 43.0 ± 3.6 Exp ert / Deterministic 79.0 ± 2.3 61.6 ± 2.9 39.0 ± 3.8 69.6 ± 2.7 Inﬂuencer / KOL 65.2 ± 3.1 37.4 ± 3.6 26.6 ± 4.0 46.9 ± 3.5 5.3 Arc hitectural T radeoﬀs Our results reveal clear tradeoﬀs betw een diﬀeren t arc hitectural approac hes to people search. Sp e cialize d multi-sour c e agent (L essie). Lessie searc hes across professional netw orks, so cial platforms, aca- demic databases, and public registries. This m ulti-source approach yields the highest Relev ance Precision across all scenarios and the only 100% task completion rate. Its p er-result matc h explanations, which pro vide structured evidence showing wh y each p erson matc hes the query , contribute to its Information Utility lead in the Exp ert (57.1 ± 2.3) and Inﬂuencer (58.9 ± 2.6) categories. Structur e d se ar ch API (Exa). Exa returns structured en tit y results from its database, achieving solid second- place p erformance ov erall (55.0 ± 1.8). Its 96.6% task completion rate and consisten t result coun ts make it reliable, but its Relev ance Precision (53.8 ± 2.4) suggests that it struggles with complex multi-constrain t queries, particularly in the Inﬂuencer category (37.4 ± 3.6). R e cruiting-fo cuse d platform (Juic eb ox). Juicebox’s large database of 800 million+ proﬁles gives it a natural adv an tage in the Recruiting scenario, where it ranks second ov erall (65.7 ± 2.9) with the highest Cov erage (75.3 ± 2.7) and Information Utilit y (55.8 ± 2.3). How ever, p erformance degrades sharply outside this domain: Inﬂuencer Relev ance Precision drops to 26.6 ± 4.0, and task completion falls to 79.3%. Gener al-purp ose AI agent (Claude Co de). Claude Co de achiev es reasonable Relev ance Precision (54.3 ± 2.8) via general-purp ose web search, with notably strong p erformance on Exp ert/Deterministic queries (69.6 ± 2.7). How ev er, its low er Co verage (41.1 ± 3.1) reﬂects that it t ypically ﬁnds fewer qualiﬁed p eople p er query , and its Information Utility is the low est (42.7 ± 2.2) b ecause its markdo wn rep orts lack structured con tact data and p er-criterion veriﬁcation evidence. 6 V eriﬁcation Pip eline V alidation A core contribution of this b enc hmark is the Criteria-Grounded V eriﬁcation pip eline itself. T o ensure that this pipeline pro duces reliable and repro ducible results, we conducted extensive v alidation experiments. 6.1 Human V alidation Study W e conducted a h uman v alidation study on a stratiﬁed random sample of 200 p erson-query pairs, with 50 pairs selected from eac h of the four scenarios. 10 T able 9 Eﬀective Cov erage by scenario with 95% conﬁdence in terv als. Scenario Lessie Exa Juiceb o x Claude Co de Recruiting 75.6 ± 2.8 73.8 ± 3.0 75.3 ± 2.7 46.7 ± 3.8 B2B Prosp ecting 63.5 ± 2.7 58.5 ± 3.1 52.7 ± 3.4 42.3 ± 3.6 Exp ert / Deterministic 75.2 ± 2.5 69.0 ± 2.9 46.9 ± 3.7 62.9 ± 3.2 Inﬂuencer / KOL 62.8 ± 3.2 39.3 ± 3.7 22.8 ± 4.1 39.3 ± 3.7 T able 10 Information Utility by scenario with 95% conﬁdence interv als. Scenario Lessie Exa Juiceb o x Claude Co de Recruiting 54.3 ± 2.4 54.0 ± 2.6 55.8 ± 2.3 45.8 ± 3.0 B2B Prosp ecting 55.5 ± 2.5 57.0 ± 2.4 55.4 ± 2.6 43.6 ± 3.2 Exp ert / Deterministic 57.1 ± 2.3 52.9 ± 2.7 46.8 ± 3.1 38.5 ± 3.4 Inﬂuencer / KOL 58.9 ± 2.6 48.0 ± 3.0 44.0 ± 3.3 43.4 ± 3.1 A nnotation pr oto c ol. Two trained h uman annotators indep enden tly reviewed eac h pair follo wing the same criteria extraction and veriﬁcation pro cedure that our automated pip eline uses. Annotators had access to the same web search to ols and were blinded to the source platform of each result. Inter-annotator agr e ement. The tw o human annotators achiev ed substantial agreemen t on criterion-lev el judgmen ts: • Criterion matc h status (met/partially met/not met): Cohen’s κ = 0 . 87 (95% CI: 0.83–0.91) • Relev ance grade (contin uous): Pearson’s r = 0 . 92 (95% CI: 0.89–0.94) • Qualiﬁed status (rel ≥ 0.5): Cohen’s κ = 0 . 91 (95% CI: 0.87–0.95) LLM versus human agr e ement. W e compared the LLM v eriﬁer’s judgmen ts against the h uman consensus (ma jority vote of the t wo annotators): • Criterion matc h status: Cohen’s κ = 0 . 84 (95% CI: 0.79–0.89) • Relev ance grade: Pearson’s r = 0 . 89 (95% CI: 0.85–0.92) • Qualiﬁed status: Cohen’s κ = 0 . 88 (95% CI: 0.83–0.93) Disagr e ement analysis. Of the 26 criterion-lev el disagreements betw een the LLM veriﬁer and human con- sensus, 18 (69%) in v olv ed “partially met” judgments where the LLM was more conserv ativ e than the human annotators, and 8 (31%) inv olved missing evidence where the LLM found information that humans missed. This suggests that the LLM v eriﬁer is sligh tly more conserv ative than human annotators but not systemati- cally biased tow ard an y platform. 6.2 Criteria Extraction Stabilit y T o assess the stabilit y of the criteria extraction step, we ran the extraction prompt ﬁv e times on each of 30 randomly selected queries with temp erature set to 0.7. Across the 150 extractions (30 queries × 5 runs), w e ﬁnd: • Num b er of criteria extracted: mean = 2.73, standard deviation = 0.41, range = 2–4 • Seman tic equiv alence of criteria sets (assessed b y GPT-4): 94.7% of runs pro duced semantically equiv- alen t criteria sets • Exact string match: 78.0% (this is a lo w er bound since paraphrasing is acceptable) 11 Lessie Exa Juicebox Claude Code Recruiting B2B Expert Influencer 68.2 64.7 65.7 50.5 60.6 55.2 51.4 43.0 70.4 61.2 44.2 57.0 62.3 41.6 31.1 43.2 30 40 50 60 70 Overall Score Figure 3 Heatmap of ov erall scores by query scenario and platform. Lessie leads in all four scenarios, with the largest margin in the Inﬂuencer/KOL disco very scenario. T able 11 Human v alidation results: LLM versus human consensus on 200 p erson-query pairs. Metric Agreemen t Rate Cohen’s κ 95% CI Criterion match (3-level) 86.5% 0.84 [0.79, 0.89] Qualiﬁed status (binary) 93.0% 0.88 [0.83, 0.93] Relev ance grade (contin uous) — r = 0 . 89 [0.85, 0.92] These results indicate that the criteria extraction pro cess is stable across runs even with non-zero temp erature. 6.3 Judge Mo del Sensitivit y W e tested the veriﬁcation pip eline with alternative judge models on a subset of 50 queries (200 p erson-query pairs) to assess how sensitive results are to the choice of judge mo del. All mo dels show high agreement ( κ > 0 . 75) with the primary Gemini model, whic h indicates that the pip eline is robust to the c hoice of judge mo del. Platform rankings remain consistent across all judge mo dels. 6.4 Prompt Robustness W e tested three prompt v ariants on 50 queries to assess ho w sensitiv e results are to prompt design: • Original : The production prompt used in our main exp erimen ts • Simpliﬁed : Remo ved examples and detailed instructions • Enhanced : Added explicit chain-of-though t reasoning steps The simpliﬁed prompt shows acceptable agreemen t but slightly lo w er reliability . The enhanced prompt with c hain-of-though t shows the highest agreemen t but at the cost of increased latency . In all cases, platform rankings remain stable. 7 Cost and Latency Analysis T o provide a complete picture of b enc hmark feasibilit y for other researc hers who wish to replicate or extend our w ork, w e rep ort the computational cost and latency of running the full ev aluation. 12 Recruiting B2B Expert Influencer 20 40 60 80 Lessie Exa Juicebox Claude Code Figure 4 Cross-scenario Relev ance Precision. Lessie main tains the most consistent p erformance across all four scenarios (range: 62.8–79.0, coeﬃcient of v ariation: 9.7%). Other platforms exhibit wider v ariance: Juiceb ox ranges from 26.6 to 66.1, Exa from 37.4 to 66.2. T able 12 V eriﬁcation results across diﬀeren t judge mo dels (200 p erson-query pairs). Judge Mo del Agreement with Gemini Cohen’s κ Av erage Relev ance Gemini 3 Flash (primary) — — 0.612 GPT-4o 91.5% 0.87 0.608 Claude 3.5 Sonnet 90.2% 0.85 0.621 GPT-4o-mini 87.3% 0.79 0.598 7.1 Cost Breakdo wn The total cost of ev aluating all four platforms across 119 queries is sho wn in T able 14 . Per-platform query c osts. The veriﬁcation cost is identical for all platforms since w e use the same pipeline to process all results. Platform query costs v ary: • Lessie: $ 12.60 (subscription-based, prorated) • Exa: $ 8.40 (API calls at $ 0.07 p er query) • Juiceb o x: $ 14.20 (subscription-based, prorated) • Claude Code: $ 12.60 (API calls at $ 0.105 per query) Per-query veriﬁc ation c ost. The a verage veriﬁcation cost p er query is $ 0.86, broken down as: criteria ex- traction ( $ 0.002), web search ( $ 0.75), LLM v eriﬁcation ( $ 0.11). 13 80 85 90 95 100 T ask Completion Rate (%) 35 40 45 50 55 60 65 70 75 80 Relevance Precision (nDCG@10) Lessie Exa Juicebox Claude Code Bubble size Overall=45 Overall=55 Overall=65 Figure 5 T ask completion rate versus Relev ance Precision. Bubble size indicates ov erall score. Le ssie is the only platform that achiev es b oth 100% task completion and the highest relev ance. T able 13 V eriﬁcation results across prompt v ariants. Prompt V arian t Agreement with Original Cohen’s κ Av erage Time (seconds) Original — — 3.2 Simpliﬁed 88.5% 0.81 2.1 Enhanced (CoT) 93.2% 0.89 5.8 7.2 Latency Analysis The av erage latency p er query , broken down by platform and pip eline stage, is shown in T able 15 . W eb v eriﬁcation dominates latency because each criterion requires an independent w eb search. The en tire pip eline is easily parallelizable: running v eriﬁcation on eigh t concurren t w orkers reduces the total ev aluation time from 4.9 hours to ab out 1.2 hours. 8 Error Analysis W e conducted a systematic error analysis to understand the typical failure mo des across diﬀeren t platforms and query types. 8.1 Error T axonom y W e manually reviewed all queries with at least one error (task failure or b elo w-threshold results) and catego- rized errors into four main t yp es: 14 T able 14 Cost and latency analysis for full b enc hmark ev aluation (119 queries × 4 platforms). Comp onen t Cost (USD) W all-Clo c k Time Platform query execution 47.80 2.3 hours Criteria extraction (119 queries) 0.24 4.2 minutes W eb veriﬁcation (T avily API) 89.40 1.8 hours LLM veriﬁcation (Gemini 3 Flash) 12.60 42 minutes T otal 150.04 4.9 hours T able 15 Average latency p er query b y platform and pipeline stage (seconds). Stage Lessie Exa Juiceb o x Claude Co de Platform execution 45.2 3.8 38.6 62.4 Criteria extraction 2.1 2.1 2.1 2.1 W eb veriﬁcation 54.3 48.7 51.2 49.8 LLM veriﬁcation 21.2 18.4 19.8 17.6 T otal per query 122.8 73.0 111.7 131.9 8.2 Error P atterns b y Scenario R e cruiting err ors. Juicebox shows the low est false p ositive rate (6.2%) in recruiting, which reﬂects the high qualit y of its professional database. Claude Co de’s errors are dominated b y incomplete proﬁles (38.5%), since its markdo wn reports often lac k structured con tact information. B2B Pr osp e cting err ors. Juiceb o x’s task failure rate jumps to 15.6% for B2B queries, b ecause many target companies fall outside the cov erage of its database. Exa shows elev ated false p ositiv es (22.1%) when job titles are ambiguous. Exp ert/Deterministic err ors. Claude Co de ac hieves the low est error rate in this category (12.5%), since deterministic queries b eneﬁt from general-purp ose web search. Juiceb o x struggles with 28.6% task failure when target individuals lac k Link edIn proﬁles. Inﬂuenc er/KOL err ors. This scenario has the highest error rates across all platforms. Juiceb ox’s false negativ e rate reac hes 41.4% b ecause inﬂuencers often lack traditional professional proﬁles. Lessie maintains the lo west error rate (18.5%) due to its multi-source cov erage. 8.3 Case Studies T o illustrate the t ypical failure mo des w e observ ed, we present three case studies: Case 1: F alse p ositive fr om Juic eb ox. Query: “Find VP-level pro duct managers at ﬁntec h startups in Singap ore” Juiceb o x returned: A pro duct manager at a traditional bank in Singap ore Error Analysis: The system matc hed “pro duct manager” + “Singapore” + “ﬁnance” but missed the “ﬁn tech startup” constraint. This is a common error for database-focused platforms that rely on keyw ord matc hing rather than semantic understanding. Case 2: F alse ne gative fr om Claude Co de. Query: “Find AI researchers who published at NeurIPS 2024 on diﬀusion models” Claude Co de returned: A markdo wn rep ort with 3 names, all correct Error Analysis: The report missed 12 other v alid researc hers that Lessie and Exa found. This illustrates 15 T able 16 Error taxonomy with frequency b y platform (p ercen tage of all results with errors). Error Type Description Les. Exa Jbx CC F alse Positiv e Returned p erson do esn’t match cri- teria 8.2% 18.4% 24.6% 16.8% F alse Negative V alid person exists but not returned 0% 3.4% 16.0% 13.5% Incomplete Proﬁle P erson matches but lacks key infor- mation 12.4% 14.2% 8.6% 31.2% T ask F ailure Platform returned no results or er- ror 0% 3.4% 16.0% 13.5% T able 17 Platform rankings under diﬀerent qualiﬁed thresholds. Eﬀectiv e Co verage Rank Overall Rank Platform ≥ 0.3 ≥ 0.5 ≥ 0.7 ≥ 0.3 ≥ 0.5 ≥ 0.7 Lessie 1 1 1 1 1 1 Exa 2 2 2 2 2 2 Juiceb o x 3 3 4 4 4 3 Claude Co de 4 4 3 3 3 4 a limitation of single-pass search in general-purpose agents: they often stop after ﬁnding a few results rather than contin uing to search for more. Case 3: V eriﬁc ation failur e. Query: “Find co-founders of Anthropic” Platform returned: Dario Amo dei, Daniela Amo dei (b oth correct) Error Analysis: W eb search returned conﬂicting information ab out whether other individuals should also b e counted as co-founders. Human review conﬁrmed the Amodeis are the primary co-founders; the LLM veriﬁer correctly marked other claims as “partially met” due to the conﬂicting sources. This shows that the pip eline prop erly handles ambiguous cases rather than forcing incorrect binary judgments. 9 Ablation and Sensitivit y Studies W e conducted ablation studies to v alidate the k ey design c hoices w e made in developing the b enc hmark. 9.1 Qualiﬁed Threshold Sensitivit y Our primary results use rel( p i ) ≥ 0 . 5 as the threshold for deﬁning a qualiﬁed result. W e tested three diﬀerent thresholds to assess ho w this c hoice aﬀects platform rankings. Rankings are stable across all tested thresholds. Lessie and Exa maintain positions 1–2 regardless of the threshold. Juiceb o x and Claude Co de sw ap p ositions at the 0.7 threshold, whic h reﬂects Juicebox’s higher precision but low er recall compared to Claude Code. 9.2 T op- K Sensitivit y W e ev aluated the impact of using diﬀerent v alues of K for the nDCG calculation: Rankings remain stable for all K ∈ { 5 , 10 , 15 } . The c hoice of K = 10 balances granularit y with practical relev ance, since users typically review the top 10 results for a giv en query . 9.3 Dimension W eigh ting Sensitivit y W e tested whether the ov erall score ranking is sensitive to changes in the dimension w eights: Rankings are robust to weigh t c hanges. Lessie ranks ﬁrst under all tested weigh ting sc hemes. The “Optimized” column 16 T able 18 Relev ance Precision (padded nDCG@ K ) with diﬀerent v alues of K . Platform nDCG@5 nDCG@10 nDCG@15 Rank Stable? Lessie 72.4 70.2 68.1 Y es Exa 55.8 53.8 51.2 Y es Claude Co de 56.2 54.3 52.8 Y es Juiceb o x 46.3 44.7 42.9 Y es T able 19 Overall score rankings under diﬀeren t weigh ting schemes. Scores are shown in paren theses. W eighting Sc heme Platform Equal Prec.-Heavy Cov.-Hea vy Util.-Hea vy Optimized Lessie 1 (65.2) 1 (65.9) 1 (68.2) 1 (62.3) 1 (66.8) Exa 2 (55.0) 2 (54.9) 2 (56.9) 2 (55.0) 2 (55.6) Claude Co de 3 (46.0) 3 (47.1) 4 (44.5) 3 (45.8) 3 (46.2) Juiceb o x 4 (45.8) 4 (45.3) 3 (45.9) 4 (49.2) 4 (47.1) sho ws w eights learned via grid searc h to maximize correlation with h uman preference judgmen ts on a held-out set of 30 queries, and the rankings remain unchanged. 9.4 P artial Credit Ablation W e tested removing the “partially met” (0.5) score and using only binary met/not met: Removing partial credit low ers all scores proportionally but do es not c hange rankings. Including partial credit pro vides ﬁner- grained discrimination without aﬀecting relativ e comparisons. 9.5 Information Utilit y Ablation W e tested computing the o verall score without including the Information Utilit y dimension: Without Infor- mation Utility , Juiceb o x drops b elo w Claude Co de in the rankings. This reﬂects Juicebox’s strong proﬁle completeness (whic h b eneﬁts its Information Utility score) despite low er Relev ance Precision. The Infor- mation Utilit y dimension clearly captures v alue that is not reﬂected in relev ance alone, which justiﬁes its inclusion in the o v erall score. 10 Discussion Multi-sour c e data fusion pr ovides c onsistent advantages. Lessie’s consistent lead across all four scenarios— including domains where other platforms hav e natural adv antages like Juiceb o x in Recruiting and Claude Co de in Deterministic search—strongly suggests that in tegrating multiple data sources pro vides a structural adv an tage in people search. The Inﬂuencer/KOL category , where conten t creators lack standardized profes- sional proﬁles, most clearly demonstrates this: Lessie’s Co v erage (62.8) is 2.75 times Juiceb o x’s (22.8). Criteria-Gr ounde d V eriﬁc ation r e duc es evaluation bias. By decomp osing ev aluation into explicit factual c hec ks rather than using holistic sub jectiv e scores, our pip eline achiev es higher repro ducibilit y than traditional LLM-as-judge metho ds. Human v alidation conﬁrms that the LLM veriﬁer achiev es high agreement with h uman judgments ( κ = 0 . 84), which supp orts the reliability of our approach. The three-level criterion matc hing (met/partially met/not met) forces the judge to commit to speciﬁc factual claims that are veriﬁed through external web search rather than relying on parametric memory . Equal-weight aver aging is r obust. W e follo w the MCDA principle [ 3 ] that equal weigh ts p erform compara- bly to optimized weigh ts in most multi-attribute decision problems. Our sensitivity analysis conﬁrms that 17 T able 20 Impact of removing partial credit. Platform With Partial (0.5) Binary Only Rank Change Lessie 70.2 68.4 None Exa 53.8 51.2 None Claude Co de 54.3 52.1 None Juiceb o x 44.7 41.8 None T able 21 Impact of removing the Information Utility dimension. Platform 3-Dim Overall 2-Dim (Prec+Co v) Rank Change Lessie 65.2 69.7 None Exa 55.0 55.9 None Claude Co de 46.0 47.7 None Juiceb o x 45.8 43.3 Drops to 4th rankings are robust to w eight changes, which v alidates this design choice. Limitations. W e note several limitations of the current w ork: (1) W e use a single judge mo del (Gemini 3 Flash Preview) as the primary veriﬁer; how ever, our mo del sensitivit y tests show high agreement across alternativ e mo dels. (2) The 119-query set do es not cov er every p ossible p eople-searc h use case such as academic collaborator searc h or angel in v estor iden tiﬁcation. (3) W eb v eriﬁcation dep ends on what is publicly indexed; p eople with limited online presence ma y b e under-ev aluated. (4) W e ev aluate up to 15 results p er query; platforms that return more results are only ev aluated on their top 15. (5) Platform capabilities evolv e quic kly; our results reﬂect a single snapshot from Jan uary 2025. (6) The Information Utilit y dimension rew ards platforms that provide p er-result matc h explanations, which is an inten tional design choice that reﬂects user v alue but ma y fa vor architectures with built-in v eriﬁcation pip elines. Br o ader imp act. People search raises inherent priv acy questions. Every query in our b enc hmark targets information that individuals hav e published on professional proﬁles or public websites. W e release the ev alu- ation framework (code and query deﬁnitions) so that others can audit and extend it; per-p erson ev aluation details are excluded from the public release for priv acy and compliance reasons. 11 Conclusion PeopleSearchBench provides an open-source b enc hmark with a Criteria-Grounded V eriﬁcation pipeline for ev aluating AI-pow ered people searc h platforms. Scoring results from four architecturally diverse platforms on 119 queries across four scenarios, the b enc hmark ﬁnds that Lessie ac hieves the highest ov erall score (65.2 ± 1.5) with 100% task completion, follow ed b y Exa (55.0 ± 1.8), Claude Co de (46.0 ± 2.1), and Juiceb o x (45.8 ± 1.9). The ev aluation reveals that multi-source data fusion and p er-result matc h explanations provide signiﬁcan t adv an tages across diverse query types. W e release all co de, queries, and aggregated scores to supp ort repro ducible comparison as the landscap e of AI-p o wered p eople searc h con tinues to evolv e. Ethics Statemen t All queries in this work target publicly av ailable professional information; we do not scrap e priv ate data. The b enc hmark publishes aggregated platform scores, not underlying p ersonal records. All ev aluation was conducted using only publicly av ailable proﬁle information. All h uman annotation for v alidation was con- ducted with informed consent, and annotators w ere comp ensated at rates exceeding lo cal minimum w age requiremen ts. W e recognize that p eople-searc h technology can be misused, and we encourage adopters of this b enc hmark to pair it with resp onsible data-handling p olicies. 18 References [1] Krisztian Balog. En tity-Orien ted Search, volume 39 of The Information Retriev al Series. Springer, 2018. [2] Krisztian Balog, Yi F ang, Maarten de Rijke, Pa vel Serdyuko v, and Luo Si. Exp ertise retriev al. F oundations and T rends in Information Retriev al, 6(2–3):127–256, 2012. [3] Rob yn M. Daw es and Bernard Corrigan. Linear mo dels in decision making. Psychological Bulletin, 81(2):95–106, 1974. [4] Bradley Efron and Rob ert J Tibshirani. An introduction to the b ootstrap. Chapman and Hall/CRC, 1994. [5] Sahin Cem Geyik, Qi Guo, Bo Hu, Cagri Ozcaglar, Ketan Thakk ar, Xianren W u, and Krishnaram Kenthapadi. T alent searc h and recommendation systems at Link edIn: Practical challenges and lessons learned. In Proceedings of the 41st International ACM SIGIR Conference on Research and Dev elopment in Information Retriev al, pages 1353–1354, 2018. [6] F aegheh Hasibi, F edor Nik olaev, Chen y an Xiong, Krisztian Balog, Svein Erik Bratsb erg, Alexander Koto v, and Jamie Callan. DBp edia-En tity v2: A test collection for en tity search. In Proceedings of the 40th International A CM SIGIR Conference on Research and Developmen t in Information Retriev al, pages 1265–1268, 2017. [7] Carlos E. Jimenez, John Y ang, Alexander W ettig, Sh unyu Y ao, Kexin P ei, Oﬁr Press, and Karthik Narasimhan. SWE-b enc h: Can language mo dels resolve real-w orld GitHub issues? In International Conference on Learning Represen tations, 2024. [8] Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Y uchen Lin, Jamin Shin, Sean W elleck, Graham Neubig, Mo on tae Lee, Kyung jae Lee, and Minjo on Seo. Prometheus 2: An op en source language mo del specialized in ev aluating other language mo dels. In Pro ceedings of the 2024 Conference on Empirical Metho ds in Natural Language Pro cessing, 2024. [9] T om Kwiatko wski, Jennimaria Palomaki, Olivia Redﬁeld, Michael Collins, Ankur Parikh, Chris Alb erti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A b enc hmark for question answering researc h. T ransactions of the Asso ciation for Computational Linguistics, 7:452–466, 2019. [10] Xiao Liu, Hao Y u, Hanchen Zhang, Yifan Xu, Xuan yu Lei, Hanyu Lai, Y u Gu, Hangliang Ding, Kaiw en Men, Kejuan Y ang, et al. Agen tBench: Ev aluating LLMs as agents. In In ternational Conference on Learning Represen tations, 2024. [11] Niklas Muennighoﬀ, Nouamane T azi, Lo ¨ ıc Magne, and Nils Reimers. MTEB: Massiv e text embedding b enc hmark. In Proceedings of the 17th Conference of the European Chapter of the Asso ciation for Computational Linguistics, pages 2014–2037, 2023. [12] Nandan Thakur, Nils Reimers, Andreas R ¨ uc kl´ e, Abhishek Sriv astav a, and Iryna Gurevych. BEIR: A heteroge- neous b enc hmark for zero-shot ev aluation of information retriev al models. In Thirty-ﬁfth Conference on Neural Information Pro cessing Systems Datasets and Benc hmarks T rack, 2021. [13] Ellen M. V o orhees and Donna K. Harman. TREC: Experiment and Ev aluation in Information Retriev al. MIT Press, 2005. [14] P eiyi W ang, Lei Li, Liang Chen, Zefan Cai, Daw ei Zhu, Binghuai Lin, Y unbo Cao, Lingp eng Kong, Qi Liu, Tian yu Liu, and Zhifang Sui. Large language mo dels are not fair ev aluators. In Pro ceedings of the 62nd Annual Meeting of the Asso ciation for Computational Linguistics, pages 9440–9450, 2024. [15] Seongh yeon Y e, Doy oung Kim, Sungdong Kim, Hyeon bin Hwang, Seungone Kim, Y ongrae Jo, James Thorne, Juho Kim, and Minjo on Seo. FLASK: Fine-grained language mo del ev aluation based on alignment skill sets. In In ternational Conference on Learning Representations, 2024. [16] Zhen yu Y u, MOHD Y AMANI IDNA IDRIS, P ei W ang, and Rizwan Qureshi. Cotextor: T raining-free modular m ultilingual text editing via lay ered disentanglemen t and depth-aw are fusion. In The Thirty-nin th Annual Conference on Neural Information Pro cessing Systems Creative AI T rack: Humanit y, 2025. [17] Ruilin Zhang, Haiyang Zheng, and Hongpeng W ang. Cnmbi: Determining the num b er of clusters using center pairwise matc hing and b oundary ﬁltering. In Proceedings of the International Conference on Adv anced Data Mining and Applications, pages 262–277, 2023. 19 [18] Ruilin Zhang, Haiy ang Zheng, and Hongpeng W ang. Tdec: Deep embedded image clustering with transformer and distribution information. In Proceedings of the 2023 A CM In ternational Conference on Multimedia Retriev al, 2023. [19] Haiy ang Zheng, Ruilin Zhang, and Hongp eng W ang. Deep image clustering based on curriculum learning and densit y information. In Pro ceedings of ACM In ternational Conference on Multimedia Retriev al, 2024. [20] Lianmin Zheng, W ei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao W u, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, et al. Judging LLM-as-a-judge with MT-Bench and c hatb ot arena. In Adv ances in Neural Information Pro cessing Systems, volume 36, 2023. [21] Sh uyan Zhou, F rank F. Xu, Hao Zh u, Xuhui Zhou, Rob ert Lo, Abishek Sridhar, Xian yi Cheng, Tian yue Ou, Y onatan Bisk, Daniel F ried, Uri Alon, and Graham Neubig. W ebArena: A realistic web en vironment for building autonomous agents. In International Conference on Learning Representations, 2024. A Case Studies T o provide concrete illustrations of platform performance diﬀerences, we present detailed case studies across three represen tative query t yp es: inﬂuencer disco very , exp ert ﬁnding, and recruiting. A.1 Case Study 1: Niche Inﬂuencer Discov ery Query. “Find inﬂuencers on Instagram with ‘slot’ in their username and also in their regular name, they m ust be from Brazil, hav e at least 300 to 50k follow ers, and promote casinos.” Chal lenge. This query requires multi-constrain t matching across platform-sp eciﬁc attributes (Instagram username format), geographic lo cation (Brazil), follow er count range, and niche conten t domain (casino promotion). Such queries are common in inﬂuencer marketing but c hallenging for general-purp ose search engines. R esults Analysis. T able 22 sho ws the performance comparison across platforms. T able 22 Case Study 1: Inﬂuencer discov ery results by platform. Platform P@10 Qualiﬁed Key Error Type Lessie 1.00 10 None Exa 0.20 2 W rong platform (LinkedIn instead of Instagram) Juiceb o x 0.10 1 W rong profession (video editors, not inﬂuencers) Err or Analysis. • Exa returned Link edIn proﬁles of iGaming industry professionals instead of Instagram inﬂuencers. The system matched “Brazil” + “casino/gaming” but failed on the platform constraint (Instagram) and username format requirement. • Juiceb o x returned video editors and creative professionals with no Instagram presence matching the criteria. The database-fo cused approach struggled with so cial media-sp eciﬁc queries outside professional net w orks. • Lessie correctly identiﬁed Instagram accounts with “slot” in usernames (e.g., carol.martins slots , carla oliveira slots ) and v eriﬁed Brazil location and follo wer counts. Key Insight. Multi-platform so cial media queries require sp ecialized data sources b ey ond professional databases. General w eb searc h engines often conﬂate professional proﬁles with so cial media inﬂuencers. 20 T able 23 Case Study 2: Cross-domain exp ert ﬁnding results. Platform P@10 Avg. Relev ance Key Strength Lessie 1.00 0.97 Both academic & industry v eriﬁed Juiceb o x 1.00 1.00 Strong industry proﬁles Exa 0.60 0.75 Goo d academic cov erage T able 24 Case Study 3: T echnical recruiting results. Platform P@10 T ask Success Lo cation Accuracy Lessie 1.00 100% 100% Exa 1.00 100% 100% Juiceb o x 0.67 100% 67% (conﬂicting data) A.2 Case Study 2: Cross-Domain Exp ert Finding Query. “Find p eople who hav e b oth a strong academic publication record in NLP and also hold senior engineering positions at tec h companies. I wan t the rare academics-turned-practitioners.” Chal lenge. This query requires ﬁnding individuals who exist at the intersection of tw o distinct domains: academic research (publications at NLP ven ues) and industry leadership (senior engineering roles). Such “cross-domain” queries test a platform’s ability to synthesize information from multiple sources. R esults Analysis. T able 23 sho ws the performance comparison. Example R esults. • Lessie found candidates like a Principal at Amazon with ACL/EMNLP publications, and a former VP of Researc h at OpenAI with ICML/ICLR pap ers. All criteria w ere v eriﬁed with evidence. • Juiceb o x returned strong candidates including Senior ML Engineers at Google and Microsoft with NLP publications. Ho wev er, some candidates lac ked the “senior engineering p osition” criterion (e.g., PhD studen ts). • Exa returned academics who lack ed current industry positions (e.g., professors at universities), sho wing diﬃcult y in ﬁltering for the “currently employ ed at tec h compan y” constraint. Key Insight. Cross-domain queries b eneﬁt from platforms that can v erify multiple criteria indep enden tly . Academic-only results or industry-only results b oth represen t partial failures for this query type. A.3 Case Study 3: T echnical Recruiting Query. “Lo oking for mac hine learning engineers in Boston who ha v e w orked on large language mo dels.” Chal lenge. Recruiting queries require precise matc hing on role (ML Engineer), location (Boston), and tec h- nical exp ertise (LLMs). The “w orked on LLMs” constraint is particularly c hallenging as it requires under- standing pro ject exp erience beyond job titles. R esults Analysis. T able 24 sho ws the performance comparison. 21 T able 25 Summary of case study ﬁndings. Query Type Best Platform Key Diﬀerentiator Nic he Inﬂuencer Lessie Multi-platform so cial media cov er- age Cross-Domain Exp ert Lessie, Juiceb o x Academic + industry veriﬁcation T echnical Re- cruiting Lessie, Exa Location accuracy , role matc hing Err or Analysis. • Lessie and Exa both found highly relev ant candidates including Lead ML Engineers at HubSp ot and ML Engineers at Red Hat, all v eriﬁed as Boston-based with LLM exp erience. • Juiceb o x returned some candidates with incorrect or conﬂicting location data (e.g., listing b oth Mas- sac h usetts and San F rancisco), and some candidates with “AI Prompt Engineer” titles that don’t matc h the “ML Engineer” requiremen t. Key Insight. Professional database platforms perform w ell on standard recruiting queries but may hav e data qualit y issues with lo cation ﬁelds. The LLM experience constrain t w as handled w ell by platforms that parse pro ject descriptions. A.4 Cross-Case Syn thesis T able 25 summarizes the k ey ﬁndings across all case studies. The case studies reveal that: 1. Query type matters: No single platform dominates all scenarios. Sp ecialized queries (inﬂuencer dis- co v ery) fa vor platforms with div erse data sources. 2. Constrain t complexit y: Multi-constrain t queries exp ose weaknesses in keyw ord-matching approaches. Platforms with semantic understanding p erform b etter. 3. Data freshness: Location and role information can b ecome outdated. Platforms that verify curren t emplo ymen t status hav e an adv antage. B Complete Query Set This app endix pro vides an o verview of the complete 119 benchmark queries with metadata. Due to space constrain ts, w e show a representativ e sample here; the full query set with complete metadata is a v ailable in our GitHub rep ository . C Uniﬁed Result Sc hema All platform results are normalized to the follo wing uniﬁed sc hema before ev aluation: Platform-sp e ciﬁc mappings. De duplic ation and name disambiguation. Results are deduplicated by normalized name (lo w ercase, remo ve titles) + compan y combination. When m ultiple proﬁles exist for the same p erson, the proﬁle with the highest information completeness is retained. D Ev aluation Prompts This appendix pro vides the complete prompts used in the Criteria-Grounded V eriﬁcation pip eline. 22 T able 26 Uniﬁed result schema for platform output normalization. Field Required Description p erson id Y es Unique identiﬁer (URL or generated hash) name Y es F ull name of the person title No Current job title compan y No Curren t employ er lo cation No Geographic lo cation (city , country) link edin url No LinkedIn proﬁle URL (if av ailable) t witter url No Twitter/X proﬁle URL (if av ailable) email No Email address (if av ailable) bio No Short biography or summary exp erience No List of previous p ositions (JSON arra y) education No List of education entries (JSON array) skills No List of skills/tags matc h explanation No Platform-provided explanation of why this p erson matches source urls No List of source URLs for v eriﬁcation T able 27 Field mapping rules by platform. Platform Nativ e Fields Mapping Notes Lessie Structured JSON with all ﬁelds Direct mapping; includes matc h explanation Exa name, title, company , link edin url Missing ﬁelds set to n ull; no matc h explanation Juiceb o x F ull proﬁle from 60+ sources Direct mapping; includes email when av ailable Claude Co de Markdown text report Parsed via regex; structured ﬁelds extracted from text D.1 Criteria Extraction Prompt You are a query analyzer. Given a people search query, extract explicit, independently verifiable criteria. Query: {query} Instructions: 1. Identify all constraints in the query (role, company, location, skills, experience level, etc.) 2. Each criterion must be independently verifiable via web search 3. Output as a JSON list of criterion objects Output format: { "criteria": [ {"id": "c1", "description": "...", "type": "role"}, ... ] } Example: Query: "Find senior ML engineers at Google in Bay Area" Output: 23 { "criteria": [ {"id": "c1", "description": "Role is Senior ML Engineer or equivalent", "type": "role"}, {"id": "c2", "description": "Currently employed at Google", "type": "company"}, {"id": "c3", "description": "Located in San Francisco Bay Area", "type": "location"} ] } D.2 V eriﬁcation Prompt You are a fact-checker. Given a person and a criterion, verify whether the person meets the criterion using web search. Person: {person_data} Criterion: {criterion_description} Instructions: 1. Search for evidence about this person using web search 2. Evaluate whether the criterion is met based on evidence 3. Output one of: "met", "partially_met", "not_met" 4. Provide brief justification with source URLs Output format: { "judgment": "met|partially_met|not_met", "justification": "...", "sources": ["url1", "url2"] } D.3 Information Utilit y Scoring Prompt You are evaluating the information utility of a people search result. Given the person data and the original query, score three dimensions: 1. Structural Completeness (0-1): Does the result include name, title, company, contact info, work history, education? 2. Query-Specific Evidence (0-1): Does it explain WHY this person matches? 3. Actionability (0-1): Can the user take action (contact, shortlist)? Person: {person_data} Query: {query} Output format: { "structural_completeness": 0.0-1.0, "query_specific_evidence": 0.0-1.0, "actionability": 0.0-1.0, "utility": 0.0-1.0 } E Priv acy and Compliance Details Data c ol le ction c omplianc e. All data collection adheres to: • rob ots.txt : W e respect rob ots.txt directives for all w eb sources • T erms of Service : Platform queries use oﬃcial APIs or w eb in terfaces • Rate limiting : All requests respect rate limits (max 1 req/sec p er source) 24 T able 28 T ask completion rate (%) and mean qualiﬁed results p er query b y category with 95% CI. T ask Completion (%) Mean Qualiﬁed / Query Scenario Les. Exa Jbx CC Les. Exa Jb x CC Recruiting 100 100 100 90.0 ± 5.5 11.3 ± 0.8 11.1 ± 0.9 11.3 ± 0.8 7.0 ± 0.7 B2B 100 100 84.4 ± 6.5 75.0 ± 7.7 9.5 ± 0.7 8.8 ± 0.8 7.9 ± 0.9 6.3 ± 0.8 Exp ert 100 96.4 ± 3.6 71.4 ± 8.5 100 11.3 ± 0.9 10.4 ± 0.8 7.0 ± 0.7 9.4 ± 0.9 Inﬂuencer 100 89.7 ± 5.7 79.3 ± 7.6 82.8 ± 7.0 9.4 ± 0.8 5.9 ± 0.7 3.4 ± 0.6 5.9 ± 0.7 Data stor age. • Ra w proﬁle pages are not stored; only extracted ﬁelds are retained • P ersonal iden tiﬁers (email, phone) are hashed before storage • Ev aluation results are aggregated; no p er-person data is publicly released GDPR/CCP A c onsider ations. • All queries target publicly a v ailable professional information • Individuals can request remo v al via GitHub issues • No automated decision-making or proﬁling is p erformed F Repro ducibilit y Chec klist T o ensure full repro ducibilit y , w e pro vide: • Co de : Complete ev aluation pip eline at GitHub repository • Queries : All 119 queries with metadata in JSON format • Prompts : All LLM prompts in this app endix • Sc hema : Uniﬁed result sc hema in Appendix C • Data : Aggregated scores and p er-query results (CSV) • En vironment : Python requirements.txt and Dock er conﬁguration • Random seeds : All random pro cesses use ﬁxed seed (42) G Detailed P er-Category Metrics T able 28 rep orts task completion rates and mean qualiﬁed results p er query b y category and platform with 95% conﬁdence interv als. 25

PeopleSearchBench: A Multi-Dimensional Benchmark for Evaluating AI-Powered People Search Platforms

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment