PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits
Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features. We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning. The code is available at https://github.com/lokali/PersonaX.
💡 Research Summary
PersonaX is a newly introduced multimodal resource that brings together large‑scale behavioral trait assessments, facial visual data, and structured biographical metadata for two distinct populations: public figures (CelebPersona) and professional athletes (AthlePersona). The CelebPersona subset contains 9,444 individuals derived from the CelebA image collection, while AthlePersona comprises 4,181 male athletes from seven major sports leagues (NBA, NFL, NHL, PGA, Premier League, Bundesliga, ATP). For each entity the authors provide (1) textual descriptions and Big‑Five trait scores generated by three state‑of‑the‑art large language models (ChatGPT‑4o, Gemini‑2.5‑Pro, Llama‑4‑Maverick), (2) a 1024‑dimensional facial embedding (the raw images are not released), and (3) a set of structured attributes such as birth date, nationality, height, weight, occupation, and geographic coordinates. All textual and visual data are transformed into high‑dimensional embeddings and further obfuscated through an invertible transformation to preserve privacy while still enabling downstream analysis.
The paper’s contributions are twofold. First, it releases the PersonaX datasets, filling a gap in existing resources that rarely combine behavioral descriptors with visual and biographical modalities. Second, it proposes a two‑level analytical framework. At the structured level, the authors apply five statistical independence tests—including a modified Kernel Conditional Independence (KCI) test, HSIC‑based tests, distance correlation, and others—to examine relationships between continuous variables (e.g., height, weight, trait scores) and categorical variables (e.g., sport, gender). These tests reveal non‑linear associations such as a positive link between conscientiousness and body mass index in athletes.
At the unstructured level, the authors introduce a novel Causal Representation Learning (CRL) framework tailored to multimodal, multi‑measurement settings. The CRL model maps each modality into a latent space, assumes a directed acyclic causal graph among the latent variables, and learns modality‑specific linear or non‑linear decoders. Identifiability is guaranteed by leveraging (i) the three independent LLM‑derived trait score sets, (ii) the high‑dimensional structure of facial embeddings, and (iii) the linear independence of the structured attributes. Theoretical proofs show that both the latent recovery step and the causal graph estimation step are uniquely identifiable under mild assumptions.
Empirical evaluation proceeds on both synthetic data—where ground‑truth latent variables and causal graphs are known—and the real PersonaX data. In synthetic experiments, CRL outperforms baseline ICA and GAN‑based methods, improving structural recovery accuracy by more than 12 %. On the real datasets, CRL successfully uncovers plausible causal directions, for example indicating that higher openness scores tend to be associated with lower facial asymmetry. The authors also conduct a thorough LLM selection study, measuring generation time, missing/indecisive rates, privacy preservation, output formatting, factual accuracy, and context consistency. Results show that ChatGPT‑4o and Gemini‑2.5‑Pro achieve the best trade‑off between consistency (low standard deviation across repeated runs) and factual correctness.
Ethical considerations are addressed comprehensively. All source materials are publicly available and used under non‑commercial licenses; raw images and text are never released, only transformed embeddings. The paper discusses potential bias across race, gender, and nationality, and provides distribution statistics to enable bias audits. Privacy safeguards meet GDPR and CCPA standards through reversible transformations that keep personally identifiable information hidden while allowing research use.
In summary, PersonaX supplies the community with a richly annotated multimodal dataset linking LLM‑inferred behavioral traits to visual and biographical signals, and it offers a rigorous statistical‑causal pipeline for exploring inter‑modal relationships. The work opens avenues for downstream applications such as personalized human‑computer interaction, fairness‑aware trait prediction, and causal analysis in computational social science. Future work may expand the dataset to include women, non‑binary individuals, and broader occupational domains, as well as integrate the learned causal graphs into adaptive AI systems that respect diversity, equality, and user autonomy.
Comments & Academic Discussion
Loading comments...
Leave a Comment