"Not in My Backyard": LLMs Uncover Online and Offline Social Biases Against Homelessness
Homelessness is a persistent social challenge, impacting millions worldwide. Over 876,000 people experienced homelessness (PEH) in the U.S. in 2025. Social bias is a significant barrier to alleviation, shaping public perception and influencing policymaking. Given that online textual media and offline city council discourse reflect and influence part of public opinion, it provides valuable insights to identify and track social biases against PEH. We present a new, manually-annotated multi-domain dataset compiled from Reddit, X (formerly Twitter), news articles, and city council meeting minutes across ten U.S. cities. Our 16-category multi-label taxonomy creates a challenging long-tail classification problem: some categories appear in less than 1% of samples, while others exceed 70%. We find that small human-annotated datasets (1,702 samples) are insufficient for training effective classifiers, whether used to fine-tune encoder models or as few-shot examples for LLMs. To address this, we use GPT-4.1 to generate pseudo-labels on a larger unlabeled corpus. Training on this expanded dataset enables even small encoder models (ModernBERT, 150M parameters) to achieve 35.23 macro-F1, approaching GPT-4.1’s 41.57. This demonstrates that \textbf{data quantity matters more than model size}, enabling low-cost, privacy-preserving deployment without relying on commercial APIs. Our results reveal that negative bias against PEH is prevalent both offline and online (especially on Reddit), with “not in my backyard” narratives showing the highest engagement. These findings uncover a type of ostracism that directly impacts poverty-reduction policymaking and provide actionable insights for practitioners addressing homelessness.
💡 Research Summary
The paper investigates social bias against people experiencing homelessness (PEH) by analyzing both online textual media (Reddit, X formerly Twitter, news articles) and offline discourse (city council meeting minutes) across ten U.S. cities. The authors first compile a manually annotated gold‑standard dataset of 1,702 samples, each labeled with a 16‑category multi‑label taxonomy that expands on the existing OA TH frames to capture nuances such as genuine vs. rhetorical questions, factual claims, opinion expression, and an explicit “racist” category. Human annotators (three per item) achieve an average inter‑annotator agreement of 78.38%, and soft labeling (majority vote) is used to create binary vectors for each sample.
The study asks three research questions: (RQ1) whether a small gold set suffices for training effective classifiers, (RQ2) whether GPT‑generated pseudo‑labels can enable low‑cost, privacy‑preserving local models, and (RQ3) how homelessness‑related bias varies across media platforms and what societal implications arise. Initial experiments show that fine‑tuning transformer encoders (BERT, RoBERTa, ModernBERT) on the gold set yields poor macro‑F1 scores (≈25–32), and even few‑shot prompting of GPT‑4.1 improves performance only marginally (≈43 macro‑F1). This confirms that the gold set is valuable for evaluation but insufficient for training.
To overcome data scarcity, the authors employ GPT‑4.1 in zero‑shot mode to generate pseudo‑labels for a much larger unlabeled corpus (tens of thousands of documents). They then fine‑tune local language models (LLaMA 3.2 3B, Qwen 2.5 7B, Phi‑4 Mini) using LoRA, a low‑rank adaptation technique that updates only a small subset of parameters. When ModernBERT (150 M parameters) is fine‑tuned on the pseudo‑labeled data, it reaches a macro‑F1 of 35.23, a 38 % relative improvement over gold‑only training and approaching the performance of the proprietary GPT‑4.1 (41.57 macro‑F1). This empirical result supports the authors’ claim that “data quantity matters more than model size,” enabling inexpensive, privacy‑preserving deployment without reliance on commercial APIs.
Platform‑specific analysis reveals distinct bias patterns. Reddit exhibits the highest engagement with the “not in my backyard” (NIMBY) narrative, accounting for over 40 % of the discourse and generating the most up‑votes and comments. X (Twitter) shows a higher prevalence of hateful or racist frames, while news articles tend to blend policy discussion with societal critique. City‑council minutes contain more formal policy and budget language but still reflect negative bias when discussing shelter siting. The authors also compare smaller cities (e.g., South Bend, Indiana) with larger media‑visible cities (e.g., San Francisco), finding that NIMBY sentiment is especially intense in smaller locales where residents feel a direct impact.
The paper acknowledges limitations. Pseudo‑labels inherit any biases present in GPT‑4.1, potentially contaminating downstream models. Extreme class imbalance remains problematic for rare categories such as “racist” (<1 % of samples). Geolocation inference for X users, while multi‑staged, is not perfectly accurate, which could affect regional bias analyses. Future work is suggested in human‑LLM collaborative labeling, advanced sampling or loss‑reweighting to address imbalance, and extending the methodology to multilingual or non‑U.S. contexts.
In conclusion, the study provides a novel multi‑domain dataset, demonstrates that large‑scale pseudo‑labeling can compensate for limited human annotations, and shows that modest encoder models can achieve near‑state‑of‑the‑art performance when trained on sufficient data. The findings highlight that online “not in my backyard” sentiment is a dominant driver of public opposition to homelessness interventions, especially on Reddit, and that such bias can directly influence policy decisions. The work offers actionable insights for policymakers, NGOs, and technologists seeking to monitor and mitigate stigma against PEH while maintaining data privacy and cost‑effectiveness.
Comments & Academic Discussion
Loading comments...
Leave a Comment