The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage
Traffic stops are among the most frequent police-civilian interactions, and body-worn cameras (BWCs) provide a unique record of how these encounters unfold. Respect is a central dimension of these interactions, shaping public trust and perceived legitimacy, yet its interpretation is inherently subjective and shaped by lived experience, rendering community-specific perspectives a critical consideration. Leveraging unprecedented access to Los Angeles Police Department BWC footage, we introduce the first large-scale traffic-stop dataset annotated with respect ratings and free-text rationales from multiple perspectives. By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities. To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-consistent alignment; and (iii) propose a perspective-aware modeling framework that predicts personalized respect ratings and generates annotator-specific rationales for both officers and civilian drivers from traffic-stop transcripts. Across all three annotator groups, our approach improves both rating prediction performance and rationale alignment. Our perspective-aware framework enables law enforcement to better understand diverse community expectations, providing a vital tool for building public trust and procedural legitimacy.
💡 Research Summary
The paper investigates how the concept of “respect” is perceived differently across community groups during police traffic stops, leveraging a large corpus of Los Angeles Police Department (LAPD) body‑worn camera (BWC) footage. The authors assembled a novel dataset of roughly 1,000 traffic‑stop videos recorded between September 2021 and September 2022, and recruited annotators from three distinct backgrounds: police‑affiliated individuals (GPA), justice‑system‑impacted individuals (GJI), and non‑affiliated community members (GNA). Each video was annotated with a 1‑to‑5 respect rating (1 = very disrespectful, 5 = very respectful) for both the officer and the driver, together with a free‑text rationale explaining the rating. In total, 1,362 rationales were collected, with group‑level statistics showing modest differences in mean ratings and rationale length.
To capture the nuanced, domain‑specific criteria that inform respect judgments, the authors built a “respect rubric” grounded in procedural justice theory, LAPD training manuals, extensive fieldwork (surveys, focus groups, ride‑alongs), and prior literature. The rubric organizes respect into three overlapping core categories—Emotions, Professionalism, and Communication—plus a set of Contextual Moderators that capture situational factors (e.g., threats, environmental noise). Each category lists concrete positive and negative behaviors (e.g., warmth vs. anger in Emotions; proper greetings vs. abrupt commands in Professionalism). This rubric serves two purposes: (1) it provides annotators with clear guidance, and (2) it enables systematic, interpretable evaluation of rationales.
For automated evaluation, the authors implemented an LLM‑as‑a‑judge system using LLaMA‑3‑70B. Given a rationale, the judge produces a binary activation vector over the K rubric dimensions, indicating which respect‑related elements are present. Generated rationales are compared to reference rationales via macro‑averaged precision, recall, and F1 on these vectors, yielding a rubric‑based quality metric that aligns closely with human judgments.
The core modeling contribution is a “perspective‑aware” framework. Each data point consists of a conversation transcript (the prompt) conditioned on an annotator’s group label and demographic attributes. The model is trained to (1) predict the annotator‑specific respect rating and (2) generate a rationale that reflects that annotator’s perspective. To improve alignment, the authors synthesize a rubric‑grounded preference dataset: a Generator module samples candidate rationales from target models, an Augmenter paraphrases human rationales, and the Judge module filters candidates based on rubric alignment. This preference data is used to fine‑tune the model with reinforcement‑style alignment.
Empirical results show consistent gains across all three groups. Rating prediction error (MAE) improves by roughly 12‑15 % over baselines that ignore annotator perspective, and rubric‑F1 scores for generated rationales increase by about 8 %. Analysis of group‑specific behavior reveals that GJI annotators weight emotional cues (e.g., empathy, calmness) more heavily, while GPA annotators emphasize procedural professionalism (e.g., proper greetings, clear instructions). The rubric‑based judge correlates with human evaluation at κ ≈ 0.84, demonstrating its reliability and potential to reduce annotation costs.
In summary, the paper makes three major contributions: (1) a publicly released, multi‑annotator traffic‑stop dataset with respect ratings and rationales; (2) a theory‑driven, domain‑specific respect rubric and an LLM‑based automatic evaluator; (3) a perspective‑aware language model that can predict personalized respect scores and generate group‑aligned explanations. The work highlights the importance of modeling subjective social constructs as multi‑perspective phenomena rather than forcing a single “ground truth,” and it offers a concrete tool for law‑enforcement agencies to understand and address community‑specific expectations, thereby supporting trust‑building and procedural legitimacy. Future directions include extending the approach to other police‑civilian interactions, applying it to different jurisdictions, and integrating real‑time feedback mechanisms for on‑the‑ground use.
Comments & Academic Discussion
Loading comments...
Leave a Comment