SentenceRacer: A Game with a Purpose for Image Sentence Annotation

SentenceRacer: A Game with a Purpose for Image Sentence Annotation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recently datasets that contain sentence descriptions of images have enabled models that can automatically generate image captions. However, collecting these datasets are still very expensive. Here, we present SentenceRacer, an online game that gathers and verifies descriptions of images at no cost. Similar to the game hangman, players compete to uncover words in a sentence that ultimately describes an image. SentenceRacer both generates and verifies that the sentences are accurate descriptions. We show that SentenceRacer generates annotations of higher quality than those generated on Amazon Mechanical Turk (AMT).


💡 Research Summary

The paper introduces SentenceRacer, an online multiplayer game designed to collect and verify natural‑language descriptions of images at virtually no monetary cost. The motivation stems from the high expense associated with building large‑scale image‑caption datasets such as Microsoft COCO or Flickr30M, which traditionally rely on Amazon Mechanical Turk (AMT) for both generation and verification of captions. SentenceRacer adapts the classic “Hangman” mechanic: in each round a designated “leader” writes a sentence describing a displayed image, after which stop‑words are removed and the remaining words are hidden. The other players act as guessers, typing words they think belong in the hidden sentence. Correct guesses reveal the word’s position in the sentence and award points to both the guesser and the leader. This dual‑reward system incentivizes leaders to craft detailed, easily guessable sentences, while simultaneously providing an implicit verification signal—if all words are eventually guessed, the sentence is considered verified without any external annotator.

System Design and Gameplay

  • Minimum of three participants per game.
  • Leader writes a full sentence; stop‑words are stripped.
  • Guessers see each other’s guesses in real time but cannot communicate directly with the leader.
  • Time‑limited rounds; each correct guess yields points and reveals the word.
  • A sentence is automatically verified when every hidden word has been guessed.

Data Collection Protocol
The authors recruited ten groups of four volunteers each. Each group played the game on ten randomly selected COCO images, repeating the process ten times per image. This yielded 49 fully verified sentences (i.e., all words guessed). To assess external validity, these sentences were sent to AMT where three independent workers judged whether each sentence accurately described the image; a majority vote (≥2/3) determined AMT verification.

Key Findings

  1. Verification Overlap – 87.8 % of sentences verified by the game were also verified by AMT, whereas only 54.9 % of the non‑verified game sentences received AMT approval.
  2. Higher Verification Rate – On the same set of images, the proportion of verified sentences from SentenceRacer (87.8 %) slightly exceeded that of pure AMT collection (85.5 %).
  3. Correlation with Remaining Blanks – As the number of unrevealed words (“blanks”) decreased, the likelihood of AMT verification rose sharply (e.g., 0 blanks → 87.8 % verified vs. 4 blanks → 42.8 %). This suggests the game’s internal verification is a reliable proxy for human judgment.
  4. Sentence Quality – Quality was operationalized as the count of distinct objects, object‑object relationships, and object attributes mentioned per caption. SentenceRacer captions averaged 2.98 objects, 1.88 relationships, and 1.45 attributes, all significantly higher than AMT captions (2.30, 1.02, 1.17 respectively; p < 0.01 for objects, p < 0.001 for relationships, marginal for attributes). The reward structure appears to push players toward richer, more informative descriptions.
  5. User Experience – Survey responses indicated participants found SentenceRacer more fun and engaging than a standard AMT captioning task, attributing enjoyment to the social interaction and fast‑paced nature of the game.

Limitations and Future Directions

  • Group Size Requirement – The need for at least three concurrent players may limit scalability in settings where real‑time participation is hard to guarantee.
  • Potential for Over‑Length – Leaders might be tempted to write overly long sentences to maximize points, risking player fatigue or reduced guess accuracy.
  • Verification Granularity – The binary “all words guessed” criterion does not capture grammatical correctness or subtle semantic errors; additional linguistic checks could be integrated.
  • Diversity of Language – The authors propose introducing a list of taboo words to force lexical variety, and exploiting idle time between rounds to crowdsource auxiliary tasks such as grounding mentioned objects within the image.

Conclusion
SentenceRacer demonstrates that a well‑designed gamified interface can simultaneously generate and validate image captions, achieving comparable or superior quality to traditional paid crowdsourcing while delivering a more enjoyable participant experience. By turning the verification step into an inherent part of gameplay, the system reduces overall annotation cost and opens avenues for richer multimodal datasets. The authors suggest extending the framework to incorporate diversity‑enhancing constraints and to multiplex additional annotation tasks during gameplay, potentially further amplifying the utility of this approach for computer‑vision research.


Comments & Academic Discussion

Loading comments...

Leave a Comment