Exploring Models and Data for Image Question Answering
This work aims to address the problem of image-based question-answering (QA) with new models and datasets. In our work, we propose to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images. Our model performs 1.8 times better than the only published results on an existing image QA dataset. We also present a question generation algorithm that converts image descriptions, which are widely available, into QA form. We used this algorithm to produce an order-of-magnitude larger dataset, with more evenly distributed answers. A suite of baseline results on this new dataset are also presented.
💡 Research Summary
The paper tackles the problem of image‑based question answering (Image QA) by introducing both a new family of neural models and a large‑scale automatically generated dataset.
Model design. The authors build on a pre‑trained 19‑layer VGG network (ImageNet) and extract the 4096‑dimensional activation from the last fully‑connected layer. This visual vector is linearly projected into a 300‑ or 500‑dimensional space that matches the dimensionality of word embeddings. The projected image vector is then treated as a “word” in the question sequence and fed to a recurrent neural network. The main architecture, called VIS + LSTM, inserts the image at the beginning (or end) of the token stream and runs a standard LSTM; a variant, 2‑VIS + BLSTM, uses two image insertions (start and end) and a bidirectional LSTM. Additional baselines include a simple image‑only multinomial logistic regression (IMG + BOW), a blind bag‑of‑words classifier (BOW), a blind LSTM that sees only the question, a “deaf” image‑only classifier (IMG), an image‑plus‑prior model that combines visual predictions with empirical object‑color priors, and a K‑nearest‑neighbor method that matches both image and text features. All models output a single‑word answer via a softmax layer, turning the task into a multi‑class classification problem.
Dataset creation. Recognizing that the only publicly available Image QA corpus (DAQUAR) is tiny (≈1.5 k images, 7 k QA pairs) and heavily biased toward a few answer classes, the authors devise an automatic question‑generation pipeline that converts image captions into QA pairs. Using the Stanford parser, they split compound sentences, replace indefinite articles with definite ones, and apply WH‑movement rules to move interrogative words to the front. Four question types are generated: (1) object questions (“What …?”) by replacing a noun with “what”, (2) number questions (“How many …?”) by extracting numerals, (3) color questions (“What is the color of …?”) by locating color adjectives, and (4) location questions (“Where is …?”) by focusing on prepositional phrases beginning with “in”. WordNet and NLTK provide noun and adjective categories; post‑processing removes overly frequent or extremely rare answers, reducing the mode frequency from 24.98 % to 7.30 % in the test split. The resulting COCO‑QA dataset contains 78 734 QA pairs (38 948 for training, 38 948 for testing), with an average question length of 9.65 tokens and a maximum of 55 tokens. Answers are restricted to a single word, enabling straightforward accuracy and WUPS evaluation.
Experiments and results. The authors evaluate on both DAQUAR (restricted to single‑word answers) and COCO‑QA using two metrics: plain accuracy and the Wu‑Palmer similarity (WUPS) at thresholds 0.9 and 0.0. On DAQUAR, VIS + LSTM achieves 34.41 % accuracy, a 1.8‑fold improvement over the previously reported best (≈12.73 %). On COCO‑QA, VIS + LSTM reaches 34.41 % accuracy, 46.05 % WUPS@0.9, and 53.31 % WUPS@0.0, outperforming all baselines. The blind models (BOW, LSTM) obtain around 32 % accuracy, while the image‑only “deaf” model (IMG + BOW) scores 34.17 %, confirming that visual information is essential but that the best performance comes from joint visual‑linguistic modeling. The K‑NN baseline (k = 31) performs poorly (≈31 % accuracy), indicating that simple memorization of training examples is insufficient. Notably, the mode‑guessing baseline drops to only 7 % on COCO‑QA, demonstrating that the dataset’s answer distribution is well balanced and that models cannot rely on trivial priors.
Contributions and impact. The paper’s main contributions are: (1) a simple yet effective end‑to‑end architecture that maps image features into the same embedding space as words and feeds them directly to an LSTM, eliminating the need for intermediate detection or segmentation; (2) an automatic, linguistically motivated pipeline for turning image captions into high‑quality QA pairs, yielding a dataset an order of magnitude larger than previous resources and with a much flatter answer distribution; (3) a thorough empirical comparison of multiple baselines, showing that joint visual‑language models substantially outperform both blind and deaf alternatives. The work paves the way for future research on multi‑word answer generation, attention mechanisms that focus on relevant image regions, and more complex reasoning (e.g., relational or logical inference) in Image QA. The released code and dataset (GitHub, COCO‑QA download link) provide a solid foundation for the community to build upon.
Comments & Academic Discussion
Loading comments...
Leave a Comment