Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images
In this paper, we address the task of learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions. Using linguistic context and visual features, our method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images which contain these novel concepts. Our method has an image captioning module based on m-RNN with several improvements. In particular, we propose a transposed weight sharing scheme, which not only improves performance on image captioning, but also makes the model more suitable for the novel concept learning task. We propose methods to prevent overfitting the new concepts. In addition, three novel concept datasets are constructed for this new task. In the experiments, we show that our method effectively learns novel visual concepts from a few examples without disturbing the previously learned concepts. The project page is http://www.stat.ucla.edu/~junhua.mao/projects/child_learning.html
💡 Research Summary
The paper introduces a novel task called Novel Visual Concept learning from Sentences (NVCS), which aims to enable an image captioning system to acquire new visual concepts and their linguistic labels from only a handful of images paired with sentence descriptions. Traditional captioning models rely on a fixed vocabulary learned from large datasets; adding a new word typically requires retraining the entire network, which is computationally expensive and often impractical when the original training data are unavailable.
To address this, the authors build upon the state‑of‑the‑art m‑RNN captioning architecture and make two key modifications. First, they replace the original recurrent layer with an LSTM to improve long‑range dependency handling. Second, they propose a Transposed Weight Sharing (TWS) scheme that dramatically reduces the number of parameters. In the original m‑RNN, most parameters reside in two large matrices: U_D (word‑embedding from one‑hot) and U_M (decoder from multimodal space to vocabulary). The authors factor U_M into U_TD · U_I and force U_TD to be the transpose of U_D. This sharing cuts the parameter count by roughly 50 % while allowing a larger embedding dimension, which yields richer semantic representations without overfitting.
When a new concept appears (e.g., the word “quidditch”), the model must learn embeddings for the new word and adjust the decoder for it, but without disturbing the already‑learned weights for existing words. The authors achieve this by: (1) fixing the sub‑matrix U_D_o that corresponds to original vocabulary and only updating U_D_n (the columns for new words); (2) fixing the baseline bias term b_n for new words, because with only a few training examples the bias would otherwise be over‑estimated. They also centralize the activation of the intermediate multimodal layer, effectively removing any residual influence of U_D on the baseline probability.
Three datasets are constructed to evaluate the approach. Two are derived from MS‑COCO, each containing a small set of images (5–10 per concept) for novel objects and novel animals that do not appear in the original COCO vocabulary. The third, “Rare‑Concept,” introduces three completely unseen entities—quidditch (a fictional sport), a T‑rex, and a samisen (a Japanese three‑stringed instrument). For each dataset, the base model is first trained on the full COCO corpus, then fine‑tuned only on the few new examples using the proposed NVCS procedure.
Experimental results show that the TWS‑enhanced model achieves comparable or slightly better captioning scores (BLEU‑4, METEOR, CIDEr) on the original COCO test set, confirming that the weight‑sharing does not harm existing performance. More importantly, after learning from just a few examples, the model can correctly generate captions that include the new word and appropriate visual context (e.g., “A group of people playing quidditch with a red ball”). Compared with a baseline that retrains the entire network on the combined data, the proposed method reduces training time by over 70 % and eliminates the need to retain the original large training set, while maintaining similar accuracy on both old and new concepts. Qualitative human evaluations further confirm that the generated sentences are natural and semantically coherent.
In summary, the paper contributes (1) a practical framework for incremental vocabulary expansion in multimodal neural networks, (2) a novel transposed weight sharing technique that halves model parameters and enables larger embeddings, (3) strategies to prevent over‑fitting when only a few labeled examples are available, and (4) three publicly released NVCS datasets to foster future research. The work bridges a gap between cognitive studies of child language acquisition and modern deep learning, demonstrating that image captioning systems can learn new visual concepts quickly and efficiently, much like a child does.
Comments & Academic Discussion
Loading comments...
Leave a Comment