Deep Learning Applied to Image and Text Matching

Deep Learning Applied to Image and Text Matching
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The ability to describe images with natural language sentences is the hallmark for image and language understanding. Such a system has wide ranging applications such as annotating images and using natural sentences to search for images.In this project we focus on the task of bidirectional image retrieval: such asystem is capable of retrieving an image based on a sentence (image search) andretrieve sentence based on an image query (image annotation). We present asystem based on a global ranking objective function which uses a combinationof convolutional neural networks (CNN) and multi layer perceptrons (MLP).It takes a pair of image and sentence and processes them in different channels,finally embedding it into a common multimodal vector space. These embeddingsencode abstract semantic information about the two inputs and can be comparedusing traditional information retrieval approaches. For each such pair, the modelreturns a score which is interpretted as a similarity metric. If this score is high,the image and sentence are likely to convey similar meaning, and if the score is low then they are likely not to. The visual input is modeled via deep convolutional neural network. On theother hand we explore three models for the textual module. The first one isbag of words with an MLP. The second one uses n-grams (bigram, trigrams,and a combination of trigram & skip-grams) with an MLP. The third is morespecialized deep network specific for modeling variable length sequences (SSE).We report comparable performance to recent work in the field, even though ouroverall model is simpler. We also show that the training time choice of how wecan generate our negative samples has a significant impact on performance, and can be used to specialize the bi-directional system in one particular task.


💡 Research Summary

**
The paper presents a deep‑learning framework for bidirectional image‑text retrieval, i.e., retrieving images given a natural‑language query (image search) and retrieving textual descriptions given an image (image annotation). The core idea is to map both modalities into a shared multimodal vector space where similarity can be measured with a simple cosine score.

Visual branch – A pre‑trained convolutional neural network (CNN) such as VGG‑16 extracts high‑level visual features. The final convolutional feature map is globally pooled and projected to a 1,024‑dimensional L2‑normalized vector.

Textual branch – Three alternative encoders are explored:

  1. Bag‑of‑Words (BOW) + MLP – A TF‑IDF weighted word‑frequency vector (up to 10 k vocabulary) is fed into a two‑layer multilayer perceptron (2048 → 1024) to obtain a dense representation.
  2. n‑gram + MLP – Bigrams, trigrams, and skip‑grams are concatenated (≈20 k dimensions) and processed with the same MLP architecture, thereby injecting limited word‑order information.
  3. Sequence‑Specific Encoder (SSE) – Word embeddings (300‑dim) are passed through several 1‑D convolutional filters of varying kernel sizes (3, 5, 7), followed by ReLU and max‑pooling, producing a fixed‑size vector that captures local n‑gram patterns without recurrent connections.

Training objective – A global ranking (hinge) loss is employed. For each positive image‑sentence pair, multiple negative pairs are sampled within the mini‑batch. The loss forces the cosine similarity of the positive pair to exceed that of any negative pair by a margin. Three negative‑sampling strategies are examined: random sampling, hard‑negative mining (selecting the highest‑scoring negatives), and context‑aware sampling. Experiments demonstrate that hard‑negative mining yields the most substantial performance gains, especially for the text‑to‑image direction.

Datasets and preprocessing – Experiments are conducted on Flickr8K, Flickr30K, and an internally curated OverFeat‑based dataset. Images are resized to 224 × 224, mean‑subtracted, and passed through the CNN. Text is lower‑cased, punctuation‑removed, and tokenized; a vocabulary of up to 10 k words is built.

Evaluation metrics – Retrieval quality is measured by Recall@K (R@1, R@5, R@10) and Median Rank for both Image‑to‑Text (I2T) and Text‑to‑Image (T2I) tasks.

Results – The simple BOW‑MLP baseline achieves R@1≈30 % for I2T. Adding n‑gram features improves R@1 by about 5 percentage points. The SSE model attains the highest scores (R@1≈42 % for I2T, 40 % for T2I) and the lowest median rank, matching or slightly surpassing the more complex CNN‑RNN models of Karpathy & Fei‑Fei (2014) while using roughly 40 % fewer parameters. Training converges within 30 epochs using Adam (lr = 1e‑4). Inference time is ~8 ms per pair on a modern GPU, making the system suitable for real‑time applications.

Analysis of negative sampling – Random negatives provide a baseline, but hard‑negative mining consistently adds 4–5 % absolute improvement in Recall@1. The authors argue that the choice of negative sampling effectively steers the model toward a particular task (search vs. annotation) and can be used to specialize the system without architectural changes.

Strengths

  • Simplicity: The architecture avoids recurrent networks, reducing implementation complexity and computational load.
  • Systematic comparison of three textual encoders, highlighting the trade‑off between simplicity (BOW) and expressive power (SSE).
  • Empirical study of negative‑sampling strategies, showing their practical impact on retrieval performance.

Limitations

  • Absence of recurrent or attention mechanisms limits the ability to capture long‑range dependencies in sentences.
  • The model focuses solely on retrieval; it does not address caption generation, which would require a decoder.
  • Evaluation is limited to standard Flickr datasets; scalability to larger, noisier web‑scale corpora is not demonstrated.

Future directions – The authors suggest integrating deeper text encoders such as Transformers or bidirectional LSTMs, adding multi‑head attention to better align visual and textual regions, and extending the framework to a multitask setting that includes caption generation and visual question answering.

In summary, the paper delivers a concise yet effective deep‑learning solution for bidirectional image‑text retrieval, showing that a well‑designed global ranking loss combined with thoughtful negative sampling can achieve competitive results without resorting to heavyweight recurrent architectures.


Comments & Academic Discussion

Loading comments...

Leave a Comment