Automatic image captioning remains challenging despite the recent impressive progress in neural image captioning. In the paper “Adversarial Semantic Alignment for Improved Image Captions,” appearing at the 2019 Conference in Computer Vision and Pattern Recognition (CVPR), we – together with several other IBM Research AI colleagues — address three main challenges in bridging the semantic gap between visual scenes and language in order to produce diverse, creative and human-like captions.
Compositionality and Naturalness
The first challenge stems from the compositional nature of natural language and visual scenes. While the training dataset contains co-occurrences of some objects in their context, a captioning system should be able to generalize by composing objects in other contexts.
Traditional captioning systems suffer from lack of compositionality and naturalness as they often generate captions in a sequential manner, i.e., next generated word depends on both the previous word and the image feature. This can frequently lead to syntactically correct, but semantically irrelevant language structures, as well as to a lack of diversity in the generated captions. We propose to address the compositionality issue with a context-aware Attention captioning model, which allows the captioner to compose sentences based on fragments of the observed visual scenes. Specifically, we used a recurrent language model with a gated recurrent visual attention that gives the choice at every generating step of attending to either visual or textual cues from the last generation step
To address the issue of lack of naturalness, we introduce another innovation by using generative adversarial networks (GANs)  in training the captioner, where a co-attention discriminator scores the “naturalness” of a sentence and its fidelity to the image via a co-attention model that matches fragments of the visual scenes and the language generated and vice versa. The Co-attention discriminator judges the quality of a caption by scoring the likelihood of generated words given the image features and vice versa. Note that this scoring is local (word and pixel level) and not at a global representation level. This locality in the scoring is important in capturing the compositional nature of language and visual scenes. The discriminator role is not only to ensure that the language generated is human-like, but it also enables the captioner to compose by judging the image and sentence pairs on a local level.
The second challenge is the dataset bias impacting current captioning systems. The trained models overfit to the common objects that co-occur in a common context (e.g., bed and bedroom), which leads to a problem where such systems struggle to generalize to scenes where the same objects appear in unseen contexts (e.g., bed and forest). Although reducing the dataset bias is in itself a challenging, open research problem, we propose a diagnostic tool to quantify how biased a given captioning system is.
Specifically, we created a test diagnosis dataset of captioned images with the common objects occurring in unusual scenes (Out of Context – OOC dataset) in order to test the compositional and generalization properties of a captioner. The evaluation on OOC is a good indicator of the model’s generalization. Bad performance is a sign that the captioner is over-fitted to the training context. We show that GAN-based models with co-attention discriminator and context-aware generator have better generalization to unseen contexts than previous state of the art methods (See Figure 1).
Evaluation and Turing Test
The third challenge is in the evaluation of the quality of generated captions. Using automated metrics, though partially helpful, is still unsatisfactory since they do not take the image into account. In many cases, their scoring remains inadequate and sometimes even misleading — especially when scoring diverse and descriptive captions. Human evaluation remains a gold standard in scoring captioning systems. We used a Turing test in which human evaluators were asked if a given caption is real or machine-generated. The human evaluators judged many of the model-generated captions to be real, demonstrating that the proposed captioner has a good performance and promising to be a valuable new approach for automatic image captioning.
Progress on automatic image captioning and scene understanding will make computer vision systems more reliable for use as personal assistants for visually impaired people and in improving their day-to-day life. The semantic gap in bridging language and vision points to the need for incorporating common sense and reasoning into scene understanding.