Skip to content

Weinsz/Image-Captioning

Repository files navigation

Image Captioning with LSTMs and Attention

Authors: Vincenzo Siano and Kevin Garofalo

  • This project implements and compares two deep learning architectures for image captioning on the COCO 2014 dataset: a baseline Encoder-Decoder model using an LSTM and an advanced version incorporating an Attention mechanism. The entire project is built with TensorFlow and Keras and run on a local setup.

  • The final model is presented through an interactive web interface created with Gradio:

    model_gradio
  • This work is based on the theoretical and practical findings of the famous papers "Show and Tell" and "Show, Attend and Tell". Computer Vision and NLP have advanced rapidly since image captioning's era, driven by the introduction of new models and techniques that continually push the boundaries of performance.

  • A natural progression beyond RNN-based models would be the adoption of Transformers architectures, which represent a true game-changing innovation: they provide a powerful self-attention mechanism which has redefined the state of the art across multiple domains, and applying them to image captioning is expected to deliver a significant leap in performance.

Project Structure

The repository is organized as follows:

.
├── COCO_dataset.ipynb             # Notebook for exploring the COCO dataset.
├── image_captioning_LSTM.ipynb      # Notebook for training the baseline LSTM-only model.
├── image_captioning_attention.ipynb # Notebook for training the advanced Attention model.
├── image_captioning_inference.ipynb # Notebook for evaluating models and generating captions.
├── gradio_app.py                  # Python script to launch the Gradio web interface.
├── requirements.txt               # Required Python libraries.
├── vocabulary.txt                 # The vocabulary file generated during training.
├── .gitignore                     # Specifies files and folders to be ignored by Git.
├── src/                           # Source code for reuse.
│   ├── config.py                  # Project hyperparameters and constants.
│   └── model.py                   # Model class definitions (Encoder, Decoder, Attention).
├── training_checkpoints/          # Saved weights for the baseline LSTM-only model.
│   ├── best_decoder.weights.h5    # Best weights for the decoder in the LSTM-only model.
│   ├── best_encoder.weights.h5    # Best weights for the encoder in the LSTM-only model.
├── training_checkpoints_attention/ # Saved weights for the Attention model, with Bahdanau and Luong variants.
│   ├── best_decoder.weights.h5    # Best weights for the decoder in the Bahdanau Attention model.
│   ├── best_encoder.weights.h5    # Best weights for the encoder in the Bahdanau Attention model.
│   ├── best_decoder_luong.weights.h5 # Best weights for the decoder in the Luong Attention model.
│   ├── best_encoder_luong.weights.h5 # Best weights for the encoder in the Luong Attention model.
├── example_images/                # Example images for the Gradio app.
└── caption_results/               # JSON files with generated captions for evaluation.
    ├── results_lstm_greedy_captions_val2014.json # Captions from the LSTM-only model using Greedy Search.
    ├── results_lstm_beam_captions_val2014.json   # Captions from the LSTM-only model using Beam Search.
    ├── results_attention_greedy_captions_val2014.json # Captions from the Bahdanau Attention model using Greedy Search.
    ├── results_attention_beam_captions_val2014.json   # Captions from the Bahdanau Attention model using Beam Search.

Project Phases

The project was executed in several distinct phases, moving from a foundational baseline to a sophisticated final model.

1. Baseline Model: LSTM Encoder-Decoder

The initial model was an Encoder-Decoder architecture. This model achieved a respectable baseline score but showed significant overfitting during training. This highlighted the need for more advanced regularization and a more powerful architecture.

2. Advanced Training & Regularization

To fight overfitting and stabilize training, a suite of advanced techniques was implemented:

  • Data Augmentation: On-the-fly random transformations (flips, brightness changes) were applied to the training images to increase dataset variety.
  • Scheduled Sampling: A technique to bridge the gap between training (teacher forcing) and inference by sometimes feeding the model its own predictions.
  • Adaptive Learning Rate: A custom ReduceLROnPlateau schedule was implemented to adjust the learning rate based on validation performance.

These techniques successfully controlled overfitting and led to a much healthier and more robust training process. However, they also revealed that the model's performance was limited by its architecture (as an information bottleneck).

3. Advanced Model: Adding the Attention Mechanism

To try to overcome the limitations of the baseline, the architecture was upgraded with an Attention mechanism.

The encoder was modified to output a spatial feature map (an 8x8 grid) instead of a single vector. The decoder was equipped with an attention layer, allowing it to "look" at different parts of the image grid as it generated each word.

4. Inference & Evaluation

For the final evaluation, both Greedy Search and Beam Search algorithms were implemented and compared.

  • Beam Search: This more advanced decoding algorithm explores multiple candidate captions at each step, generally leading to more fluent, human-like, and accurate results.
  • Qualitative Analysis: A crucial finding was that Beam Search (k=3) produced captions that were qualitatively more human-like and precise, even when the BLEU-4 score was slightly lower than Greedy Search. This highlighted the importance of human evaluation alongside automated metrics.

How to run the Project

1. Setup

First, set up your environment and download the dataset.

  • Install dependencies:
    pip install -r requirements.txt
  • Download the COCO 2014 Dataset: Download the train images, validation images, and annotations. Unzip them and place them in a coco directory at the project's root level, following the paths defined in src/config.py.

2. Training the Models

You can train the models by running the Jupyter Notebooks.

  • To train the Baseline LSTM-only Model: Open and run the cells in image_captioning_LSTM.ipynb. This will train the model in two phases (transfer learning and fine-tuning) and save the best weights to the training_checkpoints/ directory.
  • To train the Attention Model: Open and run the cells sequentially in image_captioning_attention.ipynb. This will train the model and save the best weights to the training_checkpoints_attention/ directory.

3. Generating Captions and Evaluating

To evaluate the trained models on the 5000-image validation subset:

  • Open and run the image_captioning_inference.ipynb notebook.
  • You can skip the cells that generate the validation set captions into new .json files if you already have the ones you need.
  • This notebook will load the pre-trained weights, generate captions for the validation set using both Greedy and Beam Search, and calculate the final BLEU and METEOR scores. The results will be saved as .json files in the caption_results/ folder.

4. Launching the Web Interface

To interact with your best-trained model, run the Gradio app.

  • You can choose if you want to use the Baseline LSTM-only Model or the Attention Model.
  • Ensure your best model weights are saved in proper directories and the vocabulary.txt file is present.
  • From your terminal, run the following command:
    python gradio_app.py
  • Open the local URL provided (e.g., http://127.0.0.1:7860) in your web browser to access the interface. You can upload your own images or use the provided examples from the example_images/ folder.

About

Image Captioning with LSTM and Attention Mechanism on COCO dataset. Project realized during the MSc in Data Science @ UniMiB

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors