Authors: Vincenzo Siano and Kevin Garofalo
-
This project implements and compares two deep learning architectures for image captioning on the COCO 2014 dataset: a baseline Encoder-Decoder model using an LSTM and an advanced version incorporating an Attention mechanism. The entire project is built with TensorFlow and Keras and run on a local setup.
-
The final model is presented through an interactive web interface created with Gradio:
-
This work is based on the theoretical and practical findings of the famous papers "Show and Tell" and "Show, Attend and Tell". Computer Vision and NLP have advanced rapidly since image captioning's era, driven by the introduction of new models and techniques that continually push the boundaries of performance.
-
A natural progression beyond RNN-based models would be the adoption of Transformers architectures, which represent a true game-changing innovation: they provide a powerful self-attention mechanism which has redefined the state of the art across multiple domains, and applying them to image captioning is expected to deliver a significant leap in performance.
The repository is organized as follows:
.
├── COCO_dataset.ipynb # Notebook for exploring the COCO dataset.
├── image_captioning_LSTM.ipynb # Notebook for training the baseline LSTM-only model.
├── image_captioning_attention.ipynb # Notebook for training the advanced Attention model.
├── image_captioning_inference.ipynb # Notebook for evaluating models and generating captions.
├── gradio_app.py # Python script to launch the Gradio web interface.
├── requirements.txt # Required Python libraries.
├── vocabulary.txt # The vocabulary file generated during training.
├── .gitignore # Specifies files and folders to be ignored by Git.
├── src/ # Source code for reuse.
│ ├── config.py # Project hyperparameters and constants.
│ └── model.py # Model class definitions (Encoder, Decoder, Attention).
├── training_checkpoints/ # Saved weights for the baseline LSTM-only model.
│ ├── best_decoder.weights.h5 # Best weights for the decoder in the LSTM-only model.
│ ├── best_encoder.weights.h5 # Best weights for the encoder in the LSTM-only model.
├── training_checkpoints_attention/ # Saved weights for the Attention model, with Bahdanau and Luong variants.
│ ├── best_decoder.weights.h5 # Best weights for the decoder in the Bahdanau Attention model.
│ ├── best_encoder.weights.h5 # Best weights for the encoder in the Bahdanau Attention model.
│ ├── best_decoder_luong.weights.h5 # Best weights for the decoder in the Luong Attention model.
│ ├── best_encoder_luong.weights.h5 # Best weights for the encoder in the Luong Attention model.
├── example_images/ # Example images for the Gradio app.
└── caption_results/ # JSON files with generated captions for evaluation.
├── results_lstm_greedy_captions_val2014.json # Captions from the LSTM-only model using Greedy Search.
├── results_lstm_beam_captions_val2014.json # Captions from the LSTM-only model using Beam Search.
├── results_attention_greedy_captions_val2014.json # Captions from the Bahdanau Attention model using Greedy Search.
├── results_attention_beam_captions_val2014.json # Captions from the Bahdanau Attention model using Beam Search.
The project was executed in several distinct phases, moving from a foundational baseline to a sophisticated final model.
The initial model was an Encoder-Decoder architecture. This model achieved a respectable baseline score but showed significant overfitting during training. This highlighted the need for more advanced regularization and a more powerful architecture.
To fight overfitting and stabilize training, a suite of advanced techniques was implemented:
- Data Augmentation: On-the-fly random transformations (flips, brightness changes) were applied to the training images to increase dataset variety.
- Scheduled Sampling: A technique to bridge the gap between training (teacher forcing) and inference by sometimes feeding the model its own predictions.
- Adaptive Learning Rate: A custom
ReduceLROnPlateauschedule was implemented to adjust the learning rate based on validation performance.
These techniques successfully controlled overfitting and led to a much healthier and more robust training process. However, they also revealed that the model's performance was limited by its architecture (as an information bottleneck).
To try to overcome the limitations of the baseline, the architecture was upgraded with an Attention mechanism.
The encoder was modified to output a spatial feature map (an 8x8 grid) instead of a single vector. The decoder was equipped with an attention layer, allowing it to "look" at different parts of the image grid as it generated each word.
For the final evaluation, both Greedy Search and Beam Search algorithms were implemented and compared.
- Beam Search: This more advanced decoding algorithm explores multiple candidate captions at each step, generally leading to more fluent, human-like, and accurate results.
- Qualitative Analysis: A crucial finding was that Beam Search (k=3) produced captions that were qualitatively more human-like and precise, even when the BLEU-4 score was slightly lower than Greedy Search. This highlighted the importance of human evaluation alongside automated metrics.
First, set up your environment and download the dataset.
- Install dependencies:
pip install -r requirements.txt
- Download the COCO 2014 Dataset: Download the train images, validation images, and annotations. Unzip them and place them in a
cocodirectory at the project's root level, following the paths defined insrc/config.py.
You can train the models by running the Jupyter Notebooks.
- To train the Baseline LSTM-only Model: Open and run the cells in
image_captioning_LSTM.ipynb. This will train the model in two phases (transfer learning and fine-tuning) and save the best weights to thetraining_checkpoints/directory. - To train the Attention Model: Open and run the cells sequentially in
image_captioning_attention.ipynb. This will train the model and save the best weights to thetraining_checkpoints_attention/directory.
To evaluate the trained models on the 5000-image validation subset:
- Open and run the
image_captioning_inference.ipynbnotebook. - You can skip the cells that generate the validation set captions into new
.jsonfiles if you already have the ones you need. - This notebook will load the pre-trained weights, generate captions for the validation set using both Greedy and Beam Search, and calculate the final BLEU and METEOR scores. The results will be saved as
.jsonfiles in thecaption_results/folder.
To interact with your best-trained model, run the Gradio app.
- You can choose if you want to use the Baseline LSTM-only Model or the Attention Model.
- Ensure your best model weights are saved in proper directories and the
vocabulary.txtfile is present. - From your terminal, run the following command:
python gradio_app.py
- Open the local URL provided (e.g.,
http://127.0.0.1:7860) in your web browser to access the interface. You can upload your own images or use the provided examples from theexample_images/folder.