This repository contains the code for the audio-visual model presented in the paper:
π "Audio-Visual Approach For Multimodal Concurrent Speaker Detection"
π Link to Paper
Building upon our prior work on audio-only CSD, this repository presents a multimodal approach that incorporates visual information to enhance performance.
π Reference: "Concurrent Speaker Detection: A Multi-Microphone Transformer-Based Approach"
π Link to Audio-Only Paper
π Link to Audio-Only Code
This new model expands the capabilities of our earlier research by leveraging both audio and visual cues, providing a more robust solution for detecting concurrent speakers.
CSD-Audio-Visual/
βββ doc/
β βββ Figures/
βββ src/
β βββ XXX.py
β βββ Inference.py
βββ README.md
βββ ...doc/Figures/: Contains the figures used in this repositoryXXX.py: PyTorch implementation of our proposed CSD model.Inference.py: This file provides a demonstration of how to use the CSD model for inference.Β
It includes code for generating a random input of the same shape as the model's expected input and then using the model to generate predictions. Additionally, it prints the model's summary.
This model classifies audio-visual segments into three categories:
1οΈβ£ Noise only
2οΈβ£ Single-speaker activity
3οΈβ£ Concurrent-speaker activity
The method was evaluated on AMI and EasyCom datasets.
Our primary focus is on Concurrent Speaker Detection (CSD), which classifies audio segments into the three classes above.
For comparison with existing methods, we also evaluate our model on the related binary classification tasks:
1οΈβ£ Voice Activity Detection (VAD): Distinguishes between speech (single/multiple speakers) and non-speech.
2οΈβ£ Overlapped Speech Detection (OSD) : Identifies segments with overlapped versus non-overlapped speech.
The high-level architecture of our proposed model is presented in the following figure.
If you use this work, please cite:
@article{eliav2024audio,
title={Audio-Visual Approach For Multimodal Concurrent Speaker Detection},
author={Eliav, Amit and Gannot, Sharon},
journal={arXiv preprint arXiv:2407.01774},
year={2024}
}