Skip to content

amiteliav/CSD-Audio-Visual

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CSD-Audio-Visual

Audio-Visual Approach For Multimodal Concurrent Speaker Detection

This repository contains the code for the audio-visual model presented in the paper:
πŸ“„ "Audio-Visual Approach For Multimodal Concurrent Speaker Detection"
πŸ”— Link to Paper

Building upon our prior work on audio-only CSD, this repository presents a multimodal approach that incorporates visual information to enhance performance.
πŸ“„ Reference: "Concurrent Speaker Detection: A Multi-Microphone Transformer-Based Approach"
πŸ”— Link to Audio-Only Paper
πŸ”— Link to Audio-Only Code

This new model expands the capabilities of our earlier research by leveraging both audio and visual cues, providing a more robust solution for detecting concurrent speakers.

πŸ“‚ Repository Structure

CSD-Audio-Visual/
β”œβ”€β”€ doc/
β”‚   └── Figures/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ XXX.py
β”‚   └── Inference.py
β”œβ”€β”€ README.md
└── ...
  • doc/Figures/: Contains the figures used in this repository
  • XXX.py: PyTorch implementation of our proposed CSD model.
  • Inference.py: This file provides a demonstration of how to use the CSD model for inference.Β 
    It includes code for generating a random input of the same shape as the model's expected input and then using the model to generate predictions. Additionally, it prints the model's summary.

πŸ“Œ Overview

This model classifies audio-visual segments into three categories:

1️⃣ Noise only
2️⃣ Single-speaker activity
3️⃣ Concurrent-speaker activity

The method was evaluated on AMI and EasyCom datasets.

Our primary focus is on Concurrent Speaker Detection (CSD), which classifies audio segments into the three classes above. For comparison with existing methods, we also evaluate our model on the related binary classification tasks:
1️⃣ Voice Activity Detection (VAD): Distinguishes between speech (single/multiple speakers) and non-speech.
2️⃣ Overlapped Speech Detection (OSD) : Identifies segments with overlapped versus non-overlapped speech.

Model Architecture - High Level

The high-level architecture of our proposed model is presented in the following figure.

Model Overview

TBC

πŸ“„ Citation

If you use this work, please cite:

@article{eliav2024audio,
title={Audio-Visual Approach For Multimodal Concurrent Speaker Detection},
author={Eliav, Amit and Gannot, Sharon},
journal={arXiv preprint arXiv:2407.01774},
year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages