Skip to content

macwiatrak/BacBench

Repository files navigation

BacBench

License

BacBench is a multi-scale and multi-task benchmark for evaluating ML models for bacterial genomics across the bacterial tree of life. Currently BacBench includes 5 active tasks collected and curated from public databases: (1) essential genes prediction, (2) operon identification, (3) protein-protein interaction (PPI), (4) antibiotic resistance prediction and (5) phenotypic traits prediction. The strain clustering task is deprecated but remains available in the repository for reproducibility.

BacBench allows for embedding and evaluating genomes using various models (see Benchmarked models section), as well as preprocessing bacterial genomes.

BacBench

News

  • 13.05.2026: Updated Essential genes and PPI task with phylogeny-aware split by genus. Added BacLM, a new masked language model trained on both DNA and protein sequences. Deprecated strain clustering task.
  • 2026-02-26: Added more embedding models, including Evo2 (recommended to run inside Evo2 container), ProkBERT, ESMPlusPlus and gLM2
  • 2025-05-15: BacBench datasets are now available on HuggingFace.

Contents

Setup

Requirements

BacBench uses PyTorch, HuggingFace Transformers, PyTorch Lightning and was developed in python=3.10.

To compute ESM-2, ESM-C, and Bacformer embeddings efficiently, BacBench can use the optional faesm/faplm extra, which requires flash-attention.

We recommend using BacBench on a machine with 1) considerable disk space (for downloading datasets), 2) GPU (for embedding genomes and running some evaluations).

Installation

Before installing BacBench, make sure to create a new python environment. We recommend using mamba, conda or venv to create a new environment.

You can install BacBench by cloning the repository and installing the dependencies:

git clone https://github.com/macwiatrak/BacBench.git
cd BacBench
# 1) install BacBench **with its core dependencies**
pip install .

We also recommend installing the faesm package, which provides fast inference for ESM-2 and ESM-C models.

Note: Only install faesm on a machine with a GPU and CUDA installed.

# 2) (optional but recommended) add the fast‐attention extra (“faesm”)
pip install ".[faesm]"

For development and tests, install the test extra:

pip install -e ".[test]"
pytest

Embedding and most full benchmark runs require GPU hardware and task-specific input files.

Usage

Below we describe how to access and use BacBench to:

  1. Access the datasets.
  2. Embed the genomes using various models.
  3. Evaluate the models on distinct tasks.
  4. Download and preprocess bacterial genomes.

Datasets

All of the datasets are available on HuggingFace.

The datasets for essential genes prediction, operon identification, PPI, antibiotic resistance prediction and phenotypic traits prediction are available in DNA and/or protein sequence modalities. Due to the size of the datasets, we recommend streaming the datasets unless you have a lot of disk space available. See examples below.

from datasets import load_dataset


# essential genes prediction task
# protein sequences, size=59.2MB
essential_genes_prot_seqs_ds = load_dataset("macwiatrak/bacbench-essential-genes-protein-sequences")
# DNA sequences, size=92.2MB
essential_genes_dna_seqs_ds = load_dataset("macwiatrak/bacbench-essential-genes-dna")


# operon identification task
# protein sequences, size=15.3MB
operon_identification_prot_seqs_ds = load_dataset("macwiatrak/operon-identification-long-read-rna-sequencing-protein-sequences")
# DNA sequences, size=24MB
operon_identification_dna_ds = load_dataset("macwiatrak/operon-identification-long-read-rna-sequencing-dna")

# protein-protein interaction (PPI) task, 261 genomes
# protein sequences, size=792MB
ppi_prot_seqs_ds = load_dataset("macwiatrak/bacbench-ppi-stringdb-protein-sequences-small", streaming=True)
# DNA sequences, size=985MB
ppi_dna_ds = load_dataset("macwiatrak/bacbench-ppi-stringdb-dna-small", streaming=True)
# for the large version of the PPI dataset (>10k genomes, available only in protein sequences modality), size=58GB, use the following dataset:
ppi_ds = load_dataset("macwiatrak/bacbench-ppi-stringdb-protein-sequences", streaming=True)


# antibiotic resistance prediction task
# protein sequences, size=38.8GB
ar_prot_seqs_ds = load_dataset("macwiatrak/bacbench-antibiotic-resistance-protein-sequences", streaming=True)
# DNA sequences, size=54.9GB
ar_dna_ds = load_dataset("macwiatrak/bacbench-antibiotic-resistance-dna", streaming=True)


# phenotypic traits prediction task
# protein sequences, size=36GB
pheno_traits_prot_seqs_ds = load_dataset("macwiatrak/bacbench-phenotypic-traits-protein-sequences", streaming=True)
# DNA sequences, size=51.1GB
pheno_traits_dna_ds = load_dataset("macwiatrak/bacbench-phenotypic-traits-dna", streaming=True)

Dataset details including the number of genomes and more are available in the dataset cards on HuggingFace.

Embedding genomes

We provide extendable scripts to embed genomes at the gene and whole-genome level using various models.

Embedding genomes is the first step to evaluating the models on the tasks. We include details on how to embed genomes for each task in the task-specific README files in the bacbench/tasks/ directory.

Below, we show examples on how to embed genomes using the supported models on a few tasks.

Note: Running embedding scripts requires GPU hardware for practical performance.

Essential genes prediction task

# embed and save the genomes using the ESM-C model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/bacbench-essential-genes-protein-sequences \
    --output-filepath <output-dir>/essential_genes_esmc_embeddings.parquet \
    --model-path Synthyra/ESMplusplus_small \
    --batch-size 64

# embed and save the genomes using the Bacformer model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/bacbench-essential-genes-protein-sequences \
    --output-filepath <output-dir>/essential_genes_bacformer_embeddings.parquet \
    --model-path macwiatrak/bacformer-large-masked-complete-genomes \
    --batch-size 64 \
    --max-n-proteins 9000  # max number of proteins in a genome

# embed and save the genomes using the Nucleotide Transformer model
python bacbench/modeling/run_embed_dna.py \
    --dataset-name macwiatrak/bacbench-essential-genes-dna \
    --output-filepath <output-dir>/essential_genes_nt_embeddings.parquet \
    --model-path InstaDeepAI/nucleotide-transformer-v2-250m-multi-species \
    --batch-size 128 \
    --max-seq-len 2048 \
    --dna-seq-overlap 32  # overlap between the sequences when the gene length is higher than --max-seq-len, default value

Operon identification task

# embed and save the genomes using the ProtBert model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/operon-identification-long-read-rna-sequencing-protein-sequences \
    --output-filepath <output-dir>/operon_identification_protbert_embeddings.parquet \
    --model-path Rostlab/prot_bert  \
    --batch-size 64

# embed and save the genomes using the Bacformer model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/operon-identification-long-read-rna-sequencing-protein-sequences \
    --output-filepath <output-dir>/operon_identification_bacformer_embeddings.parquet \
    --model-path macwiatrak/bacformer-masked-complete-genomes \
    --batch-size 64 \
    --max-n-proteins 9000  # max number of proteins in a genome, default value


# embed and save the genomes using the Mistral-DNA model
python bacbench/modeling/run_embed_dna.py \
    --dataset-name macwiatrak/operon-identification-long-read-rna-sequencing-dna \
    --output-filepath <output-dir>/operon_identification_mistral_embeddings.parquet \
    --model-path Raphaelmourad/Mistral-DNA-v1-138M-bacteria \
    --batch-size 256 \
    --max-seq-len 512 \
    --dna-seq-overlap 16

Antibiotic resistance prediction task

# embed and save the genomes using the ESM-2 model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/bacbench-antibiotic-resistance-protein-sequences \
    --output-filepath <output-dir>/amr_esm2_embeddings.parquet \
    --model-path facebook/esm2_t12_35M_UR50D \
    --batch-size 64 \
    --genome-pooling-method mean \
    --agg-whole-genome \
    --streaming

# embed and save the genomes using the Bacformer model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/bacbench-antibiotic-resistance-protein-sequences \
    --output-filepath <output-dir>/amr_bacformer_embeddings.parquet \
    --model-path macwiatrak/bacformer-large-masked-complete-genomes \
    --batch-size 64 \
    --genome-pooling-method mean \
    --agg-whole-genome \
    --streaming \
    --max-n-proteins 9000  # max number of proteins in a genome, default value


# embed and save the genomes using the Nucleotide Transformer model
python bacbench/modeling/run_embed_dna.py \
    --dataset-name macwiatrak/bacbench-antibiotic-resistance-dna \
    --output-filepath <output-dir>/amr_nucleotide_transformer_embeddings.parquet \
    --model-path InstaDeepAI/nucleotide-transformer-v2-250m-multi-species \
    --batch-size 128 \
    --max-seq-len 2048 \
    --dna-seq-overlap 32 \
    --agg-whole-genome \
    --genome-pooling-method mean \
    --streaming

Protein-protein interaction task

# embed and save per-protein embeddings for PPI training/evaluation
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/bacbench-ppi-stringdb-protein-sequences-small \
    --output-filepath <output-dir>/ppi_esm2_embeddings.parquet \
    --model-path facebook/esm2_t12_35M_UR50D \
    --batch-size 64 \
    --streaming

Note: DNABERT-2 requires specific requirements, to install them please refer to DNABERT-2 github.

Embedding slices of the dataset: We also provide functionality to embed only a slice of the dataset, which is useful for testing and debugging. To use it just use the --start-idx and --end-idx arguments to specify the slice of the dataset you want to embed. Both run_embed_dna.py and run_embed_prot_seqs.py scripts support this functionality.

See Benchmarked models section for the list of currently supported models.

Model evaluation

We provide scripts to evaluate the embeddings models for each task in the bacbench/tasks/ directory. We include details on how to evaluate models for each task in the task-specific README files in the bacbench/tasks/ directory.

Below, we show examples on how to evaluate the models using embedded data.

Note: to run evaluation scripts, you need to have the embeddings saved in a parquet file (see above examples for how to embed the genomes).

Essential genes prediction task

python bacbench/tasks/essential_genes/run_train_cls.py \
    --input-df-file-path <input-dir>/essential_genes_esmc_embeddings.parquet \
    --output-dir <output-dir> \
    --lr 0.005 \
    --max-epochs 100 \
    --model-name esmc

Operon identification task

python bacbench/tasks/operon/run_evaluation_long_read_rna_seq.py \
    --input-filepath <input-dir>/operon_identification_bacformer_embeddings.parquet \
    --output-filepath <output-filepath>

Protein-protein interaction task

# Train an MLP on PPI pairs
python bacbench/tasks/ppi/run_train_mlp.py \
    --input-filepath <input-dir>/ppi_esm2_embeddings.parquet \
    --output-dir <output-dir> \
    --max-epochs 10

# Run unsupervised evaluation directly from pair scores
python bacbench/tasks/ppi/run_unsupervised_eval.py \
    --input-filepath <input-dir>/ppi_esm2_embeddings.parquet \
    --output-dir <output-dir> \
    --model-name esm2

Antibiotic resistance prediction task

python bacbench/tasks/antibiotic_resistance/train_and_predict_linear.py \
    --input-genomes-df-filepath <input-dir>/amr_esm2_embeddings.parquet \
    --labels-df-filepath <input-dir>/binary_labels.csv \
    --output-dir <output-dir> \
    --model-name esm2 \
    --lr 0.005

Phenotypic traits prediction task

python bacbench/tasks/phenotypic_traits/train_and_predict_linear.py \
    --input-genomes-df-filepath <input-dir>/pheno_bacformer_embeddings.parquet \
    --labels-df-filepath <input-dir>/labels.csv \
    --output-dir <output-dir> \
    --model-name bacformer \
    --lr 0.01

For more details on how to run the evaluation scripts, please refer to the scripts in the bacbench/tasks/ directory.

Large dataset tips

  • Use --streaming when loading large Hugging Face datasets.
  • Use --start-idx and --end-idx for quick debugging runs on a small slice.
  • Use --save-every-n-rows with --output-dir for streaming embedding jobs that should checkpoint partial parquet chunks.
  • Use task-specific README files in bacbench/tasks/ for full dataset and label-file locations.
  • For PPI training, use --use-incremental-parquet-read if the embedding parquet is too large to read into memory while building train/validation/test splits.
  • Keep GPU-specific extras such as faesm, DNABERT-2 requirements, and Evo2 requirements in separate environments when possible.

Download and preprocess genomes

To make it easier to download and preprocess genomes, we provide functionality as part of the bacbench package to (i) download genomes from NCBI/GenBank and (ii) preprocess them to the required format for the models. See examples below.

from bacbench.pp import (
    extract_protein_info_from_gbff,
    extract_protein_info_from_gff,
    extract_dna_info_from_fna,
    download_and_process_genome_by_taxid,
    download_and_process_genome_by_assembly_id,
)

# given an GBFF file, extract the protein sequences and their annotations
# for example, we can use the Pseudomonas aeruginosa PAO1 genome = https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000006765.1/
genome_protein_seqs_df = extract_protein_info_from_gbff("<input-dir>/GCF_000006765.1.gbff")
# given a GFF file, extract the protein info
genome_protein_info_df = extract_protein_info_from_gff("<input-dir>/GCF_000006765.1.gff")
# given a FNA file, extract the DNA sequences
genome_dna_seqs_df = extract_dna_info_from_fna("<input-dir>/GCF_000006765.1_ASM676v1_genomic.fna")

# we also provide functionality to download and preprocess genomes from NCBI/GenBank
# download and preprocess a genome by its taxid
taxid_df = download_and_process_genome_by_taxid(
    taxid=208964,  # taxid for Pseudomonas aeruginosa PAO1
    file_type="gbff",
)
# download and preprocess a genome by its assembly id
assembly_id_df = download_and_process_genome_by_assembly_id(
    assembly_id="GCF_000006765.1",
    file_type="gbff",
)

Benchmarked models

We currently support the following models:

Model Input Variant / Checkpoint Objective Params dim Max context
Mistral-DNA DNA Mistral-DNA-v1-138M-bacteria Autoregressive 138 M 768 512
DNABERT-2* DNA DNABERT-2-117M Masked 117 M 768 512
Nucleotide Transformer DNA nucleotide-transformer-v2-250m-multi-species Masked 250 M 768 2 048
ProkBERT DNA neuralbioinfo/prokbert-mini-long Masked 27 M 384 4 096
Evo DNA evo-1-8k-base (1.1_fix) Autoregressive 6.5 B 4 096 8 192
Evo2** DNA evo_1b_base Autoregressive 1 B 1920 8 192
ESM-2 Single protein seq. esm2_t12_35M_UR50D Masked 35 M 480 1 024
ESM-C Single protein seq. esmc_300m Masked 300 M 960 2 048
ESMPlusPlus (reimplementation of ESMC) Single protein seq. Synthyra/ESMplusplus_small Masked 300 M 960 2 048
ProtBert Single protein seq. prot_bert Masked 420 M 1 024 1 024
gLM2 Mixed modality (DNA & protein) tattabio/gLM2_650M Masked 650 M 1 280 4 096
BacLM Mixed modality (DNA or protein) macwiatrak/baclm-350m-masked Masked 350 M 960 2 048
Bacformer Multiple protein seq. bacformer-masked-complete-genomes Masked 27 M 480 6 000
Bacformer Large Multiple protein seq. bacformer-large-masked-complete-genomes Masked 27 M 960 6 000

* DNABERT-2 requires specific requirements, to install them please refer to DNABERT-2 github.

** Evo2 requires specific requirements, to install them please refer to the Evo2 github. We recommend running Evo2 in a container.

Historical strain clustering runs used the MAG version of the Bacformer model (bacformer-masked-MAG and bacformer-large-masked-MAG) because the inputs are metagenome-assembled genomes (MAGs), rather than complete genomes.

Note: for mixed modality models (gLM2 and BacLM) we use both DNA and protein sequences as input. The current implementation in bacbench/modeling/embed_prot_seqs.py and bacbench/modeling/embed_dna.py supports using either DNA or protein sequences as input, but not both at the same time. We are planning to add support for using both DNA and protein sequences as input for the mixed modality models in the future and the WIP scripts to do it are available in bacbench/modeling/utils/scripts.

Task overview

Task Status Input modality Embedding granularity Main evaluation script
Essential genes prediction Active DNA or protein Gene/protein embeddings bacbench/tasks/essential_genes/run_train_cls.py
Operon identification from long read RNA-seq Active DNA or protein Per-gene embeddings grouped by contig bacbench/tasks/operon/run_evaluation_long_read_rna_seq.py
Protein-protein interaction Active Protein Per-protein embeddings with STRING-derived PPI labels bacbench/tasks/ppi/run_train_mlp.py, bacbench/tasks/ppi/run_unsupervised_eval.py
Antibiotic resistance prediction Active DNA or protein Whole-genome embeddings bacbench/tasks/antibiotic_resistance/train_and_predict_linear.py
Phenotypic traits prediction Active DNA or protein Whole-genome embeddings bacbench/tasks/phenotypic_traits/train_and_predict_linear.py
Strain clustering Deprecated DNA or protein Whole-genome embeddings bacbench/tasks/strain_clustering/run_evaluation.py

Contributing

We welcome contributions to BacBench! If you would like to contribute, please follow these steps:

  1. Fork the repository.
  2. Install pre-commit and set up the pre-commit hooks (make sure to do it at the root of the repository).
pip install pre-commit
pre-commit install
  1. Create a new branch for your feature or bug fix.
  2. Make your changes and commit them.
  3. Push your changes to your forked repository.
  4. Create a pull request to the main repository.
  5. Make sure to add tests for your changes and run the tests to ensure everything is working correctly.

Citation

Citation details will be added when the manuscript/preprint is available.

To-do-list

  • Publish to pypi
  • Create model leaderboard for each task
  • Add support for adding new models to the benchmark
  • Add dataset details to the repository
  • Add support for batch downloading genomes from NCBI/GenBank

Contact

For questions, bugs, and feature requests, please raise an issue in the repository.

Acknowledgements

We sincerely thank the authors of following open-source projects:

About

Multi-scale and multi-task benchmark for evaluating ML models for bacterial genomics across the bacterial tree of life.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages