BacBench

BacBench is a multi-scale and multi-task benchmark for evaluating ML models for bacterial genomics across the bacterial tree of life. Currently BacBench includes 5 active tasks collected and curated from public databases: (1) essential genes prediction, (2) operon identification, (3) protein-protein interaction (PPI), (4) antibiotic resistance prediction and (5) phenotypic traits prediction. The strain clustering task is deprecated but remains available in the repository for reproducibility.

BacBench allows for embedding and evaluating genomes using various models (see Benchmarked models section), as well as preprocessing bacterial genomes.

News

13.05.2026: Updated Essential genes and PPI task with phylogeny-aware split by genus. Added BacLM, a new masked language model trained on both DNA and protein sequences. Deprecated strain clustering task.
2026-02-26: Added more embedding models, including Evo2 (recommended to run inside Evo2 container), ProkBERT, ESMPlusPlus and gLM2
2025-05-15: BacBench datasets are now available on HuggingFace.

Setup

Requirements

BacBench uses PyTorch, HuggingFace Transformers, PyTorch Lightning and was developed in python=3.10.

To compute ESM-2, ESM-C, and Bacformer embeddings efficiently, BacBench can use the optional faesm/faplm extra, which requires flash-attention.

We recommend using BacBench on a machine with 1) considerable disk space (for downloading datasets), 2) GPU (for embedding genomes and running some evaluations).

Installation

Before installing BacBench, make sure to create a new python environment. We recommend using mamba, conda or venv to create a new environment.

You can install BacBench by cloning the repository and installing the dependencies:

git clone https://github.com/macwiatrak/BacBench.git
cd BacBench
# 1) install BacBench **with its core dependencies**
pip install .

We also recommend installing the faesm package, which provides fast inference for ESM-2 and ESM-C models.

Note: Only install faesm on a machine with a GPU and CUDA installed.

# 2) (optional but recommended) add the fast‐attention extra (“faesm”)
pip install ".[faesm]"

For development and tests, install the test extra:

pip install -e ".[test]"
pytest

Embedding and most full benchmark runs require GPU hardware and task-specific input files.

Usage

Below we describe how to access and use BacBench to:

Access the datasets.
Embed the genomes using various models.
Evaluate the models on distinct tasks.
Download and preprocess bacterial genomes.

Datasets

All of the datasets are available on HuggingFace.

The datasets for essential genes prediction, operon identification, PPI, antibiotic resistance prediction and phenotypic traits prediction are available in DNA and/or protein sequence modalities. Due to the size of the datasets, we recommend streaming the datasets unless you have a lot of disk space available. See examples below.

from datasets import load_dataset


# essential genes prediction task
# protein sequences, size=59.2MB
essential_genes_prot_seqs_ds = load_dataset("macwiatrak/bacbench-essential-genes-protein-sequences")
# DNA sequences, size=92.2MB
essential_genes_dna_seqs_ds = load_dataset("macwiatrak/bacbench-essential-genes-dna")


# operon identification task
# protein sequences, size=15.3MB
operon_identification_prot_seqs_ds = load_dataset("macwiatrak/operon-identification-long-read-rna-sequencing-protein-sequences")
# DNA sequences, size=24MB
operon_identification_dna_ds = load_dataset("macwiatrak/operon-identification-long-read-rna-sequencing-dna")

# protein-protein interaction (PPI) task, 261 genomes
# protein sequences, size=792MB
ppi_prot_seqs_ds = load_dataset("macwiatrak/bacbench-ppi-stringdb-protein-sequences-small", streaming=True)
# DNA sequences, size=985MB
ppi_dna_ds = load_dataset("macwiatrak/bacbench-ppi-stringdb-dna-small", streaming=True)
# for the large version of the PPI dataset (>10k genomes, available only in protein sequences modality), size=58GB, use the following dataset:
ppi_ds = load_dataset("macwiatrak/bacbench-ppi-stringdb-protein-sequences", streaming=True)


# antibiotic resistance prediction task
# protein sequences, size=38.8GB
ar_prot_seqs_ds = load_dataset("macwiatrak/bacbench-antibiotic-resistance-protein-sequences", streaming=True)
# DNA sequences, size=54.9GB
ar_dna_ds = load_dataset("macwiatrak/bacbench-antibiotic-resistance-dna", streaming=True)


# phenotypic traits prediction task
# protein sequences, size=36GB
pheno_traits_prot_seqs_ds = load_dataset("macwiatrak/bacbench-phenotypic-traits-protein-sequences", streaming=True)
# DNA sequences, size=51.1GB
pheno_traits_dna_ds = load_dataset("macwiatrak/bacbench-phenotypic-traits-dna", streaming=True)

Dataset details including the number of genomes and more are available in the dataset cards on HuggingFace.

Embedding genomes

We provide extendable scripts to embed genomes at the gene and whole-genome level using various models.

Embedding genomes is the first step to evaluating the models on the tasks. We include details on how to embed genomes for each task in the task-specific README files in the bacbench/tasks/ directory.

Below, we show examples on how to embed genomes using the supported models on a few tasks.

Note: Running embedding scripts requires GPU hardware for practical performance.

Essential genes prediction task

# embed and save the genomes using the ESM-C model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/bacbench-essential-genes-protein-sequences \
    --output-filepath <output-dir>/essential_genes_esmc_embeddings.parquet \
    --model-path Synthyra/ESMplusplus_small \
    --batch-size 64

# embed and save the genomes using the Bacformer model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/bacbench-essential-genes-protein-sequences \
    --output-filepath <output-dir>/essential_genes_bacformer_embeddings.parquet \
    --model-path macwiatrak/bacformer-large-masked-complete-genomes \
    --batch-size 64 \
    --max-n-proteins 9000  # max number of proteins in a genome

# embed and save the genomes using the Nucleotide Transformer model
python bacbench/modeling/run_embed_dna.py \
    --dataset-name macwiatrak/bacbench-essential-genes-dna \
    --output-filepath <output-dir>/essential_genes_nt_embeddings.parquet \
    --model-path InstaDeepAI/nucleotide-transformer-v2-250m-multi-species \
    --batch-size 128 \
    --max-seq-len 2048 \
    --dna-seq-overlap 32  # overlap between the sequences when the gene length is higher than --max-seq-len, default value

Operon identification task

# embed and save the genomes using the ProtBert model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/operon-identification-long-read-rna-sequencing-protein-sequences \
    --output-filepath <output-dir>/operon_identification_protbert_embeddings.parquet \
    --model-path Rostlab/prot_bert  \
    --batch-size 64

# embed and save the genomes using the Bacformer model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/operon-identification-long-read-rna-sequencing-protein-sequences \
    --output-filepath <output-dir>/operon_identification_bacformer_embeddings.parquet \
    --model-path macwiatrak/bacformer-masked-complete-genomes \
    --batch-size 64 \
    --max-n-proteins 9000  # max number of proteins in a genome, default value


# embed and save the genomes using the Mistral-DNA model
python bacbench/modeling/run_embed_dna.py \
    --dataset-name macwiatrak/operon-identification-long-read-rna-sequencing-dna \
    --output-filepath <output-dir>/operon_identification_mistral_embeddings.parquet \
    --model-path Raphaelmourad/Mistral-DNA-v1-138M-bacteria \
    --batch-size 256 \
    --max-seq-len 512 \
    --dna-seq-overlap 16

Antibiotic resistance prediction task

# embed and save the genomes using the ESM-2 model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/bacbench-antibiotic-resistance-protein-sequences \
    --output-filepath <output-dir>/amr_esm2_embeddings.parquet \
    --model-path facebook/esm2_t12_35M_UR50D \
    --batch-size 64 \
    --genome-pooling-method mean \
    --agg-whole-genome \
    --streaming

# embed and save the genomes using the Bacformer model
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/bacbench-antibiotic-resistance-protein-sequences \
    --output-filepath <output-dir>/amr_bacformer_embeddings.parquet \
    --model-path macwiatrak/bacformer-large-masked-complete-genomes \
    --batch-size 64 \
    --genome-pooling-method mean \
    --agg-whole-genome \
    --streaming \
    --max-n-proteins 9000  # max number of proteins in a genome, default value


# embed and save the genomes using the Nucleotide Transformer model
python bacbench/modeling/run_embed_dna.py \
    --dataset-name macwiatrak/bacbench-antibiotic-resistance-dna \
    --output-filepath <output-dir>/amr_nucleotide_transformer_embeddings.parquet \
    --model-path InstaDeepAI/nucleotide-transformer-v2-250m-multi-species \
    --batch-size 128 \
    --max-seq-len 2048 \
    --dna-seq-overlap 32 \
    --agg-whole-genome \
    --genome-pooling-method mean \
    --streaming

Protein-protein interaction task

# embed and save per-protein embeddings for PPI training/evaluation
python bacbench/modeling/run_embed_prot_seqs.py \
    --dataset-name macwiatrak/bacbench-ppi-stringdb-protein-sequences-small \
    --output-filepath <output-dir>/ppi_esm2_embeddings.parquet \
    --model-path facebook/esm2_t12_35M_UR50D \
    --batch-size 64 \
    --streaming

Note: DNABERT-2 requires specific requirements, to install them please refer to DNABERT-2 github.

Embedding slices of the dataset: We also provide functionality to embed only a slice of the dataset, which is useful for testing and debugging. To use it just use the --start-idx and --end-idx arguments to specify the slice of the dataset you want to embed. Both run_embed_dna.py and run_embed_prot_seqs.py scripts support this functionality.

See Benchmarked models section for the list of currently supported models.

Model evaluation

We provide scripts to evaluate the embeddings models for each task in the bacbench/tasks/ directory. We include details on how to evaluate models for each task in the task-specific README files in the bacbench/tasks/ directory.

Below, we show examples on how to evaluate the models using embedded data.

Note: to run evaluation scripts, you need to have the embeddings saved in a parquet file (see above examples for how to embed the genomes).

Essential genes prediction task

python bacbench/tasks/essential_genes/run_train_cls.py \
    --input-df-file-path <input-dir>/essential_genes_esmc_embeddings.parquet \
    --output-dir <output-dir> \
    --lr 0.005 \
    --max-epochs 100 \
    --model-name esmc

Operon identification task

python bacbench/tasks/operon/run_evaluation_long_read_rna_seq.py \
    --input-filepath <input-dir>/operon_identification_bacformer_embeddings.parquet \
    --output-filepath <output-filepath>

Protein-protein interaction task

# Train an MLP on PPI pairs
python bacbench/tasks/ppi/run_train_mlp.py \
    --input-filepath <input-dir>/ppi_esm2_embeddings.parquet \
    --output-dir <output-dir> \
    --max-epochs 10

# Run unsupervised evaluation directly from pair scores
python bacbench/tasks/ppi/run_unsupervised_eval.py \
    --input-filepath <input-dir>/ppi_esm2_embeddings.parquet \
    --output-dir <output-dir> \
    --model-name esm2

Antibiotic resistance prediction task

python bacbench/tasks/antibiotic_resistance/train_and_predict_linear.py \
    --input-genomes-df-filepath <input-dir>/amr_esm2_embeddings.parquet \
    --labels-df-filepath <input-dir>/binary_labels.csv \
    --output-dir <output-dir> \
    --model-name esm2 \
    --lr 0.005

Phenotypic traits prediction task

python bacbench/tasks/phenotypic_traits/train_and_predict_linear.py \
    --input-genomes-df-filepath <input-dir>/pheno_bacformer_embeddings.parquet \
    --labels-df-filepath <input-dir>/labels.csv \
    --output-dir <output-dir> \
    --model-name bacformer \
    --lr 0.01

For more details on how to run the evaluation scripts, please refer to the scripts in the bacbench/tasks/ directory.

Large dataset tips

Use --streaming when loading large Hugging Face datasets.
Use --start-idx and --end-idx for quick debugging runs on a small slice.
Use --save-every-n-rows with --output-dir for streaming embedding jobs that should checkpoint partial parquet chunks.
Use task-specific README files in bacbench/tasks/ for full dataset and label-file locations.
For PPI training, use --use-incremental-parquet-read if the embedding parquet is too large to read into memory while building train/validation/test splits.
Keep GPU-specific extras such as faesm, DNABERT-2 requirements, and Evo2 requirements in separate environments when possible.

Download and preprocess genomes

To make it easier to download and preprocess genomes, we provide functionality as part of the bacbench package to (i) download genomes from NCBI/GenBank and (ii) preprocess them to the required format for the models. See examples below.

from bacbench.pp import (
    extract_protein_info_from_gbff,
    extract_protein_info_from_gff,
    extract_dna_info_from_fna,
    download_and_process_genome_by_taxid,
    download_and_process_genome_by_assembly_id,
)

# given an GBFF file, extract the protein sequences and their annotations
# for example, we can use the Pseudomonas aeruginosa PAO1 genome = https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000006765.1/
genome_protein_seqs_df = extract_protein_info_from_gbff("<input-dir>/GCF_000006765.1.gbff")
# given a GFF file, extract the protein info
genome_protein_info_df = extract_protein_info_from_gff("<input-dir>/GCF_000006765.1.gff")
# given a FNA file, extract the DNA sequences
genome_dna_seqs_df = extract_dna_info_from_fna("<input-dir>/GCF_000006765.1_ASM676v1_genomic.fna")

# we also provide functionality to download and preprocess genomes from NCBI/GenBank
# download and preprocess a genome by its taxid
taxid_df = download_and_process_genome_by_taxid(
    taxid=208964,  # taxid for Pseudomonas aeruginosa PAO1
    file_type="gbff",
)
# download and preprocess a genome by its assembly id
assembly_id_df = download_and_process_genome_by_assembly_id(
    assembly_id="GCF_000006765.1",
    file_type="gbff",
)

Benchmarked models

We currently support the following models:

Model	Input	Variant / Checkpoint	Objective	Params	dim	Max context
Mistral-DNA	DNA	Mistral-DNA-v1-138M-bacteria	Autoregressive	138 M	768	512
DNABERT-2*	DNA	DNABERT-2-117M	Masked	117 M	768	512
Nucleotide Transformer	DNA	nucleotide-transformer-v2-250m-multi-species	Masked	250 M	768	2 048
ProkBERT	DNA	neuralbioinfo/prokbert-mini-long	Masked	27 M	384	4 096
Evo	DNA	evo-1-8k-base (1.1_fix)	Autoregressive	6.5 B	4 096	8 192
Evo2**	DNA	evo_1b_base	Autoregressive	1 B	1920	8 192
ESM-2	Single protein seq.	esm2_t12_35M_UR50D	Masked	35 M	480	1 024
ESM-C	Single protein seq.	esmc_300m	Masked	300 M	960	2 048
ESMPlusPlus (reimplementation of ESMC)	Single protein seq.	Synthyra/ESMplusplus_small	Masked	300 M	960	2 048
ProtBert	Single protein seq.	prot_bert	Masked	420 M	1 024	1 024
gLM2	Mixed modality (DNA & protein)	tattabio/gLM2_650M	Masked	650 M	1 280	4 096
BacLM	Mixed modality (DNA or protein)	macwiatrak/baclm-350m-masked	Masked	350 M	960	2 048
Bacformer	Multiple protein seq.	bacformer-masked-complete-genomes^†	Masked	27 M	480	6 000
Bacformer Large	Multiple protein seq.	bacformer-large-masked-complete-genomes^†	Masked	27 M	960	6 000

* DNABERT-2 requires specific requirements, to install them please refer to DNABERT-2 github.

** Evo2 requires specific requirements, to install them please refer to the Evo2 github. We recommend running Evo2 in a container.

† Historical strain clustering runs used the MAG version of the Bacformer model (bacformer-masked-MAG and bacformer-large-masked-MAG) because the inputs are metagenome-assembled genomes (MAGs), rather than complete genomes.

Note: for mixed modality models (gLM2 and BacLM) we use both DNA and protein sequences as input. The current implementation in bacbench/modeling/embed_prot_seqs.py and bacbench/modeling/embed_dna.py supports using either DNA or protein sequences as input, but not both at the same time. We are planning to add support for using both DNA and protein sequences as input for the mixed modality models in the future and the WIP scripts to do it are available in bacbench/modeling/utils/scripts.

Task overview

Task	Status	Input modality	Embedding granularity	Main evaluation script
Essential genes prediction	Active	DNA or protein	Gene/protein embeddings	`bacbench/tasks/essential_genes/run_train_cls.py`
Operon identification from long read RNA-seq	Active	DNA or protein	Per-gene embeddings grouped by contig	`bacbench/tasks/operon/run_evaluation_long_read_rna_seq.py`
Protein-protein interaction	Active	Protein	Per-protein embeddings with STRING-derived PPI labels	`bacbench/tasks/ppi/run_train_mlp.py`, `bacbench/tasks/ppi/run_unsupervised_eval.py`
Antibiotic resistance prediction	Active	DNA or protein	Whole-genome embeddings	`bacbench/tasks/antibiotic_resistance/train_and_predict_linear.py`
Phenotypic traits prediction	Active	DNA or protein	Whole-genome embeddings	`bacbench/tasks/phenotypic_traits/train_and_predict_linear.py`
Strain clustering	Deprecated	DNA or protein	Whole-genome embeddings	`bacbench/tasks/strain_clustering/run_evaluation.py`

Contributing

We welcome contributions to BacBench! If you would like to contribute, please follow these steps:

Fork the repository.
Install pre-commit and set up the pre-commit hooks (make sure to do it at the root of the repository).

pip install pre-commit
pre-commit install

Create a new branch for your feature or bug fix.
Make your changes and commit them.
Push your changes to your forked repository.
Create a pull request to the main repository.
Make sure to add tests for your changes and run the tests to ensure everything is working correctly.

Citation

Citation details will be added when the manuscript/preprint is available.

To-do-list

Publish to pypi
Create model leaderboard for each task
Add support for adding new models to the benchmark
Add dataset details to the repository
Add support for batch downloading genomes from NCBI/GenBank

Contact

For questions, bugs, and feature requests, please raise an issue in the repository.

Acknowledgements

We sincerely thank the authors of following open-source projects:

Name		Name	Last commit message	Last commit date
Latest commit History 396 Commits
.idea		.idea
bacbench		bacbench
imgs		imgs
tests		tests
.codecov.yaml		.codecov.yaml
.cruft.json		.cruft.json
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
biome.jsonc		biome.jsonc
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BacBench

News

Contents

Setup

Requirements

Installation

Usage

Datasets

Embedding genomes

Essential genes prediction task

Operon identification task

Antibiotic resistance prediction task

Protein-protein interaction task

Model evaluation

Essential genes prediction task

Operon identification task

Protein-protein interaction task

Antibiotic resistance prediction task

Phenotypic traits prediction task

Large dataset tips

Download and preprocess genomes

Benchmarked models

Task overview

Contributing

Citation

To-do-list

Contact

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BacBench

News

Contents

Setup

Requirements

Installation

Usage

Datasets

Embedding genomes

Essential genes prediction task

Operon identification task

Antibiotic resistance prediction task

Protein-protein interaction task

Model evaluation

Essential genes prediction task

Operon identification task

Protein-protein interaction task

Antibiotic resistance prediction task

Phenotypic traits prediction task

Large dataset tips

Download and preprocess genomes

Benchmarked models

Task overview

Contributing

Citation

To-do-list

Contact

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages