BacBench is a multi-scale and multi-task benchmark for evaluating ML models for bacterial genomics across the bacterial tree of life. Currently BacBench includes 5 active tasks collected and curated from public databases: (1) essential genes prediction, (2) operon identification, (3) protein-protein interaction (PPI), (4) antibiotic resistance prediction and (5) phenotypic traits prediction. The strain clustering task is deprecated but remains available in the repository for reproducibility.
BacBench allows for embedding and evaluating genomes using various models (see Benchmarked models section), as well as preprocessing bacterial genomes.
- 13.05.2026: Updated Essential genes and PPI task with phylogeny-aware split by genus. Added BacLM, a new masked language model trained on both DNA and protein sequences. Deprecated strain clustering task.
- 2026-02-26: Added more embedding models, including Evo2 (recommended to run inside Evo2 container), ProkBERT, ESMPlusPlus and gLM2
- 2025-05-15: BacBench datasets are now available on HuggingFace.
- Setup
- Usage
- Benchmarked models
- Task overview
- Contributing
- Citation
- To-do-list
- Contact
- Acknowledgements
BacBench uses PyTorch, HuggingFace Transformers, PyTorch Lightning
and was developed in python=3.10.
To compute ESM-2, ESM-C, and Bacformer embeddings efficiently, BacBench can use the optional faesm/faplm extra, which requires flash-attention.
We recommend using BacBench on a machine with 1) considerable disk space (for downloading datasets), 2) GPU (for embedding genomes and running some evaluations).
Before installing BacBench, make sure to create a new python environment. We recommend using mamba, conda or venv to create a new environment.
You can install BacBench by cloning the repository and installing the dependencies:
git clone https://github.com/macwiatrak/BacBench.git
cd BacBench
# 1) install BacBench **with its core dependencies**
pip install .We also recommend installing the faesm package, which provides fast inference for ESM-2 and ESM-C models.
Note: Only install faesm on a machine with a GPU and CUDA installed.
# 2) (optional but recommended) add the fast‐attention extra (“faesm”)
pip install ".[faesm]"For development and tests, install the test extra:
pip install -e ".[test]"
pytestEmbedding and most full benchmark runs require GPU hardware and task-specific input files.
Below we describe how to access and use BacBench to:
- Access the datasets.
- Embed the genomes using various models.
- Evaluate the models on distinct tasks.
- Download and preprocess bacterial genomes.
All of the datasets are available on HuggingFace.
The datasets for essential genes prediction, operon identification, PPI, antibiotic resistance prediction
and phenotypic traits prediction are available in DNA and/or protein sequence modalities. Due to the size of the
datasets, we recommend streaming the datasets unless you have a lot of disk space available. See examples below.
from datasets import load_dataset
# essential genes prediction task
# protein sequences, size=59.2MB
essential_genes_prot_seqs_ds = load_dataset("macwiatrak/bacbench-essential-genes-protein-sequences")
# DNA sequences, size=92.2MB
essential_genes_dna_seqs_ds = load_dataset("macwiatrak/bacbench-essential-genes-dna")
# operon identification task
# protein sequences, size=15.3MB
operon_identification_prot_seqs_ds = load_dataset("macwiatrak/operon-identification-long-read-rna-sequencing-protein-sequences")
# DNA sequences, size=24MB
operon_identification_dna_ds = load_dataset("macwiatrak/operon-identification-long-read-rna-sequencing-dna")
# protein-protein interaction (PPI) task, 261 genomes
# protein sequences, size=792MB
ppi_prot_seqs_ds = load_dataset("macwiatrak/bacbench-ppi-stringdb-protein-sequences-small", streaming=True)
# DNA sequences, size=985MB
ppi_dna_ds = load_dataset("macwiatrak/bacbench-ppi-stringdb-dna-small", streaming=True)
# for the large version of the PPI dataset (>10k genomes, available only in protein sequences modality), size=58GB, use the following dataset:
ppi_ds = load_dataset("macwiatrak/bacbench-ppi-stringdb-protein-sequences", streaming=True)
# antibiotic resistance prediction task
# protein sequences, size=38.8GB
ar_prot_seqs_ds = load_dataset("macwiatrak/bacbench-antibiotic-resistance-protein-sequences", streaming=True)
# DNA sequences, size=54.9GB
ar_dna_ds = load_dataset("macwiatrak/bacbench-antibiotic-resistance-dna", streaming=True)
# phenotypic traits prediction task
# protein sequences, size=36GB
pheno_traits_prot_seqs_ds = load_dataset("macwiatrak/bacbench-phenotypic-traits-protein-sequences", streaming=True)
# DNA sequences, size=51.1GB
pheno_traits_dna_ds = load_dataset("macwiatrak/bacbench-phenotypic-traits-dna", streaming=True)Dataset details including the number of genomes and more are available in the dataset cards on HuggingFace.
We provide extendable scripts to embed genomes at the gene and whole-genome level using various models.
Embedding genomes is the first step to evaluating the models on the tasks. We include details on how to embed
genomes for each task in the task-specific README files in the bacbench/tasks/ directory.
Below, we show examples on how to embed genomes using the supported models on a few tasks.
Note: Running embedding scripts requires GPU hardware for practical performance.
# embed and save the genomes using the ESM-C model
python bacbench/modeling/run_embed_prot_seqs.py \
--dataset-name macwiatrak/bacbench-essential-genes-protein-sequences \
--output-filepath <output-dir>/essential_genes_esmc_embeddings.parquet \
--model-path Synthyra/ESMplusplus_small \
--batch-size 64
# embed and save the genomes using the Bacformer model
python bacbench/modeling/run_embed_prot_seqs.py \
--dataset-name macwiatrak/bacbench-essential-genes-protein-sequences \
--output-filepath <output-dir>/essential_genes_bacformer_embeddings.parquet \
--model-path macwiatrak/bacformer-large-masked-complete-genomes \
--batch-size 64 \
--max-n-proteins 9000 # max number of proteins in a genome
# embed and save the genomes using the Nucleotide Transformer model
python bacbench/modeling/run_embed_dna.py \
--dataset-name macwiatrak/bacbench-essential-genes-dna \
--output-filepath <output-dir>/essential_genes_nt_embeddings.parquet \
--model-path InstaDeepAI/nucleotide-transformer-v2-250m-multi-species \
--batch-size 128 \
--max-seq-len 2048 \
--dna-seq-overlap 32 # overlap between the sequences when the gene length is higher than --max-seq-len, default value# embed and save the genomes using the ProtBert model
python bacbench/modeling/run_embed_prot_seqs.py \
--dataset-name macwiatrak/operon-identification-long-read-rna-sequencing-protein-sequences \
--output-filepath <output-dir>/operon_identification_protbert_embeddings.parquet \
--model-path Rostlab/prot_bert \
--batch-size 64
# embed and save the genomes using the Bacformer model
python bacbench/modeling/run_embed_prot_seqs.py \
--dataset-name macwiatrak/operon-identification-long-read-rna-sequencing-protein-sequences \
--output-filepath <output-dir>/operon_identification_bacformer_embeddings.parquet \
--model-path macwiatrak/bacformer-masked-complete-genomes \
--batch-size 64 \
--max-n-proteins 9000 # max number of proteins in a genome, default value
# embed and save the genomes using the Mistral-DNA model
python bacbench/modeling/run_embed_dna.py \
--dataset-name macwiatrak/operon-identification-long-read-rna-sequencing-dna \
--output-filepath <output-dir>/operon_identification_mistral_embeddings.parquet \
--model-path Raphaelmourad/Mistral-DNA-v1-138M-bacteria \
--batch-size 256 \
--max-seq-len 512 \
--dna-seq-overlap 16# embed and save the genomes using the ESM-2 model
python bacbench/modeling/run_embed_prot_seqs.py \
--dataset-name macwiatrak/bacbench-antibiotic-resistance-protein-sequences \
--output-filepath <output-dir>/amr_esm2_embeddings.parquet \
--model-path facebook/esm2_t12_35M_UR50D \
--batch-size 64 \
--genome-pooling-method mean \
--agg-whole-genome \
--streaming
# embed and save the genomes using the Bacformer model
python bacbench/modeling/run_embed_prot_seqs.py \
--dataset-name macwiatrak/bacbench-antibiotic-resistance-protein-sequences \
--output-filepath <output-dir>/amr_bacformer_embeddings.parquet \
--model-path macwiatrak/bacformer-large-masked-complete-genomes \
--batch-size 64 \
--genome-pooling-method mean \
--agg-whole-genome \
--streaming \
--max-n-proteins 9000 # max number of proteins in a genome, default value
# embed and save the genomes using the Nucleotide Transformer model
python bacbench/modeling/run_embed_dna.py \
--dataset-name macwiatrak/bacbench-antibiotic-resistance-dna \
--output-filepath <output-dir>/amr_nucleotide_transformer_embeddings.parquet \
--model-path InstaDeepAI/nucleotide-transformer-v2-250m-multi-species \
--batch-size 128 \
--max-seq-len 2048 \
--dna-seq-overlap 32 \
--agg-whole-genome \
--genome-pooling-method mean \
--streaming# embed and save per-protein embeddings for PPI training/evaluation
python bacbench/modeling/run_embed_prot_seqs.py \
--dataset-name macwiatrak/bacbench-ppi-stringdb-protein-sequences-small \
--output-filepath <output-dir>/ppi_esm2_embeddings.parquet \
--model-path facebook/esm2_t12_35M_UR50D \
--batch-size 64 \
--streamingNote: DNABERT-2 requires specific requirements, to install them please refer to DNABERT-2 github.
Embedding slices of the dataset: We also provide functionality to embed only a slice of the dataset, which is useful for testing and debugging.
To use it just use the --start-idx and --end-idx arguments to specify the slice of the dataset you want to embed.
Both run_embed_dna.py and run_embed_prot_seqs.py scripts support this functionality.
See Benchmarked models section for the list of currently supported models.
We provide scripts to evaluate the embeddings models for each task in the bacbench/tasks/ directory.
We include details on how to evaluate models for each task in the task-specific README files in the bacbench/tasks/ directory.
Below, we show examples on how to evaluate the models using embedded data.
Note: to run evaluation scripts, you need to have the embeddings saved in a parquet file (see above examples for how to embed the genomes).
python bacbench/tasks/essential_genes/run_train_cls.py \
--input-df-file-path <input-dir>/essential_genes_esmc_embeddings.parquet \
--output-dir <output-dir> \
--lr 0.005 \
--max-epochs 100 \
--model-name esmcpython bacbench/tasks/operon/run_evaluation_long_read_rna_seq.py \
--input-filepath <input-dir>/operon_identification_bacformer_embeddings.parquet \
--output-filepath <output-filepath># Train an MLP on PPI pairs
python bacbench/tasks/ppi/run_train_mlp.py \
--input-filepath <input-dir>/ppi_esm2_embeddings.parquet \
--output-dir <output-dir> \
--max-epochs 10
# Run unsupervised evaluation directly from pair scores
python bacbench/tasks/ppi/run_unsupervised_eval.py \
--input-filepath <input-dir>/ppi_esm2_embeddings.parquet \
--output-dir <output-dir> \
--model-name esm2python bacbench/tasks/antibiotic_resistance/train_and_predict_linear.py \
--input-genomes-df-filepath <input-dir>/amr_esm2_embeddings.parquet \
--labels-df-filepath <input-dir>/binary_labels.csv \
--output-dir <output-dir> \
--model-name esm2 \
--lr 0.005python bacbench/tasks/phenotypic_traits/train_and_predict_linear.py \
--input-genomes-df-filepath <input-dir>/pheno_bacformer_embeddings.parquet \
--labels-df-filepath <input-dir>/labels.csv \
--output-dir <output-dir> \
--model-name bacformer \
--lr 0.01For more details on how to run the evaluation scripts, please refer to the scripts in the bacbench/tasks/ directory.
- Use
--streamingwhen loading large Hugging Face datasets. - Use
--start-idxand--end-idxfor quick debugging runs on a small slice. - Use
--save-every-n-rowswith--output-dirfor streaming embedding jobs that should checkpoint partial parquet chunks. - Use task-specific README files in
bacbench/tasks/for full dataset and label-file locations. - For PPI training, use
--use-incremental-parquet-readif the embedding parquet is too large to read into memory while building train/validation/test splits. - Keep GPU-specific extras such as
faesm, DNABERT-2 requirements, and Evo2 requirements in separate environments when possible.
To make it easier to download and preprocess genomes, we provide functionality as part of the bacbench package to
(i) download genomes from NCBI/GenBank and (ii) preprocess them to the required format for the models. See examples below.
from bacbench.pp import (
extract_protein_info_from_gbff,
extract_protein_info_from_gff,
extract_dna_info_from_fna,
download_and_process_genome_by_taxid,
download_and_process_genome_by_assembly_id,
)
# given an GBFF file, extract the protein sequences and their annotations
# for example, we can use the Pseudomonas aeruginosa PAO1 genome = https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000006765.1/
genome_protein_seqs_df = extract_protein_info_from_gbff("<input-dir>/GCF_000006765.1.gbff")
# given a GFF file, extract the protein info
genome_protein_info_df = extract_protein_info_from_gff("<input-dir>/GCF_000006765.1.gff")
# given a FNA file, extract the DNA sequences
genome_dna_seqs_df = extract_dna_info_from_fna("<input-dir>/GCF_000006765.1_ASM676v1_genomic.fna")
# we also provide functionality to download and preprocess genomes from NCBI/GenBank
# download and preprocess a genome by its taxid
taxid_df = download_and_process_genome_by_taxid(
taxid=208964, # taxid for Pseudomonas aeruginosa PAO1
file_type="gbff",
)
# download and preprocess a genome by its assembly id
assembly_id_df = download_and_process_genome_by_assembly_id(
assembly_id="GCF_000006765.1",
file_type="gbff",
)We currently support the following models:
| Model | Input | Variant / Checkpoint | Objective | Params | dim | Max context |
|---|---|---|---|---|---|---|
| Mistral-DNA | DNA | Mistral-DNA-v1-138M-bacteria | Autoregressive | 138 M | 768 | 512 |
| DNABERT-2* | DNA | DNABERT-2-117M | Masked | 117 M | 768 | 512 |
| Nucleotide Transformer | DNA | nucleotide-transformer-v2-250m-multi-species | Masked | 250 M | 768 | 2 048 |
| ProkBERT | DNA | neuralbioinfo/prokbert-mini-long | Masked | 27 M | 384 | 4 096 |
| Evo | DNA | evo-1-8k-base (1.1_fix) | Autoregressive | 6.5 B | 4 096 | 8 192 |
| Evo2** | DNA | evo_1b_base | Autoregressive | 1 B | 1920 | 8 192 |
| ESM-2 | Single protein seq. | esm2_t12_35M_UR50D | Masked | 35 M | 480 | 1 024 |
| ESM-C | Single protein seq. | esmc_300m | Masked | 300 M | 960 | 2 048 |
| ESMPlusPlus (reimplementation of ESMC) | Single protein seq. | Synthyra/ESMplusplus_small | Masked | 300 M | 960 | 2 048 |
| ProtBert | Single protein seq. | prot_bert | Masked | 420 M | 1 024 | 1 024 |
| gLM2 | Mixed modality (DNA & protein) | tattabio/gLM2_650M | Masked | 650 M | 1 280 | 4 096 |
| BacLM | Mixed modality (DNA or protein) | macwiatrak/baclm-350m-masked | Masked | 350 M | 960 | 2 048 |
| Bacformer | Multiple protein seq. | bacformer-masked-complete-genomes† | Masked | 27 M | 480 | 6 000 |
| Bacformer Large | Multiple protein seq. | bacformer-large-masked-complete-genomes† | Masked | 27 M | 960 | 6 000 |
* DNABERT-2 requires specific requirements, to install them please refer to DNABERT-2 github.
** Evo2 requires specific requirements, to install them please refer to the Evo2 github. We recommend running Evo2 in a container.
† Historical strain clustering runs used the MAG version of the Bacformer model (bacformer-masked-MAG and bacformer-large-masked-MAG) because the inputs are metagenome-assembled genomes (MAGs), rather than complete genomes.
Note: for mixed modality models (gLM2 and BacLM) we use both DNA and protein sequences as input. The current implementation in bacbench/modeling/embed_prot_seqs.py and bacbench/modeling/embed_dna.py supports using either DNA or protein sequences as input, but not both at the same time. We are planning to add support for using both DNA and protein sequences as input for the mixed modality models in the future and the WIP scripts to do it are available in bacbench/modeling/utils/scripts.
| Task | Status | Input modality | Embedding granularity | Main evaluation script |
|---|---|---|---|---|
| Essential genes prediction | Active | DNA or protein | Gene/protein embeddings | bacbench/tasks/essential_genes/run_train_cls.py |
| Operon identification from long read RNA-seq | Active | DNA or protein | Per-gene embeddings grouped by contig | bacbench/tasks/operon/run_evaluation_long_read_rna_seq.py |
| Protein-protein interaction | Active | Protein | Per-protein embeddings with STRING-derived PPI labels | bacbench/tasks/ppi/run_train_mlp.py, bacbench/tasks/ppi/run_unsupervised_eval.py |
| Antibiotic resistance prediction | Active | DNA or protein | Whole-genome embeddings | bacbench/tasks/antibiotic_resistance/train_and_predict_linear.py |
| Phenotypic traits prediction | Active | DNA or protein | Whole-genome embeddings | bacbench/tasks/phenotypic_traits/train_and_predict_linear.py |
| Strain clustering | Deprecated | DNA or protein | Whole-genome embeddings | bacbench/tasks/strain_clustering/run_evaluation.py |
We welcome contributions to BacBench! If you would like to contribute, please follow these steps:
- Fork the repository.
- Install
pre-commitand set up the pre-commit hooks (make sure to do it at the root of the repository).
pip install pre-commit
pre-commit install- Create a new branch for your feature or bug fix.
- Make your changes and commit them.
- Push your changes to your forked repository.
- Create a pull request to the main repository.
- Make sure to add tests for your changes and run the tests to ensure everything is working correctly.
Citation details will be added when the manuscript/preprint is available.
- Publish to pypi
- Create model leaderboard for each task
- Add support for adding new models to the benchmark
- Add dataset details to the repository
- Add support for batch downloading genomes from NCBI/GenBank
For questions, bugs, and feature requests, please raise an issue in the repository.
We sincerely thank the authors of following open-source projects:
