[[TOC]]
This workflow deconvolutes mixed read sets (e.g. plate sweep sequencing data, shotgun metagenomic data) and resolves these into strain-level resolution bins. At it's core it implements Themisto pseudoalignment of reads to a curated set of references, mSWEEP to estimate relative abundances and mGEMS to bin reads.
Indexing references with Themisto and clustering are optionally automated, however if you have inputs an index and biologically meaningful grouping, you may provide these and skip some computation.
Additionally, for large reference datasets with significant redundancy we offer a reference refinement option. This involves subsetting references to a configurable number of maximally distant representatives from the clusters - to conserve some within-cluster diversity whilst reducing compute demands.
Finally, we offer an experimental feature in which reference genomes can instead be automatically generated by setting ref_mode to autoselect. Only reads need to be provided, these will be queried using Sylph against GTDB to select appropriate genomes. This is still under development but feel free to try it out.
View the poster summary of this project presented at ISCB-UK 2026 here.
To run the pipeline from source (this repository):
-
Clone the repository.
-
To run with
docker, use the path to the-profile dockeroption:nextflow run <path/to/main.nf> \ -profile docker \ --manifest <path/to/manifest.csv> \ --ref_mode full \ --references <path/to/references.txt>Other profiles are also supported (
docker,singularity).
⚠️ If no profile is specified the pipeline will run with a Sanger HPC-specific configuration.This pipeline's default settings are optimised for running on the Sanger HPC, including making use of temp storage. To run on other systems please configure the parameters appropriately.
See Parameters for available pipeline options.
First load modules for gemsweep, nextflow and ISG/singularity.
Instead of nextflow run main.nf you can now run on the command line with gemsweep <options>. For instance, to see a help message:
module load gemsweep
gemsweep --help
-
Paired-end reads per (mixed) sample (unless uing
--ref_prep_only)To provide locally stored reads either use --manifest (or alias --manifest_of_reads) to supply a CSV file with the header line 'ID,R1,R2' (mandatory) and rows containing the read ID, path to <read 1>.fastq.gz and path to <read 2>.fastq.gz, or use --manifest_from_dir to supply a directory containing the reads (can be used alongside --max_depth with an integer reflecting how many sub-directories deep to look for reads).
Alternatively you can supply reads from ENA or, if you have access, Sanger's iRODS. See here for more detail: https://gitlab.internal.sanger.ac.uk/sanger-pathogens/pipelines/assorted-sub-workflows/-/blob/main/mixed_input/README.md?ref_type=heads
-
One of the following options for supplying references:
- a prebuilt themisto index of references AND a reference grouping text file*
- a text file of paths to each reference 'references.txt' (indexing and clustering will happen within the pipeline)
- none when opting for autoselection of GTDB genomes based on sylph queries of the supplied reads
Compatible parameters for each reference mode (
ref_mode):Reference Mode Required Params Mode Description indexthemisto_index,ref_groupsThe supplied index and reference groupings will be validated and directly used in the main workflow (Themisto pseudoalignment, mSWEEP relative abundance estimation and mGEMS read binning). When supplying a prebuilt index a) the kmer size must be identical to the argument themisto_k(default: 31) and b) the reference grouping file must be in identical positional order to the references when indexed.fullreferencesAll references supplied will be indexed, clustered (with the workflow indicated by cluster_dist) and the produced index and groups files will be used in the main workflowrefinereferencesThe references supplied will be clustered (NOTE: currently only compatible with --cluster_dist core_acc) and each cluster is dereplicated and capped to a maximum indicated byrepresentatives. Index and groups file are produced for the representatives selected to use in the main workflow.autoselectN/A The references are not supplied but rather derived from querying the reads against GTDB and using the hits as references for indexing and clustering before the main workflow. In the autoselect mode, reference genome clustering is always done with PopPUNK - the cluster_distparameter has no effect. Also, the reference refinement process is always applied to subselect representative genomes from clustered references.
⚠ Experimental feature — autoselection
BETAThe autoselection feature is still under development in this release. Please read the following before enabling it.
Resource allocation
A custom config may be needed to increase resource limits. The
POPPUNKprocess peaks below 64 GB for all GTDB species except E. coli. For theMSWEEPprocess, memory should also be increased. As a rough guide a Toy Human Gut (CAMI dataset) sample with 113 species detected required ~320GB peak memory forMSWEEP. Less diverse read sets will run more easily.Clustering limitations
PopPUNK is the only clustering strategy in
autoselectmode and can fail when too few references are available — for example, species not well represented in GTDB. Ignored PopPUNK failures mean those species will be absent from the reference set passed to Themisto and mSWEEP/mGEMS, which may affect result robustness.These limitations are targeted for improvement in a future release.
Main pipeline outputs are written under --outdir (./results by default).
-
Binned reads per reference group:
results/<sample_id>/mGEMS/*
-
mSWEEP abundance/probability outputs:
results/<sample_id>/<sample_id>_mSWEEP_abundances.txtresults/<sample_id>/<sample_id>_mSWEEP_probs.tsv
-
Read assignment table (optionally, with
--get_assignments) -
The final reference genome paths, their groups and Themisto index, if generated within the pipeline run:
results/ref_groups/references.txtresults/ref_groups/groups.txtresults/themisto/index.*results/themisto/index_report.txt
When using --ref_mode autoselect, Sylph outputs are written to:
results/sylph/combined_sylph_report.tsvresults/<sample_id>/sylph/<sample_id>_sylph_profile.tsvresults/<sample_id>/sylph/<sample_id>_sylphtax_profile.sylphmpa
If --save_sylph_sketches or --publish_poppunk are true these will also be published in the results/ directory.
To generate a manifest of binned reads for downstream analysis, after your run has completed use generate_manifest.py from the assorted-sub-workflows submodule as demonstrated below (path relative to repo root):
mkdir mGEMs_bins_manifest
./assorted-sub-workflows/mixed_input/bin/generate_manifest.py \
--input ./results \
--output mGEMs_bins_manifest \
--fastq_validation relaxed \
--max_depth 2--input: path to your results directory (set by--outdir, default:./results)--output: name CSV manifest of all discovered FASTQs--max_depth 2: searches 2 subdirectory levels deep, capturing all mGEMs bins across samples
Logging options
| Flag | Type | Default | Description |
|---|---|---|---|
monochrome_logs |
boolean |
false |
Output logs in plain ASCII (disable colored logging). |
General options
| Flag | Type | Default | Description |
|---|---|---|---|
manifest |
path |
null |
Input manifest CSV with required header ID,R1,R2, containing per-sample paths to .fastq.gz files. |
outdir |
path |
"./results" |
Path to top directory containing all results, by default results within the launch directory. |
Workflow options
| Flag | Type | Default | Description |
|---|---|---|---|
ref_prep_only |
bool |
false |
Run only reference preparatory steps, skipping read pseudoalignment through to binning. |
ref_mode |
str |
null |
Required. Choose a reference input mode. Options: index,full,refine,autoselect . |
References options
| Flag | Type | Default | Description |
|---|---|---|---|
references |
path |
null |
Path to text file containing paths to references, one per line. |
representatives |
integer |
20 |
Number of representatives at which to cap each reference cluster. Used when --ref_mode is refine or autoselect. |
cluster_dist |
str |
poppunk |
Genomic distance used in clustering references, options: core_acc or ani. Determines the clustering workflow, see below for more info. Only used when ref_mode isrefineorfull. |
The pipeline's idea of strain-level is defined by the clustering stage. When you supply groups in --ref_mode index the references are pre-clustered. In --ref_mode refine or full you have a choice of clustering workflows defined by the --cluster_dist param.
The default value --cluster_dist core_acc means that a poppunk workflow is applied; see PopPUNK Options below to configure. Be aware this is a non-deterministic mode of clustering, developed to cluster single-species genome datasets to the strain level. If you want to re-use the same clusters generated in a previous run you would need to use --ref_mode index. Note that --ref_mode autoselect currently only uses this poppunk-based clustering workflow.
Alternatively ANI-based community-finding algorithms are available; using --cluster_dist ani instead invokes sketchlib to estimate ANI similarities followed by a choice of community-finding algorithms from the package python-igraph, including some deterministic algorithms. See Sketchlib workflow options below to configure.
Reference Autoselection options
| Flag | Type | Default | Description |
|---|---|---|---|
sylph_db |
Path |
"/data/pam/software/sylph/gtdb_full_r226.syldb" |
Path to a pre-built Sylph database (.syldb) |
sylph_tax_metadata |
Path |
"/data/pam/software/sylph-tax/v1/gtdb_r226_metadata.tsv" |
Path to the sylph-tax metadata TSV to use for sylph-tax taxprof |
sylph_k |
int |
31 |
K-mer size for sylph sketch. |
sylph_min_ani |
float |
95 |
ANI threshold for Sylph filtering. |
sylph_min_cov |
float |
0.01 |
Coverage threshold for Sylph filtering. |
taxonomic_rank |
str |
species |
Taxonomic rank by which to group references. Choices: domain, kingdom, phylum, class, order, family, genus, species. |
pool_latin_taxa |
bool |
false |
Advanced option. Ignores alphabet suffixes of GTDB divisions of latin-name taxa, thus pooling those subdivisions together. Not recommended to change unless the effects on output are understood; see below for more info. |
save_sylph_sketches |
bool |
true |
Keep Sylph sketches. |
genome_id_to_file |
Path |
"/data/pam/collections/GTDB/release226/genomic_files_all_retrievable_2026_03_05/metadata/id_to_genome_path.tsv" |
File from which to extract genome paths based on genome identifiers. |
Note on using pool_latin_taxa:
Certain genus/species in GTDB are further divided by appended alphabet suffixes; for example, in GTDB r226, Escherichia coli has 3 species-rank taxonomic groups: Escherichia_coli, Escherichia_coli_E and Escherichia_coli_F. Further explanation is available in the GTDB documentation. If you wanted to consider these as one group you can use this advanced option. Note that:
a) generated groups are no longer compliant with GTDB taxonomic definitions, consider if this affects downstream
b) the size of the produced group may be considerably larger, for example at the genus level in GTDB release 232 g__Clostridium has 1607 genomes but all 34 GTDB genuses in g__Clostridium* total at 2931 genomes.
Note that not all taxa belonging to a "traditional" species might be pooled this way due to certain GTDB species being named differently; for instance in GTDB r232, a new species called ECMA0423 sp047199055 has been created out of genomes previously classified as Escherichia_coli.
Cache Autoselection options
| Flag | Type | Default | Description |
|---|---|---|---|
cache_dir |
path |
null |
Path to a cache root or an existing config-specific cache directory for autoselect mode. The pipeline reuses a matching cache directory containing cache metadata and per-species reference/group entries. |
Cache Layout:
<cache_root>/
core_acc-bgmm-20_reps/
metadata.json
species/
escherichia_coli/
references.txt
groups.txt
metadata.json
Cache setup and lookup intermediates such as cache_config.json, cache_hits.tsv, and cache_miss.tsv are kept in the Nextflow work/ directory and are not published to results.
When --cache_dir is supplied, generated reference entries are written directly to the external cache directory, not to results:
The config-level metadata.json records the clustering settings used for that cache directory. Each species-level metadata.json records cache write/update details for that species, including update counts and added reference IDs.
PopPUNK options
| Flag | Type | Default | Description |
|---|---|---|---|
poppunk_model |
str |
dbscan |
Clustering model for poppunk to use (either dbscan or bgmm) |
publish_poppunk |
bool |
false |
Optionally publish full poppunk output, group assignments are always published. |
--publish_poppunk as false when using --ref_mode autoselect or --ref_mode refine. The PopPUNK outputs are generated on the full set of genomes supplied, in the case of --ref_mode autoselect all genomes for the detected species, rather than the representatives used downstream. Additionally, as outputs are generated per species, --ref_mode autoselect can produce a large number of files with significant storage overhead.
Sketchlib workflow options
| Flag | Type | Default | Description |
|---|---|---|---|
ani_threshold |
float |
0.02 |
Max ANI distance threshold for clustering (default 0.2 clusters genomes sharing >98% ANI similarity). |
sketchlib_kstep |
str |
"13,29,4" |
Kmer sizes at which sketchlib will sketch the reference in the format start,stop,step |
cluster_strict |
bool |
false |
Fail early if all genomes form a single cluster, or each genome is a singleton. |
cluster_algorithm |
str |
connected_components |
Name of clustering/ community-finding algorithm to be used in sketchlib clustering. Options: connected_components, leiden, louvain, walktrap, fastgreedy, label_propagation, infomap, eigenvector |
Deterministic methods include connected_components (default, also known as single-linkage clustering), walktrap, fastgreedy and eigenvector. Also available are the louvain, leiden, infomap and label_propagation methods.
Themisto options
| Flag | Type | Default | Description |
|---|---|---|---|
themisto_index |
path |
null |
Path to a pre-built Themisto index including the index prefix (without exts). Skips indexing if provided. |
themisto_k |
integer |
31 |
K-mer size for indexing and pseudoalignment. Allowed values: 21, 31, 51. K-mer sizes must match if an index is provided. |
temp_dir |
path |
null |
Custom temporary storage directory to be used during runtime. Otherwise local /tmp will be used. |
temp_space |
integer |
10000 |
Amount of /tmp space (MB) that will be reserved for index creation and pseudoalignment, if /tmp is being used as the temporary storage directory. |
mSWEEP options
| Flag | Type | Default | Description |
|---|---|---|---|
ref_groups |
path |
null |
Grouped references text file, one line per reference. Mandatory only when a pre-built index is supplied to --themisto_index. |
mGEMS options
| Flag | Type | Default | Description |
|---|---|---|---|
get_assignments |
boolean |
false |
Output the read assignment table used by mGEMS for binning. |
min_abundance |
float |
0.0001 |
Only bin reads for groups that have a relative abundance higher than this value. |
- Nextflow
$\ge$ 22.03.0,$\lt$ 26.04.0 - sylph and sylph-tax databases for GTDB.
- All other dependencies are containerised in publicly available docker images.
The current version of the pipeline uses the following software dependencies:
| Software | Version | Image URL |
|---|---|---|
| themisto | 3.2.2 | quay.io/sangerpathogens/themisto:3.2.2 |
| mSWEEP | 2.2.1 | quay.io/biocontainers/msweep:2.2.1--h503566f_1 |
| mGEMS | 1.3.3 | quay.io/biocontainers/mgems:1.3.3--h13024bc_2 |
| PopPUNK | 2.7.8 | quay.io/biocontainers/poppunk:2.7.8--py310h4d0eb5b_0 |
| sylph | 0.8.1 | quay.io/biocontainers/sylph:0.9.0 |
| pp-sketchlib | 2.1.5 | quay.io/sangerpathogens/pp-sketchlib-python:2.1.5-c1 |
| python-igraph | 1.0.0 | quay.io/sangerpathogens/pp-sketchlib-python:2.1.5-c1 |
The --temp_dir option is available to customise temporary storage location if necessary. Themisto pseudoalignment requires temporary storage and requires that is on the same filesystem as the process is run. By default this pipeline uses node-local /tmp which is safe for both HPC and non-HPC as long as /tmp is available and writable (usually true).
This current version is not yet GPU enabled. Watch this space!

