gemsweep

[[TOC]]

Pipeline summary

This workflow deconvolutes mixed read sets (e.g. plate sweep sequencing data, shotgun metagenomic data) and resolves these into strain-level resolution bins. At it's core it implements Themisto pseudoalignment of reads to a curated set of references, mSWEEP to estimate relative abundances and mGEMS to bin reads.

Indexing references with Themisto and clustering are optionally automated, however if you have inputs an index and biologically meaningful grouping, you may provide these and skip some computation.

Additionally, for large reference datasets with significant redundancy we offer a reference refinement option. This involves subsetting references to a configurable number of maximally distant representatives from the clusters - to conserve some within-cluster diversity whilst reducing compute demands.

Finally, we offer an experimental feature in which reference genomes can instead be automatically generated by setting ref_mode to autoselect. Only reads need to be provided, these will be queried using Sylph against GTDB to select appropriate genomes. This is still under development but feel free to try it out.

View the poster summary of this project presented at ISCB-UK 2026 here.

Usage

Quickstart

From source code

To run the pipeline from source (this repository):

Clone the repository.
To run with docker, use the path to the -profile docker option:
```
nextflow run <path/to/main.nf> \
    -profile docker \
    --manifest <path/to/manifest.csv> \
    --ref_mode full \
    --references <path/to/references.txt>
```
Other profiles are also supported (docker, singularity).
⚠️ If no profile is specified the pipeline will run with a Sanger HPC-specific configuration.

This pipeline's default settings are optimised for running on the Sanger HPC, including making use of temp storage. To run on other systems please configure the parameters appropriately.

See Parameters for available pipeline options.

On the Sanger HPC

First load modules for gemsweep, nextflow and ISG/singularity.

Instead of nextflow run main.nf you can now run on the command line with gemsweep <options>. For instance, to see a help message:

module load gemsweep
gemsweep --help

Inputs

Paired-end reads per (mixed) sample (unless uing --ref_prep_only)

To provide locally stored reads either use --manifest (or alias --manifest_of_reads) to supply a CSV file with the header line 'ID,R1,R2' (mandatory) and rows containing the read ID, path to <read 1>.fastq.gz and path to <read 2>.fastq.gz, or use --manifest_from_dir to supply a directory containing the reads (can be used alongside --max_depth with an integer reflecting how many sub-directories deep to look for reads).

Alternatively you can supply reads from ENA or, if you have access, Sanger's iRODS. See here for more detail: https://gitlab.internal.sanger.ac.uk/sanger-pathogens/pipelines/assorted-sub-workflows/-/blob/main/mixed_input/README.md?ref_type=heads

One of the following options for supplying references:

a prebuilt themisto index of references AND a reference grouping text file*
a text file of paths to each reference 'references.txt' (indexing and clustering will happen within the pipeline)
none when opting for autoselection of GTDB genomes based on sylph queries of the supplied reads

Compatible parameters for each reference mode (ref_mode):

Reference Mode	Required Params	Mode Description
`index`	`themisto_index`, `ref_groups`	The supplied index and reference groupings will be validated and directly used in the main workflow (Themisto pseudoalignment, mSWEEP relative abundance estimation and mGEMS read binning). When supplying a prebuilt index a) the kmer size must be identical to the argument `themisto_k` (default: 31) and b) the reference grouping file must be in identical positional order to the references when indexed.
`full`	`references`	All references supplied will be indexed, clustered (with the workflow indicated by `cluster_dist`) and the produced index and groups files will be used in the main workflow
`refine`	`references`	The references supplied will be clustered (NOTE: currently only compatible with `--cluster_dist core_acc`) and each cluster is dereplicated and capped to a maximum indicated by `representatives`. Index and groups file are produced for the representatives selected to use in the main workflow.
`autoselect`	N/A	The references are not supplied but rather derived from querying the reads against GTDB and using the hits as references for indexing and clustering before the main workflow. In the autoselect mode, reference genome clustering is always done with PopPUNK - the `cluster_dist` parameter has no effect. Also, the reference refinement process is always applied to subselect representative genomes from clustered references.

⚠ Experimental feature — autoselection BETA

The autoselection feature is still under development in this release. Please read the following before enabling it.

Resource allocation

A custom config may be needed to increase resource limits. The POPPUNK process peaks below 64 GB for all GTDB species except E. coli. For the MSWEEP process, memory should also be increased. As a rough guide a Toy Human Gut (CAMI dataset) sample with 113 species detected required ~320GB peak memory for MSWEEP. Less diverse read sets will run more easily.

Clustering limitations

PopPUNK is the only clustering strategy in autoselect mode and can fail when too few references are available — for example, species not well represented in GTDB. Ignored PopPUNK failures mean those species will be absent from the reference set passed to Themisto and mSWEEP/mGEMS, which may affect result robustness.

These limitations are targeted for improvement in a future release.

Outputs

Main pipeline outputs are written under --outdir (./results by default).

Binned reads per reference group:
- results/<sample_id>/mGEMS/*
mSWEEP abundance/probability outputs:
- results/<sample_id>/<sample_id>_mSWEEP_abundances.txt
- results/<sample_id>/<sample_id>_mSWEEP_probs.tsv
Read assignment table (optionally, with --get_assignments)
The final reference genome paths, their groups and Themisto index, if generated within the pipeline run:
- results/ref_groups/references.txt
- results/ref_groups/groups.txt
- results/themisto/index.*
- results/themisto/index_report.txt

When using --ref_mode autoselect, Sylph outputs are written to:

results/sylph/combined_sylph_report.tsv
results/<sample_id>/sylph/<sample_id>_sylph_profile.tsv
results/<sample_id>/sylph/<sample_id>_sylphtax_profile.sylphmpa

If --save_sylph_sketches or --publish_poppunk are true these will also be published in the results/ directory.

Generate a manifest of binned reads

To generate a manifest of binned reads for downstream analysis, after your run has completed use generate_manifest.py from the assorted-sub-workflows submodule as demonstrated below (path relative to repo root):

mkdir mGEMs_bins_manifest
./assorted-sub-workflows/mixed_input/bin/generate_manifest.py \
  --input ./results \
  --output mGEMs_bins_manifest \
  --fastq_validation relaxed \
  --max_depth 2

--input: path to your results directory (set by --outdir, default: ./results)
--output: name CSV manifest of all discovered FASTQs
--max_depth 2: searches 2 subdirectory levels deep, capturing all mGEMs bins across samples

Parameters

Logging options

Flag	Type	Default	Description
`monochrome_logs`	`boolean`	`false`	Output logs in plain ASCII (disable colored logging).

General options

Flag	Type	Default	Description
`manifest`	`path`	`null`	Input manifest CSV with required header `ID,R1,R2`, containing per-sample paths to `.fastq.gz` files.
`outdir`	`path`	`"./results"`	Path to top directory containing all results, by default `results` within the launch directory.

Workflow options

Flag	Type	Default	Description
`ref_prep_only`	`bool`	`false`	Run only reference preparatory steps, skipping read pseudoalignment through to binning.
`ref_mode`	`str`	`null`	Required. Choose a reference input mode. Options: `index`,`full`,`refine`,`autoselect` .

References options

Flag	Type	Default	Description
`references`	`path`	`null`	Path to text file containing paths to references, one per line.
`representatives`	`integer`	`20`	Number of representatives at which to cap each reference cluster. Used when `--ref_mode` is `refine` or `autoselect`.
`cluster_dist`	`str`	`poppunk`	Genomic distance used in clustering references, options: `core_acc` or `ani`. Determines the clustering workflow, see below for more info. Only used when ref_mode is`refine`or`full`.

The pipeline's idea of strain-level is defined by the clustering stage. When you supply groups in --ref_mode index the references are pre-clustered. In --ref_mode refine or full you have a choice of clustering workflows defined by the --cluster_dist param.

The default value --cluster_dist core_acc means that a poppunk workflow is applied; see PopPUNK Options below to configure. Be aware this is a non-deterministic mode of clustering, developed to cluster single-species genome datasets to the strain level. If you want to re-use the same clusters generated in a previous run you would need to use --ref_mode index. Note that --ref_mode autoselect currently only uses this poppunk-based clustering workflow.

Alternatively ANI-based community-finding algorithms are available; using --cluster_dist ani instead invokes sketchlib to estimate ANI similarities followed by a choice of community-finding algorithms from the package python-igraph, including some deterministic algorithms. See Sketchlib workflow options below to configure.

Reference Autoselection options

Flag	Type	Default	Description
`sylph_db`	`Path`	`"/data/pam/software/sylph/gtdb_full_r226.syldb"`	Path to a pre-built Sylph database (.syldb)
`sylph_tax_metadata`	`Path`	`"/data/pam/software/sylph-tax/v1/gtdb_r226_metadata.tsv"`	Path to the sylph-tax metadata TSV to use for `sylph-tax taxprof`
`sylph_k`	`int`	`31`	K-mer size for `sylph sketch`.
`sylph_min_ani`	`float`	`95`	ANI threshold for Sylph filtering.
`sylph_min_cov`	`float`	`0.01`	Coverage threshold for Sylph filtering.
`taxonomic_rank`	`str`	`species`	Taxonomic rank by which to group references. Choices: `domain`, `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`.
`pool_latin_taxa`	`bool`	`false`	Advanced option. Ignores alphabet suffixes of GTDB divisions of latin-name taxa, thus pooling those subdivisions together. Not recommended to change unless the effects on output are understood; see below for more info.
`save_sylph_sketches`	`bool`	`true`	Keep Sylph sketches.
`genome_id_to_file`	`Path`	`"/data/pam/collections/GTDB/release226/` `genomic_files_all_retrievable_2026_03_05/` `metadata/id_to_genome_path.tsv"`	File from which to extract genome paths based on genome identifiers.

Note on using pool_latin_taxa: Certain genus/species in GTDB are further divided by appended alphabet suffixes; for example, in GTDB r226, Escherichia coli has 3 species-rank taxonomic groups: Escherichia_coli, Escherichia_coli_E and Escherichia_coli_F. Further explanation is available in the GTDB documentation. If you wanted to consider these as one group you can use this advanced option. Note that:

a) generated groups are no longer compliant with GTDB taxonomic definitions, consider if this affects downstream

b) the size of the produced group may be considerably larger, for example at the genus level in GTDB release 232 g__Clostridium has 1607 genomes but all 34 GTDB genuses in g__Clostridium* total at 2931 genomes.

Note that not all taxa belonging to a "traditional" species might be pooled this way due to certain GTDB species being named differently; for instance in GTDB r232, a new species called ECMA0423 sp047199055 has been created out of genomes previously classified as Escherichia_coli.

Cache Autoselection options

Flag	Type	Default	Description
`cache_dir`	`path`	`null`	Path to a cache root or an existing config-specific cache directory for autoselect mode. The pipeline reuses a matching cache directory containing cache metadata and per-species reference/group entries.

Cache Layout:

<cache_root>/
  core_acc-bgmm-20_reps/
    metadata.json
    species/
      escherichia_coli/
        references.txt
        groups.txt
        metadata.json

Cache setup and lookup intermediates such as cache_config.json, cache_hits.tsv, and cache_miss.tsv are kept in the Nextflow work/ directory and are not published to results.

When --cache_dir is supplied, generated reference entries are written directly to the external cache directory, not to results:

The config-level metadata.json records the clustering settings used for that cache directory. Each species-level metadata.json records cache write/update details for that species, including update counts and added reference IDs.

PopPUNK options

Flag	Type	Default	Description
`poppunk_model`	`str`	`dbscan`	Clustering model for poppunk to use (either dbscan or bgmm)
`publish_poppunk`	`bool`	`false`	Optionally publish full poppunk output, group assignments are always published.

⚠️ It is strongly recommended to leave --publish_poppunk as false when using --ref_mode autoselect or --ref_mode refine. The PopPUNK outputs are generated on the full set of genomes supplied, in the case of --ref_mode autoselect all genomes for the detected species, rather than the representatives used downstream. Additionally, as outputs are generated per species, --ref_mode autoselect can produce a large number of files with significant storage overhead.

Sketchlib workflow options

Flag	Type	Default	Description
`ani_threshold`	`float`	`0.02`	Max ANI distance threshold for clustering (default 0.2 clusters genomes sharing >98% ANI similarity).
`sketchlib_kstep`	`str`	`"13,29,4"`	Kmer sizes at which sketchlib will sketch the reference in the format start,stop,step
`cluster_strict`	`bool`	`false`	Fail early if all genomes form a single cluster, or each genome is a singleton.
`cluster_algorithm`	`str`	`connected_components`	Name of clustering/ community-finding algorithm to be used in sketchlib clustering. Options: connected_components, leiden, louvain, walktrap, fastgreedy, label_propagation, infomap, eigenvector

Deterministic methods include connected_components (default, also known as single-linkage clustering), walktrap, fastgreedy and eigenvector. Also available are the louvain, leiden, infomap and label_propagation methods.

Themisto options

Flag	Type	Default	Description
`themisto_index`	`path`	`null`	Path to a pre-built Themisto index including the index prefix (without exts). Skips indexing if provided.
`themisto_k`	`integer`	`31`	K-mer size for indexing and pseudoalignment. Allowed values: `21`, `31`, `51`. K-mer sizes must match if an index is provided.
`temp_dir`	`path`	`null`	Custom temporary storage directory to be used during runtime. Otherwise local `/tmp` will be used.
`temp_space`	`integer`	`10000`	Amount of /tmp space (MB) that will be reserved for index creation and pseudoalignment, if /tmp is being used as the temporary storage directory.

mSWEEP options

Flag	Type	Default	Description
`ref_groups`	`path`	`null`	Grouped references text file, one line per reference. Mandatory only when a pre-built index is supplied to `--themisto_index`.

mGEMS options

Flag	Type	Default	Description
`get_assignments`	`boolean`	`false`	Output the read assignment table used by mGEMS for binning.
`min_abundance`	`float`	`0.0001`	Only bin reads for groups that have a relative abundance higher than this value.

Dependencies

Nextflow $\ge$ 22.03.0, $\lt$ 26.04.0
sylph and sylph-tax databases for GTDB.
All other dependencies are containerised in publicly available docker images.

Software versions

The current version of the pipeline uses the following software dependencies:

Software	Version	Image URL
themisto	3.2.2	quay.io/sangerpathogens/themisto:3.2.2
mSWEEP	2.2.1	quay.io/biocontainers/msweep:2.2.1--h503566f_1
mGEMS	1.3.3	quay.io/biocontainers/mgems:1.3.3--h13024bc_2
PopPUNK	2.7.8	quay.io/biocontainers/poppunk:2.7.8--py310h4d0eb5b_0
sylph	0.8.1	quay.io/biocontainers/sylph:0.9.0
pp-sketchlib	2.1.5	quay.io/sangerpathogens/pp-sketchlib-python:2.1.5-c1
python-igraph	1.0.0	quay.io/sangerpathogens/pp-sketchlib-python:2.1.5-c1

Customise Temporary Storage

The --temp_dir option is available to customise temporary storage location if necessary. Themisto pseudoalignment requires temporary storage and requires that is on the same filesystem as the process is run. By default this pipeline uses node-local /tmp which is safe for both HPC and non-HPC as long as /tmp is available and writable (usually true).

GPU Acceleration

This current version is not yet GPU enabled. Watch this space!

Name		Name	Last commit message	Last commit date
Latest commit History 506 Commits
assets		assets
assorted-sub-workflows @ 8d13150		assorted-sub-workflows @ 8d13150
bin		bin
gitlab-ci		gitlab-ci
lib @ 9fbcf3c		lib @ 9fbcf3c
modules		modules
subworkflows		subworkflows
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.talismanrc		.talismanrc
LICENSE.md		LICENSE.md
README.md		README.md
gemsweep.sh		gemsweep.sh
main.nf		main.nf
module.template		module.template
nextflow.config		nextflow.config
schema.json		schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gemsweep

Pipeline summary

Usage

Quickstart

From source code

On the Sanger HPC

Inputs

Outputs

Generate a manifest of binned reads

Parameters

Dependencies

Software versions

Customise Temporary Storage

GPU Acceleration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

gemsweep

Pipeline summary

Usage

Quickstart

From source code

On the Sanger HPC

Inputs

Outputs

Generate a manifest of binned reads

Parameters

Dependencies

Software versions

Customise Temporary Storage

GPU Acceleration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages