Skip to content

sanger-pathogens/gemsweep

Repository files navigation

gemsweep

run with docker run with singularity

[[TOC]]

Pipeline summary

This workflow deconvolutes mixed read sets (e.g. plate sweep sequencing data, shotgun metagenomic data) and resolves these into strain-level resolution bins. At it's core it implements Themisto pseudoalignment of reads to a curated set of references, mSWEEP to estimate relative abundances and mGEMS to bin reads.

Indexing references with Themisto and clustering are optionally automated, however if you have inputs an index and biologically meaningful grouping, you may provide these and skip some computation.

Additionally, for large reference datasets with significant redundancy we offer a reference refinement option. This involves subsetting references to a configurable number of maximally distant representatives from the clusters - to conserve some within-cluster diversity whilst reducing compute demands.

workflow diagram

Finally, we offer an experimental feature in which reference genomes can instead be automatically generated by setting ref_mode to autoselect. Only reads need to be provided, these will be queried using Sylph against GTDB to select appropriate genomes. This is still under development but feel free to try it out.

workflow diagram

View the poster summary of this project presented at ISCB-UK 2026 here.

Usage

Quickstart

From source code

To run the pipeline from source (this repository):

  1. Clone the repository.

  2. To run with docker, use the path to the -profile docker option:

    nextflow run <path/to/main.nf> \
        -profile docker \
        --manifest <path/to/manifest.csv> \
        --ref_mode full \
        --references <path/to/references.txt>
    

    Other profiles are also supported (docker, singularity).
    ⚠️ If no profile is specified the pipeline will run with a Sanger HPC-specific configuration.

    This pipeline's default settings are optimised for running on the Sanger HPC, including making use of temp storage. To run on other systems please configure the parameters appropriately.

    See Parameters for available pipeline options.

On the Sanger HPC

First load modules for gemsweep, nextflow and ISG/singularity.

Instead of nextflow run main.nf you can now run on the command line with gemsweep <options>. For instance, to see a help message:

module load gemsweep
gemsweep --help

Inputs

  • Paired-end reads per (mixed) sample (unless uing --ref_prep_only)

    To provide locally stored reads either use --manifest (or alias --manifest_of_reads) to supply a CSV file with the header line 'ID,R1,R2' (mandatory) and rows containing the read ID, path to <read 1>.fastq.gz and path to <read 2>.fastq.gz, or use --manifest_from_dir to supply a directory containing the reads (can be used alongside --max_depth with an integer reflecting how many sub-directories deep to look for reads).

    Alternatively you can supply reads from ENA or, if you have access, Sanger's iRODS. See here for more detail: https://gitlab.internal.sanger.ac.uk/sanger-pathogens/pipelines/assorted-sub-workflows/-/blob/main/mixed_input/README.md?ref_type=heads

  • One of the following options for supplying references:

    • a prebuilt themisto index of references AND a reference grouping text file*
    • a text file of paths to each reference 'references.txt' (indexing and clustering will happen within the pipeline)
    • none when opting for autoselection of GTDB genomes based on sylph queries of the supplied reads

    Compatible parameters for each reference mode (ref_mode):

    Reference Mode Required Params Mode Description
    index themisto_index, ref_groups The supplied index and reference groupings will be validated and directly used in the main workflow (Themisto pseudoalignment, mSWEEP relative abundance estimation and mGEMS read binning). When supplying a prebuilt index a) the kmer size must be identical to the argument themisto_k (default: 31) and b) the reference grouping file must be in identical positional order to the references when indexed.
    full references All references supplied will be indexed, clustered (with the workflow indicated by cluster_dist) and the produced index and groups files will be used in the main workflow
    refine references The references supplied will be clustered (NOTE: currently only compatible with --cluster_dist core_acc) and each cluster is dereplicated and capped to a maximum indicated by representatives. Index and groups file are produced for the representatives selected to use in the main workflow.
    autoselect N/A The references are not supplied but rather derived from querying the reads against GTDB and using the hits as references for indexing and clustering before the main workflow. In the autoselect mode, reference genome clustering is always done with PopPUNK - the cluster_dist parameter has no effect. Also, the reference refinement process is always applied to subselect representative genomes from clustered references.

⚠ Experimental feature — autoselection BETA

The autoselection feature is still under development in this release. Please read the following before enabling it.

Resource allocation

A custom config may be needed to increase resource limits. The POPPUNK process peaks below 64 GB for all GTDB species except E. coli. For the MSWEEP process, memory should also be increased. As a rough guide a Toy Human Gut (CAMI dataset) sample with 113 species detected required ~320GB peak memory for MSWEEP. Less diverse read sets will run more easily.

Clustering limitations

PopPUNK is the only clustering strategy in autoselect mode and can fail when too few references are available — for example, species not well represented in GTDB. Ignored PopPUNK failures mean those species will be absent from the reference set passed to Themisto and mSWEEP/mGEMS, which may affect result robustness.

These limitations are targeted for improvement in a future release.

Outputs

Main pipeline outputs are written under --outdir (./results by default).

  • Binned reads per reference group:

    • results/<sample_id>/mGEMS/*
  • mSWEEP abundance/probability outputs:

    • results/<sample_id>/<sample_id>_mSWEEP_abundances.txt
    • results/<sample_id>/<sample_id>_mSWEEP_probs.tsv
  • Read assignment table (optionally, with --get_assignments)

  • The final reference genome paths, their groups and Themisto index, if generated within the pipeline run:

    • results/ref_groups/references.txt
    • results/ref_groups/groups.txt
    • results/themisto/index.*
    • results/themisto/index_report.txt

When using --ref_mode autoselect, Sylph outputs are written to:

  • results/sylph/combined_sylph_report.tsv
  • results/<sample_id>/sylph/<sample_id>_sylph_profile.tsv
  • results/<sample_id>/sylph/<sample_id>_sylphtax_profile.sylphmpa

If --save_sylph_sketches or --publish_poppunk are true these will also be published in the results/ directory.

Generate a manifest of binned reads

To generate a manifest of binned reads for downstream analysis, after your run has completed use generate_manifest.py from the assorted-sub-workflows submodule as demonstrated below (path relative to repo root):

mkdir mGEMs_bins_manifest
./assorted-sub-workflows/mixed_input/bin/generate_manifest.py \
  --input ./results \
  --output mGEMs_bins_manifest \
  --fastq_validation relaxed \
  --max_depth 2
  • --input: path to your results directory (set by --outdir, default: ./results)
  • --output: name CSV manifest of all discovered FASTQs
  • --max_depth 2: searches 2 subdirectory levels deep, capturing all mGEMs bins across samples

Parameters

Logging options

Flag Type Default Description
monochrome_logs boolean false Output logs in plain ASCII (disable colored logging).

General options

Flag Type Default Description
manifest path null Input manifest CSV with required header ID,R1,R2, containing per-sample paths to .fastq.gz files.
outdir path "./results" Path to top directory containing all results, by default results within the launch directory.

Workflow options

Flag Type Default Description
ref_prep_only bool false Run only reference preparatory steps, skipping read pseudoalignment through to binning.
ref_mode str null Required. Choose a reference input mode. Options: index,full,refine,autoselect .

References options

Flag Type Default Description
references path null Path to text file containing paths to references, one per line.
representatives integer 20 Number of representatives at which to cap each reference cluster. Used when --ref_mode is refine or autoselect.
cluster_dist str poppunk Genomic distance used in clustering references, options: core_acc or ani. Determines the clustering workflow, see below for more info. Only used when ref_mode isrefineorfull.

The pipeline's idea of strain-level is defined by the clustering stage. When you supply groups in --ref_mode index the references are pre-clustered. In --ref_mode refine or full you have a choice of clustering workflows defined by the --cluster_dist param.

The default value --cluster_dist core_acc means that a poppunk workflow is applied; see PopPUNK Options below to configure. Be aware this is a non-deterministic mode of clustering, developed to cluster single-species genome datasets to the strain level. If you want to re-use the same clusters generated in a previous run you would need to use --ref_mode index. Note that --ref_mode autoselect currently only uses this poppunk-based clustering workflow.

Alternatively ANI-based community-finding algorithms are available; using --cluster_dist ani instead invokes sketchlib to estimate ANI similarities followed by a choice of community-finding algorithms from the package python-igraph, including some deterministic algorithms. See Sketchlib workflow options below to configure.


Reference Autoselection options

Flag Type Default Description
sylph_db Path "/data/pam/software/sylph/gtdb_full_r226.syldb" Path to a pre-built Sylph database (.syldb)
sylph_tax_metadata Path "/data/pam/software/sylph-tax/v1/gtdb_r226_metadata.tsv" Path to the sylph-tax metadata TSV to use for sylph-tax taxprof
sylph_k int 31 K-mer size for sylph sketch.
sylph_min_ani float 95 ANI threshold for Sylph filtering.
sylph_min_cov float 0.01 Coverage threshold for Sylph filtering.
taxonomic_rank str species Taxonomic rank by which to group references. Choices: domain, kingdom, phylum, class, order, family, genus, species.
pool_latin_taxa bool false Advanced option. Ignores alphabet suffixes of GTDB divisions of latin-name taxa, thus pooling those subdivisions together. Not recommended to change unless the effects on output are understood; see below for more info.
save_sylph_sketches bool true Keep Sylph sketches.
genome_id_to_file Path "/data/pam/collections/GTDB/release226/
genomic_files_all_retrievable_2026_03_05/
metadata/id_to_genome_path.tsv"
File from which to extract genome paths based on genome identifiers.

Note on using pool_latin_taxa: Certain genus/species in GTDB are further divided by appended alphabet suffixes; for example, in GTDB r226, Escherichia coli has 3 species-rank taxonomic groups: Escherichia_coli, Escherichia_coli_E and Escherichia_coli_F. Further explanation is available in the GTDB documentation. If you wanted to consider these as one group you can use this advanced option. Note that:

a) generated groups are no longer compliant with GTDB taxonomic definitions, consider if this affects downstream

b) the size of the produced group may be considerably larger, for example at the genus level in GTDB release 232 g__Clostridium has 1607 genomes but all 34 GTDB genuses in g__Clostridium* total at 2931 genomes.

Note that not all taxa belonging to a "traditional" species might be pooled this way due to certain GTDB species being named differently; for instance in GTDB r232, a new species called ECMA0423 sp047199055 has been created out of genomes previously classified as Escherichia_coli.


Cache Autoselection options

Flag Type Default Description
cache_dir path null Path to a cache root or an existing config-specific cache directory for autoselect mode. The pipeline reuses a matching cache directory containing cache metadata and per-species reference/group entries.

Cache Layout:

<cache_root>/
  core_acc-bgmm-20_reps/
    metadata.json
    species/
      escherichia_coli/
        references.txt
        groups.txt
        metadata.json

Cache setup and lookup intermediates such as cache_config.json, cache_hits.tsv, and cache_miss.tsv are kept in the Nextflow work/ directory and are not published to results.

When --cache_dir is supplied, generated reference entries are written directly to the external cache directory, not to results:

The config-level metadata.json records the clustering settings used for that cache directory. Each species-level metadata.json records cache write/update details for that species, including update counts and added reference IDs.


PopPUNK options

Flag Type Default Description
poppunk_model str dbscan Clustering model for poppunk to use (either dbscan or bgmm)
publish_poppunk bool false Optionally publish full poppunk output, group assignments are always published.

⚠️ It is strongly recommended to leave --publish_poppunk as false when using --ref_mode autoselect or --ref_mode refine. The PopPUNK outputs are generated on the full set of genomes supplied, in the case of --ref_mode autoselect all genomes for the detected species, rather than the representatives used downstream. Additionally, as outputs are generated per species, --ref_mode autoselect can produce a large number of files with significant storage overhead.


Sketchlib workflow options

Flag Type Default Description
ani_threshold float 0.02 Max ANI distance threshold for clustering (default 0.2 clusters genomes sharing >98% ANI similarity).
sketchlib_kstep str "13,29,4" Kmer sizes at which sketchlib will sketch the reference in the format start,stop,step
cluster_strict bool false Fail early if all genomes form a single cluster, or each genome is a singleton.
cluster_algorithm str connected_components Name of clustering/ community-finding algorithm to be used in sketchlib clustering. Options: connected_components, leiden, louvain, walktrap, fastgreedy, label_propagation, infomap, eigenvector

Deterministic methods include connected_components (default, also known as single-linkage clustering), walktrap, fastgreedy and eigenvector. Also available are the louvain, leiden, infomap and label_propagation methods.


Themisto options

Flag Type Default Description
themisto_index path null Path to a pre-built Themisto index including the index prefix (without exts). Skips indexing if provided.
themisto_k integer 31 K-mer size for indexing and pseudoalignment. Allowed values: 21, 31, 51. K-mer sizes must match if an index is provided.
temp_dir path null Custom temporary storage directory to be used during runtime. Otherwise local /tmp will be used.
temp_space integer 10000 Amount of /tmp space (MB) that will be reserved for index creation and pseudoalignment, if /tmp is being used as the temporary storage directory.

mSWEEP options

Flag Type Default Description
ref_groups path null Grouped references text file, one line per reference. Mandatory only when a pre-built index is supplied to --themisto_index.

mGEMS options

Flag Type Default Description
get_assignments boolean false Output the read assignment table used by mGEMS for binning.
min_abundance float 0.0001 Only bin reads for groups that have a relative abundance higher than this value.

Dependencies

  • Nextflow $\ge$ 22.03.0, $\lt$ 26.04.0
  • sylph and sylph-tax databases for GTDB.
  • All other dependencies are containerised in publicly available docker images.

Software versions

The current version of the pipeline uses the following software dependencies:

Software Version Image URL
themisto 3.2.2 quay.io/sangerpathogens/themisto:3.2.2
mSWEEP 2.2.1 quay.io/biocontainers/msweep:2.2.1--h503566f_1
mGEMS 1.3.3 quay.io/biocontainers/mgems:1.3.3--h13024bc_2
PopPUNK 2.7.8 quay.io/biocontainers/poppunk:2.7.8--py310h4d0eb5b_0
sylph 0.8.1 quay.io/biocontainers/sylph:0.9.0
pp-sketchlib 2.1.5 quay.io/sangerpathogens/pp-sketchlib-python:2.1.5-c1
python-igraph 1.0.0 quay.io/sangerpathogens/pp-sketchlib-python:2.1.5-c1

Customise Temporary Storage

The --temp_dir option is available to customise temporary storage location if necessary. Themisto pseudoalignment requires temporary storage and requires that is on the same filesystem as the process is run. By default this pipeline uses node-local /tmp which is safe for both HPC and non-HPC as long as /tmp is available and writable (usually true).

GPU Acceleration

This current version is not yet GPU enabled. Watch this space!

About

Nextflow pipeline for deconvoluting mixed sample reads into bins with strain-level resolution.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors