CytoBulk is a toolkit for bulk and spatial transcriptomics deconvolution and mapping.
- CytoBulk has been tested on WSL2 and Linux systems.
- On Windows and macOS, many packages listed in
environment.ymlmay not have matching versions (or may be unavailable), so Docker is the recommended first-choice installation/runtime method. - If local installation fails and the issue cannot be resolved, please use Docker.
Core functions:
bulk_deconvst_deconvst_mappingbulk_mappinghe_mapping
For reproducing results from the paper, please refer to:
Run commands first:
conda env create -f environment.yml
conda activate cytobulk
pip install -e .Most common dependencies are included in environment.yml, but installing Giotto may still require manually installing additional packages.
Then install Giotto in R (required for marker detection with Giotto):
library(devtools) # if not installed: install.packages('devtools')
library(remotes) # if not installed: install.packages('remotes')
remotes::install_github("RubD/Giotto")Giotto reference:
Run commands first:
docker --version
docker info
docker pull kristawang/cytobulk:1.0.0
docker images | grep cytobulkIf Docker runs into OOM killer or other out-of-memory issues, add a memory limit to the Docker command, for example:
docker run --memory=16g ...Note on Docker logs: In some environments, runtime logs may appear in batches (or mostly at the end) due to output buffering behavior. If you do not see real-time logs, the program may still be running normally.
In some runs, an R-related segmentation fault may appear at the very end (suspected in an additional spatial evaluation step). This does not affect generation of core output files.
If you encounter an error during conda env create -f environment.yml that looks like:
ERROR: Failed to build 'rpy2' when getting requirements to build wheel
...
FileNotFoundError: [Errno 2] No such file or directory: 'gcc'
...
distutils.compilers.C.errors.CompileError: command 'gcc' failed: No such file or directory
Cause: This error occurs because the system lacks required C compilers (gcc) needed to build rpy2 from source.
Solution: Install the compilers using conda:
conda install -c conda-forge compilers make pkg-configAfter installation, verify that gcc is available:
which gcc && gcc --versionYou should see the path to gcc and its version information. Then retry the environment creation:
conda env create -f environment.yml- Most inputs are
.h5ad(AnnData) files. bulk_mappingrequiresbulk_adata.uns['deconv']generated bybulk_deconv.st_mappingrequiresst_adata.uns['deconv']generated byst_deconv.- For
he_mapping,lr_datamust containligandandreceptorcolumns.
import cytobulk as ct
import scanpy as sc
sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
bulk_adata = sc.read_h5ad("/path/to/bulk_adata.h5ad")
deconv_result, bulk_out = ct.tl.bulk_deconv(
bulk_data=bulk_adata,
sc_adata=sc_adata,
annotation_key="celltype_minor",
dataset_name="my_bulk",
out_dir="/path/to/output",
n_cell=500,
)docker run --rm -it \
-v /path/to/input:/inputs:ro \
-v /path/to/output:/outputs \
kristawang/cytobulk:1.0.0 \
bulk_deconv \
--sc /inputs/sc_adata.h5ad \
--bulk /inputs/bulk_adata.h5ad \
--annotation_key celltype_minor \
--out_dir /outputs \
--dataset_name my_bulk \
--n_cell 500 \
--seed 64bulk_data: bulkAnnData.sc_adata: single-cellAnnDatareference.annotation_key: cell type column insc_adata.obs.
dataset_name(default:""): output file prefix.out_dir(default:"."): output directory.n_cell(default:2000): pseudo-bulk cell number per synthetic sample group.top_k(default:50): number of eigen components used in graph deconvolution.use_adversarial(default:True): enable adversarial training in the deconvolution model.specificity(default:False): whether to generate additional cell-type-specific simulated bulk mixtures.- Recommendation: keep
Falsefor randomly simulated bulk data. - Recommendation: consider
Truefor real cohorts where dominant-cell-type simulation is beneficial.
- Recommendation: keep
high_purity(default:False): only meaningful whenspecificity=True; generates higher dominant-cell-type purity in simulation.- Recommendation: set
Truefor high tumor-purity cohorts (for example, TCGA-like settings).
- Recommendation: set
bulk_hvg(default:True): whether to also keep highly variable genes (HVGs) in bulk data.reproduce(default:False): enable strict reproduction mode; requires pretrained files inout_dir/modeland batch-effect file underout_dir/model/batch_effect.
Additional preprocessing kwargs commonly used:
downsampling(default in preprocessing:False): downsample per-cell-type reference cells before marker/HVG steps. Recommended to setTruefor large single-cell datasets.giotto_gene_num(default in preprocessing:150): marker-gene count for Giotto-based marker detection.skip_find_markers(default in preprocessing:False): skip marker discovery and use overlapping genes directly.
deconv_result(pandas.DataFrame): predicted cell-type fractions, and it is also stored inbulk_out.uns['deconv'].bulk_out(anndata.AnnData): original bulkAnnDatawithuns['deconv']added; saved toout_dir/output/{dataset_name}_bulk_adata.h5ad.
We provide one runnable demo input in demo/:
demo/NSCLC_GSE127471.h5ad(single-cell reference)demo/NSCLC_GSE127471_bulk.h5ad(bulk input)
Use annotation_key="Celltype_minor" for this demo.
For randomly simulated data (for example, NSCLC_GSE127471), we recommend specificity=False; otherwise, set specificity=True or keep it unset (use default behavior).
Docker version:
DATASET_DIR="/absolute/path/to/CytoBulk/demo"
DATASET_OUT="/absolute/path/to/output_dir"
DATASET_NAME="NSCLC_GSE127471"
docker run --rm -it \
-e PYTHONUNBUFFERED=1 \
-e HOST_UID="$(id -u)" \
-e HOST_GID="$(id -g)" \
-v "${DATASET_DIR}":/inputs:ro \
-v "${DATASET_OUT}":/outputs \
kristawang/cytobulk:1.0.0 \
bulk_deconv \
--sc "/inputs/${DATASET_NAME}.h5ad" \
--bulk "/inputs/${DATASET_NAME}_bulk.h5ad" \
--annotation_key "Celltype_minor" \
--out_dir "/outputs/" \
--dataset_name "${DATASET_NAME}" \
--n_cell 100 \
--seed 64 \
--specificity FalsePath definition for Docker mounts:
DATASET_DIR: local folder containing demo input files; mounted to container path/inputsas read-only.DATASET_OUT: local output folder; mounted to container path/outputsfor writing results.--scand--bulk: container-internal input paths under/inputs.--out_dir: container-internal output path (/outputs/).
Conda version:
import os
import cytobulk as ct
from scanpy import read_h5ad
import warnings
warnings.filterwarnings("ignore")
dataset_name = "NSCLC_GSE127471"
annotation_key = "Celltype_minor"
sc_adata_path = "demo/NSCLC_GSE127471.h5ad"
bulk_adata_path = "demo/NSCLC_GSE127471_bulk.h5ad"
out_dir = "demo_output"
sc_adata = read_h5ad(sc_adata_path)
bulk_adata = read_h5ad(bulk_adata_path)
os.makedirs(out_dir, exist_ok=True)
ct.tl.bulk_deconv(
bulk_data=bulk_adata,
sc_adata=sc_adata,
annotation_key=annotation_key,
out_dir=out_dir,
dataset_name=dataset_name,
n_cell=100,
specificity=False
)Note: Due to repository storage constraints, only bulk_deconv demo data is provided in this repository. For more comprehensive demo cases and use cases for other functions (st_deconv, st_mapping, bulk_mapping, he_mapping), please refer to the CytoBulk_paper repository.
import cytobulk as ct
import scanpy as sc
sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
st_adata = sc.read_h5ad("/path/to/st_adata.h5ad")
deconv_result, st_out = ct.tl.st_deconv(
st_adata=st_adata,
sc_adata=sc_adata,
annotation_key="cell_type",
dataset_name="my_st",
out_dir="/path/to/output",
n_cell=8,
)docker run --rm -it \
-v /path/to/input:/inputs:ro \
-v /path/to/output:/outputs \
kristawang/cytobulk:1.0.0 \
st_deconv \
--sc /inputs/sc_adata.h5ad \
--st /inputs/st_adata.h5ad \
--annotation_key cell_type \
--out_dir /outputs \
--dataset_name my_st \
--n_cell 8 \
--seed 64st_adata: spatial transcriptomicsAnnData.sc_adata: single-cellAnnDatareference.annotation_key: cell type column insc_adata.obs.
dataset_name(default:""): output file prefix.out_dir(default:"."): output directory.n_cell(default:10): base number of cells per simulated spot.top_k(default:50): graph deconvolution eigen components.skip_find_markers(default:False): skip marker detection (and use all overlap genes).use_adversarial(default:True): adversarial model training toggle.st_hvg(default:True): whether to keep HVGs for ST data.reproduce(default:False): requires pretrained files inout_dir/st_modeland batch-effect file underout_dir/st_model/batch_effect.
deconv_result(pandas.DataFrame): predicted cell-type fractions, and it is also stored inst_out.uns['deconv'].st_out(anndata.AnnData): original STAnnDatawithuns['deconv']added; saved toout_dir/output/{dataset_name}_st_adata.h5ad.
import cytobulk as ct
import scanpy as sc
sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
st_adata = sc.read_h5ad("/path/to/output/my_st_st_adata.h5ad")
reconstructed_sc, reconstructed_adata = ct.tl.st_mapping(
st_adata=st_adata,
sc_adata=sc_adata,
out_dir="/path/to/output",
project="my_st",
annotation_key="cell_type",
seed=64,
)In the Docker reproduction scripts, this step is exposed as st_reconstruction (functionally corresponding to ct.tl.st_mapping).
docker run --rm -it \
-v /path/to/input:/inputs:ro \
-v /path/to/output:/outputs \
kristawang/cytobulk:1.0.0 \
st_reconstruction \
--sc /inputs/sc_adata.h5ad \
--st /outputs/output/my_st_st_adata.h5ad \
--annotation_key cell_type \
--out_dir /outputs \
--dataset_name my_st \
--seed 64st_adata: deconvolved STAnnDatawithuns['deconv'].sc_adata: single-cellAnnDatareference.out_dir: output directory.project: output prefix/tag.annotation_key: cell type column insc_adata.obs.
seed(default:0): random seed.sc_downsample(default:False): whether to downsample scRNA-seq counts before matching.scRNA_max_transcripts_per_cell(default:1500): transcript cap whensc_downsample=True.mean_cell_numbers(default:8): used to estimate cells per spot ifst_adata.obsm['cell_num']is absent.save_reconstructed_st(default:True): save reconstructed STAnnData.
reconstructed_sc(pandas.DataFrame): spot-to-cell mapping table with columnsspot_idandcell_id.reconstructed_adata(anndata.AnnData): reconstructed ST expressionAnnData(contains reconstructed expression and original ST in layeroriginal_st).
import cytobulk as ct
import scanpy as sc
sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
bulk_adata = sc.read_h5ad("/path/to/output/my_bulk_bulk_adata.h5ad")
reconstructed_cell, reconstructed_bulk = ct.tl.bulk_mapping(
bulk_adata=bulk_adata,
sc_adata=sc_adata,
annotation_key="celltype_minor",
out_dir="/path/to/output",
project="my_bulk",
n_cell=500,
multiprocessing=False,
)docker run --rm -it \
-v /path/to/input:/inputs:ro \
-v /path/to/output:/outputs \
kristawang/cytobulk:1.0.0 \
bulk_mapping \
--sc /inputs/sc_adata.h5ad \
--bulk /outputs/output/my_bulk_bulk_adata.h5ad \
--annotation_key celltype_minor \
--out_dir /outputs \
--dataset_name my_bulk \
--n_cell 500 \
--seed 64bulk_adata: deconvolved bulkAnnDatawithuns['deconv'].sc_adata: single-cellAnnDatareference.
n_cell(default:100): number of mapped single cells per bulk sample.annotation_key(default:"curated_cell_type"): cell type column insc_adata.obs.bulk_layer(default:None): layer key used as bulk expression matrix.sc_layer(default:None): layer key used as single-cell expression matrix.reorder(default:True): reorder genes to enforce consistent gene order between bulk/sc.multiprocessing(default:True): parallel mapping.cpu_num(default:cpu_count()-4): worker count when multiprocessing is enabled.normalization(default:True): apply CPM + log normalization before mapping.filter_gene(default:True): filter genes by cosine similarity between original and reconstructed bulk expression.save(default:True): write mapping outputs to disk.
reconstructed_cell(pandas.DataFrame): mapping table with columnssample_idandcell_id.reconstructed_bulk(anndata.AnnData): bulkAnnDatacontaining mapping-related layers/fields (for examplelayers['mapping'],layers['mapping_ori'],obsm['cell_number']).
import cytobulk as ct
import scanpy as sc
import pandas as pd
# Optional: create tiles from a .svs image
ct.pp.process_svs_image(
svs_path="/path/to/sample.svs",
output_dir="/path/to/tiles",
crop_size=224,
magnification=1,
center_x=21000,
center_y=11200,
fold_width=10,
fold_height=10,
enable_cropping=True,
)
sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
lr_data = pd.read_csv("/path/to/lrpairs.csv")
cell_coordinates, mapping_df = ct.tl.he_mapping(
image_dir="/path/to/tiles",
out_dir="/path/to/output",
project="my_he",
lr_data=lr_data,
sc_adata=sc_adata,
annotation_key="cell_type",
k_neighbor=30,
alpha="auto_compute",
batch_size=10000,
mapping_sc=True,
return_adata=False,
)docker run --rm -it \
-v /path/to/input:/inputs:ro \
-v /path/to/output:/outputs \
kristawang/cytobulk:1.0.0 \
he_mapping \
--svs_path /inputs/sample.svs \
--image_out_dir /outputs/tiles \
--enable_cropping 1 \
--crop_size 224 \
--magnification 1 \
--center_x 21000 \
--center_y 11200 \
--fold_width 10 \
--fold_height 10 \
--sc /inputs/sc_adata.h5ad \
--lr_csv /inputs/lrpairs.csv \
--annotation_key cell_type \
--out_dir /outputs/he_result \
--project my_he \
--k_neighbor 30 \
--batch_size 10000 \
--mapping_sc 1 \
--return_adata 1 \
--seed 20230602For full H&E-to-scRNA mapping (mapping_sc=True):
image_dir: folder containing processed image tiles.out_dir: output directory.project: output prefix/tag.sc_adata: single-cellAnnDatareference.lr_data: ligand-receptor table (ligand,receptor).annotation_key: cell type column insc_adata.obs.
If you run SVS preprocessing (ct.pp.process_svs_image or Docker flags):
enable_cropping=True(or--enable_cropping 1): must provide crop region parameters:center_x,center_yfold_width,fold_height(and usually setcrop_size,magnificationexplicitly for reproducible tiling)
enable_cropping=False(or--enable_cropping 0): process the whole slide by default; no crop region parameters are required.
enable_cropping(default:False): whether to crop a local region before tiling.True: crop around the specified region (center_x,center_y,fold_width,fold_height).False: process the whole image; region parameters are ignored/not required.
crop_size(default:224): tile size in pixels.magnification(default:1): magnification factor for cropped/read region.center_x,center_y(example:21000,11200): crop center coordinates used whenenable_cropping=True.fold_width,fold_height(default:10,10): crop grid size used whenenable_cropping=True.annotation_key(default:"curated_celltype"): cell type label column.k_neighbor(default:30): graph neighbor size for image-cell graph construction.alpha(default:"auto_compute"): FGW trade-off between structure and feature matching."auto_compute": automatically estimate alpha from image cell-type distribution.- float
0~1: manually set alpha.
mapping_sc(default:True): ifFalse, only return H&E cell type prediction without scRNA mapping.batch_size(default:3000): number of image cells processed per batch.downsampling(default:False): downsample scRNA reference for mapping.return_adata(default:False): return/save mapped filteredAnnData.sc_st(default:False): use looser filtering/normalization path for spatial-like sc input.anchor_expression(default:None): optional anchor expressionAnnDataaligned to image coordinates.expression_weight(default:0): expression term weight in cost matrix when anchor expression is provided.skip_filtering(default:False): skip scRNA filtering in this function.
- When
mapping_sc=False: returns onlycell_coordinates(pandas.DataFrame, H&E inferred cell coordinates and predicted cell types). - When
mapping_sc=Trueandreturn_adata=False: returns(cell_coordinates, mapping_df). - When
mapping_sc=Trueandreturn_adata=True: returns(cell_coordinates, mapping_df, matched_adata).mapping_dfis the H&E-to-scRNA matching table.matched_adatais the matched/filtered single-cellAnnData.
If you encounter the following error while running ct.tl.he_mapping:
_pickle.UnpicklingError: invalid load key, '<'.
This error usually means the pretrained model file was not fully downloaded (corrupted/incomplete file). To resolve it, manually download the model file and place it in the package pretrained-model directory.
Download DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth from:
Then place it at:
cytobulk/tools/model/pretrained_models/DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth.
Large model files are not committed by default. If needed, place:
DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth- into
cytobulk/tools/model/pretrained_models/