CytoBulk

CytoBulk is a toolkit for bulk and spatial transcriptomics deconvolution and mapping.

CytoBulk has been tested on WSL2 and Linux systems.
On Windows and macOS, many packages listed in environment.yml may not have matching versions (or may be unavailable), so Docker is the recommended first-choice installation/runtime method.
If local installation fails and the issue cannot be resolved, please use Docker.

Core functions:

bulk_deconv
st_deconv
st_mapping
bulk_mapping
he_mapping

Paper Reproduction

For reproducing results from the paper, please refer to:

Code: https://github.com/deepomicslab/CytoBulk_paper
Data: https://doi.org/10.5281/zenodo.18495002

1) Installation

1.1 Conda installation

Run commands first:

conda env create -f environment.yml
conda activate cytobulk
pip install -e .

Most common dependencies are included in environment.yml, but installing Giotto may still require manually installing additional packages.

Then install Giotto in R (required for marker detection with Giotto):

library(devtools)  # if not installed: install.packages('devtools')
library(remotes)   # if not installed: install.packages('remotes')
remotes::install_github("RubD/Giotto")

Giotto reference:

https://giottosuite.readthedocs.io/en/master/gettingstarted.html

1.2 Docker installation

Run commands first:

docker --version
docker info
docker pull kristawang/cytobulk:1.0.0
docker images | grep cytobulk

If Docker runs into OOM killer or other out-of-memory issues, add a memory limit to the Docker command, for example:

docker run --memory=16g ...

Note on Docker logs: In some environments, runtime logs may appear in batches (or mostly at the end) due to output buffering behavior. If you do not see real-time logs, the program may still be running normally.

In some runs, an R-related segmentation fault may appear at the very end (suspected in an additional spatial evaluation step). This does not affect generation of core output files.

1.3 Troubleshooting

Issue: rpy2 installation fails with "command 'gcc' failed"

If you encounter an error during conda env create -f environment.yml that looks like:

ERROR: Failed to build 'rpy2' when getting requirements to build wheel
...
FileNotFoundError: [Errno 2] No such file or directory: 'gcc'
...
distutils.compilers.C.errors.CompileError: command 'gcc' failed: No such file or directory

Cause: This error occurs because the system lacks required C compilers (gcc) needed to build rpy2 from source.

Solution: Install the compilers using conda:

conda install -c conda-forge compilers make pkg-config

After installation, verify that gcc is available:

which gcc && gcc --version

You should see the path to gcc and its version information. Then retry the environment creation:

conda env create -f environment.yml

2) General I/O conventions

Most inputs are .h5ad (AnnData) files.
bulk_mapping requires bulk_adata.uns['deconv'] generated by bulk_deconv.
st_mapping requires st_adata.uns['deconv'] generated by st_deconv.
For he_mapping, lr_data must contain ligand and receptor columns.

3) `bulk_deconv`

3.1 Command (Conda / Python API)

import cytobulk as ct
import scanpy as sc

sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
bulk_adata = sc.read_h5ad("/path/to/bulk_adata.h5ad")

deconv_result, bulk_out = ct.tl.bulk_deconv(
		bulk_data=bulk_adata,
		sc_adata=sc_adata,
		annotation_key="celltype_minor",
		dataset_name="my_bulk",
		out_dir="/path/to/output",
		n_cell=500,
)

3.2 Command (Docker)

docker run --rm -it \
	-v /path/to/input:/inputs:ro \
	-v /path/to/output:/outputs \
	kristawang/cytobulk:1.0.0 \
	bulk_deconv \
	--sc /inputs/sc_adata.h5ad \
	--bulk /inputs/bulk_adata.h5ad \
	--annotation_key celltype_minor \
	--out_dir /outputs \
	--dataset_name my_bulk \
	--n_cell 500 \
	--seed 64

3.3 Required parameters

bulk_data: bulk AnnData.
sc_adata: single-cell AnnData reference.
annotation_key: cell type column in sc_adata.obs.

3.4 Common optional parameters (defaults and meanings)

dataset_name (default: ""): output file prefix.
out_dir (default: "."): output directory.
n_cell (default: 2000): pseudo-bulk cell number per synthetic sample group.
top_k (default: 50): number of eigen components used in graph deconvolution.
use_adversarial (default: True): enable adversarial training in the deconvolution model.
specificity (default: False): whether to generate additional cell-type-specific simulated bulk mixtures.
- Recommendation: keep False for randomly simulated bulk data.
- Recommendation: consider True for real cohorts where dominant-cell-type simulation is beneficial.
high_purity (default: False): only meaningful when specificity=True; generates higher dominant-cell-type purity in simulation.
- Recommendation: set True for high tumor-purity cohorts (for example, TCGA-like settings).
bulk_hvg (default: True): whether to also keep highly variable genes (HVGs) in bulk data.
reproduce (default: False): enable strict reproduction mode; requires pretrained files in out_dir/model and batch-effect file under out_dir/model/batch_effect.

Additional preprocessing kwargs commonly used:

downsampling (default in preprocessing: False): downsample per-cell-type reference cells before marker/HVG steps. Recommended to set True for large single-cell datasets.
giotto_gene_num (default in preprocessing: 150): marker-gene count for Giotto-based marker detection.
skip_find_markers (default in preprocessing: False): skip marker discovery and use overlapping genes directly.

3.5 Output

deconv_result (pandas.DataFrame): predicted cell-type fractions, and it is also stored in bulk_out.uns['deconv'].
bulk_out (anndata.AnnData): original bulk AnnData with uns['deconv'] added; saved to out_dir/output/{dataset_name}_bulk_adata.h5ad.

3.6 Demo case (`bulk_deconv`)

We provide one runnable demo input in demo/:

demo/NSCLC_GSE127471.h5ad (single-cell reference)
demo/NSCLC_GSE127471_bulk.h5ad (bulk input)

Use annotation_key="Celltype_minor" for this demo.

For randomly simulated data (for example, NSCLC_GSE127471), we recommend specificity=False; otherwise, set specificity=True or keep it unset (use default behavior).

Docker version:

DATASET_DIR="/absolute/path/to/CytoBulk/demo"
DATASET_OUT="/absolute/path/to/output_dir"
DATASET_NAME="NSCLC_GSE127471"

docker run --rm -it \
	-e PYTHONUNBUFFERED=1 \
	-e HOST_UID="$(id -u)" \
	-e HOST_GID="$(id -g)" \
	-v "${DATASET_DIR}":/inputs:ro \
	-v "${DATASET_OUT}":/outputs \
	kristawang/cytobulk:1.0.0 \
	bulk_deconv \
	--sc "/inputs/${DATASET_NAME}.h5ad" \
	--bulk "/inputs/${DATASET_NAME}_bulk.h5ad" \
	--annotation_key "Celltype_minor" \
	--out_dir "/outputs/" \
	--dataset_name "${DATASET_NAME}" \
	--n_cell 100 \
	--seed 64 \
	--specificity False

Path definition for Docker mounts:

DATASET_DIR: local folder containing demo input files; mounted to container path /inputs as read-only.
DATASET_OUT: local output folder; mounted to container path /outputs for writing results.
--sc and --bulk: container-internal input paths under /inputs.
--out_dir: container-internal output path (/outputs/).

Conda version:

import os
import cytobulk as ct
from scanpy import read_h5ad
import warnings

warnings.filterwarnings("ignore")

dataset_name = "NSCLC_GSE127471"
annotation_key = "Celltype_minor"

sc_adata_path = "demo/NSCLC_GSE127471.h5ad"
bulk_adata_path = "demo/NSCLC_GSE127471_bulk.h5ad"
out_dir = "demo_output"


sc_adata = read_h5ad(sc_adata_path)
bulk_adata = read_h5ad(bulk_adata_path)

os.makedirs(out_dir, exist_ok=True)

ct.tl.bulk_deconv(
		bulk_data=bulk_adata,
		sc_adata=sc_adata,
		annotation_key=annotation_key,
		out_dir=out_dir,
		dataset_name=dataset_name,
		n_cell=100,
		specificity=False
)

Note: Due to repository storage constraints, only bulk_deconv demo data is provided in this repository. For more comprehensive demo cases and use cases for other functions (st_deconv, st_mapping, bulk_mapping, he_mapping), please refer to the CytoBulk_paper repository.

4) `st_deconv`

4.1 Command (Conda / Python API)

import cytobulk as ct
import scanpy as sc

sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
st_adata = sc.read_h5ad("/path/to/st_adata.h5ad")

deconv_result, st_out = ct.tl.st_deconv(
		st_adata=st_adata,
		sc_adata=sc_adata,
		annotation_key="cell_type",
		dataset_name="my_st",
		out_dir="/path/to/output",
		n_cell=8,
)

4.2 Command (Docker)

docker run --rm -it \
	-v /path/to/input:/inputs:ro \
	-v /path/to/output:/outputs \
	kristawang/cytobulk:1.0.0 \
	st_deconv \
	--sc /inputs/sc_adata.h5ad \
	--st /inputs/st_adata.h5ad \
	--annotation_key cell_type \
	--out_dir /outputs \
	--dataset_name my_st \
	--n_cell 8 \
	--seed 64

4.3 Required parameters

st_adata: spatial transcriptomics AnnData.
sc_adata: single-cell AnnData reference.
annotation_key: cell type column in sc_adata.obs.

4.4 Common optional parameters (defaults and meanings)

dataset_name (default: ""): output file prefix.
out_dir (default: "."): output directory.
n_cell (default: 10): base number of cells per simulated spot.
top_k (default: 50): graph deconvolution eigen components.
skip_find_markers (default: False): skip marker detection (and use all overlap genes).
use_adversarial (default: True): adversarial model training toggle.
st_hvg (default: True): whether to keep HVGs for ST data.
reproduce (default: False): requires pretrained files in out_dir/st_model and batch-effect file under out_dir/st_model/batch_effect.

4.5 Output

deconv_result (pandas.DataFrame): predicted cell-type fractions, and it is also stored in st_out.uns['deconv'].
st_out (anndata.AnnData): original ST AnnData with uns['deconv'] added; saved to out_dir/output/{dataset_name}_st_adata.h5ad.

5) `st_mapping`

5.1 Command (Conda / Python API)

import cytobulk as ct
import scanpy as sc

sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
st_adata = sc.read_h5ad("/path/to/output/my_st_st_adata.h5ad")

reconstructed_sc, reconstructed_adata = ct.tl.st_mapping(
		st_adata=st_adata,
		sc_adata=sc_adata,
		out_dir="/path/to/output",
		project="my_st",
		annotation_key="cell_type",
		seed=64,
)

5.2 Command (Docker)

In the Docker reproduction scripts, this step is exposed as st_reconstruction (functionally corresponding to ct.tl.st_mapping).

docker run --rm -it \
	-v /path/to/input:/inputs:ro \
	-v /path/to/output:/outputs \
	kristawang/cytobulk:1.0.0 \
	st_reconstruction \
	--sc /inputs/sc_adata.h5ad \
	--st /outputs/output/my_st_st_adata.h5ad \
	--annotation_key cell_type \
	--out_dir /outputs \
	--dataset_name my_st \
	--seed 64

5.3 Required parameters

st_adata: deconvolved ST AnnData with uns['deconv'].
sc_adata: single-cell AnnData reference.
out_dir: output directory.
project: output prefix/tag.
annotation_key: cell type column in sc_adata.obs.

5.4 Common optional parameters (defaults and meanings)

seed (default: 0): random seed.
sc_downsample (default: False): whether to downsample scRNA-seq counts before matching.
scRNA_max_transcripts_per_cell (default: 1500): transcript cap when sc_downsample=True.
mean_cell_numbers (default: 8): used to estimate cells per spot if st_adata.obsm['cell_num'] is absent.
save_reconstructed_st (default: True): save reconstructed ST AnnData.

5.5 Output

reconstructed_sc (pandas.DataFrame): spot-to-cell mapping table with columns spot_id and cell_id.
reconstructed_adata (anndata.AnnData): reconstructed ST expression AnnData (contains reconstructed expression and original ST in layer original_st).

6) `bulk_mapping`

6.1 Command (Conda / Python API)

import cytobulk as ct
import scanpy as sc

sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
bulk_adata = sc.read_h5ad("/path/to/output/my_bulk_bulk_adata.h5ad")

reconstructed_cell, reconstructed_bulk = ct.tl.bulk_mapping(
		bulk_adata=bulk_adata,
		sc_adata=sc_adata,
		annotation_key="celltype_minor",
		out_dir="/path/to/output",
		project="my_bulk",
		n_cell=500,
		multiprocessing=False,
)

6.2 Command (Docker)

docker run --rm -it \
	-v /path/to/input:/inputs:ro \
	-v /path/to/output:/outputs \
	kristawang/cytobulk:1.0.0 \
	bulk_mapping \
	--sc /inputs/sc_adata.h5ad \
	--bulk /outputs/output/my_bulk_bulk_adata.h5ad \
	--annotation_key celltype_minor \
	--out_dir /outputs \
	--dataset_name my_bulk \
	--n_cell 500 \
	--seed 64

6.3 Required parameters

bulk_adata: deconvolved bulk AnnData with uns['deconv'].
sc_adata: single-cell AnnData reference.

6.4 Common optional parameters (defaults and meanings)

n_cell (default: 100): number of mapped single cells per bulk sample.
annotation_key (default: "curated_cell_type"): cell type column in sc_adata.obs.
bulk_layer (default: None): layer key used as bulk expression matrix.
sc_layer (default: None): layer key used as single-cell expression matrix.
reorder (default: True): reorder genes to enforce consistent gene order between bulk/sc.
multiprocessing (default: True): parallel mapping.
cpu_num (default: cpu_count()-4): worker count when multiprocessing is enabled.
normalization (default: True): apply CPM + log normalization before mapping.
filter_gene (default: True): filter genes by cosine similarity between original and reconstructed bulk expression.
save (default: True): write mapping outputs to disk.

6.5 Output

reconstructed_cell (pandas.DataFrame): mapping table with columns sample_id and cell_id.
reconstructed_bulk (anndata.AnnData): bulk AnnData containing mapping-related layers/fields (for example layers['mapping'], layers['mapping_ori'], obsm['cell_number']).

7) `he_mapping`

7.1 Command (Conda / optional SVS preprocessing + mapping)

import cytobulk as ct
import scanpy as sc
import pandas as pd

# Optional: create tiles from a .svs image
ct.pp.process_svs_image(
		svs_path="/path/to/sample.svs",
		output_dir="/path/to/tiles",
		crop_size=224,
		magnification=1,
		center_x=21000,
		center_y=11200,
		fold_width=10,
		fold_height=10,
		enable_cropping=True,
)

sc_adata = sc.read_h5ad("/path/to/sc_adata.h5ad")
lr_data = pd.read_csv("/path/to/lrpairs.csv")

cell_coordinates, mapping_df = ct.tl.he_mapping(
		image_dir="/path/to/tiles",
		out_dir="/path/to/output",
		project="my_he",
		lr_data=lr_data,
		sc_adata=sc_adata,
		annotation_key="cell_type",
		k_neighbor=30,
		alpha="auto_compute",
		batch_size=10000,
		mapping_sc=True,
		return_adata=False,
)

7.2 Command (Docker)

docker run --rm -it \
	-v /path/to/input:/inputs:ro \
	-v /path/to/output:/outputs \
	kristawang/cytobulk:1.0.0 \
	he_mapping \
	--svs_path /inputs/sample.svs \
	--image_out_dir /outputs/tiles \
	--enable_cropping 1 \
	--crop_size 224 \
	--magnification 1 \
	--center_x 21000 \
	--center_y 11200 \
	--fold_width 10 \
	--fold_height 10 \
	--sc /inputs/sc_adata.h5ad \
	--lr_csv /inputs/lrpairs.csv \
	--annotation_key cell_type \
	--out_dir /outputs/he_result \
	--project my_he \
	--k_neighbor 30 \
	--batch_size 10000 \
	--mapping_sc 1 \
	--return_adata 1 \
	--seed 20230602

7.3 Required parameters and preprocessing rules

For full H&E-to-scRNA mapping (mapping_sc=True):

image_dir: folder containing processed image tiles.
out_dir: output directory.
project: output prefix/tag.
sc_adata: single-cell AnnData reference.
lr_data: ligand-receptor table (ligand, receptor).
annotation_key: cell type column in sc_adata.obs.

If you run SVS preprocessing (ct.pp.process_svs_image or Docker flags):

enable_cropping=True (or --enable_cropping 1): must provide crop region parameters:
- center_x, center_y
- fold_width, fold_height (and usually set crop_size, magnification explicitly for reproducible tiling)
enable_cropping=False (or --enable_cropping 0): process the whole slide by default; no crop region parameters are required.

7.4 Common optional parameters (defaults and meanings)

enable_cropping (default: False): whether to crop a local region before tiling.
- True: crop around the specified region (center_x, center_y, fold_width, fold_height).
- False: process the whole image; region parameters are ignored/not required.
crop_size (default: 224): tile size in pixels.
magnification (default: 1): magnification factor for cropped/read region.
center_x, center_y (example: 21000, 11200): crop center coordinates used when enable_cropping=True.
fold_width, fold_height (default: 10, 10): crop grid size used when enable_cropping=True.
annotation_key (default: "curated_celltype"): cell type label column.
k_neighbor (default: 30): graph neighbor size for image-cell graph construction.
alpha (default: "auto_compute"): FGW trade-off between structure and feature matching.
- "auto_compute": automatically estimate alpha from image cell-type distribution.
- float 0~1: manually set alpha.
mapping_sc (default: True): if False, only return H&E cell type prediction without scRNA mapping.
batch_size (default: 3000): number of image cells processed per batch.
downsampling (default: False): downsample scRNA reference for mapping.
return_adata (default: False): return/save mapped filtered AnnData.
sc_st (default: False): use looser filtering/normalization path for spatial-like sc input.
anchor_expression (default: None): optional anchor expression AnnData aligned to image coordinates.
expression_weight (default: 0): expression term weight in cost matrix when anchor expression is provided.
skip_filtering (default: False): skip scRNA filtering in this function.

7.5 Output

When mapping_sc=False: returns only cell_coordinates (pandas.DataFrame, H&E inferred cell coordinates and predicted cell types).
When mapping_sc=True and return_adata=False: returns (cell_coordinates, mapping_df).
When mapping_sc=True and return_adata=True: returns (cell_coordinates, mapping_df, matched_adata).
- mapping_df is the H&E-to-scRNA matching table.
- matched_adata is the matched/filtered single-cell AnnData.

7.6 Troubleshooting: model file loading error

If you encounter the following error while running ct.tl.he_mapping:

_pickle.UnpicklingError: invalid load key, '<'.

This error usually means the pretrained model file was not fully downloaded (corrupted/incomplete file). To resolve it, manually download the model file and place it in the package pretrained-model directory.

Download DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth from:

https://zenodo.org/records/18495002

Then place it at:

cytobulk/tools/model/pretrained_models/DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth .

8) Pretrained model note

Large model files are not committed by default. If needed, place:

DeepCMorph_Datasets_Combined_41_classes_acc_8159.pth
into cytobulk/tools/model/pretrained_models/

9) Repository

GitHub: https://github.com/deepomicslab/CytoBulk

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
cytobulk		cytobulk
demo		demo
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

CytoBulk

Paper Reproduction

1) Installation

1.1 Conda installation

1.2 Docker installation

1.3 Troubleshooting

Issue: rpy2 installation fails with "command 'gcc' failed"

2) General I/O conventions

3) bulk_deconv

3.1 Command (Conda / Python API)

3.2 Command (Docker)

3.3 Required parameters

3.4 Common optional parameters (defaults and meanings)

3.5 Output

3.6 Demo case (bulk_deconv)

4) st_deconv

4.1 Command (Conda / Python API)

4.2 Command (Docker)

4.3 Required parameters

4.4 Common optional parameters (defaults and meanings)

4.5 Output

5) st_mapping

5.1 Command (Conda / Python API)

5.2 Command (Docker)

5.3 Required parameters

5.4 Common optional parameters (defaults and meanings)

5.5 Output

6) bulk_mapping

6.1 Command (Conda / Python API)

6.2 Command (Docker)

6.3 Required parameters

6.4 Common optional parameters (defaults and meanings)

6.5 Output

7) he_mapping

7.1 Command (Conda / optional SVS preprocessing + mapping)

7.2 Command (Docker)

7.3 Required parameters and preprocessing rules

7.4 Common optional parameters (defaults and meanings)

7.5 Output

7.6 Troubleshooting: model file loading error

8) Pretrained model note

9) Repository

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3) `bulk_deconv`

3.6 Demo case (`bulk_deconv`)

4) `st_deconv`

5) `st_mapping`

6) `bulk_mapping`

7) `he_mapping`

Packages