Source code and resources for our Findings of ACL 2026 paper: Efficient Training for Cross-lingual Speech Language Models.
This repository contains training and inference code for cross-lingual speech language models (CSLM), together with a Chinese S2S conversation benchmark (BELLE-eval-S2S) used in our experiments.
The current release includes:
- Pre-training and supervised fine-tuning (SFT) scripts
- Inference scripts for general and cross-lingual decoding
- A public Chinese test set for S2S conversation evaluation
Edit paths in cslm/train/pretrain.sh:
DATA_ROOT/DATA_PATHMODEL_DIRCACHE_DIROUT_DIR- distributed environment variables (
MASTER_ADDR,MASTER_PORT, etc.) if using multi-node
Then run:
cd cslm/train
./pretrain.shEdit paths in cslm/train/sft.sh:
DATA_ROOT/DATA_PATHMODEL_DIR(pretrained checkpoint)CACHE_DIROUT_DIR
Then run:
cd cslm/train
./sft.shThe SFT training code expects JSON/JSONL with fields:
prompt: user input (string or multi-turn list)response: target response (string or multi-turn list)
Minimal single-turn example:
{"prompt": "你好,请介绍一下你自己。", "response": "你好,我是一个跨语言语音语言模型助手。"}Minimal multi-turn example:
{
"prompt": ["你好", "请用一句话解释机器学习"],
"response": ["你好!", "机器学习是让模型从数据中学习规律并用于预测或决策的方法。"]
}Inference scripts take speech unit sequences as input (one unit sequence per line in a text file).
cd cslm/infer
python decode_general.py \
--lang zh \
--unit /path/to/unit_sequences.txt \
--model-name-or-path /path/to/checkpoint \
--output-dir /path/to/outputcd cslm/infer
python decode_general_cross.py \
--lang en \
--unit /path/to/unit_sequences.txt \
--model-name-or-path /path/to/checkpoint \
--output-dir /path/to/outputOutputs are appended to:
/path/to/output/responses.json
BELLE-eval-S2S is our open Chinese speech-to-speech conversation test set used for evaluation.
- Manifest file:
BELLE-eval-S2S/test.tsv - Audio files:
BELLE-eval-S2S/wav
If you use this benchmark, please cite our paper.
This project is built on top of the following open resources:
- Base LLM: meta-llama/Llama-3.1-8B-Instruct
- Speech tokenizer: FunAudioLLM/CosyVoice-300M
Please follow their original licenses and usage policies.
@misc{zhou2026efficienttrainingcrosslingualspeech,
title={Efficient Training for Cross-lingual Speech Language Models},
author={Yan Zhou and Qingkai Fang and Yun Hong and Yang Feng},
year={2026},
eprint={2604.11096},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.11096},
}If you have questions, please contact: zhouyan23z@ict.ac.cn.
