Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

# Model Overview

<img width="523" height="543" alt="Image" src="https://github.com/user-attachments/assets/b08f7fa4-c493-4fd4-a24f-59d291fee601" />

# Abstract
- Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts
Hybrid Mamba-Attention language model
- pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using SupervisedFineTuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD)
- employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control.
- achieves up to∼6×higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy.

# Introduction
- While Mixture-of-Experts help Nemotron 3 Ultra achieve better accuracy per active parameter, the hybrid Mamba-Attention architecture significantly improves inference throughput by reducing attention cost and KV cache footprint. Nemotron 3 Ultra achieves 5.9×, 4.8×, and 1.6×higher inference throughput compared to GLM-5.1-754B-A40B, Kimi-K2.6-1T-A32B, and Qwen-3.5-397B-17B respectively on 8K input / 64K output token setting while also attaining on-par accuracy across a wide range of agentic and reasoning benchmarks.
- We pretrained our base model in NVFP4 with **20 trillion** text tokens using a **Warmup-Stable-Decay** learning rate schedule. 
  - Pretraining was divided into two phases with **15 trillion tokens of data in the first phase focusing on diversity and broad domain coverage** followed by **5 trillion tokens** of data in the second phase focusing on **high quality data** to refine model accuracy.
  - LatentMoE helped us achieve better accuracy per parameter than standard Granular MoEs (Dai et al., 2024) while Multi Token Prediction (MTP) leads to faster inference with speculative decoding.

<img width="860" height="652" alt="Image" src="https://github.com/user-attachments/assets/0c9d4101-b68e-4dc8-8ac7-4893bb21e677" />

# Pretraining
## Model Architecture
- native Multi-Token Prediction for inference acceleration with **two heads** during pre-training.
- Both MTP heads share the same parameters to enable robust autoregressive drafting as described in NVIDIA (2026) and consist of **a single attention layer followed by a single MoE layer**

<img width="884" height="231" alt="Image" src="https://github.com/user-attachments/assets/6b1dfade-4df1-4247-988e-ffaecdc7b770" />

## NVFP4 Pretraining
- leveraging Transformer Engine’s open-source cuBLAS NVFP4 GEMM kernels for **fprop, dgrad, and wgrad**. NVFP4 layers use the **E2M1** datatype with **two-dimensional block quantization** on **weights**, **Random Hadamard Transforms on inputs to wgrad**, and **stochastic rounding on gradients** NVIDIA (2025f)
- We kept the final 15% of the network (16 layers), Mamaba output projections, latent projections, QKV and attention projections, MTP layers, and embedding layers in higher precision following NVIDIA (2025c, 2026).
- To monitor training health, we branched ablations from checkpoints at 5T, 10T, and 16T tokens, switched all tensors to BF16, and continued pretraining for 74B tokens.
- We tracked the relative train loss difference between BF16 segments and Nemotron 3 Ultra (NVFP4). As seen in prior work, switching all tensors to BF16 substantially recovers the high-precision loss, providing a proxy for high-precision training NVIDIA (2025f). These three ablation studies on Nemotron 3 Ultra showed a relative train loss gap against the BF16 segments below 0.4% on average (Figure 3, top), which is lower than NVFP4 vs. BF16 train loss gaps observed on smaller model variants NVIDIA (2025c).

<img width="646" height="655" alt="Image" src="https://github.com/user-attachments/assets/34df24e2-4055-4a7c-8d50-72fa3b747e8c" />


## Pretraining Data
### Code refresh
- refreshed our raw source code data from GitHub, adding 173B new tokens with a cut-off date of
September 30, 2025.

### Nemotron-Pretraining-Multiple-Choice and Nemotron-Pretraining-Generative
- generated large-scale, task-seeded synthetic Q&A data from the training splits of many public
datasets spanning a wide range of domains, including STEM, factual knowledge, commonsense
reasoning, logical reasoning, math, code, reading comprehension, and multilingual QA. Held-out
test splits were not used for data generation
- To validate the quality of this data, we conducted a 100B-token phase-3 continued-pretraining
ablation on a Nemotron-family base checkpoint. Adding the benchmark-oriented synthetic data
improved MMLU-Pro from 64.8 to 66.6, average code from 73.2 to 75.1, commonsense understanding from 72.9 to 74.5, and GPQA from 30.8 to 41.9, while average math remained stable (87.6 to 87.9)

### Nemotron-Pretraining-Fact-Seeking
- generate the questions in two stages: extracting informative, factual statements from Finewiki articles, and prompting Qwen3-30B-A3B-Instruct-2507 with each statement and its original context to generate either a short-answer or multiple-choice question.
- We injected the fact-seeking data during the final 100B tokens of training, improving accuracy on SimpleQA from 40.24 to 50.16. Since we converted SimpleQA questions into multiple-choice format for easier evaluation, these scores are not directly comparable to the original SimpleQA score

### Nemotron-Pretraining-Moral-Scenarios
- included multiple-choice questions about moral scenarios.
These questions were constructed using situations and norms from Moral Stories (Emelin et al.,
2021) and actions from **Social Chemistry** (Forbes et al., 2020). In this work, we sampled a subset of these examples and created a chain-of-thought version using Qwen3-235B-A22B-Thinking-2507

### Nemotron-Pretraining-Legal
- Datasets extracted from HTML files
- LLM-cleaned datasets
- Reformatted datasets
- Synthetic datasets


### Data Mixture and Ordering
- The data mixtures used to train Nemotron 3 Ultra are an adaptation of the data mixtures used to train Nemotron 3 Super and Nano (NVIDIA, 2025b, 2026), and incorporate new and refreshed datasets.
- Following Feng et al. (2024), we design our data mixtures to balance diversity and quality.
We adopt the proposed two-phase curriculum and transition from a data mixture which biases dataset diversity (phase 1) to a data mixture which biases dataset quality (phase 2). This transition occurs after∼15 trillion tokens corresponding to∼75% of pretraining.

<img width="657" height="236" alt="Image" src="https://github.com/user-attachments/assets/4a179f93-2657-45f7-b22b-05e7f8b7123a" />

## Hyperparameters
- For Nemotron 3 Ultra we use a Warmup-Stable-Decay (WSD) learning rate schedule over a total horizon of 20 trillion tokens.
- We warmup the learning rate for 200 billion tokens to a peak value of 2.5 ×10−4.
- For the final 5 trillion tokens we then decay the learning rate according to a minus-sqrt decay schedule to a minimum of 2.5 ×10−6
- As in Nemotron 3 Super, we used offline checkpoint merging for evaluation analysis (Tian et al., 2025) throughout pretraining with a sliding merge window size of 500B tokens at a checkpointing interval of 25B tokens weighted to emulate our learning rate decay schedule.
- We use an MTP loss scaling factor of 0.1. All other hyperparameters remain the same as for Nemotron 3 Super


## Long-Context Extension
- Similar to Nemotron 3 Super & Nano, we added a long-context phase (LC-Phase) at the end of
pretraining
- We used a constant learning rate of 2.5 *10−6. We used 32-way context parallelism, 8-way tensor parallelism, 128-way expert parallelism, and 2-way pipeline parallelism to
train on GB200 GPUs.
  - Context Parallelism: 32-way
  - Tensor Parallelism: 8-way
  - Expert Parallelism: 128-way
  - Pipeline Parallelism: 2-way
- We performed CPT on 1,048,576 (1M) context length for 92% of the iterations, while we trained on 4,096 (4K) for the remaining 8% of the time in order to maintain the accuracy of the short benchmarks. Note that each iteration was trained with either 1M or 4K length and we did not mix sequence lengths within an iteration. Each iteration we constantly trained for 25,165,824 tokens.
- We only put math and code SFT-style data into the 4K iterations, since we found it worked best to maintain the short benchmark metrics while achieving strong long-context RULER scores. Eventually, the LC-Phase was trained for 33B tokens

## Base Model Evaluations
- All evaluation results reported for Nemotron 3 Ultra 550B-A55B Base were collected via Nemo
Evaluator SDK15 and NVIDIA’s open source container of LM Evaluation Harness

<img width="857" height="747" alt="Image" src="https://github.com/user-attachments/assets/1bec8d12-b735-45cb-98b0-ba510afc50cb" />

## Model Stability

<img width="655" height="574" alt="Image" src="https://github.com/user-attachments/assets/cd09889f-6993-473b-b732-e78cdc1226f8" />

<img width="656" height="440" alt="Image" src="https://github.com/user-attachments/assets/8d299042-f7fe-4045-a769-a82c95428dc7" />

- During pretraining, we observed two instances of training divergence characterized by simultaneous increases in both the training cross-entropy loss and wgrad L2 norm. These are shown in Figure 5.
- Divergence 1: Local Gradient Accumulation Precision for Output Layer
  - The first divergence, which occurred at around 8T tokens, was attributed to a reduction in local gradient accumulation precision for the output layer from FP32 to BF16 (in a bid to move data-parallel gradient reductions to BF16 over the wire as a throughput optimization
  - Nemotron 3 Ultra uses 2 MTP blocks with a MTP loss scaling factor of 0.1 (0.05 for
each MTP block); as a result, the MTP blocks’ wgrad contribution to the shared output layer is essentially lost when using BF16, which has only 7 mantissa bits. Figure 6 shows MTP-2 loss started spiking / diverging before training (and validation) loss. Rolling back to an earlier checkpoint and moving back to the full FP32 gradient reduction recipe re-stabilized training (as shown in Figure 5).
- Divergence 2: Undetermined
  - the second training divergence which occurred around 16T tokens, we found through ablations that starting learning rate annealing (both a 5T and 10T decay) immediately after rolling back to the 15T token checkpoint mitigates divergence (Figure 7). We eventually made the practical decision to cut the total pretraining token horizon down to 20T tokens.
- we found two interesting phenomena:
  - 1. Imbalanced and Dead Experts: As one possible proxy for pretraining health, the distribution of tokens across the available experts within the Mixture-of-Experts (MoE) layers can be continuously monitored. When a model begins to diverge or experience optimization difficulties, the routing mechanism often degrades, leading to severe token skew. 
    - For Ultra, routing started balanced, with the median (across
layers) MaxVio being 1.2 and the maximum being 4.8 (first MoE layer). As training progressed, expert routing became increasingly unbalanced; the median layer’s MaxVio stayed around 1.2 but the maximum kept increasing to ≈12 by 12T tokens (again first layer). Although not causal by itself, MaxVio seems correlated with training instability.
  - 2. Imbalanced Residual Stream Activation Norms: For those models, residual norms
in the early layers would increase, and then decrease and stabilize. Later layers had their residual norms slowly increase during training. For Ultra, residual norms initially followed this pattern, but norms in the early layers started rising around 7.5T pretraining tokens, with large residual norm spikes happening around 11T tokens, indicating poor signal propagation

<img width="862" height="443" alt="Image" src="https://github.com/user-attachments/assets/952e87be-cb5e-4253-ae6b-6ea8c41f9ced" />

# Post-Training

<img width="778" height="441" alt="Image" src="https://github.com/user-attachments/assets/3ebcc33f-5611-4128-bc77-8aee52f7a48f" />


## Supervised Fine Tuning
- In Stage 1, we train on packed sequences of length 294,912 tokens with global batch size 64 for 204,800 samples, using a cosine learning-rate schedule with peak learning rate 1.5 ×10−5, minimum learning rate 1 ×10−6, and 9,600 warmup samples
- In Stage 2, we extend packed sequence to 515,000 tokens, augmenting the mixture with additional long-context data up to 512K tokens. We train with global batch size 64 for 19,200 samples, using the same learning-rate schedule with peak 1 ×10−5 to minimum 2 ×10−6 and 6,400 warmup samples.
- we retain the shared-weight MTP objective during SFT, using two MTP layers with a per-token auxiliary-loss scaling factor of 0.1

### Data
#### Efficiency and Control
- The first is training samples generated by GPT-OSS-120B in its medium-effort mode on
prompts of math reasoning, STEM question answering and instruction following. These SFT data initiate Ultra’s medium-effort mode which is later optimized during the RLVR stage. The second component is training samples where the reasoning traces are truncated to random reasoning budgets while the responses remain the same. This is similar to Nemotron 3 Nano and Super with one design change: the </think> tokens in the truncated samples are masked from SFT training loss.

#### Search Capabilities
- Finally, we work with data vendors to curate particularly challenging samples that require 50–100 searches, and collect SFT trajectories in our **BrowseComp harness,** described in Appendix A. For these trajectories, we use **MiniMax 2.5 and GLM 5.1 as teacher models**

#### Terminal-Use Capabilities
- Seed instructions were sourced from a combination of publicly available datasets: OpenCodeReasoning (NVIDIA, 2025d), OpenMathReasoning (NVIDIA, 2025e), SWE-bench (Jimenez et al., 2024), SWE-Fixer-Train-110K (InternLM, 2025), SWE-rebench (Badertdinov et al., 2025), and SWE-smith (Pan et al., 2025b).
- For trajectory generation, DeepSeek-V3.2 was used as the acting agent within the Terminus-2 agent provided by the Harbor framework (Harbor, 2025)
- The final dataset comprises approximately 370K multi-turn conversations, consisting of a
mixture of reasoning and non-reasoning trajectories

### Data Packing
- we adopt a length-aware best-fit packing strategy (Ding et al., 2024), which packs multiple conversations into sequences up to a maximum context length.

## Reinforcement Learning
- a unified RLVR (Reinforcement Learning with Verifiable Reward) training stage spanning all available environments, targeting terminal usage, office and productivity workflows, software engineering, search, general tool-calling, math, code, STEM, safety, chat, instruction following, long-context QA, inductive and transductive reasoning, structured outputs, and general model usability. For harness-based environments, we construct training data using a diverse collection of harness implementations and interaction formats, improving robustness to variations in execution settings and reducing overfitting to any particular harness design
- For data mixture and curriculum construction, we adopt the Gaussian-based approach introduced in NVIDIA (2025b). Our training procedure largely follows the **asynchronous GRPO algorithm** with the stability optimizations proposed in NVIDIA (2026)
- To support training across a large and diverse set of environments, we use a global batch size of 8192, with each sample generating 16 rollouts. Training begins with a maximum generation length of 48K tokens, which is later increased to 64K tokens.

## MOPD
- Mixed-environment RLVR provides broad capability improvements across a wide range of domains. However, as the number of environments continues to grow, each domain contributes only a relatively small number of samples to any given training batch, diluting the per-domain learning signal and making it increasingly difficult to balance training across domains. To fully unlock performance and push the frontier in each capability area, we train more than ten specialized teacher models, each optimized through its own domain-specific training pipeline.

<img width="878" height="584" alt="Image" src="https://github.com/user-attachments/assets/700a6170-97b3-4b9f-8e58-78427644c22e" />

- In our implementation, MOPD is executed asynchronously. Rollout workers, teacher-scoring workers, and learner workers run in a pipeline
- PPO-style clipping is applied to 𝑟𝑡(𝜃) around the proximal policy 𝜋prox. The learner maximizes the clipped asynchronous MOPD surrogate
- MOPD training uses a maximum generation length of 192K tokens, matching the longest generation length used across all teacher training runs. Each training batch contains 1,024 prompts, with one rollout per prompt. In our ablation studies, using multiple rollouts did not yield additional benefits.

<img width="872" height="436" alt="Image" src="https://github.com/user-attachments/assets/56bf99ed-96cf-435c-a358-675e56800c52" />

### Specialized Teachers
- Software Engineering Teacher. The SWE teacher was trained through a three-stage pipeline. We first applied SFT to the Ultra base model on a blend of agentic data. Next, we ran PivotRL (Yi et al., 2026) on single-step agentic environments


### MOPD Warmup
- One key finding from our MOPD trials is that teacher models trained with substantially different training pipelines cannot be effectively combined through a straight forward MOPD merge, resulting in suboptimal performance
- We hypothesize that when the teacher and student are trained on different
SFT data, they acquire different reasoning behaviors and induce different output distributions. **This distribution mismatch can cause student-generated trajectories to be out-of-distribution for the teacher**
- To mitigate the distribution mismatch between teacher and student models, we introduce a brief warmup stage before MOPD. Specifically, the student undergoes a very light SFT on data drawn from the teacher’s training distribution.

<img width="662" height="163" alt="Image" src="https://github.com/user-attachments/assets/4374ae84-37bf-4a28-af38-2471a84b4567" />

- Recovery rate is defined as (MOPD2−RLVR)/(Teacher−RLVR)
<img width="859" height="461" alt="Image" src="https://github.com/user-attachments/assets/095ffc6f-876d-4bfa-9abd-48e2dca65f7f" />


### 3.3.5. Limitations and Open Problems 
#### MOPD on long-horizon tasks. 
- Agentic workflows require many turns of tool calls and
environment interactions, whereas reasoning tasks are typically single-turn. When mixing end-to-end agentic environments with reasoning environments in MOPD, we observed substantial training inefficiency because rollout times can differ dramatically. Balancing efficiency and accuracy requires sophisticated training infrastructure together with careful asynchronous algorithm design. In practice, we use single-turn rollouts, similar to PivotRL (Yi et al., 2026), for most of the agentic tasks. This approach performs relatively well, but whether end-to-end rollouts can yield further gains, and how to make it robust to potential distribution mismatch (Wang et al., 2026a), remains an open area for exploration.

## 3.6. Infrastructure
### 3.6.1. Accelerating Rollout Generation with Multi-Token Prediction
- During RL and MOPD, we train using a one-step off-policy asynchronous RL setup, so rollout generation is overlapped with the policy update, and the step time is bounded by whichever stage is slower.

<img width="857" height="435" alt="Image" src="https://github.com/user-attachments/assets/c05f0693-07ae-4800-a1e5-59583fe66eda" />


### 3.6.2. Scaling RL Infrastructure

<img width="885" height="267" alt="Image" src="https://github.com/user-attachments/assets/dbb9bda4-3962-4c4b-8308-56e364f58a70" />


<img width="712" height="224" alt="Image" src="https://github.com/user-attachments/assets/ca59013f-5aed-48da-9210-78aa96c6e707" />

- Additionally, the RL job creates a massive number of Ray actors simultaneously across policy, generation, and environment workers. At 3K+ GPU scale, Ray’s single-threaded Global Control Service (GCS) was overwhelmed by actor registrations, causing long startups (25-49 minutes)

<img width="885" height="267" alt="Image" src="https://github.com/user-attachments/assets/ef1ce1bb-d87c-4d8e-8a37-d0c50b145e82" />

## 3.7 Post-Trained Model Evaluations

<img width="651" height="626" alt="Image" src="https://github.com/user-attachments/assets/75a5e188-af36-4aae-a5a5-c38187dac83e" />


# 4. Quantization
- apply post-training quantization (PTQ) using Model-Optimizer to quantize the Nemotron 3
Ultra checkpoint to NVFP4 (NVIDIA, 2025) for efficient inference on NVIDIA Blackwell GPUs.
- The quantization format per operator (GEMM, KVCache, Mamba Cache) is summarized in Table 12

<img width="639" height="273" alt="Image" src="https://github.com/user-attachments/assets/5e97722f-ade8-4332-9fe3-0df183caad17" />

## 4.6 One NVFP4 checkpoint
- We release a single NVFP4 checkpoint for Nemotron 3 Ultra that targets both Blackwell, where it runs with native FP4 math, and Hopper, where it runs as W4A16 (weights NVFP4, activations BF16).

# Conclusion
We present our most capable model yet – Nemotron 3 Ultra with 550 billion total and 55 billion active parameters. Nemotron 3 Ultra uses a MoE Hybrid Mamba-Attention architecture along with LatentMoE and MTP for optimal inference and accuracy. Nemotron 3 Ultra was pre-trained on 20 trillion text tokens and then post-trained using SFT, RL, and MOPD. We show that our model attains 5x higher inference throughput than other state-of-the-art open LLMs while achieving on-par accuracy. We open-source the pre-trained, post-trained, and quantized checkpoints along with the training data for Nemotron 3 Ultra on HuggingFace.

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning #52

Description

Model Overview

Abstract

Introduction

Pretraining

Model Architecture

NVFP4 Pretraining

Pretraining Data

Code refresh

Nemotron-Pretraining-Multiple-Choice and Nemotron-Pretraining-Generative

Nemotron-Pretraining-Fact-Seeking

Nemotron-Pretraining-Moral-Scenarios

Nemotron-Pretraining-Legal

Data Mixture and Ordering

Hyperparameters

Long-Context Extension

Base Model Evaluations

Model Stability

Post-Training

Supervised Fine Tuning

Data

Efficiency and Control

Search Capabilities

Terminal-Use Capabilities

Data Packing

Reinforcement Learning

MOPD

Specialized Teachers

MOPD Warmup

3.3.5. Limitations and Open Problems

MOPD on long-horizon tasks.

3.6. Infrastructure

3.6.1. Accelerating Rollout Generation with Multi-Token Prediction

3.6.2. Scaling RL Infrastructure

3.7 Post-Trained Model Evaluations

4. Quantization

4.6 One NVFP4 checkpoint

Conclusion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions