Integrate AutoRound into Diffusers by xin3he · Pull Request #13552 · huggingface/diffusers

xin3he · 2026-04-23T13:37:55Z

What does this PR do?

This pull request introduces support for the AutoRound quantization algorithm in the Diffusers library. AutoRound is a weight-only quantization method that enables efficient inference by optimizing weight rounding and min-max ranges, primarily targeting the W4A16 configuration (4-bit weights, 16-bit activations). The changes add a new quantization config, quantizer class, backend integration, and comprehensive documentation, while ensuring proper handling of optional dependencies.

Key changes:

AutoRound quantization support

Added a new AutoRoundConfig class to quantization_config.py for configuring AutoRound quantization parameters, including bits, group size, symmetry, backend, and modules to exclude from quantization.
Introduced the AutoRoundQuantizer class in quantizers/autoround/autoround_quantizer.py, implementing the logic for loading pre-quantized AutoRound models and integrating with the auto-round library.
Registered AutoRoundConfig and AutoRoundQuantizer in the quantization auto-mapping logic, enabling selection via the "auto-round" key

Dependency management and import handling

Added is_auto_round_available utility and integrated it into the main import structure and conditional imports, ensuring that AutoRound features are only available if the dependency is installed. Dummy objects are provided otherwise.
Implemented a test utility require_auto_round_version_greater_or_equal for version-gated testing of AutoRound features.

Documentation

Added a comprehensive user guide at docs/source/en/quantization/autoround.md, including usage examples, backend options, configuration details, and resource links.

These changes collectively enable seamless integration of AutoRound quantization into Diffusers, with robust configuration, backend selection, and user guidance.

Existed Model

cc @wenhuach21 @thuang6 @hshen14

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@yiyixuxu @asomoza @stevhliu @sayakpaul

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he · 2026-04-23T13:44:35Z

[2025/05] AutoRound has been integrated into vLLM: Usage, Medium blog, 小红书.
[2025/05] AutoRound has been integrated into Transformers: Blog.

I would like to integrate AutoRound into Diffusers to support diffusion models. Although the performance improvement is not significant, memory usage is notably reduced.

sayakpaul · 2026-04-23T16:01:09Z

Thanks for this PR! Could you provide some example code using AutoRound and also some example outputs? Feel free to also report latency and memory consumption so that there's some signal into its effectiveness.

Cc: @SunMarc

stevhliu

thanks for the integration!

stevhliu · 2026-04-23T16:49:45Z

+
+# AutoRound
+
+[AutoRound](https://github.com/intel/auto-round) is a weight-only quantization algorithm that uses **S**ign **G**radient **D**escent to jointly optimize rounding values and min-max ranges for weights. It targets the W4A16 configuration (4-bit weights, 16-bit activations), reducing model memory footprint while preserving inference accuracy.


Suggested change

[AutoRound](https://github.com/intel/auto-round) is a weight-only quantization algorithm that uses **S**ign **G**radient **D**escent to jointly optimize rounding values and min-max ranges for weights. It targets the W4A16 configuration (4-bit weights, 16-bit activations), reducing model memory footprint while preserving inference accuracy.

[AutoRound](https://huggingface.co/papers/2309.05516) is a weight-only quantization algorithm. It uses sign gradient descent to jointly optimize weight rounding and min-max ranges. AutoRound targets W4A16 (4-bit weights, 16-bit activations), reducing memory usage without sacrificing inference accuracy.

stevhliu · 2026-04-23T16:51:22Z

+
+## Quickstart
+
+Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.


Suggested change

Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.

Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. The method works with any model that loads via [Accelerate(https://hf.co/docs/accelerate/index) and has `torch.nn.Linear` layers.

stevhliu · 2026-04-23T16:52:11Z

+import torch
+from diffusers import AutoModel, FluxPipeline, AutoRoundConfig
+
+model_id = "your-org/flux-autoround-w4g128"


Suggested change

model_id = "your-org/flux-autoround-w4g128"

model_id = "INCModel/Z-Image-W4A16-AutoRound"

stevhliu · 2026-04-23T16:52:25Z

+import torch
+from diffusers import AutoModel, AutoRoundConfig
+
+model_id = "your-org/flux-autoround-w4g128"


Suggested change

model_id = "your-org/flux-autoround-w4g128"

model_id = "INCModel/Z-Image-W4A16-AutoRound"

stevhliu · 2026-04-23T16:54:01Z

+| Configuration | `bits` | `group_size` | `sym` | Description |
+|--------------|--------|-------------|-------|-------------|
+| W4G128 asymmetric | `4` | `128` | `False` | Default. Good balance of accuracy and compression. |
+| W4G128 symmetric | `4` | `128` | `True` | Slightly faster dequantization, marginal accuracy loss. |


stevhliu · 2026-04-23T17:05:18Z

+- `modules_to_not_convert` (`list[str]` or `None`, defaults to `None`): List of module name patterns to exclude from quantization. Use this to keep sensitive layers in full precision.
+- `backend` (`str`, defaults to `"auto"`): The inference backend kernel. See the [Inference backends](#inference-backends) table above.
+
+## Supported quantization configurations


Suggested change

## Supported quantization configurations

## Quantization configurations

stevhliu · 2026-04-23T17:05:59Z

+| W4G128 symmetric | `4` | `128` | `True` | Slightly faster dequantization, marginal accuracy loss. |
+| W4G32 asymmetric | `4` | `32` | `False` | Finer granularity, better accuracy, slightly more metadata overhead. |
+
+## Serializing and deserializing quantized models


Suggested change

## Serializing and deserializing quantized models

## Save and load

stevhliu · 2026-04-23T17:06:08Z

+
+AutoRound quantized models can be saved and reloaded using the standard [`~ModelMixin.save_pretrained`] and [`~ModelMixin.from_pretrained`] methods.
+
+### Save


i'd remove the separate ### Save and ### Load sections and just combine the code snippets in:

<hfoptions id="save-and-load"> <hfoption id="save"> code here </hfoption> <hfoption id="load"> code here </hfoption> </hfoptions>

stevhliu · 2026-04-23T17:06:15Z

+model.save_pretrained("path/to/saved_model")
+```
+
+### Load


Suggested change

### Load

stevhliu · 2026-04-23T17:08:59Z

+- [AutoRound GitHub repository](https://github.com/intel/auto-round)
+- [AutoRound paper (arXiv:2309.05516)](https://arxiv.org/abs/2309.05516)


these two also feel redundant since the first linked mention of AutoRound also directs to the paper and repo

Suggested change

- [AutoRound GitHub repository](https://github.com/intel/auto-round)

- [AutoRound paper (arXiv:2309.05516)](https://arxiv.org/abs/2309.05516)

xin3he added 3 commits April 10, 2026 15:19

support auto_round

42d4fdc

Signed-off-by: Xin He <xin3.he@intel.com>

add document and unit tests

e1714a9

Signed-off-by: Xin He <xin3.he@intel.com>

fix CI

c0daf15

Signed-off-by: Xin He <xin3.he@intel.com>

github-actions Bot added documentation Improvements or additions to documentation quantization tests utils size/L PR with diff > 200 LOC labels Apr 23, 2026

Merge branch 'main' into auto_round

f04afa9

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 23, 2026

xin3he changed the title ~~Auto round~~ Integrate AutoRound into Diffusers Apr 23, 2026

stevhliu reviewed Apr 23, 2026

View reviewed changes


		# AutoRound

		[AutoRound](https://github.com/intel/auto-round) is a weight-only quantization algorithm that uses Sign Gradient Descent to jointly optimize rounding values and min-max ranges for weights. It targets the W4A16 configuration (4-bit weights, 16-bit activations), reducing model memory footprint while preserving inference accuracy.

	[AutoRound](https://github.com/intel/auto-round) is a weight-only quantization algorithm that uses Sign Gradient Descent to jointly optimize rounding values and min-max ranges for weights. It targets the W4A16 configuration (4-bit weights, 16-bit activations), reducing model memory footprint while preserving inference accuracy.
	[AutoRound](https://huggingface.co/papers/2309.05516) is a weight-only quantization algorithm. It uses sign gradient descent to jointly optimize weight rounding and min-max ranges. AutoRound targets W4A16 (4-bit weights, 16-bit activations), reducing memory usage without sacrificing inference accuracy.


		## Quickstart

		Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.

	model_id = "your-org/flux-autoround-w4g128"
	model_id = "INCModel/Z-Image-W4A16-AutoRound"

	\| W4G128 symmetric \| `4` \| `128` \| `True` \| Slightly faster dequantization, marginal accuracy loss. \|
	\| W4G128 symmetric \| `4` \| `128` \| `True` \| Faster dequantization, small accuracy trade-off. \|

	## Supported quantization configurations
	## Quantization configurations

	## Serializing and deserializing quantized models
	## Save and load


		AutoRound quantized models can be saved and reloaded using the standard [`~ModelMixin.save_pretrained`] and [`~ModelMixin.from_pretrained`] methods.

		### Save

		- [AutoRound GitHub repository](https://github.com/intel/auto-round)
		- [AutoRound paper (arXiv:2309.05516)](https://arxiv.org/abs/2309.05516)

Conversation

xin3he commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

AutoRound quantization support

Dependency management and import handling

Documentation

Existed Model

Before submitting

Who can review?

Uh oh!

xin3he commented Apr 23, 2026

Uh oh!

sayakpaul commented Apr 23, 2026

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xin3he commented Apr 23, 2026 •

edited

Loading