Skip to content

Integrate AutoRound into Diffusers#13552

Open
xin3he wants to merge 4 commits intohuggingface:mainfrom
xin3he:auto_round
Open

Integrate AutoRound into Diffusers#13552
xin3he wants to merge 4 commits intohuggingface:mainfrom
xin3he:auto_round

Conversation

@xin3he
Copy link
Copy Markdown

@xin3he xin3he commented Apr 23, 2026

What does this PR do?

This pull request introduces support for the AutoRound quantization algorithm in the Diffusers library. AutoRound is a weight-only quantization method that enables efficient inference by optimizing weight rounding and min-max ranges, primarily targeting the W4A16 configuration (4-bit weights, 16-bit activations). The changes add a new quantization config, quantizer class, backend integration, and comprehensive documentation, while ensuring proper handling of optional dependencies.

Key changes:

AutoRound quantization support

  • Added a new AutoRoundConfig class to quantization_config.py for configuring AutoRound quantization parameters, including bits, group size, symmetry, backend, and modules to exclude from quantization.
  • Introduced the AutoRoundQuantizer class in quantizers/autoround/autoround_quantizer.py, implementing the logic for loading pre-quantized AutoRound models and integrating with the auto-round library.
  • Registered AutoRoundConfig and AutoRoundQuantizer in the quantization auto-mapping logic, enabling selection via the "auto-round" key

Dependency management and import handling

  • Added is_auto_round_available utility and integrated it into the main import structure and conditional imports, ensuring that AutoRound features are only available if the dependency is installed. Dummy objects are provided otherwise.
  • Implemented a test utility require_auto_round_version_greater_or_equal for version-gated testing of AutoRound features.

Documentation

  • Added a comprehensive user guide at docs/source/en/quantization/autoround.md, including usage examples, backend options, configuration details, and resource links.

These changes collectively enable seamless integration of AutoRound quantization into Diffusers, with robust configuration, backend selection, and user guidance.

Existed Model

cc @wenhuach21 @thuang6 @hshen14

Before submitting

Who can review?

@yiyixuxu @asomoza @stevhliu @sayakpaul

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

xin3he added 3 commits April 10, 2026 15:19
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
@github-actions github-actions Bot added documentation Improvements or additions to documentation quantization tests utils size/L PR with diff > 200 LOC labels Apr 23, 2026
@github-actions github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 23, 2026
@xin3he xin3he changed the title Auto round Integrate AutoRound into Diffusers Apr 23, 2026
@xin3he
Copy link
Copy Markdown
Author

xin3he commented Apr 23, 2026

  • [2025/05] AutoRound has been integrated into vLLM: Usage, Medium blog, 小红书.

  • [2025/05] AutoRound has been integrated into Transformers: Blog.

I would like to integrate AutoRound into Diffusers to support diffusion models. Although the performance improvement is not significant, memory usage is notably reduced.

@sayakpaul
Copy link
Copy Markdown
Member

Thanks for this PR! Could you provide some example code using AutoRound and also some example outputs? Feel free to also report latency and memory consumption so that there's some signal into its effectiveness.

Cc: @SunMarc

Copy link
Copy Markdown
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the integration!


# AutoRound

[AutoRound](https://github.com/intel/auto-round) is a weight-only quantization algorithm that uses **S**ign **G**radient **D**escent to jointly optimize rounding values and min-max ranges for weights. It targets the W4A16 configuration (4-bit weights, 16-bit activations), reducing model memory footprint while preserving inference accuracy.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[AutoRound](https://github.com/intel/auto-round) is a weight-only quantization algorithm that uses **S**ign **G**radient **D**escent to jointly optimize rounding values and min-max ranges for weights. It targets the W4A16 configuration (4-bit weights, 16-bit activations), reducing model memory footprint while preserving inference accuracy.
[AutoRound](https://huggingface.co/papers/2309.05516) is a weight-only quantization algorithm. It uses sign gradient descent to jointly optimize weight rounding and min-max ranges. AutoRound targets W4A16 (4-bit weights, 16-bit activations), reducing memory usage without sacrificing inference accuracy.


## Quickstart

Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. The method works with any model that loads via [Accelerate(https://hf.co/docs/accelerate/index) and has `torch.nn.Linear` layers.

import torch
from diffusers import AutoModel, FluxPipeline, AutoRoundConfig

model_id = "your-org/flux-autoround-w4g128"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_id = "your-org/flux-autoround-w4g128"
model_id = "INCModel/Z-Image-W4A16-AutoRound"

import torch
from diffusers import AutoModel, AutoRoundConfig

model_id = "your-org/flux-autoround-w4g128"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_id = "your-org/flux-autoround-w4g128"
model_id = "INCModel/Z-Image-W4A16-AutoRound"

| Configuration | `bits` | `group_size` | `sym` | Description |
|--------------|--------|-------------|-------|-------------|
| W4G128 asymmetric | `4` | `128` | `False` | Default. Good balance of accuracy and compression. |
| W4G128 symmetric | `4` | `128` | `True` | Slightly faster dequantization, marginal accuracy loss. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| W4G128 symmetric | `4` | `128` | `True` | Slightly faster dequantization, marginal accuracy loss. |
| W4G128 symmetric | `4` | `128` | `True` | Faster dequantization, small accuracy trade-off. |

- `modules_to_not_convert` (`list[str]` or `None`, defaults to `None`): List of module name patterns to exclude from quantization. Use this to keep sensitive layers in full precision.
- `backend` (`str`, defaults to `"auto"`): The inference backend kernel. See the [Inference backends](#inference-backends) table above.

## Supported quantization configurations
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Supported quantization configurations
## Quantization configurations

| W4G128 symmetric | `4` | `128` | `True` | Slightly faster dequantization, marginal accuracy loss. |
| W4G32 asymmetric | `4` | `32` | `False` | Finer granularity, better accuracy, slightly more metadata overhead. |

## Serializing and deserializing quantized models
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Serializing and deserializing quantized models
## Save and load


AutoRound quantized models can be saved and reloaded using the standard [`~ModelMixin.save_pretrained`] and [`~ModelMixin.from_pretrained`] methods.

### Save
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd remove the separate ### Save and ### Load sections and just combine the code snippets in:

<hfoptions id="save-and-load">
<hfoption id="save">

code here

</hfoption>
<hfoption id="load">

code here

</hfoption>
</hfoptions>

model.save_pretrained("path/to/saved_model")
```

### Load
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Load

Comment on lines +153 to +154
- [AutoRound GitHub repository](https://github.com/intel/auto-round)
- [AutoRound paper (arXiv:2309.05516)](https://arxiv.org/abs/2309.05516)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two also feel redundant since the first linked mention of AutoRound also directs to the paper and repo

Suggested change
- [AutoRound GitHub repository](https://github.com/intel/auto-round)
- [AutoRound paper (arXiv:2309.05516)](https://arxiv.org/abs/2309.05516)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation quantization size/L PR with diff > 200 LOC tests utils

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants