Integrate AutoRound into Diffusers#13552
Conversation
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
I would like to integrate AutoRound into Diffusers to support diffusion models. Although the performance improvement is not significant, memory usage is notably reduced. |
|
Thanks for this PR! Could you provide some example code using AutoRound and also some example outputs? Feel free to also report latency and memory consumption so that there's some signal into its effectiveness. Cc: @SunMarc |
|
|
||
| # AutoRound | ||
|
|
||
| [AutoRound](https://github.com/intel/auto-round) is a weight-only quantization algorithm that uses **S**ign **G**radient **D**escent to jointly optimize rounding values and min-max ranges for weights. It targets the W4A16 configuration (4-bit weights, 16-bit activations), reducing model memory footprint while preserving inference accuracy. |
There was a problem hiding this comment.
| [AutoRound](https://github.com/intel/auto-round) is a weight-only quantization algorithm that uses **S**ign **G**radient **D**escent to jointly optimize rounding values and min-max ranges for weights. It targets the W4A16 configuration (4-bit weights, 16-bit activations), reducing model memory footprint while preserving inference accuracy. | |
| [AutoRound](https://huggingface.co/papers/2309.05516) is a weight-only quantization algorithm. It uses sign gradient descent to jointly optimize weight rounding and min-max ranges. AutoRound targets W4A16 (4-bit weights, 16-bit activations), reducing memory usage without sacrificing inference accuracy. |
|
|
||
| ## Quickstart | ||
|
|
||
| Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. |
There was a problem hiding this comment.
| Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers. | |
| Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. The method works with any model that loads via [Accelerate(https://hf.co/docs/accelerate/index) and has `torch.nn.Linear` layers. |
| import torch | ||
| from diffusers import AutoModel, FluxPipeline, AutoRoundConfig | ||
|
|
||
| model_id = "your-org/flux-autoround-w4g128" |
There was a problem hiding this comment.
| model_id = "your-org/flux-autoround-w4g128" | |
| model_id = "INCModel/Z-Image-W4A16-AutoRound" |
| import torch | ||
| from diffusers import AutoModel, AutoRoundConfig | ||
|
|
||
| model_id = "your-org/flux-autoround-w4g128" |
There was a problem hiding this comment.
| model_id = "your-org/flux-autoround-w4g128" | |
| model_id = "INCModel/Z-Image-W4A16-AutoRound" |
| | Configuration | `bits` | `group_size` | `sym` | Description | | ||
| |--------------|--------|-------------|-------|-------------| | ||
| | W4G128 asymmetric | `4` | `128` | `False` | Default. Good balance of accuracy and compression. | | ||
| | W4G128 symmetric | `4` | `128` | `True` | Slightly faster dequantization, marginal accuracy loss. | |
There was a problem hiding this comment.
| | W4G128 symmetric | `4` | `128` | `True` | Slightly faster dequantization, marginal accuracy loss. | | |
| | W4G128 symmetric | `4` | `128` | `True` | Faster dequantization, small accuracy trade-off. | |
| - `modules_to_not_convert` (`list[str]` or `None`, defaults to `None`): List of module name patterns to exclude from quantization. Use this to keep sensitive layers in full precision. | ||
| - `backend` (`str`, defaults to `"auto"`): The inference backend kernel. See the [Inference backends](#inference-backends) table above. | ||
|
|
||
| ## Supported quantization configurations |
There was a problem hiding this comment.
| ## Supported quantization configurations | |
| ## Quantization configurations |
| | W4G128 symmetric | `4` | `128` | `True` | Slightly faster dequantization, marginal accuracy loss. | | ||
| | W4G32 asymmetric | `4` | `32` | `False` | Finer granularity, better accuracy, slightly more metadata overhead. | | ||
|
|
||
| ## Serializing and deserializing quantized models |
There was a problem hiding this comment.
| ## Serializing and deserializing quantized models | |
| ## Save and load |
|
|
||
| AutoRound quantized models can be saved and reloaded using the standard [`~ModelMixin.save_pretrained`] and [`~ModelMixin.from_pretrained`] methods. | ||
|
|
||
| ### Save |
There was a problem hiding this comment.
i'd remove the separate ### Save and ### Load sections and just combine the code snippets in:
<hfoptions id="save-and-load">
<hfoption id="save">
code here
</hfoption>
<hfoption id="load">
code here
</hfoption>
</hfoptions>| model.save_pretrained("path/to/saved_model") | ||
| ``` | ||
|
|
||
| ### Load |
| - [AutoRound GitHub repository](https://github.com/intel/auto-round) | ||
| - [AutoRound paper (arXiv:2309.05516)](https://arxiv.org/abs/2309.05516) |
There was a problem hiding this comment.
these two also feel redundant since the first linked mention of AutoRound also directs to the paper and repo
| - [AutoRound GitHub repository](https://github.com/intel/auto-round) | |
| - [AutoRound paper (arXiv:2309.05516)](https://arxiv.org/abs/2309.05516) |
What does this PR do?
This pull request introduces support for the AutoRound quantization algorithm in the Diffusers library. AutoRound is a weight-only quantization method that enables efficient inference by optimizing weight rounding and min-max ranges, primarily targeting the W4A16 configuration (4-bit weights, 16-bit activations). The changes add a new quantization config, quantizer class, backend integration, and comprehensive documentation, while ensuring proper handling of optional dependencies.
Key changes:
AutoRound quantization support
AutoRoundConfigclass toquantization_config.pyfor configuring AutoRound quantization parameters, including bits, group size, symmetry, backend, and modules to exclude from quantization.AutoRoundQuantizerclass inquantizers/autoround/autoround_quantizer.py, implementing the logic for loading pre-quantized AutoRound models and integrating with the auto-round library.AutoRoundConfigandAutoRoundQuantizerin the quantization auto-mapping logic, enabling selection via the"auto-round"keyDependency management and import handling
is_auto_round_availableutility and integrated it into the main import structure and conditional imports, ensuring that AutoRound features are only available if the dependency is installed. Dummy objects are provided otherwise.require_auto_round_version_greater_or_equalfor version-gated testing of AutoRound features.Documentation
docs/source/en/quantization/autoround.md, including usage examples, backend options, configuration details, and resource links.These changes collectively enable seamless integration of AutoRound quantization into Diffusers, with robust configuration, backend selection, and user guidance.
Existed Model
cc @wenhuach21 @thuang6 @hshen14
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@yiyixuxu @asomoza @stevhliu @sayakpaul
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.