Support loading sharded quantized checkpoints
Sharded checkpoints can now be loaded in the from_quantized
method.
Gemma GPTQ quantization
Gemma model can be quantized with AutoGPTQ.
Other changes and fixes
- Add back missing import by @fxmarty in #553
- Fix bias materialization for Marlin by @fxmarty in #554
- Fix shape check marlin by @fxmarty in #557
- Explicitely check compute capability in marlin's QLinear by @fxmarty in #567
- Compatibility with latest transformers by @fxmarty in #573
Full Changelog: v0.7.0...v0.7.1