release v1.1.0

2026-03-23 09:06:47 +07:00 · 2024-10-10 20:25:23 +08:00
parent 92fce84bdd
commit 71773f050a
28 changed files with 1717 additions and 650 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,4 +1,17 @@
 # CHANGELOG
+## v1.1.0
+- Support group-wise quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
+- Support joint inference with LoRA model loading
+- Support storage and preloading of prompt cache.
+- Support gguf model conversion (currently only support q4_0 and fp16).
+- Optimize initialization, prefill, and decode time.
+- Support four input types: prompt, embedding, token, and multimodal.
+- Add PC-based simulation accuracy testing and inference interface support for rkllm-toolkit.
+- Add gdq algorithm to improve 4-bit quantization accuracy.
+- Add mixed quantization algorithm, supporting a combination of grouped and non-grouped quantization based on specified ratios.
+- Add support for models such as Llama3, Gemma2, and MiniCPM3.
+- Resolve catastrophic forgetting issue when the number of tokens exceeds max_context.
+
 ## v1.0.1
 - Optimize model conversion memory occupation
 - Optimize inference memory occupation
@@ -11,7 +24,7 @@
 - Add logprob and token_id to the return value

 ## v1.0.0
- - Supports the conversion and deployment of LLM models on RK3588/RK3576 platforms
+ - Support the conversion and deployment of LLM models on RK3588/RK3576 platforms
 - Compatible with Hugging Face model architectures
- - Currently supports the models Llama, Qwen, Qwen2, and Phi-2
- - Supports quantization with w8a8 and w4a16 precision
+ - Currently support the models Llama, Qwen, Qwen2, and Phi-2
+ - Support quantization with w8a8 and w4a16 precision