mirror of
https://github.com/DrHo1y/ezrknn-llm.git
synced 2026-03-23 09:06:47 +07:00
release v1.1.0
This commit is contained in:
19
CHANGELOG.md
19
CHANGELOG.md
@@ -1,4 +1,17 @@
|
||||
# CHANGELOG
|
||||
## v1.1.0
|
||||
- Support group-wise quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
|
||||
- Support joint inference with LoRA model loading
|
||||
- Support storage and preloading of prompt cache.
|
||||
- Support gguf model conversion (currently only support q4_0 and fp16).
|
||||
- Optimize initialization, prefill, and decode time.
|
||||
- Support four input types: prompt, embedding, token, and multimodal.
|
||||
- Add PC-based simulation accuracy testing and inference interface support for rkllm-toolkit.
|
||||
- Add gdq algorithm to improve 4-bit quantization accuracy.
|
||||
- Add mixed quantization algorithm, supporting a combination of grouped and non-grouped quantization based on specified ratios.
|
||||
- Add support for models such as Llama3, Gemma2, and MiniCPM3.
|
||||
- Resolve catastrophic forgetting issue when the number of tokens exceeds max_context.
|
||||
|
||||
## v1.0.1
|
||||
- Optimize model conversion memory occupation
|
||||
- Optimize inference memory occupation
|
||||
@@ -11,7 +24,7 @@
|
||||
- Add logprob and token_id to the return value
|
||||
|
||||
## v1.0.0
|
||||
- Supports the conversion and deployment of LLM models on RK3588/RK3576 platforms
|
||||
- Support the conversion and deployment of LLM models on RK3588/RK3576 platforms
|
||||
- Compatible with Hugging Face model architectures
|
||||
- Currently supports the models Llama, Qwen, Qwen2, and Phi-2
|
||||
- Supports quantization with w8a8 and w4a16 precision
|
||||
- Currently support the models Llama, Qwen, Qwen2, and Phi-2
|
||||
- Support quantization with w8a8 and w4a16 precision
|
||||
Reference in New Issue
Block a user