mirror of
https://github.com/DrHo1y/ezrknn-llm.git
synced 2026-03-23 17:16:44 +07:00
161 lines
8.6 KiB
Markdown
Executable File
161 lines
8.6 KiB
Markdown
Executable File
# ezrknn-llm
|
|
This repo tries to make RKNN LLM usage easier for people who don't want to read through Rockchip's docs
|
|
|
|
## Requirements
|
|
Keep in mind this repo is focused for:
|
|
- High-end Rockchip SoCs, mainly the RK3588
|
|
- Linux, not Android
|
|
- Linux kernels from Rockchip (as of writing 5.10 and 6.1 from Rockchip should work, if your board has one of these it will very likely be Rockchip's kernel)
|
|
|
|
## Quick Install
|
|
Run:
|
|
|
|
```bash
|
|
curl https://raw.githubusercontent.com/Pelochus/ezrknn-llm/main/install.sh | sudo bash
|
|
```
|
|
|
|
## Test
|
|
Run (cd is required):
|
|
|
|
```bash
|
|
# TODO
|
|
```
|
|
|
|
## Converting LLMs for Rockchip's NPUs
|
|
### Docker
|
|
In order to do this, you need a Linux PC x86 (Intel or AMD). Currently, Rockchip does not provide ARM support for converting models, so can't be done on a Orange Pi or similar.
|
|
Run:
|
|
|
|
`docker run -it pelochus/ezrkllm-toolkit:1.0 bash`
|
|
|
|
Then, inside the Docker container:
|
|
|
|
```bash
|
|
apt install -y python3-tk # This needs some configuring from your part
|
|
cd ezrknn-llm/rkllm-toolkit/examples/huggingface/
|
|
```
|
|
|
|
Now change the `test.py` with your preferred model. This container provides Qwen-1.8B and LLaMa2 Uncensored. By default, Qwen-1.8B is selected. To convert the model, run:
|
|
|
|
`python3 test.py`
|
|
|
|
I currently cannot convert the models, so I don't know what the output will be. I believe this is Rockchip's fault. Let me know if you could or what error gives you.
|
|
|
|
# Original README starts below
|
|
|
|
<hr>
|
|
<hr>
|
|
<hr>
|
|
|
|
# Description
|
|
|
|
RKLLM software stack can help users to quickly deploy AI models to Rockchip chips. The overall framework is as follows:
|
|
<center class="half">
|
|
<div style="background-color:#ffffff;">
|
|
<img src="res/framework.jpg" title="RKLLM"/>
|
|
</center>
|
|
|
|
In order to use RKNPU, users need to first run the RKLLM-Toolkit tool on the computer, convert the trained model into an RKLLM format model, and then inference on the development board using the RKLLM C API.
|
|
|
|
- RKLLM-Toolkit is a software development kit for users to perform model conversionand quantization on PC.
|
|
|
|
- RKLLM Runtime provides C/C++ programming interfaces for Rockchip NPU platform to help users deploy RKLLM models and accelerate the implementation of LLM applications.
|
|
|
|
- RKNPU kernel driver is responsible for interacting with NPU hardware. It has been open source and can be found in the Rockchip kernel code.
|
|
|
|
# Support Platform
|
|
|
|
- RK3588 Series
|
|
- RK3576 Series
|
|
|
|
# Support Models
|
|
|
|
- [x] [LLAMA models](https://huggingface.co/meta-llama)
|
|
- [x] [TinyLLAMA models](https://huggingface.co/TinyLlama)
|
|
- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
|
|
- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
|
|
- [x] [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b/tree/103caa40027ebfd8450289ca2f278eac4ff26405)
|
|
- [x] [Gemma models](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)
|
|
- [x] [InternLM2 models](https://huggingface.co/collections/internlm/internlm2-65b0ce04970888799707893c)
|
|
- [x] [MiniCPM models](https://huggingface.co/collections/openbmb/minicpm-65d48bf958302b9fd25b698f)
|
|
|
|
# Model Performance Benchmark
|
|
|
|
| model | dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | memory(G) | platform |
|
|
|:-------------- |:---------- |:------:|:-----------:|:----------:|:--------:|:--------:|:---------:|:--------:|
|
|
| TinyLLAMA-1.1B | w4a16 | 64 | 320 | 256 | 345.00 | 21.10 | 0.77 | RK3576 |
|
|
| | w4a16_g128 | 64 | 320 | 256 | 410.00 | 18.50 | 0.8 | RK3576 |
|
|
| | w8a8 | 64 | 320 | 256 | 140.46 | 24.21 | 1.25 | RK3588 |
|
|
| | w8a8_g512 | 64 | 320 | 256 | 195.00 | 20.08 | 1.29 | RK3588 |
|
|
| Qwen2-1.5B | w4a16 | 64 | 320 | 256 | 512.00 | 14.40 | 1.75 | RK3576 |
|
|
| | w4a16_g128 | 64 | 320 | 256 | 550.00 | 12.75 | 1.76 | RK3576 |
|
|
| | w8a8 | 64 | 320 | 256 | 206.00 | 16.46 | 2.47 | RK3588 |
|
|
| | w8a8_g128 | 64 | 320 | 256 | 725.00 | 7.00 | 2.65 | RK3588 |
|
|
| Phi-3-3.8B | w4a16 | 64 | 320 | 256 | 975.00 | 6.60 | 2.16 | RK3576 |
|
|
| | w4a16_g128 | 64 | 320 | 256 | 1180.00 | 5.85 | 2.23 | RK3576 |
|
|
| | w8a8 | 64 | 320 | 256 | 516.00 | 7.44 | 3.88 | RK3588 |
|
|
| | w8a8_g512 | 64 | 320 | 256 | 610.00 | 6.13 | 3.95 | RK3588 |
|
|
| ChatGLM3-6B | w4a16 | 64 | 320 | 256 | 1168.00 | 4.62 | 3.86 | RK3576 |
|
|
| | w4a16_g128 | 64 | 320 | 256 | 1582.56 | 3.82 | 3.96 | RK3576 |
|
|
| | w8a8 | 64 | 320 | 256 | 800.00 | 4.95 | 6.69 | RK3588 |
|
|
| | w8a8_g128 | 64 | 320 | 256 | 2190.00 | 2.70 | 7.18 | RK3588 |
|
|
| Gemma2-2B | w4a16 | 64 | 320 | 256 | 628.00 | 8.00 | 3.63 | RK3576 |
|
|
| | w4a16_g128 | 64 | 320 | 256 | 776.20 | 7.40 | 3.63 | RK3576 |
|
|
| | w8a8 | 64 | 320 | 256 | 342.29 | 9.67 | 4.84 | RK3588 |
|
|
| | w8a8_g128 | 64 | 320 | 256 | 1055.00 | 5.49 | 5.14 | RK3588 |
|
|
| InternLM2-1.8B | w4a16 | 64 | 320 | 256 | 475.00 | 13.30 | 1.59 | RK3576 |
|
|
| | w4a16_g128 | 64 | 320 | 256 | 572.00 | 11.95 | 1.62 | RK3576 |
|
|
| | w8a8 | 64 | 320 | 256 | 205.97 | 15.66 | 2.38 | RK3588 |
|
|
| | w8a8_g512 | 64 | 320 | 256 | 298.00 | 12.66 | 2.45 | RK3588 |
|
|
| MiniCPM3-4B | w4a16 | 64 | 320 | 256 | 1397.00 | 4.80 | 2.7 | RK3576 |
|
|
| | w4a16_g128 | 64 | 320 | 256 | 1645.00 | 4.39 | 2.8 | RK3576 |
|
|
| | w8a8 | 64 | 320 | 256 | 702.18 | 6.15 | 4.65 | RK3588 |
|
|
| | w8a8_g128 | 64 | 320 | 256 | 1691.00 | 3.42 | 5.06 | RK3588 |
|
|
| llama3-8B | w4a16 | 64 | 320 | 256 | 1607.98 | 3.60 | 5.63 | RK3576 |
|
|
| | w4a16_g128 | 64 | 320 | 256 | 2010.00 | 3.00 | 5.76 | RK3576 |
|
|
| | w8a8 | 64 | 320 | 256 | 1128.00 | 3.79 | 9.21 | RK3588 |
|
|
| | w8a8_g512 | 64 | 320 | 256 | 1281.35 | 3.05 | 9.45 | RK3588 |
|
|
|
|
- This performance data were collected based on the maximum CPU and NPU frequencies of each platform with version 1.1.0.
|
|
- The script for setting the frequencies is located in the scripts directory.
|
|
|
|
# Download
|
|
|
|
You can download the latest package, docker image, example, documentation, and platform-tool from [RKLLM_SDK](https://console.zbox.filez.com/l/RJJDmB), fetch code: rkllm
|
|
|
|
# Note
|
|
|
|
- The modifications in version 1.1 are significant, making it incompatible with older version models. Please use the latest toolchain for model conversion and inference.
|
|
|
|
- The supported Python versions are:
|
|
|
|
- Python 3.8
|
|
|
|
- Python 3.10
|
|
|
|
- Latest version: [ <u>v1.1.1](https://github.com/airockchip/rknn-llm/releases/tag/release-v1.1.1)</u>
|
|
|
|
# RKNN Toolkit2
|
|
|
|
If you want to deploy additional AI model, we have introduced a SDK called RKNN-Toolkit2. For details, please refer to:
|
|
|
|
https://github.com/airockchip/rknn-toolkit2
|
|
|
|
# CHANGELOG
|
|
|
|
## v1.1.0
|
|
|
|
- Support group-wise quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
|
|
- Support joint inference with LoRA model loading
|
|
- Support storage and preloading of prompt cache.
|
|
- Support gguf model conversion (currently only support q4_0 and fp16).
|
|
- Optimize initialization, prefill, and decode time.
|
|
- Support four input types: prompt, embedding, token, and multimodal.
|
|
- Add PC-based simulation accuracy testing and inference interface support for rkllm-toolkit.
|
|
- Add gdq algorithm to improve 4-bit quantization accuracy.
|
|
- Add mixed quantization algorithm, supporting a combination of grouped and non-grouped quantization based on specified ratios.
|
|
- Add support for models such as Llama3, Gemma2, and MiniCPM3.
|
|
- Resolve catastrophic forgetting issue when the number of tokens exceeds max_context.
|
|
|
|
for older version, please refer [CHANGELOG](CHANGELOG.md)
|