ezrknn-llm
This repo tries to make RKNN LLM usage easier for people who don't want to read through Rockchip's docs
Requirements
Keep in mind this repo is focused for:
- High-end Rockchip SoCs, mainly the RK3588
- Linux, not Android
- Linux kernels from Rockchip (as of writing 5.10 and 6.1 from Rockchip should work, if your board has one of these it will very likely be Rockchip's kernel)
Quick Install
Run:
curl https://raw.githubusercontent.com/Pelochus/ezrknn-llm/main/install.sh | sudo bash
Test
Run (cd is required):
# TODO
Converting LLMs for Rockchip's NPUs
Docker
In order to do this, you need a Linux PC x86 (Intel or AMD). Currently, Rockchip does not provide ARM support for converting models, so can't be done on a Orange Pi or similar. Run:
docker run -it pelochus/ezrkllm-toolkit:1.1 bash
Then, inside the Docker container:
apt install -y python3-tk # This needs some configuring from your part
cd ezrknn-llm/rkllm-toolkit/examples/huggingface/
Now change the test.py with your preferred model. This container provides Qwen-1.8B and LLaMa2 Uncensored. By default, Qwen-1.8B is selected. To convert the model, run:
python3 test.py
I currently cannot convert the models, so I don't know what the output will be. I believe this is Rockchip's fault. Let me know if you could or what error gives you.
Original README starts below
Description
RKLLM software stack can help users to quickly deploy AI models to Rockchip chips. The overall framework is as follows:
In order to use RKNPU, users need to first run the RKLLM-Toolkit tool on the computer, convert the trained model into an RKLLM format model, and then inference on the development board using the RKLLM C API.
-
RKLLM-Toolkit is a software development kit for users to perform model conversionand quantization on PC.
-
RKLLM Runtime provides C/C++ programming interfaces for Rockchip NPU platform to help users deploy RKLLM models and accelerate the implementation of LLM applications.
-
RKNPU kernel driver is responsible for interacting with NPU hardware. It has been open source and can be found in the Rockchip kernel code.
Support Platform
- RK3588 Series
- RK3576 Series
Support Models
- LLAMA models
- TinyLLAMA models
- Qwen models
- Phi models
- ChatGLM3-6B
- Gemma models
- InternLM2 models
- MiniCPM models
Model Performance Benchmark
| model | dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | memory(G) | platform |
|---|---|---|---|---|---|---|---|---|
| TinyLLAMA-1.1B | w4a16 | 64 | 320 | 256 | 345.00 | 21.10 | 0.77 | RK3576 |
| w4a16_g128 | 64 | 320 | 256 | 410.00 | 18.50 | 0.8 | RK3576 | |
| w8a8 | 64 | 320 | 256 | 140.46 | 24.21 | 1.25 | RK3588 | |
| w8a8_g512 | 64 | 320 | 256 | 195.00 | 20.08 | 1.29 | RK3588 | |
| Qwen2-1.5B | w4a16 | 64 | 320 | 256 | 512.00 | 14.40 | 1.75 | RK3576 |
| w4a16_g128 | 64 | 320 | 256 | 550.00 | 12.75 | 1.76 | RK3576 | |
| w8a8 | 64 | 320 | 256 | 206.00 | 16.46 | 2.47 | RK3588 | |
| w8a8_g128 | 64 | 320 | 256 | 725.00 | 7.00 | 2.65 | RK3588 | |
| Phi-3-3.8B | w4a16 | 64 | 320 | 256 | 975.00 | 6.60 | 2.16 | RK3576 |
| w4a16_g128 | 64 | 320 | 256 | 1180.00 | 5.85 | 2.23 | RK3576 | |
| w8a8 | 64 | 320 | 256 | 516.00 | 7.44 | 3.88 | RK3588 | |
| w8a8_g512 | 64 | 320 | 256 | 610.00 | 6.13 | 3.95 | RK3588 | |
| ChatGLM3-6B | w4a16 | 64 | 320 | 256 | 1168.00 | 4.62 | 3.86 | RK3576 |
| w4a16_g128 | 64 | 320 | 256 | 1582.56 | 3.82 | 3.96 | RK3576 | |
| w8a8 | 64 | 320 | 256 | 800.00 | 4.95 | 6.69 | RK3588 | |
| w8a8_g128 | 64 | 320 | 256 | 2190.00 | 2.70 | 7.18 | RK3588 | |
| Gemma2-2B | w4a16 | 64 | 320 | 256 | 628.00 | 8.00 | 3.63 | RK3576 |
| w4a16_g128 | 64 | 320 | 256 | 776.20 | 7.40 | 3.63 | RK3576 | |
| w8a8 | 64 | 320 | 256 | 342.29 | 9.67 | 4.84 | RK3588 | |
| w8a8_g128 | 64 | 320 | 256 | 1055.00 | 5.49 | 5.14 | RK3588 | |
| InternLM2-1.8B | w4a16 | 64 | 320 | 256 | 475.00 | 13.30 | 1.59 | RK3576 |
| w4a16_g128 | 64 | 320 | 256 | 572.00 | 11.95 | 1.62 | RK3576 | |
| w8a8 | 64 | 320 | 256 | 205.97 | 15.66 | 2.38 | RK3588 | |
| w8a8_g512 | 64 | 320 | 256 | 298.00 | 12.66 | 2.45 | RK3588 | |
| MiniCPM3-4B | w4a16 | 64 | 320 | 256 | 1397.00 | 4.80 | 2.7 | RK3576 |
| w4a16_g128 | 64 | 320 | 256 | 1645.00 | 4.39 | 2.8 | RK3576 | |
| w8a8 | 64 | 320 | 256 | 702.18 | 6.15 | 4.65 | RK3588 | |
| w8a8_g128 | 64 | 320 | 256 | 1691.00 | 3.42 | 5.06 | RK3588 | |
| llama3-8B | w4a16 | 64 | 320 | 256 | 1607.98 | 3.60 | 5.63 | RK3576 |
| w4a16_g128 | 64 | 320 | 256 | 2010.00 | 3.00 | 5.76 | RK3576 | |
| w8a8 | 64 | 320 | 256 | 1128.00 | 3.79 | 9.21 | RK3588 | |
| w8a8_g512 | 64 | 320 | 256 | 1281.35 | 3.05 | 9.45 | RK3588 |
- This performance data were collected based on the maximum CPU and NPU frequencies of each platform with version 1.1.0.
- The script for setting the frequencies is located in the scripts directory.
Download
You can download the latest package, docker image, example, documentation, and platform-tool from RKLLM_SDK, fetch code: rkllm
Note
-
The modifications in version 1.1 are significant, making it incompatible with older version models. Please use the latest toolchain for model conversion and inference.
-
The supported Python versions are:
-
Python 3.8
-
Python 3.10
-
-
Latest version: v1.1.1
RKNN Toolkit2
If you want to deploy additional AI model, we have introduced a SDK called RKNN-Toolkit2. For details, please refer to:
https://github.com/airockchip/rknn-toolkit2
CHANGELOG
v1.1.0
- Support group-wise quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
- Support joint inference with LoRA model loading
- Support storage and preloading of prompt cache.
- Support gguf model conversion (currently only support q4_0 and fp16).
- Optimize initialization, prefill, and decode time.
- Support four input types: prompt, embedding, token, and multimodal.
- Add PC-based simulation accuracy testing and inference interface support for rkllm-toolkit.
- Add gdq algorithm to improve 4-bit quantization accuracy.
- Add mixed quantization algorithm, supporting a combination of grouped and non-grouped quantization based on specified ratios.
- Add support for models such as Llama3, Gemma2, and MiniCPM3.
- Resolve catastrophic forgetting issue when the number of tokens exceeds max_context.
for older version, please refer CHANGELOG
