GitHub - cbalint13/rvv-kernels: RISCV Vector Kernel C/LLVM-IR generator

High performance RVV kernel generator to C & LLVM-IR dialects

This is a C/LLVM-IR kernel generator that address unsupported RVV ISA versions for LLVM or any other toolchains.

Benchmark

Usage

Prepare a docker image with rv64 cross compiler

$ git clone https://github.com/cbalint13/rvv-kernels
$ cd rvv-kernels
$ docker build --file Dockerfile.ML.fedora --tag th1520-rvv .

Generate a kernel

$ docker run -it --rm -v "$PWD":/opt/src th1520-rvv bash
[root@b8032fd28a75 src]# ./make.sh 32 4 int8 v0.7.1 cbalint@192.168.1.45

(x) Naive kernel:
  HEX = b0 28 00 00 b0 66 00 00 b0 a4 00 00 b0 e2 00 00
  O[] = 00010416 00026288 00042160 00058032

(x) MACC operations: elems[32] x lanes[4] = 256 Ops

(x) RVV kernel:
  HEX = b0 28 00 00 b0 66 00 00 b0 a4 00 00 b0 e2 00 00
  O[] = 00010416 00026288 00042160 00058032

RVV bench: 25.600 GOPS in 2.215818 secs
RVV speed: 11.553 GOPS/sec

[root@b8032fd28a75 src]# ls -l dot_int8_kernel.*
-rw-r--r-- 1 1000 1000 3867 Mar 13 18:03 dot_int8_kernel.c
-rw-r--r-- 1 1000 1000 5034 Mar 13 18:03 dot_int8_kernel.ir

Optional benchmark logs & graph

[root@b8032fd28a75 src]# ./script/0-explore.sh
[root@b8032fd28a75 src]# ls -l benchmark-int8.log
-rw-r--r-- 1 1000 1000 5731 Mar 13 17:38 benchmark-int8.log

[root@b8032fd28a75 src]# ./script/1-plotgraph.py --logs benchmark-int8.log --title 'RVV v0.7.1 int8 kernels benchmark (TH1520)'
[root@b8032fd28a75 src]# ls -l benchmark-int8.log.png
-rw-r--r-- 1 1000 1000 58380 Mar 13 18:47 benchmark-int8.log.png

Notes

This generator emmits C / LLVM-IR kernels, with encoded insn, thus making it RVV version agnostic
T-Head 1520 (C906, also others) implements older v0.7.1 RVV vector ISA, unsupported by LLVM upstream
TH1520 setvli ASIC implementation is slow, see comments on a dynamic kernel: trials/riscv-asm.c
The setvli slowness issue force the SVE (scalable vector) concept to avoid frequent setvli calls

The trials/riscv-asm.c sample kernel would cope with SVE concept of runtime dynamism but for reasons tested and mentioned here, on the particular T-Head's C906 RVV ASIC implementation, the context switching setvli drags down the whole performance in a severe way, thus setvli calls should be minimized for this particular target.

For RVV 0.7.1 there is a limit of how & which vector registers can be used in the context of MUL (multiplier), so the maximum vector fill width of 64 x int8 being reduced into x2 lanes is not possible, it would require e8/m4 MUL mode that leaves room for only 4 x vregs (v0, v8, v16, v24) a insufficient amount of registers. The maximum usable int8 elements width is 32 for RVV 0.7.1 version.

The generated kernel setssetvli once and unrolls computations across the vector registers.

Changelog

13 Mar 2024 intial realease, for now int8 with RVV 0.7.1 version

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs		docs
script		script
trials		trials
.gitignore		.gitignore
Dockerfile.ML.fedora		Dockerfile.ML.fedora
LICENSE		LICENSE
README.md		README.md
benchmark-int8.log		benchmark-int8.log
benchmark-int8.log.png		benchmark-int8.log.png
dot_int8_kernel.c		dot_int8_kernel.c
dot_int8_kernel.ir		dot_int8_kernel.ir
make.sh		make.sh
rvv-bench.c		rvv-bench.c
rvv-bench.h		rvv-bench.h
rvv-dot-kernel-gen.py		rvv-dot-kernel-gen.py

License

cbalint13/rvv-kernels

Folders and files

Latest commit

History

Repository files navigation

High performance RVV kernel generator to C & LLVM-IR dialects

Benchmark

Usage

Notes

Changelog

About

Topics

Resources

License

Stars

Watchers

Forks

Languages