GitHub - howard-hou/VisualRWKV: VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks.

VisualRWKV

Description

VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks.

By utilizing a loosely coupled adapter design, visual capabilities can be effortlessly enhanced while preserving the performance of the RWKV language model. This approach allows for easy integration and interchangeability without compromising the core functionality of RWKV.

Usage

Installation

# clone project
git clone https://github.com/howard-hou/VisualRWKV
cd VisualRWKV

# [OPTIONAL] create conda environment
conda create -n myenv python=3.10
conda activate myenv

# install requirements
pip install -r requirements.txt

Download Checkpoint

RWKV models are here, please download LLM and its adapter.

Model Name	Link
RWKV-4 169M	RWKV-4 169M
RWKV-1b5 raven	RWKV-1b5 raven
Visual Adapter	adapted to RWKV-4 169M
Visual Adapter	adapted to RWKV-1b5 raven

Zero-shot Performance

Zero-shot Visual Question Answering

model	dataset	split	overall	other	yes/no	number
RWKV-1b5 raven	vqav2	validation	43.7	34.35	60.67	30.25
RWKV-4 169M	vqav2	validation	15.41	23.59	0.14	28.42

Zero-shot Image Captioning

model	dataset	split	Bleu_1	Bleu_2	Bleu_3	Bleu_4	METEOR	ROUGE_L	CIDEr	SPICE
RWKV-1b5 raven	coco	test	0.6911	0.5143	0.3652	0.2542	0.2376	0.5018	0.8658	0.1728
RWKV-1b5 raven	nocaps	val	0.6779	0.5027	0.3521	0.242	0.2062	0.4639	0.5988	0.0915
RWKV-4 169M	coco	test	0.6762	0.4957	0.3446	0.2332	0.2202	0.488	0.768	0.1562
RWKV-4 169M	nocaps	val	0.6561	0.4783	0.3261	0.2142	0.1918	0.4538	0.5184	0.0792

note: model is trained on coo_caption, so it is not zero-shot; but it is zero-shot on nocaps_caption

Example

import torch
import visualrwkv
import os
from PIL import Image
from visualrwkv.utlis import postprocess_response
from tqdm import tqdm

os.environ["RWKV_JIT_ON"] = "1"
os.environ["RWKV_CUDA_ON"] = "0"

device = "cuda" if torch.cuda.is_available() else "cpu"

# download checkpoint to your dir
adapter_path = "your_dir/rwkv169_coco-vg-sbu-cc_t5small_vit224.ckpt"
rwkv_path = "your_dir/RWKV-4-Pile-169M-20220807-8023.pth"
# now only support VisualRWKV-small, will support more models in the future
model, preprocess = visualrwkv.load(
    model_name="VisualRWKV-small", adapter_path=adapter_path, rwkv_path=rwkv_path
)

instruction = ["describe the image."]
max_new_tokens = 20  # use to control the length of the generated text
image = preprocess(Image.open("demo.png")).unsqueeze(0).to(device)
image_embs = model.adapter.forward_task_embs(image)
outputs = model.generate(
    image_embs, instruction=instruction, max_new_tokens=max_new_tokens
)
decoded = model.tokenizer.batch_decode(outputs, skip_special_tokens=True)
decoded = postprocess_response(decoded)
print(decoded) 
# ['couple in the water with a dog on the beach']

# test in cpu, 400 ms per image

Jun	JUL	Aug
	17
2022	2023	2024

README.md

VisualRWKV

Description

Usage

Installation

Download Checkpoint

Zero-shot Performance

Zero-shot Visual Question Answering

Zero-shot Image Captioning

Example

About

Releases

Packages

Contributors 2

Languages

License

howard-hou/VisualRWKV

Sign In Required

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

VisualRWKV

Description

Usage

Installation

Download Checkpoint

Zero-shot Performance

Zero-shot Visual Question Answering

Zero-shot Image Captioning

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages