-
We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization.
-
Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases.
-
At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10× reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain but also in general and reasoning domains.
-
[2026-02] 🎉 We have released the CodeScaler Paper on Arxiv!
-
[2026-02] 🎉 We have released the code, dataset and models for CodeScaler!
- CodeScalerPair-51K: We construct high-quality preference data from on-policy training trajectories.
We release CodeScaler at different scales from 1.7B, 4B to 8B.
-
CodeScaler-1.7B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-1.7B.
-
CodeScaler-4B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-4B.
-
CodeScaler-8B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-8B.
Step 1: Clone the repository
git clone https://github.com/LARK-AI-Lab/CodeScaler.git
cd CodeScalerStep 2: Create a conda environment
conda create -n CodeScaler python==3.10.19
conda activate CodeScalerStep 3: Install dependencies
pip install -r requirements.txtStep 4: Install FlashAttention
pip install --no-cache-dir \
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/\
flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl💡 Tip: You can also install FlashAttention based on your specific PyTorch and CUDA versions for optimal performance.
Prepare the training and evaluation datasets:
# Prepare training dataset
python data/prepare_deepcoder.py
# Download and prepare evaluation dataset
python data/download_dataset.py
python data/prepare_evaluation.py💡 Tip: The training dataset is based on DeepCoder training datasets, and evaluation includes multiple coding benchmarks.
Train Qwen3-8B-Base on DeepCoder dataset using CodeScaler as reward model:
# Login to Weights & Biases for experiment tracking
wandb login
# Start training
bash scripts/train.sh💡 Tip: Check
scripts/train.shto customize hyperparameters such as learning rate, batch size, and training epochs.
Evaluate your trained model:
# Run evaluation on benchmarks
bash scripts/eval.shimport torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = 'LARK-Lab/CodeScaler-8B'
tokenizer = AutoTokenizer.from_pretrained(model_path)
reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)
reward_model.eval()
question = """\
Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k.
A subarray is a contiguous part of the array.
For example:
```
Input:
nums = [1, 1, 1], k = 2
Output:
2
```
"""
# Correct solution using prefix sum approach
program_correct = """\
from collections import defaultdict
def subarraySum(nums, k):
prefix = 0
count = 0
freq = defaultdict(int)
freq[0] = 1 # Important: subarray starting from index 0
for num in nums:
prefix += num
if prefix - k in freq:
count += freq[prefix - k]
freq[prefix] += 1
return count
"""
# Incorrect solution using sliding window (fails on negative numbers)
program_wrong = """\
def subarraySum(nums, k):
left = 0
curr_sum = 0
count = 0
for right in range(len(nums)):
curr_sum += nums[right]
while curr_sum > k and left <= right:
curr_sum -= nums[left]
left += 1
if curr_sum == k:
count += 1
return count
"""
convs = [
[
{
"content": question,
"role": "user",
},
{
"role": "assistant",
"content": program
}
] for program in [program_correct, program_wrong]
]
texts = [
tokenizer.apply_chat_template(conv, tokenize=False)
for conv in convs
]
toks = tokenizer(
texts,
truncation=True,
padding=True,
max_length=2048,
return_tensors="pt",
)
with torch.no_grad():
outputs = reward_model(
input_ids=toks["input_ids"].to(device),
attention_mask=toks["attention_mask"].to(device),
)
scores = outputs.logits.squeeze(-1).cpu().tolist()
print("RM Scores:", scores)
# RM Scores: [6.5424089431762695, -0.0312652587890625]If you find our work helpful, please consider citing:
@misc{zhu2026codescalerscalingcodellm,
title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models},
author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo},
year={2026},
eprint={2602.17684},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.17684},
}
