Skip to content

LARK-AI-Lab/CodeScaler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

CodeScaler Paper on arXiv GitHub Code GitHub Page Datasets on Hugging Face CodeScaler on Hugging Face

📊 Overview

Overview of models

  • We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization.

  • Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases.

  • At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10× reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain but also in general and reasoning domains.

News

  • [2026-02] 🎉 We have released the CodeScaler Paper on Arxiv!

  • [2026-02] 🎉 We have released the code, dataset and models for CodeScaler!

📚 Datasets

  • CodeScalerPair-51K: We construct high-quality preference data from on-policy training trajectories.

🤖 Models

We release CodeScaler at different scales from 1.7B, 4B to 8B.

  • CodeScaler-1.7B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-1.7B.

  • CodeScaler-4B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-4B.

  • CodeScaler-8B: A reward model trained on CodeScalerPair-51K from Skywork/Skywork-Reward-V2-Qwen3-8B.

🚀 Quick Start

⚙️ Environment Setup

Step 1: Clone the repository

git clone https://github.com/LARK-AI-Lab/CodeScaler.git
cd CodeScaler

Step 2: Create a conda environment

conda create -n CodeScaler python==3.10.19
conda activate CodeScaler

Step 3: Install dependencies

pip install -r requirements.txt

Step 4: Install FlashAttention

pip install --no-cache-dir \
  https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/\
flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

💡 Tip: You can also install FlashAttention based on your specific PyTorch and CUDA versions for optimal performance.

📦 Data Preparation

Prepare the training and evaluation datasets:

# Prepare training dataset
python data/prepare_deepcoder.py

# Download and prepare evaluation dataset
python data/download_dataset.py
python data/prepare_evaluation.py

💡 Tip: The training dataset is based on DeepCoder training datasets, and evaluation includes multiple coding benchmarks.

🏋️ Training

Train Qwen3-8B-Base on DeepCoder dataset using CodeScaler as reward model:

# Login to Weights & Biases for experiment tracking
wandb login

# Start training
bash scripts/train.sh

💡 Tip: Check scripts/train.sh to customize hyperparameters such as learning rate, batch size, and training epochs.

📈 Evaluation

Evaluate your trained model:

# Run evaluation on benchmarks
bash scripts/eval.sh

💻 Use CodeScaler for RM Scoring

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = 'LARK-Lab/CodeScaler-8B'

tokenizer = AutoTokenizer.from_pretrained(model_path)
reward_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)
reward_model.eval()

question = """\
Given an integer array nums and an integer k, return the total number of continuous subarrays whose sum equals k.
A subarray is a contiguous part of the array.
For example:
```
Input:
nums = [1, 1, 1], k = 2

Output:
2
```
"""

# Correct solution using prefix sum approach
program_correct = """\
from collections import defaultdict

def subarraySum(nums, k):
    prefix = 0
    count = 0
    freq = defaultdict(int)
    freq[0] = 1  # Important: subarray starting from index 0

    for num in nums:
        prefix += num

        if prefix - k in freq:
            count += freq[prefix - k]

        freq[prefix] += 1

    return count
"""

# Incorrect solution using sliding window (fails on negative numbers)
program_wrong = """\
def subarraySum(nums, k):
    left = 0
    curr_sum = 0
    count = 0

    for right in range(len(nums)):
        curr_sum += nums[right]

        while curr_sum > k and left <= right:
            curr_sum -= nums[left]
            left += 1

        if curr_sum == k:
            count += 1

    return count
"""


convs = [
    [
        {
            "content": question,
            "role": "user",
        },
        {
            "role": "assistant",
            "content": program
        }
    ] for program in [program_correct, program_wrong]
]


texts = [
    tokenizer.apply_chat_template(conv, tokenize=False)
    for conv in convs
]

toks = tokenizer(
    texts,
    truncation=True,
    padding=True,
    max_length=2048,
    return_tensors="pt",
)

with torch.no_grad():
    outputs = reward_model(
        input_ids=toks["input_ids"].to(device),
        attention_mask=toks["attention_mask"].to(device),
    )
    scores = outputs.logits.squeeze(-1).cpu().tolist()


print("RM Scores:", scores)
# RM Scores: [6.5424089431762695, -0.0312652587890625]

Citation

If you find our work helpful, please consider citing:

@misc{zhu2026codescalerscalingcodellm,
      title={CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models}, 
      author={Xiao Zhu and Xinyu Zhou and Boyu Zhu and Hanxu Hu and Mingzhe Du and Haotian Zhang and Huiming Wang and Zhijiang Guo},
      year={2026},
      eprint={2602.17684},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.17684}, 
}

About

The official repo for "CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors