Evaluating LLMs for Efficient Code Generation

github PyPI - Version

🚀 LLM-oriented code efficiency evaluation requires:

  • Performance-exercising tasks & inputs -- "all complexities are equal when N is small"
  • Meaningful compound metric -- avg. speedup does not fit multi-task evaluation

🛍️ Based on our methodology, the EvalPerf dataset (current version 20240328) includes:

  • 118 performance-exercising tasks
  • Each task is equipped with a computationally challenging test input generated by the SaS generator
  • Differential Performance Score (DPS): "DPS=80" means "submissions can outperform 80% LLM solutions"

🦾 The reliability of EvalPerf comes from:

  • Correctness ablation: Pairwise comparison of LLMs' code efficiency over common passing tasks
  • Anti-flakiness: (1) long computation -> low runtime variation (Paper Fig. 6); (2) #instructions as primitive metric; & (3) DPS compares the given solution with reference solutions on the same test bed. -- These leads to low cross-platform variation (Paper Tab. 2)
Check out our COLM'24 poster and the latest experimental configurations for more details!
          
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" --backend vllm
          
Recommended comparison format:

Win-rate Leaderboard

📊 Ranking metrics: WR (Win-Rate; %) based on task- and model-wise competiton (i.e., pairwise DPS).

📝 Notes: the default prompt does not emphasize efficiency requirements as our work shows such emphasis might degrade both efficiency and correctness for some weak models. Yet, "(⏩)" marks models using performance-encouraging prompts as they might be able to accurately understand such needs.

📐 Show more metrics:

🏪 The detailed model generation data and results are available at our page repository.

💸 We use 50 samples (half) for o1 model series for cost saving; also because it's easy to sample desired amount of correct samples from strong models using less tries.


Heatmap of Pairwise DPS Comparison

What's DPS? Differential Performance Score (DPS) is a LeetCode-inspired metric, which shows the overall code efficiency ranking percentile (0-100%) based on the LLM-generated code. For example, "DPS=80" means the LLM's "submissions can outperform/match 80% LLM solutions."

Model Selection
💡 Tips: float the mouse over the heatmap to see detailed DPS of the compared two models.

Adding and visualizing new model results?


git clone [email protected]:evalplus/evalplus.github.io.git
cd evalplus.github.io && git pull
cp ${PATH_TO}/${MODEL}_temp_1.0_evalperf_results.brief.json results/evalperf
python results/evalperf/stats.py && python -m http.server 8000
# Open the displayed address in your browser

          
@inproceedings{evalperf,
  title = {Evaluating Language Models for Efficient Code Generation},
  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
  booktitle = {First Conference on Language Modeling},
  year = {2024},
  url = {https://openreview.net/forum?id=IBCBMeAhmC},
}
          

We thank OpenAI Researcher Access Program for providing part of the compute.

`; xAxisSelectors.appendChild(xDiv); const yDiv = document.createElement('div'); yDiv.className = 'form-check'; yDiv.innerHTML = `
`; yAxisSelectors.appendChild(yDiv); }); document.querySelectorAll('#xAxisSelectors input[type="checkbox"]').forEach(checkbox => { checkbox.addEventListener('change', function () { const modelId = this.id.slice(2); if (this.checked) { selectedXModels.push(modelId); } else { selectedXModels = selectedXModels.filter(id => id !== modelId); } updateCheckboxStates(); updateHeatmap(); }); }); document.querySelectorAll('#yAxisSelectors input[type="checkbox"]').forEach(checkbox => { checkbox.addEventListener('change', function () { const modelId = this.id.slice(2); if (this.checked) { selectedYModels.push(modelId); } else { selectedYModels = selectedYModels.filter(id => id !== modelId); } updateCheckboxStates(); updateHeatmap(); }); }); } function updateCheckboxStates() { } function updateHeatmap() { const xModels = modelData.filter(m => selectedXModels.includes(m.id)); const yModels = modelData.filter(m => selectedYModels.includes(m.id)).reverse(); const heatmapData = []; yModels.forEach((model1, i) => { xModels.forEach((model2, j) => { const dps_pair = HeatmapTable.get(model1.id)[model2.id]; const sc1 = parseFloat(dps_pair[0]).toFixed(1); const sc2 = parseFloat(dps_pair[1]).toFixed(1); heatmapData.push([j, i, sc1, sc2, model1.id !== model2.id ? parseFloat((sc1 - sc2).toFixed(1)) : '']); }); }); const option = { tooltip: { position: 'top', formatter: function (params) { const model1 = yModels[params.data[1]].name; const model2 = xModels[params.data[0]].name; const sc1 = params.data[2]; const sc2 = params.data[3]; winstyle = "color:Green;" losestyle = "color:Tomato;" if (sc1 > sc2) { style1 = winstyle; style2 = losestyle; } else if (sc2 > sc1) { style1 = losestyle; style2 = winstyle; } return `DPS over common passing tasks:
` + `${model1} (left): ${sc1}
` + `${model2} (bottom): ${sc2}`; // return `${model1} vs ${model2}
DPS Difference: ${diff}`; } }, grid: { top: '10%', bottom: '20%', left: '25%', right: '0%' }, xAxis: { type: 'category', data: xModels.map(m => m.name), splitArea: { show: true }, axisLabel: { rotate: 25, interval: 0, }, axisTick: { alignWithLabel: true, }, }, yAxis: { type: 'category', data: yModels.map(m => m.name), splitArea: { show: true }, axisLabel: { fontWeight: 'bold', }, }, visualMap: { type: 'continuous', min: -15, max: 15, calculable: true, orient: 'horizontal', left: 'center', top: 0, itemHeight: '350', inRange: { color: ['#d73027', '#f46d43', '#fdae61', '#fee090', '#f8f8f8', '#e0f3f8', '#abd9e9', '#74add1', '#4575b4'], }, }, series: [{ name: 'DPS Difference', type: 'heatmap', data: heatmapData, label: { show: true, formatter: function (params) { const val = params.data[params.data.length - 1]; return val > 0 ? '+' + val : val; }, textStyle: { fontSize: 16, } }, emphasis: { itemStyle: { shadowBlur: 10, shadowColor: 'rgba(0, 0, 0, 0.5)' } } }] }; heatmapChart.setOption(option); } createCheckboxes(); updateHeatmap(); window.addEventListener('resize', function () { heatmapChart.resize(); }); }