A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
Abstract | Results | Quick Start | Extensions | Metrics | Citation
Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces. However, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, which makes it difficult to rigorously evaluate long-horizon planning and execution. To address these gaps, we introduce LongCLI-Bench, a benchmark designed to evaluate agentic capabilities across realistic long-horizon tasks. We curated 20 high-quality tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol that measures requirement fulfillment (fail->pass) and regression avoidance (pass->pass), with step-level scoring to pinpoint execution failures. Experiments show that even state-of-the-art agents achieve pass rates below 20% on LongCLI-Bench, and most tasks stall at less than 30% completion. While self-correction yields marginal gains, human-agent collaboration through plan injection and interactive guidance offers significantly larger improvements.
- Docker
- Python
>=3.12 uv
cd longcli-bench
uv run pip install -e .cd longcli_dockerImage
docker build -f Dockerfile.make-pytest-base -t tb/make-pytest:v0 .
docker build -f Dockerfile.c-env-base -t tb/c-env:v0 .tb run --help
uv run tb tasks interact -t pytest_pytest_example --tasks-dir tasks_long_cli_example --include-allIf you can enter the interactive container successfully, the CLI, Docker Compose, and base-image workflow are set up correctly.
LLM_BASE_URL="<base_url>" LLM_API_KEY="<api_key>" \
tb run \
--agent <agent_name> \
--model <model_name> \
--task-id <task_id> \
--dataset-path <dataset_path> \
--run-id <output_dir> \
--n-attempts N \
--give-test-output MCommon flags:
--agent: e.g.,codex,cursor_cli,grok_cli,terminus_2--model: e.g.,gpt-5--dataset-path: task directory, e.g.,tasks_long_cli--run-id: output run name (stored underruns/<run-id>by default)--n-attempts: number of independent attempts (used for pass@k)--give-test-output: number of self-correction turns within each attempt
LLM_BASE_URL="<base_url>" LLM_API_KEY="<api_key>" \
python scripts_python/longcli_run_batch.py \
--agent-model-pair codex,gpt-5.1-codex-max \
--task-id 61810_cow \
--tasks-dir tasks_long_cli \
--output-path runs_long_cli \
--exp-setting 1,3 \
--on-existing skipModify parameters in scripts_python/longcli_run_batch.py, then run:
python scripts_python/longcli_run_batch.pyAfter batch runs finish (for example under runs_long_cli), aggregate metrics with:
python scripts_python/longcli_aggregate_results.py \
--input-dirs <input_runs_output_dir> \
--tasks-dir tasks_long_cli \
--output-json <summary_json_file> \
--output-csv <summary_csv_file> \
--tables-dir <table_output_dir>example:
python scripts_python/longcli_aggregate_results.py \
--input-dirs runs_long_cli \
--tasks-dir tasks_long_cli \
--output-json long_cli_summary.json \
--output-csv long_cli_summary.csv \
--tables-dir .This computes and exports key benchmark results, including:
- pass metrics for
F2P,P2P, and combinedall_is_pass - step scores (
f2p_step_score,p2p_step_score) - self-correction and multi-attempt summaries (e.g., pass@k-related aggregates)
Main outputs:
long_cli_summary.jsonlong_cli_summary.csvtable1_overall.csvtable2_finegrained.csvtable3_selfcorr.csv- missing-coverage reports (
long_cli_missing_report.json/.csv)
- Fine-grained scoring with task-level
is_passand step-levelstep_score. - Dual test sets:
Fail->Pass (F2P)andPass->Pass (P2P). - Two parsing modes for test results: native
pytestparsing and text-based parsing (custom score text). - Environment variable
TB_SAVE_APP_RESULT: whether to save/appfrom inside the container to the output directory (1/0). - Environment variable
TB_SKIP_AGENT: whether to skip agent execution (1/0, useful for task/script debugging). - Multi-attempt evaluation via
--n-attempts N, includingpass@kstatistics. - Multi-turn self-correction within each attempt via
--give-test-output M.
from_scratch (0 -> 1): build a project from scratch.feature_add (N -> N+1): add new functionality to an existing repository.bug_fix (No -> Yes): locate and fix complex bugs.project_refactor (A -> A'): optimize/refactor code without changing external behavior.
Test sets:
F2P (Fail->Pass): whether new requirements are correctly implemented.P2P (Pass->Pass): whether existing functionality remains intact (regression check).
Core metrics:
is_pass: binary task-level success/failure.step_score: step-level completion percentage.time: execution time.token cost: token usage.
Common reporting strategies:
pass@1,pass@3- Multi-turn self-correction using test feedback
@misc{feng2026longclibenchpreliminarybenchmarkstudy,
title={LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces},
author={Yukang Feng and Jianwen Sun and Zelai Yang and Jiaxin Ai and Chuanhao Li and Zizhen Li and Fanrui Zhang and Kang He and Rui Ma and Jifan Lin and Jie Sun and Yang Xiao and Sizhuo Zhou and Wenxiao Wu and Yiming Liu and Pengfei Liu and Yu Qiao and Shenglin Zhang and Kaipeng Zhang},
year={2026},
eprint={2602.14337},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2602.14337},
}This project is built on and extends terminal-bench. LongCLI-Bench is a derivative work with task, data, and evaluation extensions for long-horizon CLI agent benchmarking.
This repository is licensed under Apache-2.0.
Some task folders may include third-party components under their own licenses; see the corresponding subdirectory LICENSE files.


