LongCLI-Bench

A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Abstract

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces. However, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, which makes it difficult to rigorously evaluate long-horizon planning and execution. To address these gaps, we introduce LongCLI-Bench, a benchmark designed to evaluate agentic capabilities across realistic long-horizon tasks. We curated 20 high-quality tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol that measures requirement fulfillment (fail->pass) and regression avoidance (pass->pass), with step-level scoring to pinpoint execution failures. Experiments show that even state-of-the-art agents achieve pass rates below 20% on LongCLI-Bench, and most tasks stall at less than 30% completion. While self-correction yields marginal gains, human-agent collaboration through plan injection and interactive guidance offers significantly larger improvements.

Results

Quick Start

Requirements

Docker
Python >=3.12
uv

1. Installation

cd longcli-bench
uv run pip install -e .

2. Build Base Images

cd longcli_dockerImage
docker build -f Dockerfile.make-pytest-base -t tb/make-pytest:v0 .
docker build -f Dockerfile.c-env-base -t tb/c-env:v0 .

3. Smoke Check

tb run --help
uv run tb tasks interact -t pytest_pytest_example --tasks-dir tasks_long_cli_example --include-all

If you can enter the interactive container successfully, the CLI, Docker Compose, and base-image workflow are set up correctly.

4. Run a Single Task

LLM_BASE_URL="<base_url>" LLM_API_KEY="<api_key>" \
tb run \
  --agent <agent_name> \
  --model <model_name> \
  --task-id <task_id> \
  --dataset-path <dataset_path> \
  --run-id <output_dir> \
  --n-attempts N \
  --give-test-output M

Common flags:

--agent: e.g., codex, cursor_cli, grok_cli, terminus_2
--model: e.g., gpt-5
--dataset-path: task directory, e.g., tasks_long_cli
--run-id: output run name (stored under runs/<run-id> by default)
--n-attempts: number of independent attempts (used for pass@k)
--give-test-output: number of self-correction turns within each attempt

5. Batch Run (Basic Example)

LLM_BASE_URL="<base_url>" LLM_API_KEY="<api_key>" \
python scripts_python/longcli_run_batch.py \
  --agent-model-pair codex,gpt-5.1-codex-max \
  --task-id 61810_cow \
  --tasks-dir tasks_long_cli \
  --output-path runs_long_cli \
  --exp-setting 1,3 \
  --on-existing skip

6. Batch Run (Script Defaults)

Modify parameters in scripts_python/longcli_run_batch.py, then run:

python scripts_python/longcli_run_batch.py

7. Aggregate Test Results

After batch runs finish (for example under runs_long_cli), aggregate metrics with:

python scripts_python/longcli_aggregate_results.py \
  --input-dirs <input_runs_output_dir> \
  --tasks-dir tasks_long_cli \
  --output-json <summary_json_file> \
  --output-csv <summary_csv_file> \
  --tables-dir <table_output_dir>

example:

python scripts_python/longcli_aggregate_results.py \
  --input-dirs runs_long_cli \
  --tasks-dir tasks_long_cli \
  --output-json long_cli_summary.json \
  --output-csv long_cli_summary.csv \
  --tables-dir .

This computes and exports key benchmark results, including:

pass metrics for F2P, P2P, and combined all_is_pass
step scores (f2p_step_score, p2p_step_score)
self-correction and multi-attempt summaries (e.g., pass@k-related aggregates)

Main outputs:

long_cli_summary.json
long_cli_summary.csv
table1_overall.csv
table2_finegrained.csv
table3_selfcorr.csv
missing-coverage reports (long_cli_missing_report.json/.csv)

Framework Extensions over Terminal-Bench

Fine-grained scoring with task-level is_pass and step-level step_score.
Dual test sets: Fail->Pass (F2P) and Pass->Pass (P2P).
Two parsing modes for test results: native pytest parsing and text-based parsing (custom score text).
Environment variable TB_SAVE_APP_RESULT: whether to save /app from inside the container to the output directory (1/0).
Environment variable TB_SKIP_AGENT: whether to skip agent execution (1/0, useful for task/script debugging).
Multi-attempt evaluation via --n-attempts N, including pass@k statistics.
Multi-turn self-correction within each attempt via --give-test-output M.

Task Taxonomy and Evaluation Metrics

Taxonomy by Engineering Capability

from_scratch (0 -> 1): build a project from scratch.
feature_add (N -> N+1): add new functionality to an existing repository.
bug_fix (No -> Yes): locate and fix complex bugs.
project_refactor (A -> A'): optimize/refactor code without changing external behavior.

Test Sets and Metrics

Test sets:

F2P (Fail->Pass): whether new requirements are correctly implemented.
P2P (Pass->Pass): whether existing functionality remains intact (regression check).

Core metrics:

is_pass: binary task-level success/failure.
step_score: step-level completion percentage.
time: execution time.
token cost: token usage.

Common reporting strategies:

pass@1, pass@3
Multi-turn self-correction using test feedback

Citation

@misc{feng2026longclibenchpreliminarybenchmarkstudy,
  title={LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces},
  author={Yukang Feng and Jianwen Sun and Zelai Yang and Jiaxin Ai and Chuanhao Li and Zizhen Li and Fanrui Zhang and Kang He and Rui Ma and Jifan Lin and Jie Sun and Yang Xiao and Sizhuo Zhou and Wenxiao Wu and Yiming Liu and Pengfei Liu and Yu Qiao and Shenglin Zhang and Kaipeng Zhang},
  year={2026},
  eprint={2602.14337},
  archivePrefix={arXiv},
  primaryClass={cs.SE},
  url={https://arxiv.org/abs/2602.14337},
}

Acknowledgement

This project is built on and extends terminal-bench. LongCLI-Bench is a derivative work with task, data, and evaluation extensions for long-horizon CLI agent benchmarking.

License

This repository is licensed under Apache-2.0. Some task folders may include third-party components under their own licenses; see the corresponding subdirectory LICENSE files.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gemini		.gemini
.github		.github
.vscode		.vscode
adapters		adapters
analysis_outputs		analysis_outputs
dashboard		dashboard
discord-bot		discord-bot
docker		docker
longcli_dockerImage		longcli_dockerImage
resources		resources
scripts_bash		scripts_bash
scripts_python		scripts_python
tasks_long_cli		tasks_long_cli
tasks_long_cli_example/pytest_pytest_example		tasks_long_cli_example/pytest_pytest_example
terminal_bench		terminal_bench
tests		tests
.env.template		.env.template
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
registry.json		registry.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LongCLI-Bench

Abstract

Results

Quick Start

Requirements

1. Installation

2. Build Base Images

3. Smoke Check

4. Run a Single Task

5. Batch Run (Basic Example)

6. Batch Run (Script Defaults)

7. Aggregate Test Results

Framework Extensions over Terminal-Bench

Task Taxonomy and Evaluation Metrics

Taxonomy by Engineering Capability

Test Sets and Metrics

Citation

Acknowledgement

License

About

Uh oh!

Releases

Packages

Languages

License

finyorko/longcli-bench

Folders and files

Latest commit

History

Repository files navigation

LongCLI-Bench

Abstract

Results

Quick Start

Requirements

1. Installation

2. Build Base Images

3. Smoke Check

4. Run a Single Task

5. Batch Run (Basic Example)

6. Batch Run (Script Defaults)

7. Aggregate Test Results

Framework Extensions over Terminal-Bench

Task Taxonomy and Evaluation Metrics

Taxonomy by Engineering Capability

Test Sets and Metrics

Citation

Acknowledgement

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages