feat(eval): expand dataset to 37 tasks with JSON scenarios #185

chaliy · 2026-02-09T06:02:42Z

Summary

Add 12 new eval scenarios (25 → 37 tasks): 6 JSON processing tasks (config merge, NDJSON aggregation, schema migration, JSON→CSV, package.json update, group-by aggregation) and 6 coverage gap-fillers (dedup merge, multi-file replace, health check, column transform, release notes, CSV join)
Remove tool-steering from all scenario prompts — describe the task, not which tool to use
Rename tool-based IDs (e.g. jq_config_merge → json_config_merge, text_sed_config → text_hostname_replace) and category jq_mastery → json_processing
Run evals on expanded dataset: Haiku 4.5 passes 32/37 (95%), GPT-5.2 passes 23/37 (80%)

Test plan

JSONL validated (all 37 lines parse, unique IDs, required fields present)
cargo build passes
cargo test -p bashkit-eval passes
Eval run completed for Haiku 4.5 and GPT-5.2 with results saved
README updated with new results and per-scenario breakdown
Spec 012-eval category table updated

Add 6 jq_mastery scenarios: - jq_config_merge: deep-merge two JSON config files - jq_log_ndjson: aggregate errors from NDJSON logs by service - jq_reshape_api: transform API records between schema versions - jq_json_to_csv: convert JSON array to CSV with headers - jq_package_update: programmatically update package.json fields - jq_group_aggregate: group_by + sum aggregation (SQL-like) Add 6 scenarios covering other gaps: - pipe_dedup_merge: merge and deduplicate sorted lists - text_multifile_replace: rename function across multiple files - script_health_check: multi-condition validation script using jq - data_column_transform: TSV-to-CSV column reorder with awk - complex_release_notes: parse conventional commits into changelog - data_csv_join: join two CSVs on shared key column Total scenarios: 25 → 37 https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG

Prompts should describe the task, not prescribe the tool. Removes "Use jq", "Use awk", "using sed", etc. from all scenario prompts (both pre-existing and new). Renames tool-based IDs: - text_grep_extract → text_log_error_count - text_sed_config → text_hostname_replace - text_awk_report → text_csv_revenue - jq_nested_transform → json_nested_names - jq_api_response → json_api_pagination - jq_config_merge → json_config_merge - jq_log_ndjson → json_ndjson_error_aggregate - jq_reshape_api → json_api_schema_migration - jq_json_to_csv → json_to_csv_export - jq_package_update → json_package_update - jq_group_aggregate → json_order_totals Renames category jq_mastery → json_processing. Updates spec table. https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG

Haiku 4.5: 32/37 passed (95%), 81% tool success GPT-5.2: 23/37 passed (80%), 71% tool success Opus 4.6: rate-limited, skipped Updates README with new results and per-scenario breakdown. https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG

chatgpt-codex-connector · 2026-02-09T06:02:46Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Root cause: Opus eval failed on all 37 tasks with hidden 404 error. The error was wrapped by anyhow .context() and only "provider chat failed" was visible. Actual error: wrong model ID (claude-opus-4-6-20250610 doesn't exist, correct ID is claude-opus-4-6). Changes: - Add exponential backoff retry (2s, 4s, 8s, 16s) for 429 and 529/5xx in both Anthropic and OpenAI providers - Use {:#} format in runner error output to show full error chain - Update spec non-goals to reflect retry support https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG

Opus 4.6: 29/37 passed (87%), 82% tool success, 25.2 min Full 3-model comparison now in README with per-category and per-new-scenario breakdown. https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG

claude added 3 commits February 9, 2026 05:14

claude added 2 commits February 9, 2026 14:32

chore(eval): add Opus 4.6 results, update README with all 3 models

3cbe925

Opus 4.6: 29/37 passed (87%), 82% tool success, 25.2 min Full 3-model comparison now in README with per-category and per-new-scenario breakdown. https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG

chaliy merged commit e996872 into main Feb 9, 2026
9 checks passed

chaliy deleted the claude/add-json-evals-scenarios-nDs8S branch February 9, 2026 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): expand dataset to 37 tasks with JSON scenarios #185

feat(eval): expand dataset to 37 tasks with JSON scenarios #185

Uh oh!

chaliy commented Feb 9, 2026

Uh oh!

chatgpt-codex-connector bot commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(eval): expand dataset to 37 tasks with JSON scenarios #185

feat(eval): expand dataset to 37 tasks with JSON scenarios #185

Uh oh!

Conversation

chaliy commented Feb 9, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants