Skip to content

Conversation

@chaliy
Copy link
Contributor

@chaliy chaliy commented Feb 9, 2026

Summary

  • Add 12 new eval scenarios (25 → 37 tasks): 6 JSON processing tasks (config merge, NDJSON aggregation, schema migration, JSON→CSV, package.json update, group-by aggregation) and 6 coverage gap-fillers (dedup merge, multi-file replace, health check, column transform, release notes, CSV join)
  • Remove tool-steering from all scenario prompts — describe the task, not which tool to use
  • Rename tool-based IDs (e.g. jq_config_mergejson_config_merge, text_sed_configtext_hostname_replace) and category jq_masteryjson_processing
  • Run evals on expanded dataset: Haiku 4.5 passes 32/37 (95%), GPT-5.2 passes 23/37 (80%)

Test plan

  • JSONL validated (all 37 lines parse, unique IDs, required fields present)
  • cargo build passes
  • cargo test -p bashkit-eval passes
  • Eval run completed for Haiku 4.5 and GPT-5.2 with results saved
  • README updated with new results and per-scenario breakdown
  • Spec 012-eval category table updated

Add 6 jq_mastery scenarios:
- jq_config_merge: deep-merge two JSON config files
- jq_log_ndjson: aggregate errors from NDJSON logs by service
- jq_reshape_api: transform API records between schema versions
- jq_json_to_csv: convert JSON array to CSV with headers
- jq_package_update: programmatically update package.json fields
- jq_group_aggregate: group_by + sum aggregation (SQL-like)

Add 6 scenarios covering other gaps:
- pipe_dedup_merge: merge and deduplicate sorted lists
- text_multifile_replace: rename function across multiple files
- script_health_check: multi-condition validation script using jq
- data_column_transform: TSV-to-CSV column reorder with awk
- complex_release_notes: parse conventional commits into changelog
- data_csv_join: join two CSVs on shared key column

Total scenarios: 25 → 37

https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG
Prompts should describe the task, not prescribe the tool. Removes
"Use jq", "Use awk", "using sed", etc. from all scenario prompts
(both pre-existing and new). Renames tool-based IDs:

- text_grep_extract → text_log_error_count
- text_sed_config → text_hostname_replace
- text_awk_report → text_csv_revenue
- jq_nested_transform → json_nested_names
- jq_api_response → json_api_pagination
- jq_config_merge → json_config_merge
- jq_log_ndjson → json_ndjson_error_aggregate
- jq_reshape_api → json_api_schema_migration
- jq_json_to_csv → json_to_csv_export
- jq_package_update → json_package_update
- jq_group_aggregate → json_order_totals

Renames category jq_mastery → json_processing. Updates spec table.

https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG
Haiku 4.5: 32/37 passed (95%), 81% tool success
GPT-5.2: 23/37 passed (80%), 71% tool success
Opus 4.6: rate-limited, skipped

Updates README with new results and per-scenario breakdown.

https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Root cause: Opus eval failed on all 37 tasks with hidden 404 error.
The error was wrapped by anyhow .context() and only "provider chat
failed" was visible. Actual error: wrong model ID (claude-opus-4-6-20250610
doesn't exist, correct ID is claude-opus-4-6).

Changes:
- Add exponential backoff retry (2s, 4s, 8s, 16s) for 429 and 529/5xx
  in both Anthropic and OpenAI providers
- Use {:#} format in runner error output to show full error chain
- Update spec non-goals to reflect retry support

https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG
Opus 4.6: 29/37 passed (87%), 82% tool success, 25.2 min
Full 3-model comparison now in README with per-category and
per-new-scenario breakdown.

https://claude.ai/code/session_01UvoaXveMPrqy3BHSNgJQpG
@chaliy chaliy merged commit e996872 into main Feb 9, 2026
9 checks passed
@chaliy chaliy deleted the claude/add-json-evals-scenarios-nDs8S branch February 9, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants