Skip to content

nikitml/toolace

Repository files navigation

ToolACE

Fine-tuning LLMs for function calling using BFCL v4.

Stack: Unsloth (LoRA training) | vLLM (serving) | bfcl-eval (benchmarks)

Results

Best setup: qwen3-32b-raw-toolace-nofc — 84.59% overall (3,491 samples)

Model NON_LIVE LIVE Overall
qwen3-32b-raw-toolace-nofc 92.50% 80.23% 84.59%
Qwen3-4B-Thinking-2507-fc 92.18% 79.25% 83.84%
Qwen3-4B-Thinking-2507-toolace 92.10% 79.25% 83.82%
Qwen3-4B-no-Thinking-2507-fc 91.61% 79.30% 83.67%
ToolACE-2.5-Llama-3.1-8B-nofc 91.05% 78.63% 83.04%

→ See analysis.ipynb for full breakdown

Repository Structure

Path Description
toolace_converter.py Converts ToolACE dataset to OpenAI chat format
qwen/tune-*.ipynb Fine-tuning notebooks
*/serve-*.yaml vLLM server configs
eval.py Evaluation script
outputs/ Cached BFCL results
analysis.ipynb Results analysis and visualizations

Quick Start

# Serve
vllm serve --config qwen/serve-qwen3-32b-lora-raw-nofc.yaml

# Evaluate
python eval.py --model qwen3-32b-raw-toolace-nofc --subset python

Training Methodology

Approaches Tested

  1. Vanilla — Direct training on ToolACE using Unsloth chat templates

  2. Parsed — Converted ToolACE to OpenAI tool_call format

  3. Raw + No-FC (best) — Train on raw ToolACE, serve without FC parser

Training Details

From qwen/tune-qwen3-32b-toolace-raw.ipynb:

  • Base model: unsloth/Qwen3-32B (4-bit quantized)
  • Dataset: Team-ACE/ToolACE (95/5 train test split)
  • LoRA: r=16, alpha=32, trainable params: 134M (0.41%) (Unsloth default recommended)
  • Batch size: 16 (maxxed out when possible), epochs: 3, steps: 2,013
  • Optimizer: AdamW 8-bit, LR: 2e-4

Thinking vs Non-Thinking

Experiments comparing thinking-enabled models:

Model Config Accuracy
Qwen3-4B-Thinking-2507 serve-4b-thinking-fc.yaml 83.84%
Qwen3-4B-no-Thinking-2507 serve-4b-no-thinking-fc.yaml 83.67%

Results favor thinking-enabled models. This is something I couldn't get from the ToolACE dataset since it has no reasoning data.

Why Unsloth

Enabled batch 16 on 32B model:

  • 4-bit weights: 64GB → 16GB
  • LoRA: ~0.5% trained params
  • 8-bit optimizer: (negligible due to LoRA) <1GB
  • KV Cache & Activations (8K context length) ≈20-40GB
  • Gradient offloading to CPU

Inference Performance

Model Output tok/s TTFT (ms) Notes
Qwen3-4B-Thinking 2947 358 Best speed/accuracy
ToolACE-8B 2526 562
Qwen3-32B 390 2279 Highest accuracy, but 2 seconds

(32 concurrent requests, 1024-2048 tokens)

Config Naming

  • -fc = function calling enabled (vLLM tool-call-parser)
  • -nofc = function calling disabled
  • -vanilla = no LoRA adapter
  • -parsed = converted to OpenAI tool format
  • -raw = original ToolACE format

Note: qwen3/ is the older experiment folder (kept for WandB consistency). Recent experiments are in qwen/.

Future Work

ToolACE-2.5-Llama used to lead BFCL v3 but falls behind smaller thinking models like Qwen3-4B-Thinking-2507 and Nanbeige-4B on BFCL v4 (on my Python subset). Config: llama-toolace/serve-nofc.yaml

Next steps:

  • Transform ToolACE for GRPO training (supported by Unsloth with vLLM weight sharing)
  • Write reward functions to tune the reasoning process

Setup

uv pip install unsloth vllm bfcl-eval "evalscope[perf]" wandb jupyterlab

Links

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published