ToolACE

Fine-tuning LLMs for function calling using BFCL v4.

Stack: Unsloth (LoRA training) | vLLM (serving) | bfcl-eval (benchmarks)

Results

Best setup: qwen3-32b-raw-toolace-nofc — 84.59% overall (3,491 samples)

Model	NON_LIVE	LIVE	Overall
qwen3-32b-raw-toolace-nofc	92.50%	80.23%	84.59%
Qwen3-4B-Thinking-2507-fc	92.18%	79.25%	83.84%
Qwen3-4B-Thinking-2507-toolace	92.10%	79.25%	83.82%
Qwen3-4B-no-Thinking-2507-fc	91.61%	79.30%	83.67%
ToolACE-2.5-Llama-3.1-8B-nofc	91.05%	78.63%	83.04%

→ See analysis.ipynb for full breakdown

Repository Structure

Path	Description
`toolace_converter.py`	Converts ToolACE dataset to OpenAI chat format
`qwen/tune-*.ipynb`	Fine-tuning notebooks
`/serve-.yaml`	vLLM server configs
`eval.py`	Evaluation script
`outputs/`	Cached BFCL results
`analysis.ipynb`	Results analysis and visualizations

Quick Start

# Serve
vllm serve --config qwen/serve-qwen3-32b-lora-raw-nofc.yaml

# Evaluate
python eval.py --model qwen3-32b-raw-toolace-nofc --subset python

Training Methodology

Approaches Tested

Vanilla — Direct training on ToolACE using Unsloth chat templates
- Notebook example: qwen3/02_xLAM-2-32b-fc-r_vanilla.ipynb
Parsed — Converted ToolACE to OpenAI tool_call format
- Converter: toolace_converter.py
- Goal: Use models' native tool-call format instead of retraining on unfamiliar ToolACE format
- Notebook example: qwen3/03_xLAM-2-32b-fc-r_parsed.ipynb
Raw + No-FC (best) — Train on raw ToolACE, serve without FC parser
- Notebook: qwen/tune-qwen3-32b-toolace-raw.ipynb
- vLLM serve config: qwen/serve-qwen3-32b-lora-raw-nofc.yaml
- Rationale: Even if the model generates proper ToolACE examples without <tool_call></tool_call>, disabling the FC parameter in BFCL and vLLM tool-call parser plugins prevents parsing interference

Training Details

From qwen/tune-qwen3-32b-toolace-raw.ipynb:

Base model: unsloth/Qwen3-32B (4-bit quantized)
Dataset: Team-ACE/ToolACE (95/5 train test split)
LoRA: r=16, alpha=32, trainable params: 134M (0.41%) (Unsloth default recommended)
Batch size: 16 (maxxed out when possible), epochs: 3, steps: 2,013
Optimizer: AdamW 8-bit, LR: 2e-4

Thinking vs Non-Thinking

Experiments comparing thinking-enabled models:

Model	Config	Accuracy
Qwen3-4B-Thinking-2507	`serve-4b-thinking-fc.yaml`	83.84%
Qwen3-4B-no-Thinking-2507	`serve-4b-no-thinking-fc.yaml`	83.67%

Results favor thinking-enabled models. This is something I couldn't get from the ToolACE dataset since it has no reasoning data.

Why Unsloth

Enabled batch 16 on 32B model:

4-bit weights: 64GB → 16GB
LoRA: ~0.5% trained params
8-bit optimizer: (negligible due to LoRA) <1GB
KV Cache & Activations (8K context length) ≈20-40GB
Gradient offloading to CPU

Inference Performance

Model	Output tok/s	TTFT (ms)	Notes
Qwen3-4B-Thinking	2947	358	Best speed/accuracy
ToolACE-8B	2526	562
Qwen3-32B	390	2279	Highest accuracy, but 2 seconds

(32 concurrent requests, 1024-2048 tokens)

Config Naming

-fc = function calling enabled (vLLM tool-call-parser)
-nofc = function calling disabled
-vanilla = no LoRA adapter
-parsed = converted to OpenAI tool format
-raw = original ToolACE format

Note: qwen3/ is the older experiment folder (kept for WandB consistency). Recent experiments are in qwen/.

Future Work

ToolACE-2.5-Llama used to lead BFCL v3 but falls behind smaller thinking models like Qwen3-4B-Thinking-2507 and Nanbeige-4B on BFCL v4 (on my Python subset). Config: llama-toolace/serve-nofc.yaml

Next steps:

Transform ToolACE for GRPO training (supported by Unsloth with vLLM weight sharing)
Write reward functions to tune the reasoning process

Setup

uv pip install unsloth vllm bfcl-eval "evalscope[perf]" wandb jupyterlab

Links

Adapter: lozhnikov/toolace-qwen3-32b-raw
Dataset: Team-ACE/ToolACE
Base model: unsloth/Qwen3-32B

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bench/reports		bench/reports
llama-toolace		llama-toolace
nanbeige4		nanbeige4
qwen		qwen
qwen3		qwen3
.gitignore		.gitignore
.python-version		.python-version
BENCH.md		BENCH.md
BFCL.md		BFCL.md
README.md		README.md
analysis.ipynb		analysis.ipynb
bench.py		bench.py
eda.ipynb		eda.ipynb
eval.py		eval.py
pyproject.toml		pyproject.toml
test_toolace_converter.py		test_toolace_converter.py
thedatasets.py		thedatasets.py
toolace_converter.py		toolace_converter.py
toolace_converter_demo.ipynb		toolace_converter_demo.ipynb
vast-setup.sh		vast-setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToolACE

Results

Repository Structure

Quick Start

Training Methodology

Approaches Tested

Training Details

Thinking vs Non-Thinking

Why Unsloth

Inference Performance

Config Naming

Future Work

Setup

Links

About

Uh oh!

Releases

Packages

Languages

nikitml/toolace

Folders and files

Latest commit

History

Repository files navigation

ToolACE

Results

Repository Structure

Quick Start

Training Methodology

Approaches Tested

Training Details

Thinking vs Non-Thinking

Why Unsloth

Inference Performance

Config Naming

Future Work

Setup

Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages