Fine-tuning LLMs for function calling using BFCL v4.
Stack: Unsloth (LoRA training) | vLLM (serving) | bfcl-eval (benchmarks)
Best setup: qwen3-32b-raw-toolace-nofc — 84.59% overall (3,491 samples)
| Model | NON_LIVE | LIVE | Overall |
|---|---|---|---|
| qwen3-32b-raw-toolace-nofc | 92.50% | 80.23% | 84.59% |
| Qwen3-4B-Thinking-2507-fc | 92.18% | 79.25% | 83.84% |
| Qwen3-4B-Thinking-2507-toolace | 92.10% | 79.25% | 83.82% |
| Qwen3-4B-no-Thinking-2507-fc | 91.61% | 79.30% | 83.67% |
| ToolACE-2.5-Llama-3.1-8B-nofc | 91.05% | 78.63% | 83.04% |
→ See analysis.ipynb for full breakdown
| Path | Description |
|---|---|
toolace_converter.py |
Converts ToolACE dataset to OpenAI chat format |
qwen/tune-*.ipynb |
Fine-tuning notebooks |
*/serve-*.yaml |
vLLM server configs |
eval.py |
Evaluation script |
outputs/ |
Cached BFCL results |
analysis.ipynb |
Results analysis and visualizations |
# Serve
vllm serve --config qwen/serve-qwen3-32b-lora-raw-nofc.yaml
# Evaluate
python eval.py --model qwen3-32b-raw-toolace-nofc --subset python-
Vanilla — Direct training on ToolACE using Unsloth chat templates
- Notebook example:
qwen3/02_xLAM-2-32b-fc-r_vanilla.ipynb
- Notebook example:
-
Parsed — Converted ToolACE to OpenAI
tool_callformat- Converter:
toolace_converter.py - Goal: Use models' native tool-call format instead of retraining on unfamiliar ToolACE format
- Notebook example:
qwen3/03_xLAM-2-32b-fc-r_parsed.ipynb
- Converter:
-
Raw + No-FC (best) — Train on raw ToolACE, serve without FC parser
- Notebook:
qwen/tune-qwen3-32b-toolace-raw.ipynb - vLLM serve config:
qwen/serve-qwen3-32b-lora-raw-nofc.yaml - Rationale: Even if the model generates proper ToolACE examples without
<tool_call></tool_call>, disabling the FC parameter in BFCL and vLLM tool-call parser plugins prevents parsing interference
- Notebook:
From qwen/tune-qwen3-32b-toolace-raw.ipynb:
- Base model:
unsloth/Qwen3-32B(4-bit quantized) - Dataset:
Team-ACE/ToolACE(95/5 train test split) - LoRA: r=16, alpha=32, trainable params: 134M (0.41%) (Unsloth default recommended)
- Batch size: 16 (maxxed out when possible), epochs: 3, steps: 2,013
- Optimizer: AdamW 8-bit, LR: 2e-4
Experiments comparing thinking-enabled models:
| Model | Config | Accuracy |
|---|---|---|
| Qwen3-4B-Thinking-2507 | serve-4b-thinking-fc.yaml |
83.84% |
| Qwen3-4B-no-Thinking-2507 | serve-4b-no-thinking-fc.yaml |
83.67% |
Results favor thinking-enabled models. This is something I couldn't get from the ToolACE dataset since it has no reasoning data.
Enabled batch 16 on 32B model:
- 4-bit weights: 64GB → 16GB
- LoRA: ~0.5% trained params
- 8-bit optimizer: (negligible due to LoRA) <1GB
- KV Cache & Activations (8K context length) ≈20-40GB
- Gradient offloading to CPU
| Model | Output tok/s | TTFT (ms) | Notes |
|---|---|---|---|
| Qwen3-4B-Thinking | 2947 | 358 | Best speed/accuracy |
| ToolACE-8B | 2526 | 562 | |
| Qwen3-32B | 390 | 2279 | Highest accuracy, but 2 seconds |
(32 concurrent requests, 1024-2048 tokens)
-fc= function calling enabled (vLLM tool-call-parser)-nofc= function calling disabled-vanilla= no LoRA adapter-parsed= converted to OpenAI tool format-raw= original ToolACE format
Note: qwen3/ is the older experiment folder (kept for WandB consistency). Recent experiments are in qwen/.
ToolACE-2.5-Llama used to lead BFCL v3 but falls behind smaller thinking models like Qwen3-4B-Thinking-2507 and Nanbeige-4B on BFCL v4 (on my Python subset). Config: llama-toolace/serve-nofc.yaml
Next steps:
- Transform ToolACE for GRPO training (supported by Unsloth with vLLM weight sharing)
- Write reward functions to tune the reasoning process
uv pip install unsloth vllm bfcl-eval "evalscope[perf]" wandb jupyterlab- Adapter: lozhnikov/toolace-qwen3-32b-raw
- Dataset: Team-ACE/ToolACE
- Base model: unsloth/Qwen3-32B