Preserve tool-call JSON for deterministic local inference#22
Open
latent-variable wants to merge 1 commit intoMiniMax-AI:mainfrom
Open
Preserve tool-call JSON for deterministic local inference#22latent-variable wants to merge 1 commit intoMiniMax-AI:mainfrom
latent-variable wants to merge 1 commit intoMiniMax-AI:mainfrom
Conversation
2eba424 to
91d7951
Compare
Collaborator
|
Thanks for this excellent PR! This is a well-designed optimization for local LLM inference with KV cache. The solution is elegant:
The performance improvement (33% → 99.88% cache hit rate) is impressive! 🚀 However, there are merge conflicts with the current Looking forward to merging this! 🙏 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
arguments_jsontoFunctionCallso we persist the exact tool-call payload returned by the modelProblem
When Mini-Agent is configured to send requests to a local LM Studio endpoint (or any local serving stack with KV cache), each subsequent request must be byte-identical for the cached portion of the context. Today the request builder re-serializes every tool call in the transcript using
json.dumps(..., sort_keys=True). That changes key ordering, whitespace, or float formatting compared to what the model actually emitted, meaning the tool call that was prepended to Request #2 is different from the one the model saw during Request #1. LM Studio therefore treats the assistant history as a cache miss, reprocessing all prior tokens (~12k tokens per turn in our setup), wasting latency and compute.Testing