5 small language models for calling agent tick tools

# introduction

Agentic AI systems rely on the model’s ability to reliably call tools, select the appropriate functions, format arguments correctly, and integrate results into multi-step workflows. Large-scale frontier models such as ChatGPT, Claude, and Gemini handle this well, but they involve trade-offs in cost, latency, and hardware requirements that make them impractical for many real-world deployments. Small language models have done a good job of filling that gap, and several compact, open-weight options now offer first-class tool invocation support without having to run in the data center.

Here, in no particular order, I present five small language models for agent tool invocation. Please note that for convenience and consistency, all model links point to models hosted on Hugging Face.

# 1.SmolLM3-3B

Technical aspects	detail
parameters	3B
architecture	Decoder dedicated transformer (GQA + NoPE, 3:1 ratio)
context length	64K native. Up to 128K with YaRN extrapolation
training token	11.2T
Multilingual support	6 languages (English, French, Spanish, German, Italian, PT)
Reasoning mode	Dual mode (switching between thinking/not thinking)
tool call	Yes: JSON/XML (`xml_tools`) and Python (`python_tools`)
license	Apache 2.0

SmolLM3 is a 3B parameter language model designed to push the boundaries of small-scale models, supporting dual-mode inference, six languages, and long contexts. It is a decoder-only transformer that uses Grouped Query Attention (GQA) and No Positional Embedding (NoPE) (3:1 ratio), pre-trained on 11.2T tokens using a step-by-step curriculum of web, code, math, and inference data. Post-training training included an intermediate phase of training on 140 billion inference tokens, followed by supervised fine-tuning and tuning with Anchored Preference Optimization (APO), HuggingFace’s off-policy approach to preference adjustment. This model supports two different tool invocation interfaces: JSON/XML BLOB. xml_tools and Python-style function calls python_toolsThis makes it extremely flexible for agent pipelines and RAG systems. A fully open release including weights, datasets, and training code, SmolLM3 is ideal for chatbots, RAG systems, and code assistants on constrained hardware such as edge devices and low-VRAM machines.

# 2. Quen 3-4B-Instruction-2507

Technical aspects	detail
parameters	4.0B (3.6B non-embedded)
architecture	Causal LM, 36 layers, GQA (32 Q head / 8 KV head)
context length	262,144 tokens (native)
Reasoning mode	Non-thinking only (no block)
Multilingual support	100+ languages
tool call	Yes: Native, via Qwen-Agent / MCP
license	Apache 2.0

Quen 3-4B-Instruction-2507 Qwen3-4B is an updated version of the non-thinking mode with significant improvements in common functions such as following instructions, logical reasoning, text comprehension, math, science, coding, and tool usage. It also provides significant improvements in long-tail knowledge coverage across multiple languages. Both the Instruct and Thinking variants share a total of 4 billion parameters (3.6B excluding padding) built across 36 transformer layers and use GQA with 32 query heads and 8 key/value heads to enable efficient memory management of very long contexts. This particular no-thinking variant is optimized for direct, fast response use cases, such as providing concise answers without explicit chain-of-thought tracing, and is ideal for chatbots, customer support, and tool-invoking agents where low latency is important. Qwen3 has excellent tool invocation capabilities and Alibaba recommends using the Qwen-Agent framework. This framework encapsulates tool call templates and parsers internally, reduces coding complexity, and supports MCP server configuration files.

# 3. Phi-3-mini-4k-instruct

Technical aspects	detail
parameters	3.8B
architecture	Decoder dedicated transformer
context length	4K token
vocabulary size	32,064 tokens
training data	Synthesized and filtered public web data
after training	SFT + DPO
tool call	Yes: via chat template (requires HF Transformers ≥ 4.41.2)
license	Massachusetts Institute of Technology

Phi-3-Mini-4K-Instruct is a lightweight, state-of-the-art open model with 3.8B parameters, trained on the Phi-3 dataset containing both synthetic data and filtered public web data, and focused on high-quality, inference-dense properties. The model underwent a post-training process that incorporated both supervised fine-tuning (SFT) and direct overriding optimization (DPO) to ensure instruction-following and safety. Microsoft’s flagship “small but smart” model, Phi-3-mini, was notable at the time of its launch for its ability to run on devices including smartphones while rivaling GPT-3.5 in feature benchmarks. This model is primarily targeted at memory- and compute-constrained environments, latency-constrained scenarios, and tasks that require strong reasoning, especially mathematics and logic. Although it is older than the other models on this list and limited to 4K context windows, the MIT license gives it the most permissive license of the options available, and its strong general argument makes it a popular base for tweaking in commercial applications.

# 4. Gemma-4-E2B-it

Technical aspects	detail
Valid parameters	2.3B (total 5.1B including embedding)
architecture	Dense, hybrid attention (sliding window + global) + PLE
layer	35
double sliding window	512 tokens
context length	128,000 tokens
vocabulary size	262K
modality	Text, images, audio (≤30 seconds), video (as frames)
Multilingual support	35+ native speakers trained in 140+ languages
tool call	Yes: native function call
license	Apache 2.0

Gemma-4-E2B It is part of Google DeepMind’s Gemma 4 family and features a hybrid attention mechanism: local sliding window attention with full global attention. This design provides lightweight model processing speed and a low memory footprint without sacrificing the deep recognition required for complex, long-context tasks. The “E” in E2B stands for “effective” parameters, made possible by a key architectural innovation called layer-by-layer embedding (PLE), which adds a dedicated adjustment vector to each decoder layer. This is a mechanism that uses quantization to allow E2B to run with less than 1.5 GB of memory while still producing valuable output. This model supports native function calls, enables agent workflows, is optimized for on-device deployment on mobile and IoT devices, and can handle text, image, audio, and video input. Released with Apache 2.0 (a change from the more restrictive custom license of previous Gemma generations), Gemma 4 E2B is an attractive option for developers building multimodal agent applications that run entirely at the edge.

# 5. Mistral-7B-Instruction-v0.3

Technical aspects	detail
parameters	7.25B
architecture	Transformer, GQA+SWA
context length	32,768 tokens
vocabulary size	32,768 tokens (expanded from v0.2)
tokenizer	v3 Mistral Tokenizer
function call	Yes: via `TOOL_CALLS` / `AVAILABLE_TOOLS` / `TOOL_RESULTS` Token (see here)
license	Apache 2.0

Mistral-7B-Instruction-v0.3 This is a tweaked instructional version of Mistral-7B-v0.3, which introduced three important changes over v0.2. Vocabulary expansion to 32,768, v3 tokenizer support, and function call support. The model employs grouped query attention to speed up inference and sliding window attention (SWA) to efficiently process long sequences, and function call support is enabled through an extended vocabulary that includes dedicated tokens. TOOL_CALLS, AVAILABLE_TOOLSand TOOL_RESULTS. As the largest model in this roundup with parameter 7B, Mistral-7B-Instruct-v0.3 offers the best general instruction following performance in the group, making it an industry-standard workhorse widely available through Ollama, vLLM, and most inference platforms.

# summary

The five models discussed here (SmolLM3-3B, Qwen3-4B-Instruct-2507, Phi-3-mini-4k-instruct, Gemma-4-E2B-it, and Mistral-7B-Instruct-v0.3) vary in architecture, number of parameters, context windows, and release dates, but they all have one important characteristic in common: All of them support calling structured tools in compact, open-weight packages.

From Hugging Face’s fully transparent SmolLM3 to Google DeepMind’s multimodal edge-optimized Gemma 4 E2B, this selection shows that deploying competent agent models doesn’t require large infrastructure or frontier models. Whether your priorities are on-device inference, long context processing, multilingual support, or the most permissive license possible, this list includes models worth considering.

Note that these are not the only small language models with tool invocation capabilities. But they are a good representation of people I have first-hand experience with and feel comfortable including based on my results.

Matthew Mayo (@mattmayo13) holds a Master’s degree in Computer Science and a Postgraduate Diploma in Data Mining. As Editor-in-Chief of KDnuggets & Statology and Contributing Editor of Machine Learning Mastery, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.

Source link