# introduction
Agentic AI systems rely on the model’s ability to reliably call tools, select the appropriate functions, format arguments correctly, and integrate results into multi-step workflows. Large-scale frontier models such as ChatGPT, Claude, and Gemini handle this well, but they involve trade-offs in cost, latency, and hardware requirements that make them impractical for many real-world deployments. Small language models have done a good job of filling that gap, and several compact, open-weight options now offer first-class tool invocation support without having to run in the data center.
Here, in no particular order, I present five small language models for agent tool invocation. Please note that for convenience and consistency, all model links point to models hosted on Hugging Face.
# 1.SmolLM3-3B
| Technical aspects | detail |
|---|---|
| parameters | 3B |
| architecture | Decoder dedicated transformer (GQA + NoPE, 3:1 ratio) |
| context length | 64K native. Up to 128K with YaRN extrapolation |
| training token | 11.2T |
| Multilingual support | 6 languages (English, French, Spanish, German, Italian, PT) |
| Reasoning mode | Dual mode (switching between thinking/not thinking) |
| tool call | Yes: JSON/XML (xml_tools) and Python (python_tools) |
| license | Apache 2.0 |
SmolLM3 is a 3B parameter language model designed to push the boundaries of small-scale models, supporting dual-mode inference, six languages, and long contexts. It is a decoder-only transformer that uses Grouped Query Attention (GQA) and No Positional Embedding (NoPE) (3:1 ratio), pre-trained on 11.2T tokens using a step-by-step curriculum of web, code, math, and inference data. Post-training training included an intermediate phase of training on 140 billion inference tokens, followed by supervised fine-tuning and tuning with Anchored Preference Optimization (APO), HuggingFace’s off-policy approach to preference adjustment. This model supports two different tool invocation interfaces: JSON/XML BLOB. xml_tools and Python-style function calls python_toolsThis makes it extremely flexible for agent pipelines and RAG systems. A fully open release including weights, datasets, and training code, SmolLM3 is ideal for chatbots, RAG systems, and code assistants on constrained hardware such as edge devices and low-VRAM machines.
# 2. Quen 3-4B-Instruction-2507
| Technical aspects | detail |
|---|---|
| parameters | 4.0B (3.6B non-embedded) |
| architecture | Causal LM, 36 layers, GQA (32 Q head / 8 KV head) |
| context length | 262,144 tokens (native) |
| Reasoning mode | Non-thinking only (no block) |
| Multilingual support | 100+ languages |
| tool call | Yes: Native, via Qwen-Agent / MCP |
| license | Apache 2.0 |
Quen 3-4B-Instruction-2507 Qwen3-4B is an updated version of the non-thinking mode with significant improvements in common functions such as following instructions, logical reasoning, text comprehension, math, science, coding, and tool usage. It also provides significant improvements in long-tail knowledge coverage across multiple languages. Both the Instruct and Thinking variants share a total of 4 billion parameters (3.6B excluding padding) built across 36 transformer layers and use GQA with 32 query heads and 8 key/value heads to enable efficient memory management of very long contexts. This particular no-thinking variant is optimized for direct, fast response use cases, such as providing concise answers without explicit chain-of-thought tracing, and is ideal for chatbots, customer support, and tool-invoking agents where low latency is important. Qwen3 has excellent tool invocation capabilities and Alibaba recommends using the Qwen-Agent framework. This framework encapsulates tool call templates and parsers internally, reduces coding complexity, and supports MCP server configuration files.
# 3. Phi-3-mini-4k-instruct
| Technical aspects | detail |
|---|---|
| parameters | 3.8B |
| architecture | Decoder dedicated transformer |
| context length | 4K token |
| vocabulary size | 32,064 tokens |
| training data | Synthesized and filtered public web data |
| after training | SFT + DPO |
| tool call | Yes: via chat template (requires HF Transformers ≥ 4.41.2) |
| license | Massachusetts Institute of Technology |
Phi-3-Mini-4K-Instruct is a lightweight, state-of-the-art open model with 3.8B parameters, trained on the Phi-3 dataset containing both synthetic data and filtered public web data, and focused on high-quality, inference-dense properties. The model underwent a post-training process that incorporated both supervised fine-tuning (SFT) and direct overriding optimization (DPO) to ensure instruction-following and safety. Microsoft’s flagship “small but smart” model, Phi-3-mini, was notable at the time of its launch for its ability to run on devices including smartphones while rivaling GPT-3.5 in feature benchmarks. This model is primarily targeted at memory- and compute-constrained environments, latency-constrained scenarios, and tasks that require strong reasoning, especially mathematics and logic. Although it is older than the other models on this list and limited to 4K context windows, the MIT license gives it the most permissive license of the options available, and its strong general argument makes it a popular base for tweaking in commercial applications.
# 4. Gemma-4-E2B-it
| Technical aspects | detail |
|---|---|
| Valid parameters | 2.3B (total 5.1B including embedding) |
| architecture | Dense, hybrid attention (sliding window + global) + PLE |
| layer | 35 |
| double sliding window | 512 tokens |
| context length | 128,000 tokens |
| vocabulary size | 262K |
| modality | Text, images, audio (≤30 seconds), video (as frames) |
| Multilingual support | 35+ native speakers trained in 140+ languages |
| tool call | Yes: native function call |
| license | Apache 2.0 |
Gemma-4-E2B It is part of Google DeepMind’s Gemma 4 family and features a hybrid attention mechanism: local sliding window attention with full global attention. This design provides lightweight model processing speed and a low memory footprint without sacrificing the deep recognition required for complex, long-context tasks. The “E” in E2B stands for “effective” parameters, made possible by a key architectural innovation called layer-by-layer embedding (PLE), which adds a dedicated adjustment vector to each decoder layer. This is a mechanism that uses quantization to allow E2B to run with less than 1.5 GB of memory while still producing valuable output. This model supports native function calls, enables agent workflows, is optimized for on-device deployment on mobile and IoT devices, and can handle text, image, audio, and video input. Released with Apache 2.0 (a change from the more restrictive custom license of previous Gemma generations), Gemma 4 E2B is an attractive option for developers building multimodal agent applications that run entirely at the edge.
# 5. Mistral-7B-Instruction-v0.3
| Technical aspects | detail |
|---|---|
| parameters | 7.25B |
| architecture | Transformer, GQA+SWA |
| context length | 32,768 tokens |
| vocabulary size | 32,768 tokens (expanded from v0.2) |
| tokenizer | v3 Mistral Tokenizer |
| function call | Yes: via TOOL_CALLS / AVAILABLE_TOOLS / TOOL_RESULTS Token (see here) |
| license | Apache 2.0 |
Mistral-7B-Instruction-v0.3 This is a tweaked instructional version of Mistral-7B-v0.3, which introduced three important changes over v0.2. Vocabulary expansion to 32,768, v3 tokenizer support, and function call support. The model employs grouped query attention to speed up inference and sliding window attention (SWA) to efficiently process long sequences, and function call support is enabled through an extended vocabulary that includes dedicated tokens. TOOL_CALLS, AVAILABLE_TOOLSand TOOL_RESULTS. As the largest model in this roundup with parameter 7B, Mistral-7B-Instruct-v0.3 offers the best general instruction following performance in the group, making it an industry-standard workhorse widely available through Ollama, vLLM, and most inference platforms.
# summary
The five models discussed here (SmolLM3-3B, Qwen3-4B-Instruct-2507, Phi-3-mini-4k-instruct, Gemma-4-E2B-it, and Mistral-7B-Instruct-v0.3) vary in architecture, number of parameters, context windows, and release dates, but they all have one important characteristic in common: All of them support calling structured tools in compact, open-weight packages.
From Hugging Face’s fully transparent SmolLM3 to Google DeepMind’s multimodal edge-optimized Gemma 4 E2B, this selection shows that deploying competent agent models doesn’t require large infrastructure or frontier models. Whether your priorities are on-device inference, long context processing, multilingual support, or the most permissive license possible, this list includes models worth considering.
Note that these are not the only small language models with tool invocation capabilities. But they are a good representation of people I have first-hand experience with and feel comfortable including based on my results.
Matthew Mayo (@mattmayo13) holds a Master’s degree in Computer Science and a Postgraduate Diploma in Data Mining. As Editor-in-Chief of KDnuggets & Statology and Contributing Editor of Machine Learning Mastery, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.
