
# Introduction
In a latest article on Machine Studying Mastery, we constructed a tool-calling agent that reached outward, that’s pulling climate, information, foreign money charges, and time from public APIs. That article coated the synthesis half of the sample properly, however it left the extra attention-grabbing half on the desk: an agent that causes about its personal setting, inspects its personal machine, and offloads logic it would not belief itself to carry out. It may very well be argued that that is nearer to really “agentic.”
This text picks up the place that one left off. We are going to give Gemma 4 two new instruments — a sandboxed native filesystem explorer and a restricted Python interpreter — and watch the mannequin determine, by itself, when to go searching and when to compute.
Subjects we are going to cowl embody:
- Why “agentic” instrument calling wants greater than internet APIs to be attention-grabbing
- The best way to construct a filesystem inspection instrument with laborious path-traversal guards
- The best way to wire a Python interpreter instrument to the mannequin with out handing it the keys to your machine
- How the identical orchestration loop from earlier than generalizes to those new capabilities
I extremely suggest that you simply first learn this text earlier than persevering with on.
# From Dialog to Company
When the one instruments you give a language mannequin are read-only internet APIs, primarily you continue to actually have a chatbot, albeit one with potential entry to raised data. The mannequin receives a immediate, decides which API to ping, and stitches the JSON response right into a paragraph. There isn’t a actual notion of setting, no state to examine, no consequence to motive about; it is a situation extra akin to retrieval augmented technology than true company.
Company, within the sensible sense practitioners use the phrase, exhibits up when a mannequin begins interacting with the system it’s working on. That may imply studying from a neighborhood filesystem, executing code, modifying information, calling different processes, or any mixture of these. The second a instrument can do one thing aside from return a clear string from a distant service, the mannequin has to begin asking about itself: what information exist, what does this quantity truly equal, what’s on this folder earlier than I declare it comprises something.
The Gemma 4 household, and particularly the gemma4:e2b edge variant now we have been utilizing, is sufficiently small to run regionally on a laptop computer whereas being competent sufficient at structured output to drive this type of loop reliably. That mixture is what makes the local-agentic sample attention-grabbing within the first place. The whole code for this tutorial may be discovered right here.
# The Architectural Reuse
The orchestration loop from the earlier tutorial doesn’t change. We outline Python capabilities, expose them through JSON schema, cross the registry to Ollama alongside the consumer immediate, intercept any tool_calls block on the response, execute the requested operate regionally, append the outcome as a instrument-role message, and re-query the mannequin so it may well synthesize a last reply. The identical call_ollama helper, the identical TOOL_FUNCTIONS dictionary, the identical available_tools schema array from the earlier tutorial all make appearances.
What modifications is the character of the instruments themselves. The place the earlier batch have been all skinny shoppers over distant APIs, these we are going to construct now each run code on the machine. That shifts the design drawback from “how do I parse this response” to “how do I make certain the mannequin can not, even by chance, do one thing it shouldn’t be allowed to do.”
# Device 1: A Sandboxed Filesystem Explorer
The primary instrument, list_directory_contents, offers the mannequin the flexibility to see what information exist in a given folder. This sounds trivial till you keep in mind that os.listdir accepts any string, together with /, ~, and ../../and so on. A naive implementation might fortunately stroll the mannequin’s “curiosity” straight to your API keys.
The design selection right here is to pin a secure base listing at script begin and reject any request that resolves outdoors of it:
# Safety: confine list_directory_contents to this base listing and its descendants
# Set to the present working listing when the script begins
SAFE_BASE_DIR = os.path.abspath(os.getcwd())
def list_directory_contents(path: str = ".") -> str:
"""Lists information and directories inside a path, constrained to the secure base listing."""
strive:
# Resolve to an absolute path and confirm it sits inside SAFE_BASE_DIR
# This blocks traversal makes an attempt like '../../and so on' or absolute paths like "https://www.kdnuggets.com/"
requested = os.path.abspath(os.path.be part of(SAFE_BASE_DIR, path))
if not (requested == SAFE_BASE_DIR or requested.startswith(SAFE_BASE_DIR + os.sep)):
return (
f"Error: Entry denied. The trail '{path}' resolves outdoors the "
f"permitted workspace ({SAFE_BASE_DIR})."
)
...
The sample is easy however value contemplating additional. We by no means belief the string the mannequin produced. We be part of it onto the bottom listing, resolve it completely (so .. will get normalized away), after which confirm the resolved path nonetheless begins with the bottom. Each /and so on/passwd and ../../someplace collapse into paths that fail that prefix verify and are rejected earlier than os.listdir is ever known as.
The remainder of the operate is housekeeping: affirm the trail exists and is a listing, record its contents, and format every entry as both [DIR] or [FILE] with a byte measurement. The returned string is apparent English with construction the mannequin can parse on the second cross:
entries = sorted(os.listdir(requested))
if not entries:
return f"The listing '{path}' is empty."
traces = [f"Contents of '{path}' ({len(entries)} item(s)):"]
for title in entries:
full = os.path.be part of(requested, title)
if os.path.isdir(full):
traces.append(f" [DIR] {title}/")
else:
strive:
measurement = os.path.getsize(full)
traces.append(f" [FILE] {title} ({measurement} bytes)")
besides OSError:
traces.append(f" [FILE] {title}")
return "n".be part of(traces)
The JSON schema we hand to the mannequin is intentionally permissive on the parameter aspect — path is elective, defaulting to the workspace root, as a result of most helpful first questions are in regards to the present folder:
{
"sort": "operate",
"operate": {
"title": "list_directory_contents",
"description": (
"Lists information and subdirectories inside a path inside the consumer's workspace. "
"Use this to examine the setting earlier than answering questions on native information."
),
"parameters": {
"sort": "object",
"properties": {
"path": {
"sort": "string",
"description": (
"A relative path contained in the workspace, e.g. '.', 'knowledge', or 'src/utils'. "
"Defaults to the workspace root."
)
}
},
"required": []
}
}
}
Be aware the outline does a small quantity of immediate engineering: “Use this to examine the setting earlier than answering questions on native information.” That sentence pushes Gemma 4 towards calling the instrument when the consumer asks a obscure query about “my information” slightly than guessing at what is likely to be there.
# Device 2: A Restricted Python Interpreter
The second instrument, execute_python_code, is the extra harmful and the extra pedagogically attention-grabbing of the 2. The premise is that language fashions, particularly small ones, are unreliable at exact arithmetic, precise string manipulation, and something involving greater than a few steps of branching logic. A instrument that lets the mannequin write and run a deterministic snippet is a significantly better reply to these issues than asking it to motive by them in pure language.
The implementation makes use of exec() with a intentionally stripped-down builtins namespace:
def execute_python_code(code: str) -> str:
"""Executes a snippet of Python code and returns no matter was printed to stdout.
This can be a learning-only sandbox. exec() is basically unsafe; don't expose this instrument
to untrusted customers or networks. The restrictions under cease the informal instances, not a
decided attacker.
"""
strive:
# A minimal restricted setting. We strip __builtins__ right down to a small
# whitelist in order that, e.g., open(), eval(), and __import__ will not be instantly
# out there from the snippet's world scope.
safe_builtins = {
"abs": abs, "all": all, "any": any, "bool": bool, "dict": dict,
"divmod": divmod, "enumerate": enumerate, "filter": filter, "float": float,
"int": int, "len": len, "record": record, "map": map, "max": max, "min": min,
"pow": pow, "print": print, "vary": vary, "repr": repr, "reversed": reversed,
"spherical": spherical, "set": set, "sorted": sorted, "str": str, "sum": sum,
"tuple": tuple, "zip": zip,
}
# Pre-import a few secure, helpful modules so the mannequin would not should.
import math, statistics
restricted_globals = {
"__builtins__": safe_builtins,
"math": math,
"statistics": statistics,
}
A couple of choices value calling out. We change __builtins__ fully slightly than blacklisting particular person capabilities, which implies open, eval, exec, compile, __import__, enter, and anything not in our whitelist merely doesn’t exist contained in the snippet. We pre-import math and statistics into the snippet’s globals as a result of the mannequin will attain for them always and we might slightly not power it to battle __import__ restrictions. We seize stdout with contextlib.redirect_stdout so the mannequin will get again precisely what its snippet printed:
# Seize stdout so we will hand the printed output again to the mannequin
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
exec(code, restricted_globals, {})
output = buffer.getvalue().strip()
if not output:
return "Code executed efficiently however produced no output. Use print() to return a price."
return f"Output:n{output}"
The empty-output department issues greater than it seems. Small fashions will routinely write expressions like x = sum(vary(101)) and neglect the print(x). Returning a selected error telling them to make use of print() offers the orchestration loop the choice to retry; with out it, the mannequin would synthesize a last reply primarily based on an empty string and confidently invent a price.
A last phrase on security, because the script’s docstring is blunt about it: this can be a studying sandbox, not a hardened one. A decided adversary can get away of a Python exec sandbox in a dozen methods, most of them involving object introspection by ().__class__.__mro__. For a single-user agent working by yourself laptop computer by yourself prompts, the whitelist is loads. For anything, you’d need an actual isolation layer — a subprocess with seccomp, a container, or RestrictedPython.
# The Orchestration Loop
The principle loop is unchanged in construction from the earlier tutorial. The mannequin is queried with the consumer immediate and the instrument registry, and if it responds with tool_calls, every name is dispatched towards TOOL_FUNCTIONS:
if "tool_calls" in message and message["tool_calls"]:
print("[TOOL EXECUTION]")
messages.append(message)
num_tools = len(message["tool_calls"])
for i, tool_call in enumerate(message["tool_calls"]):
function_name = tool_call["function"]["name"]
arguments = tool_call["function"]["arguments"]
...
if function_name in TOOL_FUNCTIONS:
func = TOOL_FUNCTIONS[function_name]
strive:
outcome = func(**arguments)
...
messages.append({
"function": "instrument",
"content material": str(outcome),
"title": function_name
})
The CLI formatting is value a small tweak for this script. The execute_python_code instrument’s code argument could be a multi-line string with newlines in it, which can wreck an ASCII tree if printed naively. We flatten and truncate string arguments for the show solely; the mannequin nonetheless receives the total string when the operate runs:
def _short(v):
if isinstance(v, str):
flat = v.change("n", "n")
if len(flat) > 60:
flat = flat[:57] + "..."
return f"'{flat}'"
return str(v)
args_str = ", ".be part of(f"{okay}={_short(v)}" for okay, v in arguments.gadgets())
As soon as every instrument result’s appended again into the message historical past as a "function": "instrument" entry, we re-call Ollama with the enriched payload and the mannequin produces its grounded last reply. Similar two-pass sample, identical logic.
# Testing the Instruments
And now we check our instrument calling. Pull gemma4:e2b with ollama pull gemma4:e2b you probably have not already, then run the script from a folder you don’t thoughts the mannequin peeking at.
Let’s begin with the filesystem instrument. From the mission listing:
What scripts are in my present folder, and which one seems prefer it must be used to course of CSVs?
Outcome:
[SYSTEM]
○ Device: execute_python_code......................[LOADED]
○ Device: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/initiatives/gemma_agent.....[SANDBOXED]
[PROMPT]
What scripts are in my present folder, and which one seems prefer it must be used to course of CSVs?
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
└── Calling: list_directory_contents
├─ Args: path="."
└─ Outcome: Contents of '.' (5 merchandise(s)):
[FILE] README.md (412 bytes)
[FILE] csv_cleaner.py (1834 bytes)
[FILE] foremost.py (10786 bytes)
[FILE] notes.txt (88 bytes)
[FILE] sales_report.py (2210 bytes)
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
Your present folder comprises 5 information. The one that appears meant for CSV
processing is csv_cleaner.py — its title strongly suggests it handles CSV enter.
sales_report.py can also contact CSV knowledge, however its title is extra about output than
ingestion.
The mannequin known as the instrument, appeared on the precise filenames, and made an affordable inference grounded within the itemizing slightly than in its weights. That’s the distinction between hallucination and commentary.
Subsequent, the Python interpreter. A small activity that small fashions reliably get improper if requested to do it of their head:
What’s the normal deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to 4 decimal locations?
Outcome:
[SYSTEM]
○ Device: execute_python_code......................[LOADED]
○ Device: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/initiatives/gemma_agent.....[SANDBOXED]
[PROMPT]
What's the normal deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to 4 decimal locations?
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
└── Calling: execute_python_code
├─ Args: code="nums = [12, 18, 23, 24, 29, 31, 35, 41, 44, 47]nprint(rou..."
└─ Outcome: Output:
11.4659
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
The usual deviation of these numbers, rounded to 4 decimal locations, is 11.4659.
The mannequin offloaded the calculation fully; it wrote a snippet, known as statistics.stdev, rounded the outcome, and reported what the interpreter mentioned. No psychological arithmetic, no approximation, no fabricated important digits.
Lastly, the extra attention-grabbing case: a immediate that requires each instruments in sequence. The mannequin has to examine the folder and compute one thing about what it finds:
Take a look at the information within the present folder and inform me the overall measurement in kilobytes, rounded to 2 decimal locations.
Output:
[SYSTEM]
○ Device: execute_python_code......................[LOADED]
○ Device: list_directory_contents..................[LOADED]
○ Workspace: /Customers/matt/initiatives/gemma_agent.....[SANDBOXED]
[PROMPT]
Take a look at the information within the present folder and inform me the overall measurement in kilobytes, rounded to 2 decimal locations.
[EXECUTION]
● Querying mannequin...
[TOOL EXECUTION]
┌── Calling: list_directory_contents
│ ├─ Args: path="."
│ └─ Outcome: Contents of '.' (5 merchandise(s)):
│ [FILE] README.md (412 bytes)
│ [FILE] csv_cleaner.py (1834 bytes)
│ [FILE] foremost.py (10786 bytes)
│ [FILE] notes.txt (88 bytes)
│ [FILE] sales_report.py (2210 bytes)
│
└── Calling: execute_python_code
├─ Args: code="sizes = [412, 1834, 10786, 88, 2210]nprint(spherical(sum(siz..."
└─ Outcome: Output:
15.33
[EXECUTION]
● Synthesizing outcomes...
[RESPONSE]
The 5 information within the present folder whole 15.33 KB.
Two instruments, in the appropriate order, with the output of 1 feeding the argument of the opposite — produced by a 2-billion-parameter mannequin working on a laptop computer with no GPU. The filesystem instrument grounds the mannequin in what is definitely there; the interpreter instrument grounds the reply in what is definitely true. The mannequin contributes the half it’s genuinely good at, which is deciding which query to ask of which instrument.
It’s value poking on the security guards too, simply to substantiate they maintain. Asking the mannequin “record the contents of /and so on” produces the anticipated denial message within the instrument outcome, which the mannequin then stories again gracefully slightly than fabricating a listing itemizing. Asking it to run open('/and so on/passwd').learn() contained in the interpreter produces a NameError, since open just isn’t within the whitelisted builtins. Each failures degrade into helpful error strings as an alternative of silent compromises, which is precisely what you need at this layer.
# Conclusion
The sooner tutorial confirmed that Gemma 4 can attain throughout the web in your behalf. This one exhibits it may well attain into the machine you might be sitting at, rigorously, when you could have constructed the carefulness in. Upon getting a working tool-calling loop, the attention-grabbing query stops being “can the mannequin name a operate” and begins being “what ought to I let it contact.”
A filesystem-aware instrument and a code-execution instrument collectively get you a lot of the solution to one thing that genuinely earns the time period agent: it may well observe its setting, determine what calculation issues, and run that calculation deterministically slightly than guessing. The sample generalizes from there. Database queries, shell instructions, git operations, doc parsing; every one in all these is similar JSON schema, the identical dispatch desk, the identical two-pass synthesis, with no matter security perimeter is suitable for the blast radius of the underlying name.
Construct the perimeter first. Then hand the mannequin the keys to no matter sits inside it.
Matthew Mayo (@mattmayo13) holds a grasp’s diploma in laptop science and a graduate diploma in knowledge mining. As managing editor of KDnuggets & Statology, and contributing editor at Machine Studying Mastery, Matthew goals to make complicated knowledge science ideas accessible. His skilled pursuits embody pure language processing, language fashions, machine studying algorithms, and exploring rising AI. He’s pushed by a mission to democratize information within the knowledge science neighborhood. Matthew has been coding since he was 6 years previous.
