Skip to main content
The most powerful BoxLite pattern: let an LLM generate code and execute it in a sandbox. The LLM reasons about the problem, writes Python, and BoxLite runs it safely. This tutorial shows how to wire up tool calling with three popular providers.

How it works

Every LLM provider supports tool calling (also called function calling). You define a tool called execute_python that accepts a code parameter. When the LLM decides it needs to compute something, it returns a tool call instead of a text response. Your code executes that tool call inside a BoxLite CodeBox and feeds the output back to the LLM, which then formulates the final answer. The pattern is the same regardless of provider:
  1. Define the tool schema (what “execute code” means)
  2. Send messages to the LLM with the tool available
  3. When the LLM returns a tool call, run the code in BoxLite
  4. Send the result back to the LLM
  5. Repeat until the LLM responds with text

Prerequisites

pip install boxlite openai
export OPENAI_API_KEY="sk-..."

OpenAI

openai_sandbox.py
import asyncio
import json
import boxlite
from openai import AsyncOpenAI

EXECUTE_CODE_TOOL = {
    "type": "function",
    "function": {
        "name": "execute_python",
        "description": "Execute Python code in a secure sandbox. Use this to run calculations, data analysis, or any Python code. The code's stdout is returned.",
        "parameters": {
            "type": "object",
            "properties": {
                "code": {
                    "type": "string",
                    "description": "Python code to execute. Use print() to output results.",
                }
            },
            "required": ["code"],
        },
    },
}


async def run_agent(question: str):
    client = AsyncOpenAI()

    async with boxlite.CodeBox() as codebox:
        messages = [
            {
                "role": "system",
                "content": "You are a helpful assistant. When you need to compute something, write Python code using the execute_python tool. Always print() your results.",
            },
            {"role": "user", "content": question},
        ]

        while True:
            response = await client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                tools=[EXECUTE_CODE_TOOL],
            )

            choice = response.choices[0]

            if choice.finish_reason == "tool_calls":
                messages.append(choice.message)
                for tool_call in choice.message.tool_calls:
                    args = json.loads(tool_call.function.arguments)
                    code = args["code"]

                    print(f"--- Executing code ---\n{code}\n---")
                    result = await codebox.run(code)
                    print(f"Output: {result}")

                    messages.append({
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": result,
                    })
            else:
                print(f"\nAnswer: {choice.message.content}")
                break


if __name__ == "__main__":
    asyncio.run(run_agent("What's the 50th Fibonacci number?"))
Expected output:
--- Executing code ---
def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

print(fib(50))
---
Output: 12586269025

Answer: The 50th Fibonacci number is 12,586,269,025.

Anthropic

Anthropic uses a different tool schema format (input_schema instead of parameters) and a different response structure (stop_reason instead of finish_reason, content blocks instead of tool_calls).
anthropic_sandbox.py
import asyncio
import boxlite
from anthropic import AsyncAnthropic

EXECUTE_CODE_TOOL = {
    "name": "execute_python",
    "description": "Execute Python code in a secure sandbox. Use this to run calculations, data analysis, or any Python code. The code's stdout is returned.",
    "input_schema": {
        "type": "object",
        "properties": {
            "code": {
                "type": "string",
                "description": "Python code to execute. Use print() to output results.",
            }
        },
        "required": ["code"],
    },
}


async def run_agent(question: str):
    client = AsyncAnthropic()

    async with boxlite.CodeBox() as codebox:
        messages = [{"role": "user", "content": question}]

        while True:
            response = await client.messages.create(
                model="claude-sonnet-4-5-20250929",
                max_tokens=4096,
                system="You are a helpful assistant. When you need to compute something, write Python code using the execute_python tool. Always print() your results.",
                tools=[EXECUTE_CODE_TOOL],
                messages=messages,
            )

            # Serialize content blocks to plain dicts for the messages array
            assistant_content = []
            for block in response.content:
                if block.type == "text":
                    assistant_content.append({"type": "text", "text": block.text})
                elif block.type == "tool_use":
                    assistant_content.append({
                        "type": "tool_use",
                        "id": block.id,
                        "name": block.name,
                        "input": block.input,
                    })
            messages.append({"role": "assistant", "content": assistant_content})

            if response.stop_reason == "tool_use":
                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        code = block.input["code"]

                        print(f"--- Executing code ---\n{code}\n---")
                        result = await codebox.run(code)
                        print(f"Output: {result}")

                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": result,
                        })

                messages.append({"role": "user", "content": tool_results})
            else:
                # Extract final text response
                for block in response.content:
                    if hasattr(block, "text"):
                        print(f"\nAnswer: {block.text}")
                break


if __name__ == "__main__":
    asyncio.run(run_agent("What's the 50th Fibonacci number?"))

Vercel AI SDK

The Vercel AI SDK provides a unified tool() helper that handles schema validation and execution in one place. The maxSteps parameter replaces the manual while loop — the SDK automatically re-calls the model when a tool result is returned.
vercel-ai-sandbox.ts
import { CodeBox } from '@boxlite-ai/boxlite';
import { generateText, tool } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';

async function runAgent(question: string) {
  const codebox = new CodeBox();

  try {
    const { text } = await generateText({
      model: openai('gpt-4o'),
      system: 'You are a helpful assistant. When you need to compute something, write Python code using the execute_python tool. Always print() your results.',
      prompt: question,
      tools: {
        execute_python: tool({
          description: 'Execute Python code in a secure sandbox. Use print() to output results.',
          parameters: z.object({
            code: z.string().describe('Python code to execute.'),
          }),
          execute: async ({ code }) => {
            console.log(`--- Executing code ---\n${code}\n---`);
            const result = await codebox.run(code);
            console.log(`Output: ${result}`);
            return result;
          },
        }),
      },
      maxSteps: 5,
    });

    console.log(`\nAnswer: ${text}`);
  } finally {
    await codebox.stop();
  }
}

runAgent("What's the 50th Fibonacci number?");
The Vercel AI SDK supports many model providers — swap openai('gpt-4o') for anthropic('claude-sonnet-4-5-20250929'), google('gemini-2.0-flash'), or any other supported model. The tool definitions stay the same.

Multiple questions

The CodeBox persists across calls, so installed packages and files on disk carry over between tool invocations within the same conversation. Note that in-memory state (variables, functions) does not persist — each run() is a separate Python process.
multi_question.py
async def main():
    questions = [
        "What's the standard deviation of [23, 45, 67, 89, 12, 34, 56]?",
        "Generate a random 8-character password with letters and digits.",
        "What day of the week was January 1, 2000?",
    ]

    for question in questions:
        print(f"\nQ: {question}")
        await run_agent(question)


if __name__ == "__main__":
    asyncio.run(main())
For production deployments — concurrency models, timeout handling, security presets, and defensive execution patterns — see the AI agent integration guide.

What’s next?