Agent AI for modern deep learning experiments

Machine Learning


Read metrics, detect anomalies, apply predefined tuning rules, restart jobs if necessary, and log every decision without staring at a loss curve at 2 a.m.

This article provides a lightweight agent Designed for deep learning researchers and ML engineers It can:

• Automatically detect failures
• Visually determine performance metrics
• Apply predefined hyperparameter strategies
• Restart the job
• Document all actions and results

There is no architecture search. No AutoML. There are no invasive rewrites of the codebase.

The implementation is intentionally minimal. Containerize your training script, add a small LangChain-based agent, define hyperparameters in YAML, and express your configuration in markdown. You’ve probably already done 50% of this.

Add this agent to the manual train.py Run your workflow and go from 0️⃣ to 💯 in one day.

Problems with existing experiments

🤔 Ponder endlessly about hyperparameters.

▶️ Run train.py.

🐛 Fixed a bug in train.py.

🔁You rerun train.py

👀 You stare at the TensorBoard.

🫠 You are questioning reality.

🔄You repeat.

Every deep learning/machine learning engineer in the field does this. don’t be shy. Original photo via Pexels by MART PRODUCTION. GIF imagined by Grok

Stop staring at the numbers that models spit out.

You are not a Jedi. No matter how hard you stare, nothing can magically change your heart. [validation loss | classification accuracy | perplexity | any other metric you can name] Move in the direction you want.

Babysitting a model in the middle of the night due to gradient disappearance/explosion NaN In a deep transformer-based network that is untraceable, and it may never even show up? It’s difficult.

How can you solve real research problems if most of your time is spent on the work that needs to be done technically, but contributes little to real insight?

If 70% of your day is spent on operational drag, when does that idea come into being?

Moving to agent-driven experimentation

Most of the deep learning engineers and researchers I work with still conduct experiments manually. Most of the day is spent scanning the weights and biases or TensorBoard of last night’s runs, comparing runs, exporting metrics, tuning hyperparameters, logging notes, and restarting jobs. Then repeat this cycle.

It’s monotonous, boring and repetitive work.

Reduce these repetitive tasks so you can focus on high-value work.

The concept of AutoML is frankly ridiculous.

your [new] The agent does not decide how to change the network topology or add complex functionality. That’s your job. They are replaced by repetitive gluing operations that add little value and waste valuable time.

Agent-driven experimentation (ADE)

Switching from manual experiments to agent-driven workflows is easier than you originally thought. No stack rewrites, heavy systems, or technical debt required.

Image by author

The core of ADE requires three steps:

  1. Containerize existing training scripts
    • wrap the current state train.py Inside a Docker container. No refactoring of model logic. There are no architectural changes. Only reproducible execution bounds.
  2. Add a lightweight agent
    • Deploy a small LangChain-based script that reads metrics from a dashboard, applies settings, decides when and where to restart, stop, or document, and schedules in cron or your favorite job scheduler.
  3. Define behavior and preferences using natural language
    • Use YAML files for configuration and hyperparameters
    • Communicate with agents using markdown documents

That’s the whole system. Now let’s review each step.

Containerize your training scripts

Some may argue that you should do this anyway. This makes restarting and scheduling much easier and significantly reduces disruption to existing processes when moving to a Kubernetes cluster for training.

If you have already done this, please move on to the next section. If not, here’s some helpful code you can use to get started.

First, let’s define the project structure that will work with Docker.

your experiment/
├── scripts/
│   ├── train.py                 # Main training script
│   └── health_server.py         # Health check server
├── requirements.txt             # Python dependencies
├── Dockerfile                   # Container definition
└── run.sh                       # Script to start training + health check

you need to make sure that train.py Scripts can load configuration files from the cloud, so agents can edit them as needed.

We recommend using GitHub for this. Below is an example of how to read a remote configuration file. The agent has corresponding tools to read and modify this configuration file.

import os
import requests
import yaml
from box import Box

# add this to `train.py`
GITHUB_RAW = (
    "https://raw.githubusercontent.com/"
    "{owner}/{repo}/{ref}/{path}"
)

def load_config_from_github(owner, repo, path, ref="main", token=None):
    url = GITHUB_RAW.format(owner=owner, repo=repo, ref=ref, path=path)

    headers = {}
    if token:
        headers["Authorization"] = f"Bearer {token}"

    r = requests.get(url, headers=headers, timeout=10)
    r.raise_for_status()

    return Box(yaml.safe_load(r.text))


config = load_yaml_from_github(...)

# use params throughout your `train.py` script
optimizer = Adam(lr=config.lr)

It also includes a health check server that runs parallel to the main process. This allows container managers and agents such as Kubernetes to monitor job status. without it Inspecting logs.

If a container’s state changes unexpectedly, it can be restarted automatically. This simplifies agent inspection, since reading and summarizing log files can be more expensive in tokens than simply checking the health of a container.

# health_server.py
import time
from pathlib import Path
from fastapi import FastAPI, Response

app = FastAPI()

HEARTBEAT = Path("/tmp/heartbeat")
STATUS = Path("/tmp/status.json")  # optional richer state
MAX_AGE = 300  # seconds

def last_heartbeat_age():
    if not HEARTBEAT.exists():
        return float("inf")
    return time.time() - float(HEARTBEAT.read_text())

@app.get("/health")
def health():
    age = last_heartbeat_age()

    # stale -> training likely hung
    if age > MAX_AGE:
        return Response("stalled", status_code=500)

    # optional: detect NaNs or failure flags written by trainer
    if STATUS.exists() and "failed" in STATUS.read_text():
        return Response("failed", status_code=500)

    return {"status": "ok", "heartbeat_age": age}

small shell script, run.shStart. health_server process in parallel train.py

#!/bin/bash

# Start health check server in the background
python scripts/health_server.py &
# Capture its PID if you want to terminate later
HEALTH_PID=$!
# Start the main training script
python scripts/train.py

And of course, because the Dockerfile is built on NVIDIA’s base image, the container can use the host’s acceleration without friction. This example is for Pytorch, but you can simply extend it to Jax or Tensorflow if needed.

FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04

RUN apt-get update && apt-get install -y \
    python3 python3-pip git

RUN python3 -m pip install --upgrade pip

# Install PyTorch with CUDA support
RUN pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121

WORKDIR /app

COPY . /app

CMD ["sh", "run.sh"]

✅ You are containerized. Simple and minimal.

Add a lightweight agent

There are many agent frameworks to choose from. As for this agent, I like Langchain.

LangChain is a framework for building LLM-driven systems that combine inference and execution. This simplifies chaining model calls, managing memory, and integrating external functionality, allowing LLM to do more than just generate text.

In LangChain, tools are explicitly defined, schema-bound functions that models can call. Each tool is an idempotent skill or task (reading a file, querying an API, changing state, etc.).

For an agent to work, you must first define the tools it can use to accomplish its goals.

Tool definition

  1. Read settings
    • Loads user settings and experiment notes from a markdown document
  2. check_tensorboard
    • Screenshot metrics using Selenium with chrome web driver
  3. analysis metrics
    • Use multimodal LLM inference to understand what’s happening in your screenshots
  4. check_container_health
    • Check your containerized experiments using health checks
  5. Restarting the container
    • Restart experiment if abnormal or hyperparameter changes are required
  6. Configuration changes
    • Modify the remote configuration file and commit to Github
  7. write memory
    • Writes a set of actions to persistent memory (markdown)

This set of tools defines the operational boundaries of the agent. All interaction with the experiment through these tools makes the behavior controllable and, hopefully, predictable.

Instead of providing these tools inline, here is a gist on github that includes all the tools mentioned above. You can include these in your agent or modify them as needed.

agent

To be honest, when I first tried to understand Langchain’s official documentation, I immediately completely lost interest in the idea.

It’s too verbose and more complicated than it needs to be. If you’re new to agents or don’t want to navigate the labyrinth that is Langchain documentation, keep reading below.

Langsmith? Random aside? See the little tooltips here and there? Continue defeating this worthy enemy. What Grok imagined

Briefly, the Langchain agent works as follows.

Our agents use: prompt decide what to do with each step.

step Created dynamically by entering the current context and previous output at the prompt. Each LLM call [+ optional tool invocation] is a step and its output is fed to the next step, chain.

Using this Conceptually a recursive loopthe agent can reason and perform the correct intended action across all necessary steps. The number of steps depends on the agent’s reasoning ability and how well-defined the termination conditions are.

It’s a rung chain. get it? 🤗

prompt

As mentioned earlier, prompts are the recursive glue that maintains context across LLM and tool invocations. Displays placeholders (defined below) that are used when the agent is first initialized.

It uses some of LangChain’s built-in memory abstractions included in each tool call. Apart from that, the agent fills in the gaps and decides the next steps and which tools to invoke.

The main prompt is shown below for ease of reading. You can plug it directly into your agent script or load it from the file system before execution.

"You are an experiment automation agent responsible for monitoring 
and maintaining ML experiments.

Current context:
{chat_history}

Your workflow:
1. First, read preferences from preferences.md to understand thresholds and settings
2. Check TensorBoard at the specified URL and capture a screenshot
3. Analyze key metrics (validation loss, training loss, accuracy) from the screenshot
4. Check Docker container health for the training container
5. Take corrective actions based on analysis:
   - Restart unhealthy containers
   - Adjust hyperparameters according to user preferences 
     and anomalous patterns, restarting the experiment if necessary
6. Log all observations and actions to memory

Important guidelines:
- Always read preferences first to get current configuration
- Use visual analysis to understand metric trends
- Be conservative with config changes (only adjust if clearly needed)
- Write detailed memory entries for future reference
- Check container health before and after any restart
- When modifying config, use appropriate values from preferences

Available tools: {tool_names}
Tool descriptions: {tools}

Current task: {input}

Think step by step and use tools to complete the workflow.
"""

Now you have created about 100 lines and your agent is complete. Once the agent is initialized, define a series of steps. At each step, current_task Directives are entered at the prompt and each tool updates the shared memory instance ConverstationSummaryBufferMemory

We use OpenAI for this agent, but Langchain provides an alternative that includes its own host. If cost is an issue, there is an open source model available here.

import os
from datetime import datetime
from pathlib import Path
from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationSummaryBufferMemory

# Import tools from tools.py
from tools import (
    read_preferences,
    check_tensorboard,
    analyze_metric,
    check_container_health,
    restart_container,
    modify_config,
    write_memory
)

PROMPT=open("prompt.txt").read()
class ExperimentAutomation:
    def __init__(self, openai_key=None):
        """Initialize the agent"""
        self.llm = ChatOpenAI(
            temperature=0.8,
            model="gpt-4-turbo-preview",
            api_key=openai_key or os.getenv('OPENAI_API_KEY')
        )

        # Initialize memory for conversation context
        self.memory = ConversationSummaryBufferMemory(
            llm=self.llm,
            max_token_limit=32000,
            memory_key="chat_history",
            return_messages=True
        )

    def create_agent(self):
        """Create LangChain agent with imported tools"""
        tools = [
            lambda **kwargs: read_preferences(memory=self.memory, **kwargs),
            lambda **kwargs: check_tensorboard(memory=self.memory, **kwargs),
            lambda **kwargs: analyze_metric(memory=self.memory, **kwargs),
            lambda **kwargs: check_container_health(memory=self.memory, **kwargs),
            lambda **kwargs: restart_container(memory=self.memory, **kwargs),
            lambda **kwargs: modify_config(memory=self.memory, **kwargs),
            lambda **kwargs: write_memory(memory=self.memory, **kwargs)
        ]

        # Create the prompt template
        prompt = PromptTemplate.from_template(PROMPT)

        agent = create_react_agent(
            llm=self.llm,
            tools=tools,
            prompt=prompt
        )

        # Create agent executor with memory
        return AgentExecutor(
            agent=agent,
            tools=tools,
            memory=self.memory,
            verbose=True,
            max_iterations=15,
            handle_parsing_errors=True,
            return_intermediate_steps=True
        )

    def run_automation_cycle(self):
        """Execute the full automation cycle step by step"""
        write_memory(
            entry="Automation cycle started",
            category="SYSTEM",
            memory=self.memory
        )

        try:
            agent = self.create_agent()

            # Define the workflow as individual steps
            workflow_steps = [
                "Read preferences from preferences.md to capture thresholds and settings",
                "Check TensorBoard at the specified URL and capture a screenshot",
                "Analyze validation loss, training loss, and accuracy from the screenshot",
                "Check Docker container health for the training container",
                "Restart unhealthy containers if needed",
                "Adjust hyperparameters according to preferences and restart container if necessary",
                "Write all observations and actions to memory"
            ]

            # Execute each step individually
            for step in workflow_steps:
                result = agent.invoke({"input": step})

                # Write step output to memory
                if result.get("output"):
                    memory_summary = f"Step: {step}\nOutput: {result['output']}"
                    write_memory(entry=memory_summary, category="STEP", memory=self.memory)

            write_memory(
                entry="Automation cycle completed successfully",
                category="SYSTEM",
                memory=self.memory
            )

            return result

        except Exception as e:
            error_msg = f"Automation cycle failed: {str(e)}"
            write_memory(entry=error_msg, category="ERROR", memory=self.memory)
            raise


def main():
    try:
        automation = ExperimentAutomation(openai_key=os.environ["OPENAI_API_KEY"])
        result = automation.run_automation_cycle()

        if result.get('output'):
            print(f"\nFinal Output:\n{result['output']}")

        if result.get('intermediate_steps'):
            print(f"\nSteps Executed: {len(result['intermediate_steps'])}")

        print("\n✓ Automation cycle completed successfully")

    except Exception as e:
        print(f"\n✗ Automation failed: {e}")
        write_memory(entry=f"Critical failure: {str(e)}", category="ERROR")
        import sys
        sys.exit(1)


if __name__ == "__main__":
    main()

Now that we have the agent and the tools, let’s explain how to actually express it. intention as a researcher – The most important part.

Define behavior and preferences using natural language

As explained, it’s important to define what you’re looking for when you start an experiment to get the correct behavior from your agent.

Image inference models have come a long way and have quite a bit of context, but we’re still a long way from understanding what a good policy loss curve looks like in hierarchical policy optimization, or what the codebook complexity is in the vector quantized variational autoencoder I’ve been optimizing for the past week.

For this, initialize automated inference with the following command: preferences.md.

Let’s start with the general settings

# Experiment Preferences

This file defines my preferences for this experiment.
The agent should always read this first before taking any action.

---

## General Settings

- experiment_name: vqvae
- container_name: vqvae-train
- tensorboard_url: http://localhost:6006
- memory_file: memory.md
- maximum_adjustments_per_run: 4
---
## More details
You can always add more sections here. The read_preferences task will parse
and reason over each section. 

Next, let’s define the metrics of interest. This is especially important for visual reasoning.

Define it in your markdown document as follows: yaml Blocks parsed by the agent using read_preferences tool. Adding this structure is useful when using environment settings as arguments to other tools.

```yaml
metrics:
  - name: perplexity
    pattern: should remain high through the course of training
    restart_condition: premature collapse to zero
    hyperparameters: |
        if collapse, increase `perplexity_weight` from current value to 0.2
  - name: prediction_loss
    pattern: should decrease over the course of training
    restart_condition: increases or stalls
    hyperparameters: |
        if increases, increase the `prediction_weight` value from current to 0.4
  - name: codebook_usage
    pattern: should remain fixed at > 90%
    restart_condition: drops below 90% for many epochs
    hyperparameters: |
        decrease the `codebook_size` param from 512 to 256. 

```

The important idea is preferences.md must provide Well-structured descriptive details This allows agents to:

Compare that analysis to your intentExample: The agent shows validation loss = 0.6, but in the configuration val_loss_threshold should be 0.5know what the corrective action should be

Read thresholds and constraints (YAML or key-value) for metrics, hyperparameters, and container management.

Understand intent or intent patterns It’s explained in human-readable sections, such as “Adjust learning rate only when validation loss exceeds a threshold and accuracy plateaus.”

wire everything together

Now that you have a containerized experiment and agent, you need to schedule the agent. This is as simple as running the agent process via a cron task. This allows the agent to run every hour, providing a trade-off between cost (in tokens) and operational efficiency.

0 * * * * /usr/bin/python3 /path/to/agent.py >> /var/log/agent.log 2>&1

We found that this agent does not require the latest inference models and works well with previous generations of Anthropic and OpenAI.

summary

If research time is limited, it should be spent on research, not babysitting experiments.

Agents must handle monitoring, restarts, and parameter adjustments without ongoing supervision. Once the resistance is gone, what remains is the actual work. That means forming hypotheses, designing better models, and testing important ideas.

I hope this agent frees you up a little bit to come up with your next big idea. enjoy.

References

Mueller, T., Smith, J., and Lee, K. (2023). LangChain: A framework for developing applications using large language models. GitHub repository. https://github.com/hwchase17/langchain

Open AI. (2023). OpenAI API documentation. https://platform.openai.com/docs



Source link