How to build an autonomous machine learning research loop in Google Colab using Andrej Karpathy’s AutoResearch framework for hyperparameter discovery and experiment tracking

This tutorial uses the Colab-enabled version of The AutoResearch framework was originally proposed by Andrej Karpathy. Build an automated experiment pipeline that clones the AutoResearch repository, prepares a lightweight training environment, and runs baseline experiments to establish initial performance metrics. Next, create an automated research loop that programmatically edits the hyperparameters in train.py, runs new training iterations, evaluates the resulting model using validation bit/byte metrics, and records all experiments in a structured results table. By running this workflow on Google Colab, we show how you can replicate the core idea of autonomous machine learning research: iteratively change training configurations, evaluate performance, and maintain optimal configurations without the need for specialized hardware or complex infrastructure.

import os, sys, subprocess, json, re, random, shutil, time
from pathlib import Path


def pip_install(pkg):
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])


for pkg in [
   "numpy","pandas","pyarrow","requests",
   "rustbpe","tiktoken","openai"
]:
   try:
       __import__(pkg)
   except:
       pip_install(pkg)


import pandas as pd


if not Path("autoresearch").exists():
   subprocess.run(["git","clone","https://github.com/karpathy/autoresearch.git"])


os.chdir("autoresearch")


OPENAI_API_KEY=None
try:
   from google.colab import userdata
   OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
except:
   OPENAI_API_KEY=os.environ.get("OPENAI_API_KEY")


if OPENAI_API_KEY:
   os.environ["OPENAI_API_KEY"]=OPENAI_API_KEY

First, import the core Python libraries needed for automated research workflows. Install all required dependencies and clone the Automated Research repository directly from GitHub to ensure your environment contains the original training framework. It also configures access to the OpenAI API key (if available), allowing the system to optionally support LLM-assisted experiments later in the pipeline.

prepare_path=Path("prepare.py")
train_path=Path("train.py")
program_path=Path("program.md")


prepare_text=prepare_path.read_text()
train_text=train_path.read_text()


prepare_text=re.sub(r"MAX_SEQ_LEN = \d+","MAX_SEQ_LEN = 512",prepare_text)
prepare_text=re.sub(r"TIME_BUDGET = \d+","TIME_BUDGET = 120",prepare_text)
prepare_text=re.sub(r"EVAL_TOKENS = .*","EVAL_TOKENS = 4 * 65536",prepare_text)


train_text=re.sub(r"DEPTH = \d+","DEPTH = 4",train_text)
train_text=re.sub(r"DEVICE_BATCH_SIZE = \d+","DEVICE_BATCH_SIZE = 16",train_text)
train_text=re.sub(r"TOTAL_BATCH_SIZE = .*","TOTAL_BATCH_SIZE = 2**17",train_text)
train_text=re.sub(r'WINDOW_PATTERN = "SSSL"','WINDOW_PATTERN = "L"',train_text)


prepare_path.write_text(prepare_text)
train_path.write_text(train_text)


program_path.write_text("""
Goal:
Run autonomous research loop on Google Colab.


Rules:
Only modify train.py hyperparameters.


Metric:
Lower val_bpb is better.
""")


subprocess.run(["python","prepare.py","--num-shards","4","--download-workers","2"])

Modify key configuration parameters in the repository to make your training workflow compatible with Google Colab hardware. Reduce context length, training time budget, and number of evaluation tokens to allow experiments to run within limited GPU resources. After applying these patches, prepare the dataset shards required for training so that your model can start experimenting immediately.

subprocess.run("python train.py > baseline.log 2>&1",shell=True)


def parse_run_log(log_path):
   text=Path(log_path).read_text(errors="ignore")
   def find(p):
       m=re.search(p,text,re.MULTILINE)
       return float(m.group(1)) if m else None
   return {
       "val_bpb":find(r"^val_bpb:\s*([0-9.]+)"),
       "training_seconds":find(r"^training_seconds:\s*([0-9.]+)"),
       "peak_vram_mb":find(r"^peak_vram_mb:\s*([0-9.]+)"),
       "num_steps":find(r"^num_steps:\s*([0-9.]+)")
   }


baseline=parse_run_log("baseline.log")


results_path=Path("results.tsv")


rows=[{
   "commit":"baseline",
   "val_bpb":baseline["val_bpb"] if baseline["val_bpb"] else 0,
   "memory_gb":round((baseline["peak_vram_mb"] or 0)/1024,1),
   "status":"keep",
   "description":"baseline"
}]


pd.DataFrame(rows).to_csv(results_path,sep="\t",index=False)


print("Baseline:",baseline)

Perform baseline training to establish initial performance criteria for your model. Implement log parsing functions to extract key training metrics such as validation bits/bytes, training time, GPU memory usage, and optimization steps. We then save these baseline results in a structured experiment table so that all future experiments can be compared to this starting configuration.

TRAIN_FILE=Path("train.py")
BACKUP_FILE=Path("train.base.py")


if not BACKUP_FILE.exists():
   shutil.copy2(TRAIN_FILE,BACKUP_FILE)


HP_KEYS=[
"WINDOW_PATTERN",
"TOTAL_BATCH_SIZE",
"EMBEDDING_LR",
"UNEMBEDDING_LR",
"MATRIX_LR",
"SCALAR_LR",
"WEIGHT_DECAY",
"ADAM_BETAS",
"WARMUP_RATIO",
"WARMDOWN_RATIO",
"FINAL_LR_FRAC",
"DEPTH",
"DEVICE_BATCH_SIZE"
]


def read_text(path):
   return Path(path).read_text()


def write_text(path,text):
   Path(path).write_text(text)


def extract_hparams(text):
   vals={}
   for k in HP_KEYS:
       m=re.search(rf"^{k}\s*=\s*(.+?)$",text,re.MULTILINE)
       if m:
           vals[k]=m.group(1).strip()
   return vals


def set_hparam(text,key,value):
   return re.sub(rf"^{key}\s*=.*$",f"{key} = {value}",text,flags=re.MULTILINE)


base_text=read_text(BACKUP_FILE)
base_hparams=extract_hparams(base_text)


SEARCH_SPACE={
"WINDOW_PATTERN":['"L"','"SSSL"'],
"TOTAL_BATCH_SIZE":["2**16","2**17","2**18"],
"EMBEDDING_LR":["0.2","0.4","0.6"],
"MATRIX_LR":["0.01","0.02","0.04"],
"SCALAR_LR":["0.3","0.5","0.7"],
"WEIGHT_DECAY":["0.05","0.1","0.2"],
"ADAM_BETAS":["(0.8,0.95)","(0.9,0.95)"],
"WARMUP_RATIO":["0.0","0.05","0.1"],
"WARMDOWN_RATIO":["0.3","0.5","0.7"],
"FINAL_LR_FRAC":["0.0","0.05"],
"DEPTH":["3","4","5","6"],
"DEVICE_BATCH_SIZE":["8","12","16","24"]
}


def sample_candidate():
   keys=random.sample(list(SEARCH_SPACE.keys()),random.choice([2,3,4]))
   cand=dict(base_hparams)
   changes={}
   for k in keys:
       cand[k]=random.choice(SEARCH_SPACE[k])
       changes[k]=cand[k]
   return cand,changes


def apply_hparams(candidate):
   text=read_text(BACKUP_FILE)
   for k,v in candidate.items():
       text=set_hparam(text,k,v)
   write_text(TRAIN_FILE,text)


def run_experiment(tag):
   log=f"{tag}.log"
   subprocess.run(f"python train.py > {log} 2>&1",shell=True)
   metrics=parse_run_log(log)
   metrics["log"]=log
   return metrics

Build core utilities that enable automated hyperparameter experimentation. Implement functions that extract hyperparameters from train.py, define a searchable parameter space, and allow you to edit these values programmatically. We also create a mechanism to generate candidate configurations, apply them to training scripts, and run experiments while recording the output.

N_EXPERIMENTS=3


df=pd.read_csv(results_path,sep="\t")
best=df["val_bpb"].replace(0,999).min()


for i in range(N_EXPERIMENTS):


   tag=f"exp_{i+1}"


   candidate,changes=sample_candidate()


   apply_hparams(candidate)


   metrics=run_experiment(tag)


   if metrics["val_bpb"] and metrics["val_bpb"]

Run an automated research loop that iteratively proposes new hyperparameter configurations and evaluates their performance. For each experiment, we modify the training script, run the training process, and compare the resulting validation scores to the best configuration found so far. Log all experiment results, save improved configurations, and export optimal training scripts along with experiment history for further analysis.

In conclusion, we have built a fully automated research workflow that demonstrates how a machine can iteratively explore model configurations and improve training performance with minimal manual intervention. Throughout the tutorial, you prepared a dataset, established a baseline experiment, proposed a new hyperparameter configuration, ran an experiment, and implemented a search loop to track results across multiple trials. By maintaining experiment logs and automatically saving improved configurations, we created a reproducible and scalable research process that mirrors the workflows used in modern machine learning experiments. This approach demonstrates how automation, experiment tracking, and lightweight infrastructure can be combined to accelerate model development and enable scalable research directly from a cloud notebook environment.

check out Full code here. Also, feel free to follow us Twitter Don’t forget to join us 120,000+ ML subreddits and subscribe our newsletter. hang on! Are you on telegram? You can now also participate by telegram.

Source link