Science is the most reliable method discovered by humanity to generate knowledge. And for most of history, they were expensive to operate.
Andrei Karpathy A few weeks ago we released 600 lines of Python code and that started to change. his automatic investigation (look EV#565) runs an autonomous experimental loop where humans set strategic direction, define what good is, and agents iterate towards success within guardrails. Andrej’s first experiment trained a GPT-2 level model over two days and showed an 11% speedup and a true improvement of 20.
Shortly after its release, Shopify CEO Toby Lütke used an internal model of qmd automated research. Having run 37 experiments overnight, Toby woke up with a 0.8 billion parameter model, a 19% increase over the previous 1.6 billion parameter version. Toby is not a machine learning engineer.
Automated research is powerful because it solves two problems at once. One is to automate parts of the knowledge production process. And second, solving agent control problems and keeping agents engaged. If you give the AI a free explanation or optimize in the wrong direction, it will often deviate. Fortunately, AutoResearch prevents this by design. Humans decide where cars go. autoresearch remains at the wheel.
I’ve spent the last month adapting automated research to knowledge work beyond machine learning, with the goal of standing up a system that can run structured, low-cost experiments on the types of decisions most teams make every week. calling this version auto betaand I am making the complete playbook/skill available to paid members below.
Let’s go!
When I first saw automated research, my immediate impression was that it doesn’t have to be just machine learning. This loop is common: hypothesize, test, score, iterate. So I cloned it and started running it in other parts of my work.
Things didn’t go as I expected. The output looked good, but I couldn’t tell if it was an improvement. Unlike ML, where the agent has built-in feedback signals from each training run, knowledge work didn’t have that. Pricing decisions are not validated in 5 minutes. And in the paragraphs I write, most of the time I can’t tell if the argument is getting better or just changing.
This is why automated research is really difficult to apply to knowledge. The loop needs something to take into account and optimize for something that doesn’t exist naturally. Knowledge work requires optimizing such things.
So I built a version of automated research called AutoBeta that can address a wide range of business problems. Although not as technically robust as Karpathy, it has the same design principles. Objectives and constraints are set and the experiment is done in a loop.
What I changed was the score. I created an “oracle”. This is a comprehensive panel of judges that collapses the loop into a single number that can be optimized and scores each output against predefined criteria.
