AI agents build better AI

Artificial intelligence is no longer just a tool for end-user products. It is now a key component in building more sophisticated AI itself. This shift is evident in how AI is optimizing infrastructure, training workflows, and the very systems used to develop AI. LinkedIn Engineering began investigating this issue in August 2025 and used agent loops to improve LLM post-training execution. Our initial success came from not only automating tasks, but also creating a structured loop of suggest, test, measure, and improve.

Visual TL;DR. AI Agent Refine AI uses an iterative refinement loop. Iterative improvement loops measured by definitions of success. Iterative improvement loops with structured feedback. AI Powering AI requires an integrated platform. The integrated platform allows for parallel model trials. Parallel trials of models lead to better AI systems.

AI Agent Refine AI: The agent automates the iterative refinement loop for LLM post-training execution.
Iterative refinement loop: Systematically propose, test, measure, and improve AI models
Defining success: Use scoreboards to measure and track AI model performance
Structured Feedback: Empower your AI agents with targeted, structured feedback for improvement.
AI Powering AI: Internal projects that leverage AI to improve AI system development
Unified platform: agents, evaluation systems, and GPU microscheduling for large-scale experiments
Parallel model trials: The agent parallelizes model trials with minimal human supervision.
Better AI systems: Create more sophisticated AI through automated development processes

Visual TL;DRquickexplainDeeper

iterative improvement loop

AI that powers AI

integrated platform

Parallel model trial

Better AI systems

From startuphub.ai · Publishers behind this format

repetitiverefinement loop

AI that powers AI

integrated platform

parallel modelordeal

Better AI systems

From startuphub.ai · Publishers behind this format

This realization sparked an internal project in January 2026 with the clear goal of leveraging AI to power AI systems and require a platform designed to play a central role for agents. This resulted in a strategy focused on integrating three pillars for large-scale experimentation: agents for distributed training code, a comprehensive evaluation system, and efficient GPU microscheduling. This framework allows agents to trial models in parallel with minimal human supervision.

Within this configuration, the agent optimizes both model quality and training efficiency in an inner loop. Once the optimal architecture is found, it is scaled through distributed training in the outer loop. This approach was first applied to migrate LinkedIn’s large TensorFlow model to PyTorch, resulting in Autopilot for Torac. This professional agent doesn’t just convert. Iteratively refine the generation based on LLM inference and verifier feedback.

This pattern was quickly extended to other use cases such as kernel generation and automatic tuning, where agents autonomously search, evaluate, and enhance system performance. The core loop is a cycle of generation → validation → refinement.

iterative improvement loop

Arch’s Autopilot works in a continuous generation->score->hint->regeneration loop until the target metric is met. Each iteration undergoes rigorous quality gates, ensuring verifiers provide specific, actionable fixes rather than just pass/fail signals. Once the goal is achieved, the PyTorch implementation is validated on GPU pods and deployed via a Flyte workflow.

This autopilot system is currently being applied to engineering problems where the output is AI infrastructure, models, or performance-critical code. This includes framework migration, model code generation directly from datasets, automated research for architectural optimization, and kernel generation for low-level GPU optimization. The common thread is building verifiable AI systems, not just code generation, and explicit checks ensure that iterative agent loops effectively refine the output.

Definition of Success: Scoreboard

This system works on the “trust but verify” principle. The scoreboard is not an afterthought. This defines what “good” means to the agent loop. Reward design is very important because the agent optimizes based on the content of the reward. Shallow rewards lead to shallow corrections.

The evaluation hierarchy prioritizes functional accuracy. If the system cannot perform, learn, or stabilize, the other scores are irrelevant. Functional validity is a tricky subject. Behavioral parity ensures expected output for representative inputs. A structure check verifies the integrity of the components. Quality checks adhere to target stack conventions for maintainability. Finally, task-level metrics measure real-world performance.

Validation begins with low-cost structure and style checks and progresses to trainability, IO parity, numerical stability, task-level metrics, and increases in difficulty. This step-by-step approach optimizes loop efficiency and increases reliability.

The model code evaluation rubric includes trainability (stability, convergence), IO parity (behavioral consistency), structural fidelity (maintaining the architecture), code style, and task metric parity (downstream quality). This loop leverages both failures and successes to provide structured feedback, prevent redundant work, and accelerate improvement.

Reinforcement with structured feedback

Reinforcement in the loop is done by the verifier, who provides structured natural language feedback. This feedback acts as a coach, detailing weaknesses and suggesting modifications. Each piece of feedback is typed (NO_GRADIENT, NUMERICAL_INSTABILITY, etc.), prioritized (P1 for critical, P4 for minor), and actionable with metrics, targets, and recommended direction.

Accurate feedback drives systematic improvement, focusing on high-value changes first, such as fixing training blockers before refining your style. Validators transform assessments into guidance and rubric failures into targeted actions.

Complementing this client-side strategy is a server-side tracking tool, the Autopilot Tracking Console. Provides a central view of active and completed transformations, training jobs, and Flyte runs, including status, metrics, and artifact links. This is essential for monitoring long-running jobs and checking their execution history.

This approach increases productivity and enables agent migration and auto-tuning for more models with significantly less manual effort. Initial results show strong performance across benchmarks and capabilities that match offline metrics for internal workloads. The success of this system relies on core design decisions, including a scoring-based iterative loop, natural language reasoning for feedback, rapid fault detection, modular and deterministic scoring, and bounded iterations to prevent infinite loops.

Comprehensive assessment including N-day replays on production traffic increases confidence. GPU microscheduling ensures cost-optimized compute consumption for large-scale experiments. This is an exciting area of development, with more details to come.

© 2026 StartupHub.ai. Unauthorized reproduction is prohibited. Please do not type, scrape, copy, reproduce or republish this article in whole or in part. Use for AI training, fine-tuning, search enhancement generation, or as input to any machine learning system is prohibited without a written license. Substantially similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer abuse laws. See our Clause.

Source link