Meta-AI and KAUST researchers propose a neural computer that combines computation, memory, and I/O into a single trained model

Researchers at Meta AI and King Abdullah University of Science and Technology (KAUST) have introduced neural computers (NCs). This is a proposed form of machine in which the neural network acts as an execution computer in its own right, rather than as a layer on top of it. The research team presents both a theoretical framework and two working video-based prototypes that demonstrate initial runtime primitives in command line interface (CLI) and graphical user interface (GUI) settings.

Differences between agents and world models

To understand the proposed work, it is helpful to compare it with existing system types. Traditional computers run explicit programs. An AI agent receives a task and uses the existing software stack (operating system, API, terminal) to perform the task. World models learn to predict how the environment will evolve over time. Neural computers don’t exactly take on any of these roles. Researchers also clearly distinguish neural computers (NCs) from the neural Turing machine and differentiable neural computer series, which focus on differentiable external memory. Neural computer (NC) questions are different. Can a learning machine take on the role of a running computer itself?

Formally, a neural computer (NC) is defined by an update function F._θ and decoder G_θoperate in potential runtime state h_t. At each step, NC updates h._t From the current observation x_t and user action u_tsample the next frame x._t+1. Latent state keeps things inside the model that the operating system stack would normally do (executable context, working memory, interface state) instead of outside the model.

The long-term goal is a complete neural computer (CNC). It is a mature, general-purpose realization that simultaneously satisfies four conditions: it is Turing-complete, universally programmable, has consistent behavior unless explicitly reprogrammed, and exhibits machine-native architecture and programming language semantics. A key operational requirement related to consistent behavior is the execute/update contract. Regular inputs should be performed silently and without changing the installed functionality, whereas updates that change behavior should be performed explicitly through a programming interface, with traces that can be inspected and rolled back.

Two prototypes built on Wan2.1

Both prototypes — north carolina_CLIGen and north carolina_{GUI world} — Built on Wan2.1, the state-of-the-art video generation model at the time of the experiment, on top of which NC-specific conditioning and action modules were added. The two models were trained separately without shared parameters. Both evaluations run in open-loop mode, rolling out from recorded prompts and logged action streams rather than interacting with a live environment.

north carolina_CLIGen Model terminal interaction from text prompts and initial screen frames, and handle CLI generation from text and images to video. The CLIP image encoder processes the first frame, the T5 text encoder embeds the caption, and these adjustment functions are concatenated with diffuse noise and processed by a DiT (diffusion transformer) stack. Two datasets were collected. CLIGen (general). Contains approximately 823,989 video streams (approximately 1,100 hours) taken from publicly available asciinema.cast recordings. CLIGen (Clean) is split into approximately 78,000 regular traces and approximately 50,000 Python math validation traces generated using the VHS toolkit within a Dockerized environment. training nc_CLIGen CLIGen (general) required approximately 15,000 H100 GPU hours. CLIGen (clean) required approximately 7,000 H100 GPU hours.

The reconstruction quality with CLIGen (general) reached an average PSNR of 40.77 dB and SSIM of 0.989 at a font size of 13 pixels. Character-level accuracy measured using Tesseract OCR increased from 0.03 at initialization to 0.54 over 60,000 training steps, and exact line match accuracy reached 0.31. The specificity of the captions had a huge effect. Detailed captions (76 words on average) improved PSNR from 21.90 dB to 26.89 dB under semantic explanation. This is almost a 5 dB gain. This is because terminal frames are primarily controlled by text alignment, and literal captions act as scaffolding for precise text and pixel alignment. There is one result of the training dynamics that is worth noting. PSNR and SSIM plateau around 25,000 steps of CLIGen (Clean), and training up to 460,000 steps does not provide any further meaningful improvement.

For symbolic calculations, arithmetic probe accuracy on a pool of 1,000 retained math problems was 4% for NC._CLIGen It was 0% for the base Wan2.1, 71% for Sora-2, and 2% for Veo3.1. Re-prompting alone by explicitly providing the correct answer at the prompt during inference resulted in NC_CLIGen Accuracy increases from 4% to 83% without changing the backbone or adding reinforcement learning. The researchers interpreted this as evidence of the stability and faithful rendering of conditional content rather than native arithmetic within the model.

north carolina_{GUI world} supports full desktop interactions, modeling each interaction as a synchronized sequence of RGB frames, and input events collected at 1024×768 resolution on Ubuntu 22.04 and XFCE4 at 15 FPS. The total dataset is approximately 1,510 hours. 110 hours of goal-oriented trajectories collected using Random Slow (approximately 1,000 hours), Random Fast (approximately 400 hours), and Claude CUA. Training used 64 GPUs for approximately 15 days per run, totaling approximately 23,000 GPU hours per full pass.

The research team evaluated four action injection schemes that differ in how deeply the action embeddings interact with the diffusion backbone: external, contextual, residual, and internal. The highest structural consistency was achieved through internal coordination that inserts cross-attentions of action directly within each transformer block (SSIM)₊₁₅ 0.863, FVD₊₁₅ 14.5). Best perceived distance achieved with residual adjustment (LPIPS)₊₁₅ 0.138). For cursor control, SVG mask/reference conditioning improved cursor accuracy to 98.7% (8.7% for coordinate-only monitoring). This shows that it is essential to treat the cursor as an explicit visual object to be monitored. Data quality has proven to be just as important as architecture. The 110-hour Claude CUA dataset outperformed approximately 1,400 hours of random exploration across all metrics (FVD: 14.72 vs. 20.37 and 48.17), confirming that curated, goal-oriented data has significantly higher sample efficiency than passive collection.

What remains unresolved

The research team is honest and upfront about the gap between current prototypes and the definition of CNC. Stable reuse of learned routines, reliable symbolic computation, long-term execution consistency, and explicit runtime governance are all open. The roadmap they outline is centered around three acceptance lenses: installation and reuse, consistent execution, and update governance. The researchers argue that advances in all three would move neural computers closer to being a form of candidate machine for the next generation of computing, rather than an isolated demonstration.

Important points

Neural computers propose to make the model itself a running computer. Unlike AI agents that operate through existing software stacks, NC aims to collapse computation, memory, and I/O into a single learned runtime state, eliminating the separation between the model and the machine on which it runs.
Early prototypes show measurable interface primitives. NCCLIGen, built on Wan2.1, reaches 40.77 dB PSNR and 0.989 SSIM in terminal rendering, and NC_{GUI world} Achieved 98.7% cursor accuracy using SVG mask/reference conditioning. This confirms that I/O alignment and short-horizon control can be learned from the collected interface traces.
Data quality is more important than data size. GUI experiments demonstrate that 110 hours of goal-directed trajectories with Claude CUA outperformed approximately 1,400 hours of random exploration across all metrics, demonstrating that curated interaction data is significantly more sample efficient than passive collection.
The current model is a powerful renderer, but not a native reasoner. north carolina_CLIGen The unaided arithmetic probe yielded only a 4% score, but reprompting improved accuracy to 83% without changing the backbone. This is evidence of manipulation rather than internal calculations. Symbolic reasoning remains a major open challenge.
Achieving a perfect neural computer requires filling three practical gaps. The research team plans near-term progress on installation and reuse (learned functionality persists and remains callable), execution consistency (behavior that is reproducible between runs), and update governance (changes in behavior that are traceable to explicit reprogramming rather than silent drift).

Please check Papers and technical details. Please feel free to follow us too Twitter Don’t forget to join us 130,000+ ML subreddits and subscribe our newsletter. hang on! Are you on telegram? You can now also participate by telegram.

Need to partner with us to promote your GitHub repository, Hug Face Page, product release, webinar, etc.? connect with us

Michal Sutter is a data science expert with a master’s degree in data science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Source link