AI learns complex sequences by mastering the building blocks of computations

Machine Learning


Scientists are increasingly interested in understanding how neural networks perform complex, structured computations on sequences. Giovanni Luca Marchetti of the KTH Royal Institute of Technology, Daniel Kunin of the University of California, Berkeley, and Adele Myers of the University of California, Santa Barbara, along with Acosta, Mioran, and colleagues introduced a new challenge: continuous group formation to investigate this phenomenon. The researchers tasked the network with predicting the cumulative product of a finite group of elements and showed how the architecture learns order-dependent nonlinear relationships. This work is important because it demonstrates that deeper models can take advantage of their inherent mathematical structure, particularly their connectivity, to achieve significantly improved scaling compared to shallow networks, providing a tractable means to investigate how deep learning works.

Learning arithmetic with sequential data reveals the benefits of network depth for generalization and systematic composability

Through new research in sequential data processing, scientists have revealed how neural networks learn to perform structured operations such as arithmetic and algorithmic calculations. In this study, we introduce a sequential group composition task, a method to examine how networks learn how to map a set of group elements to their cumulative products.
This study shows that the network can learn this task, but the efficiency of learning is highly influenced by both the network’s architecture and the underlying structure of the groups themselves. Importantly, this study reveals that by exploiting the associative properties of the task, deeper networks significantly improve performance over shallow networks and dramatically improve scaling efficiency.

Sequential group synthesis tasks require a network that predicts the cumulative product of a set of elements (each element encoded as a real vector) drawn from a finite group. The researchers demonstrated that a two-layer network learns this task by progressively identifying irreducible representations of groups based on the Fourier statistics of the encoding.

Although these networks can achieve perfect performance, they require a hidden width that increases exponentially with the length of the sequence. This limitation highlights a fundamental challenge in scaling neural networks to handle longer sequences and more complex computations. In contrast, deeper architectures provide a solution to this scaling problem.

Recurrent neural networks compose elements in sequence and complete the task in a number of steps proportional to the length of the sequence. However, multilayer networks achieve even higher efficiency by configuring pairs of adjacent elements in parallel, reducing the number of required layers only logarithmically with respect to the sequence length.

This parallel processing capability shows how architectural design can dramatically impact computational efficiency. This study demonstrated that sequential group composition tasks are order-sensitive and nonlinear, and therefore require complex architectures for successful learning. Analyzing the task reveals the group-specific Fourier decomposition, allowing you to understand exactly how features are learned during training. These findings position sequential group composition as a valuable tool for developing mathematical theories of deep learning, providing insight into how networks learn from sequential data, and paving the way for more efficient and powerful architectures.

Network architecture and learning dynamics for sequential group composition are important for robust generalization

A two-layer second-order multilayer perceptron (MLP) serves as a primary tool for investigating how neural networks learn the configuration of ordered groups. The network receives a sequence of elements from a finite group encoded as a real vector xg of dimension k|G| and predicts their cumulative product. The output of the network is calculated as f(xg; Θ) = Wout σ (Win xg). Here Win of size H×k|G| embeds the input sequence in a hidden representation, σ is an element-wise monic polynomial of degree k, and Wout of size |G|×H does not embed the hidden representation.

This computation can also be expressed as a sum of H hidden neurons, each contributing fi(xg; θi) = wi σ k X j=1 ⟨ui j, xgj⟩. Here, ui and wi represent the input and output weights of the i-th neuron, respectively. The parameters are initialized from a normal distribution N(0, α2) and evolve under a time-rescaled gradient flow θi = −ηθi∇θiL(Θ) with a neuron-dependent learning rate ηθi = ∥θi∥1−k log(1/α).

This vanishing initialization scheme enables the application of alternating gradient flow (AGF), a framework for describing gradient dynamics in two-layer networks. AGF assumes that hidden neurons exist either in a resting state, where their influence on the output is negligible, or in an active state, where they directly shape the output.

The learning process unfolds in two alternating phases. Utility maximization involves resting neurons competing to orient themselves to the information in the data, maximizing U(θi) = 1 |G|k X g∈Gk ⟨f(xg; θi), xg1:k −f(xg; ΘA)⟩ subject to the constraint ∥θi∥= 1. Cost minimization then occurs as active neurons cooperate to minimize losses. Compute L(ΘA) while maintaining the norm ∥ΘA∥≥0.

This iterative process produces a characteristic step-like loss curve with a plateau indicating utility maximization and a drop indicating cost minimization. The analysis shows that the irrep of the groups is determined by the Fourier statistics of the input encoding vector x and is learned in sequence, as shown in Figure 3.

This study assumes that the input is mean-centered bx.[ρtriv] = ⟨x, 1⟩= 0, and that is for all irreps ρ, bx[ρ] is either reversible or zero, ensuring non-degeneracy and separation of the Fourier coefficients. The function for each neuron is decomposed into two terms: f(xg; θi)(x) and f(xg; θi)(+) to facilitate the analysis of interactions between inputs.

Exponential scaling of hidden width limits configuration capacity of two-layer networks despite sufficient overall parameters

A two-layer network learning a sequential group composition task requires a hidden width, specifically O(exp k), that scales exponentially with the length of the sequence. In contrast, recurrent neural networks construct elements sequentially in just O(k) steps. Multilayer networks efficiently combine elements in parallel to achieve composition in O(log k) layers.

This study demonstrates the universality of feature learning dynamics as well as the diversity of architectural efficiency in exploiting task connectivity. In this study, we demonstrate that the sequential group composition task is order-dependent and nonlinear and requires nonlinear interactions between inputs, which precludes its solution from deep linear networks.

Group-specific Fourier decomposition enables precise analysis of learning within a two-layer network, revealing how the group Fourier statistics of the encoding vector determines the learned features and their retrieval order. This decomposition provides a tractable framework for understanding how neural networks learn from sequences.

Experiments reveal that the network gradually decomposes the task into irreducible representations of groups and learns these components in greedy order based on the encoding vector. Different architectures specifically realize this process, with two-layer networks attempting to configure all k elements simultaneously.

This work shows that deep networks identify efficient solutions by exploiting connectivity to construct intermediate representations, even when the number of possible inputs increases exponentially with the length of the sequence. These results position sequential group composition as a principled lens for developing mathematical theories of learning in sequential data.

This discovery extends insights from the study of Fourier features in modular addition and binary group synthesis to sequential group synthesis, guiding feature acquisition through training rather than mere empirical testing. The analysis is built on an alternating gradient flow framework and shows that the network acquires group Fourier features in a greedy order determined by their importance.

Learning structured computation with sequential group composition provides a powerful path to general intelligence

The researchers investigated how neural networks learn to perform structured computations, such as arithmetic and algorithmic operations, by introducing sequential group composition tasks. This task requires the network to predict the cumulative product of a set of elements drawn from a finite group, requiring order dependence and nonlinear processing.

Analysis revealed that learning is shaped by group structure, input encoding statistics, and sequence length. Specifically, the two-layer network learns this task by sequentially obtaining representations of groups based on the Fourier statistics of their encoding. Although these networks can completely solve the task, their capacity increases exponentially with the length of the sequence.

However, deeper networks overcome this limitation by exploiting the associative properties of tasks and configuring elements either sequentially in an iterative architecture or in parallel in a multilayer design. This study provides a simplified model for understanding how learning works in neural networks.

The findings show that there is a clear relationship between network depth and the ability to efficiently learn configuration operations. Sequential group composition tasks provide a tractable framework for isolating and studying factors that influence learning. Limitations acknowledged by the authors include certain assumptions made about the input data and a focus on specific task structures. Future research may investigate the generalizability of these findings to more complex tasks and investigate the role of different architectural choices in promoting efficient learning of sequential operations.



Source link