Training transformers with 1970s technology

Although generative language models have found little widespread and profitable adoption beyond costing artists jobs or giving technology companies an easy scapegoat for layoffs, the underlying technology remains an attractive area of research. We were able to examine these early models by going back to the more innocent days of the late 2010s, before the cultural backlash. Alternatively, you can also see how older technologies handle this type of machine learning algorithm to further understand its fundamentals. [Damien] used a 60’s IBM and PDP-11 for training to explore the Transformer algorithm.

For older hardware like this, you will need to: [Damien] He trains Transformers to reverse lists of numbers. This is an easy problem for something like a Python program, but much more difficult for transformers. This model relies solely on self-attention and residual connections. To stay within the PDP-11’s 32KB memory limit, we employ fixed-point arithmetic and lookup tables to replace computationally intensive functions. Training is optimized with a manually adjusted learning rate and stochastic gradient descent to achieve 100% accuracy in 350 steps. In the real world, this means you could reduce your training time from hours or days to about 5 minutes.

Projects like this not only help you understand these tools, but they also go a long way in demonstrating that not every task requires a gigawatt data center to be useful. In fact, we’ve seen large-scale language models and other generative AI running on computers comparable to the ESP32. And if you need a little more computing power, it runs on consumer PCs with or without GPUs.

Source link