explore

1** Machines can learn!**

A few years ago, I decided I needed to learn how to code simple machine learning algorithms. I had written about machine learning as a journalist and wanted to understand the basics. (My background as a software engineer helped.) One of my first projects was to build a basic neural network to try to do what astronomer and mathematician Johannes Kepler had done in the early 1600s: analyze data about the position of Mars collected by Danish astronomer Tycho Brahe to derive the laws of planetary motion.

I quickly discovered that artificial neural networks (a type of machine learning algorithm that uses a network of computing units called artificial neurons) require much more data than Kepler had available. To satisfy the algorithm's demands, I used a simple simulation of the solar system to generate 10 years' worth of data on the daily alignment of the planets.

After many failed attempts and getting stuck, I coded a neural network that could predict the future positions of planets based on simulated data. It was a lot of fun to watch: the network actually learned patterns in the data and could predict, for example, the position of Mars five years from now.

I was immediately hooked. Kepler, of course, had done much more with much less: he had devised comprehensive laws that could be codified in the symbolic language of mathematics. My neural network simply took in data about the previous positions of the planets and spit out data about their future positions. It was a black box, its inner workings beyond my nascent skills to decipher. Still, it was a visceral experience to witness the ghost of Kepler in the machine.

This project inspired me to learn more about the mathematics behind machine learning, and I want to share how great it is. *Why machines learn.*

2** Everything is (almost) vector.**

One of the most amazing things I learned about machine learning is that it can turn anything into a vector: the positions of planets, pictures of cats, recordings of bird calls.

Machine learning models use vectors to represent both input and output data. A vector is simply a sequence of numbers. Each number can be thought of as a distance from the origin along an axis of a coordinate system. For example, a sequence of three numbers could be 5, 8, and 13. That is, 5 is 5 steps along the x-axis, 8 is 8 steps along the y-axis, and 13 is 13 steps along the z-axis. Taking these steps will arrive at a point in 3D space, which represents a vector represented as a sequence of numbers in brackets as follows: [5 8 13].

Now, say your algorithm wants to represent a grayscale image of a cat, and each pixel in that image is a number encoded using one byte, or 8 bits, of information, so it needs to be a number between 0 and 255, with 0 representing black, 255 representing white, and the numbers in between representing different shades of gray.

Witnessing Kepler's ghost in the machine was a visceral experience.

If you have a 100×100 pixel image, there are a total of 10,000 pixels in the image. So if you line up the numerical values ββfor each pixel, you get a vector that represents the cat in a 10,000 dimensional space. Each element of that vector represents a distance along one of the 10,000 axes. Machine learning algorithms encode a 100×100 image as a 10,000 dimensional vector. As far as the algorithm is concerned, the cat becomes a point in this high dimensional space.

By converting images into vectors and treating them as points in some mathematical space, machine learning algorithms can learn patterns that exist in the data and use what they learn to make predictions about new, unknown data. Given a new, unlabeled image, the algorithm looks at where the associated vector, or point formed by that image, is located in high-dimensional space and classifies it accordingly. What we have here is a very simple type of image recognition algorithm: given a set of images that humans have annotated as images of cats and dogs, the algorithm learns how to map those images into a high-dimensional space and uses that map to make decisions about new images.

3** Some machine learning algorithms can be “universal function approximators”.**

One way to think of machine learning algorithms is as transforming the input. *X*the output is, *Yeah*Inputs and outputs can be single numbers or vectors. *Yeah* = *debt* (*X*). here, *X* A 10,000-dimensional vector representing cats and dogs, *Yeah* It could be 0 for cats and 1 for dogs, and given enough annotated training data, the machine learning algorithm is tasked with finding the best possible function. *debt*,Convert *X* To *Yeah*.

Certain machine learning algorithms, such as deep neural networks, are “universal function approximators”, meaning that there is mathematical proof showing that they can, in principle, approximate any function, no matter how complex.**.**

Now we have a vector that represents a cat in a 10,000 dimensional space.

Deep neural networks have layers of artificial neurons, including an input layer, an output layer, and one or more so-called hidden layers sandwiched between the input and output layers. There is a mathematical result called the universal approximation theorem, which states that, given any number of neurons, any function can be approximated by a network with only one hidden layer. This means that if there is a correlation in your data between the inputs and the desired output, then the neural network can find a very good approximation of a function that implements this correlation.

This is a profound result, and one of the reasons why deep neural networks are being trained to perform increasingly complex tasks, as long as we provide them with enough input-output data and can make the networks large enough.

So whether it's a function that takes an image and converts it to 0 (cat) and 1 (dog), or a function that takes a string of words and converts those words into an image to use as a caption, or even a function that takes a snapshot of the road ahead and instructs the car to change lanes, stop, etc., a universal function approximator could in principle learn and implement any such function given enough training data. The possibilities are endless, but keep in mind that correlation is not the same as causation.

*Lead image: Aree_S / Shutterstock*