Image by editor
Machine learning and artificial intelligence seem to be growing so fast that some of us can even keep up with it. requires specialized infrastructure and hardware support. Advances in machine learning translate directly to scaling computing performance. Now let’s learn more about TPU v4.
TPU stands for Tensor Processing Unit and is designed for machine learning and deep learning applications. TPUs were invented by Google and built to handle the advanced computational needs of machine learning and artificial intelligence.
When Google designed the TPU, it was created as a domain-specific architecture. In other words, the TPU was designed as a matrix processor rather than a general-purpose processor dedicated to neural network workloads. This solves Google’s issue with memory access issues causing the GPU and CPU to slow down and use more processing power.
So we had TPU v2, v3 and now v4. So what is v2?
A TPU v2 chip contains two TensorCores, four MXUs, a vector unit, and a scalar unit. See image below.
Image by Google
Optical Circuit Switch (OCS)
TPU v4 is the first supercomputer to deploy a reconfigurable optical circuit switch. Optical circuit switches (OCS) are believed to be more effective. It is sent when congestion occurs, thus relieving the congestion previously found on the network. OCS improves scalability, availability, modularity, deployment, security, power, performance, and more.
The TPU v4’s OCS and other optical components account for less than 5% of the TPU v4 system cost and less than 5% of the system power.
TPU v4 is also the first supercomputer to support embedding in hardware. Neural networks are well trained on dense vectors, and embedding is the most efficient way to transform categorical feature values into dense vectors. TPU v4 includes 3rd generation SparseCores, a dataflow process that accelerates machine learning models that rely on embeddings.
For example, an embedded function can translate English words. This is taken as a large categorical space and transformed into a smaller dense space of 100 vector representations of each word. Embeddings are part of our daily lives and a key component of the Deep Learning Recommendation Models (DLRM) used in advertising, search rankings, YouTube, and more.
The images below show the performance of the recommended model on CPU, TPU v3, TPU v4 (with SparseCore), and TPU v4 embedded in CPU memory (without SparseCore). As you can see, the TPU v4 SparseCore is 3x faster than his TPU v3 in the recommended model, and 5-30x faster than systems with a CPU.
Image by Google
TPU v4 is 2.1x better than TPU v3 with 2.7x better performance per watt. TPU v4 is 4096 chips, 4x bigger and 10x faster. The OCS implementation and flexibility is also a big help for large language models.
The performance and availability of TPU v4 supercomputers are focused on to improve large language models such as LaMDA, MUM, and PaLM. A 540B parameter model, PaLM, was trained on TPU v4 for over 50 days and showed an impressive hardware floating point performance of 57.8%.
TPU v4 also has multi-dimensional model partitioning technology that enables low-latency, high-throughput inference of large language models.
With more laws and regulations being introduced to help businesses around the world improve their overall energy efficiency, TPU v4 is doing a decent job. Use up to 2-6x less energy and up to 20x less carbon footprint than modern DSAs in data centers.
Now that you know a bit more about TPU v4, you may be wondering how fast machine learning workloads can really change on TPU v4.
The following table shows the workload by model type for deep neural networks and the percentage of TPUs used. Over 90% of his training at Google is done on his TPU. This table shows the rapid changes in production workloads at Google.
Compared to transformations known from natural language translation and text summarization, RNNs process inputs one at a time instead of sequentially, thus reducing recurrent neural networks (RNNs).
To learn more about TPU v4 capabilities, read our research paper TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings.
Last year, Google Cloud’s ML cluster in Oklahoma made a TPU v4 supercomputer available to AI researchers and developers. The author of the paper claims that the TPU v4 is faster and consumes less power than the Nvidia A100. However, TPU v4 and his new Nvidia H100 GPU could not be compared due to limited availability and he is on 4nm architecture whereas TPU v4 has 7nm architecture.
TPU v4 features, limitations and do you think it’s better than the Nvidia A100 GPU?
Nisha Aria Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing career advice and tutorials on data science, as well as theory-based knowledge on data science. She also wants to explore different ways artificial intelligence can extend human lifespan. She is an avid learner looking to expand her technical knowledge and writing skills while helping guide others.