Calculating AI’s environmental footprint at Google
Detailed measurements allow you to compare different AI models and the hardware and energy they run on, while optimizing the efficiency of the entire system, from the hardware and data center to the models themselves. By sharing our methodology, we hope to increase consistency across the industry in calculating AI resource consumption and efficiency.
Measuring the footprint of AI processing workloads is not easy. Google has developed a comprehensive approach that considers the realities of delivering AI at Google's scale:
-
Dynamic power consumption of the entire system: This includes not only the energy and water used by the main AI model during active computation, but also the actual chip utilization achieved at production scale, which is much lower than the theoretical maximum.
-
Idle machine: To ensure high availability and reliability, production systems require some provisioned capacity that is ready to handle traffic spikes and failovers at any time, even when idle. The energy consumed by these idle chips must be factored into the total energy usage.
-
CPU and RAM: Executing AI models doesn't just happen on ML accelerators like TPUs and GPUs. Host CPU and RAM also play a critical role in delivering AI and consume energy.
-
Data center overhead: The energy consumed by IT equipment running AI workloads is only part of the problem. The infrastructure that supports these computations (cooling systems, power distribution, and other data center overhead) also consumes energy. Overhead energy efficiency is measured by a metric called Power Usage Effectiveness (PUE).
-
Data center water consumption: to Reduce energy consumption and associated emissionsdata centers often consume water for cooling. Optimizing your AI system to be more energy efficient will naturally also reduce your overall water consumption.
Many current AI energy consumption calculations only include active machine consumption, ignoring several important factors discussed above. As a result, they represent theoretical efficiencies rather than actual operational efficiencies at scale. Applying this non-inclusive methodology, which only considers active TPU and GPU consumption, we estimate that a median Gemini text prompt uses 0.10 Wh of energy and emits 0.02 gCO.2e, consume 0.12 mL of water. This is an optimistic scenario at best and significantly underestimates the actual operational footprint of AI.
Our comprehensive methodology estimates (0.24 Wh of energy, 0.03 gCO2e, 0.26 mL of water) account for all the critical ingredients for delivering AI worldwide. We believe this is the most complete view of AI's overall footprint.
A full-stack approach to AI – and AI efficiency
Gemini's dramatic efficiency gains stem from Google's full-stack approach to AI development, from custom hardware and highly efficient models to the robust service system that enables these models. We've built efficiency into every layer of AI, including:
-
More efficient model architecture: The Gemini model is Transformer model architecture Developed by researchers at Google, it delivers 10-100x efficiency gains compared to previous state-of-the-art language modeling architectures. We design a model with an inherently efficient structure such as: Mixed Experts (MoE) and hybrid inference. For example, MoE models allow you to activate a small subset of a larger model that is specifically needed to answer a query, reducing computation and data transfer by a factor of 10 to 100.
-
Efficient algorithms and quantization: We are continually improving our algorithms to enhance our models in the following ways: Accurate Quantization Training (AQT) Maximize efficiency and reduce energy consumption for service delivery without compromising response quality.
-
Optimized inference and delivery: We are constantly improving the delivery of our AI models to make them more responsive and efficient. technology like speculative decoding By making predictions with smaller models and being able to quickly verify those predictions with larger models, you can deliver more responses with fewer chips. This is more efficient than a larger model alone making many consecutive predictions. techniques like distillation Create smaller, more efficient models (Gemini Flash and Flash-Lite) for services that use larger, more powerful models as teachers. Faster machine learning hardware and models allow you to use larger batch sizes more efficiently when processing requests while meeting your latency goals.
-
Custom-built hardware: For over 10 years, we've been designing TPUs from the ground up to maximize performance per watt. We also co-design the AI model and the TPU to allow the software to take full advantage of the hardware, ensuring that the hardware can efficiently run future AI software when both are ready. Latest generation TPU, iron woodis 30 times more energy efficient than the first TPU released and significantly more energy efficient than general purpose CPUs for inference.
-
Optimized idling: Our serving stack uses CPU efficiently and minimizes TPU idling by dynamically moving models based on demand in near real-time, rather than a “set-it-and-forget” approach.
-
ML software stack: Our XLA ML compiler, Pallas kernel, and Pathways system enable model computations expressed in higher-level systems such as JAX to run efficiently on TPU serving hardware.
-
Ultra-efficient data center: Google's data centers are the most efficient in the industry, with average utilization rates across our fleet. PUE 1.09.
-
Responsible data center operations: We continue to add clean energy generation, Carbon free 24/7 While advancing the goal of replenish 120% of the fresh water we consume on average across our offices and data centers. We also optimized the cooling system to local tradeoff Science-backed implementation solves the problem between energy, water, and emissions. watershed health assessment; Guide the selection of cooling types and limit water use in high stress locations.
Efficient AI efforts
Gemini's increased efficiency is the result of years of effort, but this is just the beginning. Recognizing the growing demand for AI, we are investing heavily in reducing power provisioning costs and water required per prompt. By sharing our findings and methodology, we aim to drive industry-wide progress towards more efficient AI. This is essential for responsible AI development.
1. A point-in-time analysis quantified the energy consumed per median text generation prompt in the Gemini app, considering data from May 2025. Emissions per prompt were estimated based on energy per prompt, applying Google's 2024 fleet-wide average grid carbon intensity. Water consumption per prompt was estimated based on energy per prompt and applied Google's 2024 fleet-wide average water efficiency. These findings do not represent a specific environmental impact on all text generation prompts in the Gemini app, nor are they indicative of future performance.
2. The above analysis results for May 2025 were compared to baseline data from the median Gemini app text generation prompts for May 2024. The energy per median prompt is subject to change as new models are added, the AI model architecture evolves, and AI chatbot user behavior evolves. Data and claims have not been verified by an independent third party.
