AI Data Centres Explained: Power, Cooling, Networks, Operations

You can know the cloud well and still have the wrong picture of a data centre.

The cloud makes infrastructure look like a menu. Pick a region. Pick a zone. Pick an instance. Attach storage. Start a job. Watch a model train.

Underneath that menu is a physical machine. It takes electricity from the grid, pushes it through switchgear and power shelves, turns most of it into heat inside silicon, removes that heat with air or liquid, moves data through fibres and switches, and keeps the whole thing working with operators, controls, spares, guards, procedures, and alarms.

AI has made that machine harder to hide. A web service can often survive with more ordinary servers, normal rack power, and a network built for mixed traffic. Large AI training and inference systems concentrate power, heat, memory bandwidth, network traffic, storage pressure, supplier risk, and capital cost in one place. At that point the data centre stops being background infrastructure. It becomes part of the product.

The short version:

An AI data centre is not a room full of GPUs. It is a factory for useful computation. The raw inputs are power, chips, memory, data, network bandwidth, cooling capacity, and human operating discipline. The output is trained models, tokens, embeddings, search results, recommendations, simulations, and cloud services.

If any input is weak, the output gets expensive or unreliable.

Abstract

This article explains how AI data centres work for readers who already understand AI systems and cloud services. It does not start with "what is a server?" It starts with the next layer down: what has to be built and operated before a GPU instance, model endpoint, training cluster, or cloud region can exist.

The main claim is simple. AI infrastructure is constrained by five linked systems:

Power: how much electricity can reach the site, the building, the row, the rack, and the chip.
Cooling: how quickly heat can leave the chip, rack, room, and campus.
Data movement: how fast data, gradients, model weights, checkpoints, and requests can move without stalling expensive accelerators.
Control: how the facility and compute fleet are monitored, scheduled, patched, drained, repaired, and recovered.
Trust: how physical security, personnel controls, supply chain, audit evidence, and customer isolation are maintained.

The hard work is not buying the most powerful accelerator. The hard work is building a powered, cooled, networked, secure, observable, maintainable failure domain that can run the workload at the required cost and schedule.

The Useful Mental Model

Think of the data centre as a machine with four flows.

Flow	Plain-English job	What can break
Electricity in	Bring power from the grid to chips without unsafe faults or unacceptable interruptions	Utility delay, transformer shortage, UPS fault, breaker trip, rack power limit
Heat out	Move heat away from chips before equipment throttles or fails	Airflow problem, pump fault, bad water chemistry, cooling-tower limit, leak
Data through	Move bits between users, storage, CPUs, GPUs, and other data centres	Congestion, bad optics, weak fabric, high tail latency, slow storage
Control over	Decide what runs, where, when to change it, when to stop it, and who may touch it	Bad scheduler policy, weak telemetry, poor change control, unclear authority, security gap

Most bad explanations of data centres describe the equipment but miss the flows. The equipment is there to protect the flows.

A GPU rack is not valuable because it looks dense. It is valuable only if enough clean power reaches it, enough heat leaves it, enough data reaches it, the network lets it cooperate with other racks, the scheduler keeps it busy, and operators can repair it without causing a wider outage.

That is the frame for the rest of the article.

Cloud Words, Physical Meaning

Cloud providers turn facilities into abstractions. That is the point of cloud computing. The abstraction is useful, but it can hide the physical commitments behind it.

AWS says an Availability Zone contains one or more discrete data centres with redundant power, networking, and connectivity. Google Cloud, Azure, and Oracle use similar region and zone language because customers need to reason about latency, data residency, service availability, and regulatory placement.^{1Amazon Web Services Documentation. AWS Regions and Availability Zones. 2026. https://docs.aws.amazon.com/global-infrastructure/latest/regions/aws-regions-availability-zones.html}^{2Google Cloud Documentation. Geography and regions. 2026. https://cloud.google.com/docs/geography-and-regions}^{3Microsoft Azure. Global Infrastructure. 2026. https://azure.microsoft.com/en-us/explore/global-infrastructure/}^{4Oracle. Public Cloud Regions and Data Centers. 2026. https://www.oracle.com/cloud/architecture-and-regions.html}

The translation looks like this:

Cloud word	What exists underneath
Region	A commercial geography backed by campuses, fibre, utilities, operating teams, contracts, and legal commitments
Availability Zone	A failure-domain promise backed by separated buildings, power paths, network paths, and operating assumptions
GPU instance	A slice of accelerator hardware that had to be bought, shipped, mounted, cabled, powered, cooled, monitored, patched, and amortized
Capacity	A forecast that turned into land, power reservations, equipment orders, construction schedules, and operational risk
Latency	Geography plus fibre path plus switch hops plus congestion plus software fan-out
Reliability	Design, commissioning, monitoring, spares, maintenance, change control, incident response, and luck managed down
Sustainability	Energy source, PUE, water strategy, carbon intensity, utilization, hardware lifetime, and local grid impact

The cloud interface is a promise. The data centre is how the promise is kept.

What Makes AI Different

AI did not invent the data centre. Banks, telecom networks, search engines, cloud providers, laboratories, and governments have run serious compute facilities for decades. The change is density and synchronization.

Earlier cloud growth was often scale-out web infrastructure: many servers, many services, much traffic, but not always one tightly coupled job requiring thousands of accelerators to behave like one machine. Large AI training changes that. The accelerators exchange data frequently. A slow link, a congested switch, a storage pause, or one failing worker can slow the job. Google's "tail at scale" point applies brutally: rare per-node delays become ordinary when a request or training step depends on many nodes.^{5Dean, J. and Barroso, L.A. "The Tail at Scale." Communications of the ACM, 2013. https://research.google/pubs/the-tail-at-scale/}

Inference changes the problem again. It may need lower latency, higher availability, model-weight distribution, retrieval systems, safety systems, logging, and burst capacity close to users. A training cluster can sometimes run far from the customer. An inference service may need to sit inside a region where latency, sovereignty, and product reliability matter.

AI also changes the rack. The important unit used to be a server. Then it became a rack. With systems like NVIDIA GB200 NVL72 and DGX SuperPOD reference architectures, the unit starts to look like a rack-scale or pod-scale computer: CPUs, accelerators, NVLink or other scale-up interconnect, networking, storage, cooling, power, and management designed together.^{6NVIDIA. GB200 NVL72. 2024-2026 product page. https://www.nvidia.com/en-us/data-center/gb200-nvl72/}^{7NVIDIA Documentation. NVIDIA DGX SuperPOD Reference Architecture Featuring DGX GB200. 2025. https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/dgx-superpod-architecture.html}

This is why the phrase "AI factory" is useful when it is used carefully. The facility is not just hosting computation. It is arranged to turn capital equipment and electricity into model outputs.

The Build Sequence

A serious data-centre project does not start by picking a favourite GPU. It starts with a workload and a constraint model.

The first questions are plain:

Question	Why it matters
Is the site for training, inference, storage, general cloud, or a mix?	Each load shape stresses a different part of the system
How many megawatts are needed now and later?	Power is often the schedule gate
What rack density must the building support?	Cooling and electrical design follow rack density
What network behaviour does the workload require?	AI training punishes weak east-west bandwidth and poor congestion control
What data must stay local?	Residency, latency, and data gravity shape region choice
How long may a job or service be interrupted?	Redundancy, checkpointing, maintenance, and customer promises depend on it
Who will operate it?	A design that the team cannot maintain is not a design

Then comes site selection. Cheap land is not enough. A workable AI site needs power, fibre, civil access, permits, water or a waterless cooling strategy, construction labor, political acceptance, security, and expansion room. Utility interconnection can decide the schedule before the first rack arrives. NREL's interconnection work and NERC's reliability assessments are useful reminders that new load is a power-system problem, not just a customer procurement problem.^{8National Renewable Energy Laboratory. Transmission Interconnection Roadmap. 2024. https://www.nrel.gov/grid/transmission-interconnection-roadmap}^{9North American Electric Reliability Corporation. 2025 Long-Term Reliability Assessment. 2025. https://www.nerc.com/pa/RAPA/ra/Pages/default.aspx}

After the site comes the design basis. This is the document that says what the building is being designed to handle: IT load, rack density, redundancy, cooling medium, power topology, network architecture, physical security zones, maintainability, and commissioning tests. A bad design basis poisons the project. If the design assumes 20 kW racks and procurement later buys 120 kW liquid-cooled racks, everyone is now negotiating with physics.

Procurement follows. For AI, procurement is not a back-office function. It is part of the architecture:

Supplier layer	What it decides
Accelerator vendor	Compute density, software ecosystem, rack shape, power draw, cooling path
Foundry and packaging	Whether the chips and HBM packages can be made in volume
Memory supplier	HBM capacity, bandwidth, yield, and delivery schedule
Server OEM or ODM	How chips become serviceable systems and racks
Network vendor	Fabric bandwidth, telemetry, congestion behaviour, optics, support model
Power vendor	Switchgear, UPS, busway, transformers, breakers, generators, protection
Cooling vendor	Chillers, dry coolers, CDUs, cold plates, pumps, valves, monitoring
Construction trades	Whether the design becomes tested capacity on a real schedule

Commissioning is the first honest exam. Electrical teams test switchgear, protection, UPS behaviour, generator starts, transfer sequences, grounding, and load banks. Mechanical teams test pumps, valves, airflow, heat rejection, water chemistry, leak detection, and controls. IT teams test cabling, optics, firmware, network paths, storage throughput, scheduler integration, and workload burn-in.

Commissioning is not a ceremony. It is where the building is encouraged to fail while the consequences are still contained.

Power: The First Gate

Every watt that enters a data centre becomes heat. That one sentence explains why power and cooling are inseparable.

Power starts outside the fence. A utility must be able to serve the site. That can require transmission work, distribution upgrades, substations, transformers, protection studies, metering, contracts, and time. AI campuses can be large enough that a local utility, regulator, or community cannot treat them like ordinary commercial load. IEA, LBNL, EPRI, and Uptime Institute all frame AI data-centre growth as an energy-system issue, not merely an IT issue.^{10International Energy Agency. Energy and AI. 2025. https://www.iea.org/reports/energy-and-ai}^{11International Energy Agency. "Data centre electricity use surged in 2025, even with tightening bottlenecks driving a scramble for solutions." 2026. https://www.iea.org/news/data-centre-electricity-use-surged-in-2025-even-with-tightening-bottlenecks-driving-a-scramble-for-solutions}^{12Lawrence Berkeley National Laboratory / U.S. Department of Energy. 2024 United States Data Center Energy Usage Report. 2024. https://buildings.lbl.gov/publications/2024-lbnl-data-center-energy-usage-report}^{13Electric Power Research Institute. Powering Intelligence: Analyzing Artificial Intelligence and Data Center Energy Consumption. 2024. https://restservice.epri.com/publicdownload/000000003002028905/0/Product}^{14Uptime Institute. Giant data center power plans reach extreme levels. 2026. https://intelligence.uptimeinstitute.com/sites/default/files/2026-01/UI%20Field%20report%20194Giant%20data%20center%20power%20plans%20reach%20extreme%20levels.pdf}

Inside the site, power moves through a chain:

Stage	Plain-English role
Utility interconnect	The site gets power from the grid
Substation	Voltage is stepped, switched, protected, and metered
Medium-voltage distribution	Large power blocks move around the campus
Transformers	Voltage is stepped down for buildings and equipment
Switchgear and breakers	Faults are isolated and maintenance becomes possible
UPS	Short interruptions are bridged and equipment gets time to ride through or shut down
Generators or alternate supply	Longer outages are covered if the design requires it
Busway and rack distribution	Power reaches rows, racks, and power shelves
Server power supplies	Electricity becomes usable DC power for chips, memory, fans, pumps, and controllers

Redundancy means spare capacity in the right place. It does not mean safety by slogan. N+1 means one extra component beyond the needed number. 2N means two independent systems sized for the load. Distributed redundant designs spread the spare capacity differently. Each choice adds cost, complexity, test work, and failure modes.

PUE is the common energy-efficiency ratio. It means total facility energy divided by IT equipment energy. If a data centre has a PUE of 1.2, then each 1 MW of IT equipment needs another 0.2 MW for cooling, power losses, pumps, lights, controls, and other overhead. ISO/IEC 30134-2 standardizes PUE, and The Green Grid helped popularize it.^{15International Organization for Standardization. ISO/IEC 30134-2:2026, Data centres key performance indicators - Power usage effectiveness. 2026. https://www.iso.org/standard/30134-2}^{16The Green Grid. Data Center Power Efficiency Metrics: PUE and DCiE. 2007. https://www.thegreengrid.org/en/resources/library-and-tools/20-Data-Center-Power-Efficiency-Metrics-PUE-and-DCiE}

PUE is useful, but it can mislead. It does not tell you whether the GPUs are doing useful work. A low-PUE site with badly scheduled accelerators can still waste money. A slightly higher-PUE site in a cleaner grid or water-constrained region may be the better decision. For AI, the energy question is not only "how efficient is the building?" It is also "how much useful model work comes out per watt, per dollar, and per litre of water?"

Cooling: The Heat Has To Leave

A chip is a heater that happens to compute.

Air cooling works by moving cold air to the front of servers and hot air away from the back. Good air-cooled data halls manage hot aisles, cold aisles, blanking panels, floor tiles, containment, fan energy, pressure, humidity, filters, and equipment inlet temperature. ASHRAE TC 9.9 matters because IT equipment has a thermal operating envelope; the target is not human comfort.^{17ASHRAE TC 9.9. Thermal Guidelines for Data Processing Environments, 5th edition reference card. 2021 / 2024. https://www.ashrae.org/file%20library/technical%20resources/bookstore/supplemental%20files/therm-gdlns-5th-r-e-refcard.pdf}

Dense AI racks make this harder. The rack may draw so much power that air alone becomes inefficient or impractical. Direct-to-chip liquid cooling puts a cold plate on the hot components, usually CPUs and accelerators. Liquid carries heat to a coolant distribution unit, then into a facility water loop, then to chillers, dry coolers, cooling towers, or another heat-rejection system. Open Compute Project cold-plate work exists because these liquid-cooled interfaces need common expectations.^{18Open Compute Project. Cooling Environments - Cold Plate. 2025. https://www.opencompute.org/wiki/CoolingEnvironments/ColdPlate}

Plainly:

Cooling part	What it does
Cold plate	Touches the hot chip package and collects heat
Rack manifold	Distributes coolant to many cold plates
CDU	Controls flow, pressure, temperature, and separation between technology coolant and facility water
Facility loop	Moves heat away from the room
Chiller, dry cooler, or tower	Rejects heat outside the building
Sensors and controls	Detect temperature, pressure, leaks, flow, and abnormal states

Liquid cooling does not remove the operations problem. It changes it. The team now needs leak procedures, coolant chemistry, pump maintenance, valve checks, quick-disconnect discipline, spare parts, and technicians who can work near expensive powered equipment.

Water is the uncomfortable tradeoff. Evaporative cooling can be energy efficient, but it consumes water. Dry or closed-loop designs can reduce water use, but may use more electricity depending on climate and design. Microsoft has said its next-generation zero-water cooling designs reduce water consumed for cooling but cause a nominal increase in annual energy use compared with evaporative systems. Google frames cooling choices as a local balance between energy, water, carbon-free supply, climate, and workload.^{19Microsoft Cloud Blog. "Sustainable by design: Next-generation datacenters consume zero water for cooling." 2024-12-09. https://www.microsoft.com/en-us/microsoft-cloud/blog/2024/12/09/sustainable-by-design-next-generation-datacenters-consume-zero-water-for-cooling/}^{20Microsoft Local. "Understanding water use at Microsoft datacenters." 2026. https://local.microsoft.com/blog/understanding-water-use-at-microsoft-datacenters/}^{21Google. Operating sustainably - Google Data Centers. 2025-2026. https://www.datacenters.google/operating-sustainably}

There is no universal best cooling system. There is only a cooling system that fits the rack density, climate, water politics, energy source, operating team, and failure tolerance.

Networking: North-South, East-West, And The Fabric

Two terms are worth getting right.

North-south traffic enters or leaves the data centre or cluster. User requests, API calls, internet traffic, and region-to-region traffic often fit here.

East-west traffic moves inside the data centre. Service-to-service calls, storage reads, model-shard communication, gradient exchange, checkpoint traffic, and management traffic fit here.^{22Cisco. What Is Data Center Networking? 2025-2026. https://www.cisco.com/site/us/en/learn/topics/computing/what-is-data-center-networking.html}

Traditional web serving cares a lot about north-south traffic because users are outside the facility. AI training cares heavily about east-west traffic because accelerators must cooperate inside the cluster.

Modern data-centre networks use Clos or spine-leaf ideas. A leaf switch connects servers. Spine switches connect leaves. The goal is predictable paths between racks without one central choke point. Clos switching theory is old, but hyperscalers turned it into modern data-centre fabrics. Google's Jupiter and Meta's F16 work show how much engineering goes into making these fabrics cheap, fast, observable, and controllable at scale.^{23Clos, C. "A Study of Non-Blocking Switching Networks." Bell System Technical Journal, 1953. https://onlinelibrary.wiley.com/doi/10.1002/j.1538-7305.1953.tb01433.x}^{24Google / SIGCOMM. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. 2015. https://research.google/pubs/jupiter-rising-a-decade-of-clos-topologies-and-centralized-control-in-googles-datacenter-network-2/}^{25Meta Engineering. "Reinventing our data center network with F16, Minipack." 2019. https://engineering.fb.com/2019/03/14/data-center-engineering/f16-minipack/}

AI adds a back-end network problem. A training job may need thousands of accelerators to exchange data at the same time. If one path is slow, the job can wait. The network needs high bandwidth, low latency, good congestion control, fast failure detection, and useful telemetry.

This is where InfiniBand, Ethernet/RoCE, and newer AI-focused Ethernet work appear.

Fabric choice	Why teams choose it	What they inherit
InfiniBand	Mature low-latency HPC and AI training fabric	More specialized operations and supplier concentration
Ethernet with RoCE	Huge ecosystem, cloud familiarity, supplier diversity	Careful congestion tuning and operational discipline
Proprietary scale-up links	Very high bandwidth inside a rack or pod	Platform lock-in and shorter reach

InfiniBand remains important in AI and HPC. Ethernet is being adapted for AI and HPC at scale, including through the Ultra Ethernet Consortium.^{26InfiniBand Trade Association. InfiniBand - A low-latency, high-bandwidth interconnect. 2025-2026. https://www.infinibandta.org/about-infiniband/}^{27Ultra Ethernet Consortium. "Ultra Ethernet Consortium launches Specification 1.0." 2025-06-11. https://ultraethernet.org/ultra-ethernet-consortium-uec-launches-specification-1-0-transforming-ethernet-for-ai-and-hpc-at-scale/}

The physical layer matters too. Optics, copper, fibre trunks, patch panels, cable trays, labels, bend radius, cleaning, transceivers, and switch thermals are not accessories. A dirty connector can become a training delay. A late optics shipment can become a cluster delay. A hot switch can become a network incident.

Compute: The Chip Is Only One Layer

The accelerator gets the attention because it is expensive and visible. It is still only one part of the compute path.

An AI server or rack brings together:

Component	Why it matters
Accelerator	Performs the dense numerical work
HBM	Feeds the accelerator with very high memory bandwidth
CPU	Handles orchestration, preprocessing, host work, and parts of the application
NIC or DPU	Moves data onto the network and may offload work from the CPU
Local storage	Handles cache, scratch, logs, or fast staging
Firmware and drivers	Decide whether the hardware behaves consistently
Libraries	Make the hardware usable by model code
Rack interconnect	Lets accelerators behave like a larger system

HBM is worth calling out. It is stacked memory placed close to the accelerator package. It gives much higher bandwidth than ordinary server memory, but it also ties AI performance to advanced packaging, memory yield, thermal design, and supplier capacity. JEDEC standards, SK hynix production announcements, Micron product material, Samsung HBM material, and TSMC annual reporting all point to the same fact: AI infrastructure depends on memory and packaging, not just logic chips.^{28JEDEC / Business Wire. "JEDEC Publishes HBM3 Update to High Bandwidth Memory Standard." 2022-01-27. https://www.businesswire.com/news/home/20220127005320/en/JEDEC-Publishes-HBM3-Update-to-High-Bandwidth-Memory-HBM-Standard}^{29SK hynix Newsroom. "SK hynix Begins Volume Production of Industry's First HBM3E." 2024-03-19. https://news.skhynix.com/sk-hynix-begins-volume-production-of-industry-first-hbm3e/}^{30Micron. HBM3E. 2025-2026 product page. https://www.micron.com/products/memory/hbm/hbm3e}^{31Samsung Semiconductor. High Bandwidth Memory. 2025-2026 product page. https://semiconductor.samsung.com/dram/hbm/}^{32TSMC Investor Relations. TSMC 2025 Annual Report. 2026. https://investor.tsmc.com/sites/ir/annual-report/2025/2025%20Annual%20ReportE.pdf}

This is why supplier conversations quickly become architectural conversations. NVIDIA, AMD, Intel, Broadcom, TSMC, memory suppliers, server OEMs, network vendors, and power/cooling suppliers are all part of the same machine.^{33NVIDIA. NVIDIA 2025 Annual Report. 2025. https://s201.q4cdn.com/141608511/files/docfinancials/2025/annual/NVIDIA-2025-Annual-Report.pdf}^{34AMD. AMD Instinct MI300X Accelerators. 2024-2026 product page. https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html}^{35Intel. Intel Gaudi 3 AI Accelerator. 2024. https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi.html}^{36Broadcom Investor Relations. Broadcom Annual Reports. 2025. https://investors.broadcom.com/financial-information/annual-reports}

The practical rule is simple: an accelerator is idle whenever the rest of the system cannot feed it, cool it, schedule it, or recover it.

Storage: The Slow Part Can Be Somewhere Else

AI storage is easy to underestimate because cloud object storage feels bottomless. Inside the machine, storage is hardware, network paths, filesystems, metadata services, rebuild behaviour, durability policy, capacity planning, and power draw.

Training needs datasets, transformed shards, checkpoints, logs, model weights, evaluation outputs, and experiment records. Inference needs model weights, cache, retrieval indexes, logs, safety records, and sometimes vector databases. Governance needs lineage, retention, deletion, access control, and audit evidence.

A training path may look like this:

Raw data lands in object storage or a data lake.
Preprocessing turns it into training-ready shards.
High-throughput storage feeds the cluster.
Accelerators consume batches.
Checkpoints write back on a schedule.
Failed jobs restart from checkpoints.
Model artifacts move to evaluation, fine-tuning, serving, or archive.

Each stage can slow the job. NVMe, NVMe over Fabrics, parallel filesystems, burst buffers, object stores, metadata services, and storage efficiency work all matter because GPUs are too expensive to wait politely for data.^{37NVM Express. NVMe Specifications. 2026. https://nvmexpress.org/developers/nvme-specification/}^{38SNIA. SNIA Emerald Program. 2026. https://www.snia.org/forums/cmsi/programs/emerald}

Storage is not the glamorous part of an AI data centre. It is one of the easiest places to waste the glamorous part.

Control: The Building And The Fleet Both Need Operators

There are two control systems in an AI data centre.

The facility control system watches the building: power, switchgear, UPS, generators, pumps, valves, chillers, CDUs, cooling towers, leak sensors, fire systems, cameras, doors, and environmental sensors.

The compute control system watches the fleet: servers, accelerators, NICs, storage, images, firmware, drivers, schedulers, jobs, quotas, logs, metrics, traces, alerts, and customer services.

In a low-density facility those worlds can be loosely connected. In an AI facility they need to talk. A scheduler may need to avoid racks under thermal constraint. A liquid-cooling alarm may need a workload drain. A firmware update may change power behaviour. A network maintenance window may kill a training job that assumed the fabric would stay stable for days.

Google's Borg paper is still useful because it shows the data centre as a managed resource pool. Kubernetes is the common cloud-native control-plane model. Slurm remains common in HPC and AI clusters. OpenTelemetry gives a modern observability vocabulary for traces, metrics, and logs.^{39Google / EuroSys. Large-scale cluster management at Google with Borg. 2015. https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/}^{40Kubernetes Documentation. Kubernetes Overview. 2026. https://kubernetes.io/docs/concepts/overview/}^{41SchedMD. Slurm Workload Manager Overview. 2026. https://slurm.schedmd.com/overview.html}^{42OpenTelemetry. What is OpenTelemetry? 2026. https://opentelemetry.io/docs/what-is-opentelemetry/}

The question is not only which scheduler to use. The question is who has authority.

Who may reserve a thousand GPUs? Who may preempt a job? Who may drain a rack? Who may roll firmware? Who may ignore utilization to protect reliability? Who owns the incident when a facility alarm and a model-training failure are the same event?

If that authority model is vague, the cluster will be run by escalation.

Hyperscalers

Hyperscalers are not just large customers. They are infrastructure manufacturers.

Amazon, Microsoft, Google, Meta, and Oracle have enough scale to shape server designs, accelerator plans, switch fabrics, data-centre locations, cooling choices, energy contracts, and internal operating systems. Their advantage is not only money. It is repetition. They build enough sites to learn, standardize, and push suppliers.

Their public filings and investor updates show the capital weight of AI infrastructure. Microsoft, Alphabet, Amazon, and Meta all frame cloud and AI infrastructure as a major investment area.^{43Microsoft Investor Relations. Microsoft 2025 Annual Report. 2025. https://www.microsoft.com/investor/reports/ar25/index.html}^{44Alphabet Investor Relations. Alphabet 2025 Annual Report. 2026. https://s206.q4cdn.com/479360582/files/docfinancials/2025/q4/GOOG-10-K-2025.pdf}^{45Amazon Investor Relations. Amazon annual reports, proxies and shareholder letters. 2026. https://ir.aboutamazon.com/annual-reports-proxies-and-shareholder-letters/default.aspx}^{46Meta Investor Relations. Meta Reports Second Quarter 2025 Results. 2025. https://investor.fb.com/investor-news/press-release-details/2025/Meta-Reports-Second-Quarter-2025-Results/default.aspx}

Hyperscalers also hide a hard truth from customers: every simple cloud SKU is backed by a complex capacity bet. A GPU instance offered in a region means someone has already made decisions about land, power, cooling, hardware supply, network topology, customer demand, depreciation, and failure risk.

The cloud customer buys abstraction. The hyperscaler sells abstraction by owning more of the mess.

Suppliers

The supplier base is wider than the AI conversation usually admits.

Supplier group	What they control
NVIDIA, AMD, Intel, custom ASIC teams	Accelerator roadmaps, software ecosystems, rack design pressure
Broadcom, Arista, Cisco, NVIDIA Networking, others	Switching silicon, NICs, fabrics, telemetry, congestion behaviour
TSMC and advanced packaging suppliers	Whether the most advanced silicon and package designs can ship
SK hynix, Micron, Samsung	HBM supply, yield, bandwidth, capacity
Dell, HPE, Supermicro, Lenovo, ODMs	Server integration, rack serviceability, firmware, spares
Schneider Electric, Vertiv, Eaton, ABB, Siemens	Electrical infrastructure, cooling infrastructure, controls, monitoring
Cummins, Caterpillar, fuel and backup suppliers	Backup generation, fuel logistics, emergency power
Construction and commissioning firms	Whether drawings become usable capacity

NVIDIA is central now because it sells more than GPUs: systems, networking, interconnects, software, reference architectures, and management tooling. Dell, HPE, Supermicro, Lenovo, and others turn platforms into products that can be ordered, installed, serviced, and supported. Schneider Electric, Vertiv, Eaton, ABB, Siemens, Cummins, and Caterpillar show why power and cooling vendors are now AI infrastructure vendors too.^{47Dell Technologies. Dell AI Factory with NVIDIA. 2025-2026. https://www.dell.com/en-us/lp/dt/dell-ai-factory-with-nvidia}^{48Hewlett Packard Enterprise. HPE Cray Supercomputing. 2025-2026. https://www.hpe.com/us/en/compute/hpc/supercomputing/cray.html}^{49Supermicro. Supermicro NVIDIA GB200 NVL72. 2025-2026. https://www.supermicro.com/en/products/system/gpu/48u/srs-gb200-nvl72}^{50Lenovo. Lenovo Neptune Liquid Cooling. 2025-2026. https://www.lenovo.com/us/en/servers-storage/solutions/neptune/}^{51Schneider Electric. AI-ready data center solutions. 2025-2026. https://www.se.com/ww/en/work/solutions/for-business/data-centers-and-networks/ai-ready-data-center/}^{52Vertiv Investor Relations. Vertiv 2025 Annual Report. 2026. https://s205.q4cdn.com/554782763/files/docfinancials/2025/ar/Vertiv-2025-Annual-Report.pdf}^{53Eaton. Data centers. 2025-2026. https://www.eaton.com/us/en-us/markets/data-centers.html}^{54ABB. Data centers. 2025-2026. https://new.abb.com/data-centers}^{55Siemens. Data centers. 2025-2026. https://www.siemens.com/global/en/markets/data-centers.html}^{56Cummins. Data Centers. 2025-2026. https://www.cummins.com/generators/data-centers}^{57Caterpillar. Data Center Power Solutions. 2025-2026. https://www.cat.com/enUS/by-industry/electric-power/data-centers.html}

The decision lesson: do not treat the GPU purchase order as the project. The project is the whole supplier chain arriving in the right order.

Operations: The Part That Keeps Happening

Construction ends. Operations does not.

A running AI data centre has several daily loops:

Operating loop	What it does
Facilities operations	Watches power, cooling, fire, water, leaks, fuel, alarms, and maintenance state
IT operations	Watches servers, accelerators, storage, firmware, operating systems, and workloads
Network operations	Watches switch health, optics, congestion, routing, maintenance, and capacity
Security operations	Watches access, cameras, badges, deliveries, contractors, rack doors, media, and incidents
Capacity operations	Forecasts demand, manages quotas, reserves clusters, and plans expansion
Maintenance	Replaces parts, tests generators, services UPS systems, checks cooling loops, and calibrates sensors
Change management	Controls what changes, when it changes, who approves, and how to roll back
Incident management	Detects, triages, communicates, mitigates, learns, and updates procedures

The procedures have unromantic names: MOPs, SOPs, EOPs, permits to work, lockout/tagout, rounds, shift handovers, spares, runbooks, maintenance windows, post-incident reviews, access logs, and change freezes. They exist because a small mistake can affect many customers or many millions of dollars of equipment.

Uptime Institute's outage and survey work keeps returning to the same general lesson: failures are not only equipment failures. They are also process, staffing, maintenance, power, networking, and change-control failures.^{58Uptime Institute. 2025 Global Data Center Survey. 2025. https://intelligence.uptimeinstitute.com/resource/2025-global-data-center-survey-results-and-crosstabs}^{59Uptime Institute. Annual Outage Analysis 2024. 2024. https://uptimeinstitute.com/resources/research-and-reports/annual-outage-analysis-2024}

AI adds pressure because utilization is financially painful. Idle accelerators look like waste. But chasing utilization too aggressively can create fragile operations: overloaded cooling zones, rushed updates, weak drain procedures, ignored alarms, and maintenance debt. The operating model needs a rule for when to protect the fleet instead of chasing one more percentage point of use.

Staffing

AI data centres are marketed through software, but they are run by people who understand electrical rooms, cooling loops, controls, racks, fibres, safety, and security.

A serious site needs electrical engineers, mechanical engineers, controls engineers, network engineers, hardware technicians, cluster operators, security staff, safety staff, supply-chain planners, and commissioning specialists.

The awkward part is overlap. A technician working on a liquid-cooled accelerator rack needs IT hardware skill, mechanical awareness, safety training, vendor procedure, and the judgement to stop when a leak or alarm looks wrong. A cluster operator needs to understand that a facility constraint can be a scheduler constraint. A security officer needs to know whether a contractor's badge request matches a real work order and a permitted area.

Staffing is therefore a scaling limit. You can order hardware faster than you can train people to run it well.

Security

Cloud security begins below the cloud.

AWS, Microsoft, and Google publish physical security controls because customers need to trust the provider's physical-to-logical chain. Those controls include perimeter security, guards, cameras, badges, multi-factor access, cages, locked racks, visitor controls, equipment movement controls, media handling, incident response, and audit logs.^{60AWS Trust Center. Data Center - Our Controls. 2026. https://aws.amazon.com/trust-center/data-center/our-controls/}^{61Microsoft Service Assurance. Datacenter physical access security. 2025. https://learn.microsoft.com/en-us/compliance/assurance/assurance-datacenter-physical-access-security}^{62Google Cloud Security. "How Google protects the physical-to-logical space in a data center." 2025. https://cloud.google.com/docs/security/physical-to-logical-space}

AI raises the stakes:

Security issue	Why it matters
Model weights	A small number of files can represent enormous training cost and strategic value
Accelerator scarcity	Hardware theft, diversion, or tampering has direct business impact
Firmware and supply chain	NICs, baseboard controllers, drives, accelerators, optics, and firmware expand the attack surface
Remote hands	Contractors may touch high-value systems during urgent work
Multi-tenancy	Customers need isolation across expensive shared hardware
Physical-to-logical events	A badge event, rack-door alarm, drive movement, or camera alert may matter to cyber investigation
Sovereignty	Some customers care exactly where data, staff, and hardware operations sit

NIST SP 800-53 is useful because it refuses to separate cyber security from physical, personnel, maintenance, contingency, incident, and supply-chain controls. ISO/IEC 27001 frames security as a management system. FedRAMP shows how cloud customers inherit provider controls rather than operating every physical control themselves.^{63NIST. SP 800-53 Rev. 5, Security and Privacy Controls for Information Systems and Organizations. 2020, updates through 2025. https://csrc.nist.gov/Pubs/sp/800/53/r5/upd1/Final}^{64International Organization for Standardization. ISO/IEC 27001:2022 Information security management systems. 2022. https://www.iso.org/standard/27001}^{65FedRAMP. FedRAMP baselines. 2026. https://www.fedramp.gov/baselines/}

A data-centre security failure does not have to be dramatic. It can be an unrevoked badge, a contractor in the wrong room, a mislabeled drive, an unmanaged maintenance laptop, a bad firmware chain, or a rack-door alarm no one correlated with a logical event.

Scaling Failures

AI data-centre plans usually fail at the joins between disciplines.

Failure mode	What it really means
"We have GPUs but cannot use them"	The network, storage, scheduler, power, cooling, or software stack is limiting useful work
"The building is ready but not live"	Utility power, commissioning, permits, fibre, switchgear, or operating readiness is late
"The cluster is hot"	Rack density, containment, flow, workload placement, or liquid loop behaviour is wrong
"The network is flaky"	Optics, congestion control, firmware, cabling, routing, or job traffic is outside the expected envelope
"Storage is slow"	Data layout, metadata, checkpointing, rebuilds, or network paths are starving accelerators
"Utilization is poor"	Resources exist, but not in the shape jobs need, or the scheduler cannot pack them cleanly
"Maintenance causes incidents"	Procedures, authority, drain logic, spares, or rollback paths are weak
"Security slows everything down"	Access, audit, contractor, and equipment movement processes were added after the operating model
"The community is angry"	Power, water, tax, noise, land use, or emissions questions were treated as externalities

The common cause is simple: the team optimized one layer and assumed the others would comply.

They do not comply. They negotiate.

The Century Behind The Machine

AI data centres feel new because the demand shock is new. The ingredients have a long history.

Period	Breakthrough	What it changed
1900s-1930s	Electrified industry and tabulating machines	Computation became tied to rooms, power, operators, and business process
1940s	Electronic computers	Heat, reliability, maintenance, and power became computing problems
1950s	Transistors and switching theory	Smaller electronics and scalable network ideas entered the story
1960s	Integrated circuits, mainframes, time sharing	Central compute became shared institutional infrastructure
1970s	DRAM, Ethernet, minicomputers	Memory density and local networking moved toward clustered systems
1980s	TCP/IP, fibre, client-server computing	Networked computing became normal
1990s	Web scale and commodity servers	Software had to assume hardware failure
2000s	Warehouse-scale computing and virtualization	The data centre became the computer
2010s	Deep learning, GPUs, TPUs, Kubernetes, NVMe, Clos fabrics	Accelerated cloud infrastructure became mainstream
2020s	Transformers, HBM, chiplets, advanced packaging, liquid cooling, rack-scale AI systems	Power, cooling, memory, network, and supplier capacity became first-order AI constraints

The modern AI branch runs through GPU-accelerated deep learning, transformers, large language-model scaling, in-data-centre accelerators, high-bandwidth memory, and rack-scale systems.^{66Krizhevsky, A., Sutskever, I., and Hinton, G. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS, 2012. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html}^{67Vaswani, A. et al. "Attention Is All You Need." 2017. https://arxiv.org/abs/1706.03762}^{68Brown, T. et al. "Language Models are Few-Shot Learners." 2020. https://arxiv.org/abs/2005.14165}^{69Kaplan, J. et al. "Scaling Laws for Neural Language Models." 2020. https://arxiv.org/abs/2001.08361}^{70Jouppi, N.P. et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." 2017. https://arxiv.org/abs/1704.04760}^{28JEDEC / Business Wire. "JEDEC Publishes HBM3 Update to High Bandwidth Memory Standard." 2022-01-27. https://www.businesswire.com/news/home/20220127005320/en/JEDEC-Publishes-HBM3-Update-to-High-Bandwidth-Memory-HBM-Standard}^{71PCI-SIG. PCI Express 6.0 Specification. 2022. https://pcisig.com/pci-express-60-specification}^{72CXL Consortium. Compute Express Link. 2026. https://www.computeexpresslink.org/}

The pattern repeats. More useful compute exposes the next bottleneck. Faster chips expose heat. Bigger models expose network and memory. Better cloud products create more demand. More demand exposes grid limits. The data centre absorbs the consequence.

Useful Terms

Term	Plain meaning
AI factory	A data centre arranged to produce AI outputs, not just generic compute
Availability Zone	A cloud failure-domain promise backed by physical facilities and networks
Back-end network	The cluster network used for accelerator, storage, and training traffic
BMS	Building management system
Busway	Electrical distribution that carries power along rows or overhead paths
CDU	Coolant distribution unit for liquid-cooled systems
Clos fabric	A scalable multi-stage switching design used in data-centre networks
DCIM	Data-centre infrastructure management tooling
Direct-to-chip cooling	Liquid cooling that removes heat through cold plates on hot components
East-west traffic	Traffic between systems inside the data centre
EPMS	Electrical power monitoring system
Front-end network	Network used for service, user, storage, or management traffic depending on context
HBM	High bandwidth memory placed close to accelerators
Hot aisle/cold aisle	Airflow layout separating server intake and exhaust
InfiniBand	Low-latency fabric used in HPC and many AI clusters
IT load	Power used by servers, storage, and network equipment
MOP	Method of procedure for controlled maintenance work
North-south traffic	Traffic entering or leaving the data centre or cluster boundary
N+1	Enough capacity plus one spare component
PDU	Power distribution unit
PUE	Facility energy divided by IT equipment energy
RDMA	Network data movement with low CPU involvement
RoCE	RDMA over Converged Ethernet
Spine-leaf	Common two-layer data-centre network fabric
Straggler	Slow worker, path, or component that delays the whole distributed job
UPS	Uninterruptible power supply
White space	The data hall area where racks live
WUE	Water-use metric for data-centre operation

What To Inspect Before Building Or Buying

The first useful artifact is not a hardware shopping list. It is a constraint map.

Ask:

Gate	Question
Workload	What mix of training, fine-tuning, batch inference, online inference, retrieval, and storage must run?
Power	What power is actually available, by date, and with what expansion path?
Cooling	What rack densities must be supported over the next five years?
Network	What east-west bandwidth, latency, and congestion behaviour do jobs need?
Storage	What feed rate, checkpoint rate, metadata rate, and recovery time are required?
Scheduler	How will scarce accelerators be allocated, reserved, preempted, and charged back?
Operations	Who owns maintenance, changes, incidents, capacity, and security?
Supplier risk	Which components have long lead times or single-supplier exposure?
Security	Which physical, logical, personnel, supply-chain, and compliance controls must be inherited or operated?
Exit	What would make the build worse than cloud, colocation, or a managed AI cluster?

The last question is important. Most organizations should not build a serious AI data centre. They should rent, colocate, reserve cloud capacity, or buy a managed cluster until the workload is stable enough and large enough to justify owning the risk.

The builder owns power risk, construction risk, cooling risk, supplier risk, staffing risk, utilization risk, security risk, and technology-refresh risk. The cloud buyer pays a premium to avoid much of that risk. Neither path is automatically superior.

Honest Summary

The cloud is not fake. It is an interface over real machinery.

AI makes the machinery visible because it concentrates demand. The data centre has to deliver power, remove heat, move data, schedule work, protect assets, and recover from failure while the equipment gets denser and more expensive.

The serious question is not "can we get GPUs?"

It is:

Can we get enough power to the right place, remove the heat, feed the accelerators, keep the network stable, operate the fleet safely, and prove to customers that the system is secure?

If the answer is no, the correct decision may be to stay out of the data-centre business. If the answer is yes, the work starts with megawatts, cooling loops, network topology, controls, people, and failure drills.

The model comes later.

References

Amazon Web Services Documentation. AWS Regions and Availability Zones. 2026. https://docs.aws.amazon.com/global-infrastructure/latest/regions/aws-regions-availability-zones.html ↩
Google Cloud Documentation. Geography and regions. 2026. https://cloud.google.com/docs/geography-and-regions ↩
Microsoft Azure. Global Infrastructure. 2026. https://azure.microsoft.com/en-us/explore/global-infrastructure/ ↩
Oracle. Public Cloud Regions and Data Centers. 2026. https://www.oracle.com/cloud/architecture-and-regions.html ↩
Dean, J. and Barroso, L.A. "The Tail at Scale." Communications of the ACM, 2013. https://research.google/pubs/the-tail-at-scale/ ↩
NVIDIA. GB200 NVL72. 2024-2026 product page. https://www.nvidia.com/en-us/data-center/gb200-nvl72/ ↩
NVIDIA Documentation. NVIDIA DGX SuperPOD Reference Architecture Featuring DGX GB200. 2025. https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/dgx-superpod-architecture.html ↩
National Renewable Energy Laboratory. Transmission Interconnection Roadmap. 2024. https://www.nrel.gov/grid/transmission-interconnection-roadmap ↩
North American Electric Reliability Corporation. 2025 Long-Term Reliability Assessment. 2025. https://www.nerc.com/pa/RAPA/ra/Pages/default.aspx ↩
International Energy Agency. Energy and AI. 2025. https://www.iea.org/reports/energy-and-ai ↩
International Energy Agency. "Data centre electricity use surged in 2025, even with tightening bottlenecks driving a scramble for solutions." 2026. https://www.iea.org/news/data-centre-electricity-use-surged-in-2025-even-with-tightening-bottlenecks-driving-a-scramble-for-solutions ↩
Lawrence Berkeley National Laboratory / U.S. Department of Energy. 2024 United States Data Center Energy Usage Report. 2024. https://buildings.lbl.gov/publications/2024-lbnl-data-center-energy-usage-report ↩
Electric Power Research Institute. Powering Intelligence: Analyzing Artificial Intelligence and Data Center Energy Consumption. 2024. https://restservice.epri.com/publicdownload/000000003002028905/0/Product ↩
Uptime Institute. Giant data center power plans reach extreme levels. 2026. https://intelligence.uptimeinstitute.com/sites/default/files/2026-01/UI%20Field%20report%20194_Giant%20data%20center%20power%20plans%20reach%20extreme%20levels.pdf ↩
International Organization for Standardization. ISO/IEC 30134-2:2026, Data centres key performance indicators - Power usage effectiveness. 2026. https://www.iso.org/standard/30134-2 ↩
The Green Grid. Data Center Power Efficiency Metrics: PUE and DCiE. 2007. https://www.thegreengrid.org/en/resources/library-and-tools/20-Data-Center-Power-Efficiency-Metrics-PUE-and-DCiE ↩
ASHRAE TC 9.9. Thermal Guidelines for Data Processing Environments, 5th edition reference card. 2021 / 2024. https://www.ashrae.org/file%20library/technical%20resources/bookstore/supplemental%20files/therm-gdlns-5th-r-e-refcard.pdf ↩
Open Compute Project. Cooling Environments - Cold Plate. 2025. https://www.opencompute.org/wiki/Cooling_Environments/Cold_Plate ↩
Microsoft Cloud Blog. "Sustainable by design: Next-generation datacenters consume zero water for cooling." 2024-12-09. https://www.microsoft.com/en-us/microsoft-cloud/blog/2024/12/09/sustainable-by-design-next-generation-datacenters-consume-zero-water-for-cooling/ ↩
Microsoft Local. "Understanding water use at Microsoft datacenters." 2026. https://local.microsoft.com/blog/understanding-water-use-at-microsoft-datacenters/ ↩
Google. Operating sustainably - Google Data Centers. 2025-2026. https://www.datacenters.google/operating-sustainably ↩
Cisco. What Is Data Center Networking? 2025-2026. https://www.cisco.com/site/us/en/learn/topics/computing/what-is-data-center-networking.html ↩
Clos, C. "A Study of Non-Blocking Switching Networks." Bell System Technical Journal, 1953. https://onlinelibrary.wiley.com/doi/10.1002/j.1538-7305.1953.tb01433.x ↩
Google / SIGCOMM. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. 2015. https://research.google/pubs/jupiter-rising-a-decade-of-clos-topologies-and-centralized-control-in-googles-datacenter-network-2/ ↩
Meta Engineering. "Reinventing our data center network with F16, Minipack." 2019. https://engineering.fb.com/2019/03/14/data-center-engineering/f16-minipack/ ↩
InfiniBand Trade Association. InfiniBand - A low-latency, high-bandwidth interconnect. 2025-2026. https://www.infinibandta.org/about-infiniband/ ↩
Ultra Ethernet Consortium. "Ultra Ethernet Consortium launches Specification 1.0." 2025-06-11. https://ultraethernet.org/ultra-ethernet-consortium-uec-launches-specification-1-0-transforming-ethernet-for-ai-and-hpc-at-scale/ ↩
JEDEC / Business Wire. "JEDEC Publishes HBM3 Update to High Bandwidth Memory Standard." 2022-01-27. https://www.businesswire.com/news/home/20220127005320/en/JEDEC-Publishes-HBM3-Update-to-High-Bandwidth-Memory-HBM-Standard ↩ ↩²
SK hynix Newsroom. "SK hynix Begins Volume Production of Industry's First HBM3E." 2024-03-19. https://news.skhynix.com/sk-hynix-begins-volume-production-of-industry-first-hbm3e/ ↩
Micron. HBM3E. 2025-2026 product page. https://www.micron.com/products/memory/hbm/hbm3e ↩
Samsung Semiconductor. High Bandwidth Memory. 2025-2026 product page. https://semiconductor.samsung.com/dram/hbm/ ↩
TSMC Investor Relations. TSMC 2025 Annual Report. 2026. https://investor.tsmc.com/sites/ir/annual-report/2025/2025%20Annual%20Report_E.pdf ↩
NVIDIA. NVIDIA 2025 Annual Report. 2025. https://s201.q4cdn.com/141608511/files/doc_financials/2025/annual/NVIDIA-2025-Annual-Report.pdf ↩
AMD. AMD Instinct MI300X Accelerators. 2024-2026 product page. https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html ↩
Intel. Intel Gaudi 3 AI Accelerator. 2024. https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi.html ↩
Broadcom Investor Relations. Broadcom Annual Reports. 2025. https://investors.broadcom.com/financial-information/annual-reports ↩
NVM Express. NVMe Specifications. 2026. https://nvmexpress.org/developers/nvme-specification/ ↩
SNIA. SNIA Emerald Program. 2026. https://www.snia.org/forums/cmsi/programs/emerald ↩
Google / EuroSys. Large-scale cluster management at Google with Borg. 2015. https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/ ↩
Kubernetes Documentation. Kubernetes Overview. 2026. https://kubernetes.io/docs/concepts/overview/ ↩
SchedMD. Slurm Workload Manager Overview. 2026. https://slurm.schedmd.com/overview.html ↩
OpenTelemetry. What is OpenTelemetry? 2026. https://opentelemetry.io/docs/what-is-opentelemetry/ ↩
Microsoft Investor Relations. Microsoft 2025 Annual Report. 2025. https://www.microsoft.com/investor/reports/ar25/index.html ↩
Alphabet Investor Relations. Alphabet 2025 Annual Report. 2026. https://s206.q4cdn.com/479360582/files/doc_financials/2025/q4/GOOG-10-K-2025.pdf ↩
Amazon Investor Relations. Amazon annual reports, proxies and shareholder letters. 2026. https://ir.aboutamazon.com/annual-reports-proxies-and-shareholder-letters/default.aspx ↩
Meta Investor Relations. Meta Reports Second Quarter 2025 Results. 2025. https://investor.fb.com/investor-news/press-release-details/2025/Meta-Reports-Second-Quarter-2025-Results/default.aspx ↩
Dell Technologies. Dell AI Factory with NVIDIA. 2025-2026. https://www.dell.com/en-us/lp/dt/dell-ai-factory-with-nvidia ↩
Hewlett Packard Enterprise. HPE Cray Supercomputing. 2025-2026. https://www.hpe.com/us/en/compute/hpc/supercomputing/cray.html ↩
Supermicro. Supermicro NVIDIA GB200 NVL72. 2025-2026. https://www.supermicro.com/en/products/system/gpu/48u/srs-gb200-nvl72 ↩
Lenovo. Lenovo Neptune Liquid Cooling. 2025-2026. https://www.lenovo.com/us/en/servers-storage/solutions/neptune/ ↩
Schneider Electric. AI-ready data center solutions. 2025-2026. https://www.se.com/ww/en/work/solutions/for-business/data-centers-and-networks/ai-ready-data-center/ ↩
Vertiv Investor Relations. Vertiv 2025 Annual Report. 2026. https://s205.q4cdn.com/554782763/files/doc_financials/2025/ar/Vertiv-2025-Annual-Report.pdf ↩
Eaton. Data centers. 2025-2026. https://www.eaton.com/us/en-us/markets/data-centers.html ↩
ABB. Data centers. 2025-2026. https://new.abb.com/data-centers ↩
Siemens. Data centers. 2025-2026. https://www.siemens.com/global/en/markets/data-centers.html ↩
Cummins. Data Centers. 2025-2026. https://www.cummins.com/generators/data-centers ↩
Caterpillar. Data Center Power Solutions. 2025-2026. https://www.cat.com/en_US/by-industry/electric-power/data-centers.html ↩
Uptime Institute. 2025 Global Data Center Survey. 2025. https://intelligence.uptimeinstitute.com/resource/2025-global-data-center-survey-results-and-crosstabs ↩
Uptime Institute. Annual Outage Analysis 2024. 2024. https://uptimeinstitute.com/resources/research-and-reports/annual-outage-analysis-2024 ↩
AWS Trust Center. Data Center - Our Controls. 2026. https://aws.amazon.com/trust-center/data-center/our-controls/ ↩
Microsoft Service Assurance. Datacenter physical access security. 2025. https://learn.microsoft.com/en-us/compliance/assurance/assurance-datacenter-physical-access-security ↩
Google Cloud Security. "How Google protects the physical-to-logical space in a data center." 2025. https://cloud.google.com/docs/security/physical-to-logical-space ↩
NIST. SP 800-53 Rev. 5, Security and Privacy Controls for Information Systems and Organizations. 2020, updates through 2025. https://csrc.nist.gov/Pubs/sp/800/53/r5/upd1/Final ↩
International Organization for Standardization. ISO/IEC 27001:2022 Information security management systems. 2022. https://www.iso.org/standard/27001 ↩
FedRAMP. FedRAMP baselines. 2026. https://www.fedramp.gov/baselines/ ↩
Krizhevsky, A., Sutskever, I., and Hinton, G. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS, 2012. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html ↩
Vaswani, A. et al. "Attention Is All You Need." 2017. https://arxiv.org/abs/1706.03762 ↩
Brown, T. et al. "Language Models are Few-Shot Learners." 2020. https://arxiv.org/abs/2005.14165 ↩
Kaplan, J. et al. "Scaling Laws for Neural Language Models." 2020. https://arxiv.org/abs/2001.08361 ↩
Jouppi, N.P. et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." 2017. https://arxiv.org/abs/1704.04760 ↩
PCI-SIG. PCI Express 6.0 Specification. 2022. https://pcisig.com/pci-express-60-specification ↩
CXL Consortium. Compute Express Link. 2026. https://www.computeexpresslink.org/ ↩