AI Data Centres: The Machine Beneath The Cloud
A plain-English technical guide to the physical system behind AI infrastructure: power, cooling, network fabrics, accelerators, suppliers, operations, staffing, security, and the limits that appear when cloud capacity becomes real machinery.
- Published
- May 14, 2026
- Reading
- 30 min
- Author
- Christopher Lyon
- Filed
- Research

You can know the cloud well and still have the wrong picture of a data centre.
The cloud makes infrastructure look like a menu. Pick a region. Pick a zone. Pick an instance. Attach storage. Start a job. Watch a model train.
Underneath that menu is a physical machine. It takes electricity from the grid, pushes it through switchgear and power shelves, turns most of it into heat inside silicon, removes that heat with air or liquid, moves data through fibres and switches, and keeps the whole thing working with operators, controls, spares, guards, procedures, and alarms.
AI has made that machine harder to hide. A web service can often survive with more ordinary servers, normal rack power, and a network built for mixed traffic. Large AI training and inference systems concentrate power, heat, memory bandwidth, network traffic, storage pressure, supplier risk, and capital cost in one place. At that point the data centre stops being background infrastructure. It becomes part of the product.
The short version:
An AI data centre is not a room full of GPUs. It is a factory for useful computation. The raw inputs are power, chips, memory, data, network bandwidth, cooling capacity, and human operating discipline. The output is trained models, tokens, embeddings, search results, recommendations, simulations, and cloud services.
If any input is weak, the output gets expensive or unreliable.
Abstract
This article explains how AI data centres work for readers who already understand AI systems and cloud services. It does not start with "what is a server?" It starts with the next layer down: what has to be built and operated before a GPU instance, model endpoint, training cluster, or cloud region can exist.
The main claim is simple. AI infrastructure is constrained by five linked systems:
- Power: how much electricity can reach the site, the building, the row, the rack, and the chip.
- Cooling: how quickly heat can leave the chip, rack, room, and campus.
- Data movement: how fast data, gradients, model weights, checkpoints, and requests can move without stalling expensive accelerators.
- Control: how the facility and compute fleet are monitored, scheduled, patched, drained, repaired, and recovered.
- Trust: how physical security, personnel controls, supply chain, audit evidence, and customer isolation are maintained.
The hard work is not buying the most powerful accelerator. The hard work is building a powered, cooled, networked, secure, observable, maintainable failure domain that can run the workload at the required cost and schedule.
The Useful Mental Model
Think of the data centre as a machine with four flows.
| Flow | Plain-English job | What can break |
|---|---|---|
| Electricity in | Bring power from the grid to chips without unsafe faults or unacceptable interruptions | Utility delay, transformer shortage, UPS fault, breaker trip, rack power limit |
| Heat out | Move heat away from chips before equipment throttles or fails | Airflow problem, pump fault, bad water chemistry, cooling-tower limit, leak |
| Data through | Move bits between users, storage, CPUs, GPUs, and other data centres | Congestion, bad optics, weak fabric, high tail latency, slow storage |
| Control over | Decide what runs, where, when to change it, when to stop it, and who may touch it | Bad scheduler policy, weak telemetry, poor change control, unclear authority, security gap |
Most bad explanations of data centres describe the equipment but miss the flows. The equipment is there to protect the flows.
A GPU rack is not valuable because it looks dense. It is valuable only if enough clean power reaches it, enough heat leaves it, enough data reaches it, the network lets it cooperate with other racks, the scheduler keeps it busy, and operators can repair it without causing a wider outage.
That is the frame for the rest of the article.
Cloud Words, Physical Meaning
Cloud providers turn facilities into abstractions. That is the point of cloud computing. The abstraction is useful, but it can hide the physical commitments behind it.
AWS says an Availability Zone contains one or more discrete data centres with redundant power, networking, and connectivity. Google Cloud, Azure, and Oracle use similar region and zone language because customers need to reason about latency, data residency, service availability, and regulatory placement.1Amazon Web Services Documentation. AWS Regions and Availability Zones. 2026. https://docs.aws.amazon.com/global-infrastructure/latest/regions/aws-regions-availability-zones.html2Google Cloud Documentation. Geography and regions. 2026. https://cloud.google.com/docs/geography-and-regions3Microsoft Azure. Global Infrastructure. 2026. https://azure.microsoft.com/en-us/explore/global-infrastructure/4Oracle. Public Cloud Regions and Data Centers. 2026. https://www.oracle.com/cloud/architecture-and-regions.html
The translation looks like this:
| Cloud word | What exists underneath |
|---|---|
| Region | A commercial geography backed by campuses, fibre, utilities, operating teams, contracts, and legal commitments |
| Availability Zone | A failure-domain promise backed by separated buildings, power paths, network paths, and operating assumptions |
| GPU instance | A slice of accelerator hardware that had to be bought, shipped, mounted, cabled, powered, cooled, monitored, patched, and amortized |
| Capacity | A forecast that turned into land, power reservations, equipment orders, construction schedules, and operational risk |
| Latency | Geography plus fibre path plus switch hops plus congestion plus software fan-out |
| Reliability | Design, commissioning, monitoring, spares, maintenance, change control, incident response, and luck managed down |
| Sustainability | Energy source, PUE, water strategy, carbon intensity, utilization, hardware lifetime, and local grid impact |
The cloud interface is a promise. The data centre is how the promise is kept.
What Makes AI Different
AI did not invent the data centre. Banks, telecom networks, search engines, cloud providers, laboratories, and governments have run serious compute facilities for decades. The change is density and synchronization.
Earlier cloud growth was often scale-out web infrastructure: many servers, many services, much traffic, but not always one tightly coupled job requiring thousands of accelerators to behave like one machine. Large AI training changes that. The accelerators exchange data frequently. A slow link, a congested switch, a storage pause, or one failing worker can slow the job. Google's "tail at scale" point applies brutally: rare per-node delays become ordinary when a request or training step depends on many nodes.5Dean, J. and Barroso, L.A. "The Tail at Scale." Communications of the ACM, 2013. https://research.google/pubs/the-tail-at-scale/
Inference changes the problem again. It may need lower latency, higher availability, model-weight distribution, retrieval systems, safety systems, logging, and burst capacity close to users. A training cluster can sometimes run far from the customer. An inference service may need to sit inside a region where latency, sovereignty, and product reliability matter.
AI also changes the rack. The important unit used to be a server. Then it became a rack. With systems like NVIDIA GB200 NVL72 and DGX SuperPOD reference architectures, the unit starts to look like a rack-scale or pod-scale computer: CPUs, accelerators, NVLink or other scale-up interconnect, networking, storage, cooling, power, and management designed together.6NVIDIA. GB200 NVL72. 2024-2026 product page. https://www.nvidia.com/en-us/data-center/gb200-nvl72/7NVIDIA Documentation. NVIDIA DGX SuperPOD Reference Architecture Featuring DGX GB200. 2025. https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/dgx-superpod-architecture.html
This is why the phrase "AI factory" is useful when it is used carefully. The facility is not just hosting computation. It is arranged to turn capital equipment and electricity into model outputs.
The Build Sequence
A serious data-centre project does not start by picking a favourite GPU. It starts with a workload and a constraint model.
The first questions are plain:
| Question | Why it matters |
|---|---|
| Is the site for training, inference, storage, general cloud, or a mix? | Each load shape stresses a different part of the system |
| How many megawatts are needed now and later? | Power is often the schedule gate |
| What rack density must the building support? | Cooling and electrical design follow rack density |
| What network behaviour does the workload require? | AI training punishes weak east-west bandwidth and poor congestion control |
| What data must stay local? | Residency, latency, and data gravity shape region choice |
| How long may a job or service be interrupted? | Redundancy, checkpointing, maintenance, and customer promises depend on it |
| Who will operate it? | A design that the team cannot maintain is not a design |
Then comes site selection. Cheap land is not enough. A workable AI site needs power, fibre, civil access, permits, water or a waterless cooling strategy, construction labor, political acceptance, security, and expansion room. Utility interconnection can decide the schedule before the first rack arrives. NREL's interconnection work and NERC's reliability assessments are useful reminders that new load is a power-system problem, not just a customer procurement problem.8National Renewable Energy Laboratory. Transmission Interconnection Roadmap. 2024. https://www.nrel.gov/grid/transmission-interconnection-roadmap9North American Electric Reliability Corporation. 2025 Long-Term Reliability Assessment. 2025. https://www.nerc.com/pa/RAPA/ra/Pages/default.aspx
After the site comes the design basis. This is the document that says what the building is being designed to handle: IT load, rack density, redundancy, cooling medium, power topology, network architecture, physical security zones, maintainability, and commissioning tests. A bad design basis poisons the project. If the design assumes 20 kW racks and procurement later buys 120 kW liquid-cooled racks, everyone is now negotiating with physics.
Procurement follows. For AI, procurement is not a back-office function. It is part of the architecture:
| Supplier layer | What it decides |
|---|---|
| Accelerator vendor | Compute density, software ecosystem, rack shape, power draw, cooling path |
| Foundry and packaging | Whether the chips and HBM packages can be made in volume |
| Memory supplier | HBM capacity, bandwidth, yield, and delivery schedule |
| Server OEM or ODM | How chips become serviceable systems and racks |
| Network vendor | Fabric bandwidth, telemetry, congestion behaviour, optics, support model |
| Power vendor | Switchgear, UPS, busway, transformers, breakers, generators, protection |
| Cooling vendor | Chillers, dry coolers, CDUs, cold plates, pumps, valves, monitoring |
| Construction trades | Whether the design becomes tested capacity on a real schedule |
Commissioning is the first honest exam. Electrical teams test switchgear, protection, UPS behaviour, generator starts, transfer sequences, grounding, and load banks. Mechanical teams test pumps, valves, airflow, heat rejection, water chemistry, leak detection, and controls. IT teams test cabling, optics, firmware, network paths, storage throughput, scheduler integration, and workload burn-in.
Commissioning is not a ceremony. It is where the building is encouraged to fail while the consequences are still contained.
Power: The First Gate
Every watt that enters a data centre becomes heat. That one sentence explains why power and cooling are inseparable.
Power starts outside the fence. A utility must be able to serve the site. That can require transmission work, distribution upgrades, substations, transformers, protection studies, metering, contracts, and time. AI campuses can be large enough that a local utility, regulator, or community cannot treat them like ordinary commercial load. IEA, LBNL, EPRI, and Uptime Institute all frame AI data-centre growth as an energy-system issue, not merely an IT issue.10International Energy Agency. Energy and AI. 2025. https://www.iea.org/reports/energy-and-ai11International Energy Agency. "Data centre electricity use surged in 2025, even with tightening bottlenecks driving a scramble for solutions." 2026. https://www.iea.org/news/data-centre-electricity-use-surged-in-2025-even-with-tightening-bottlenecks-driving-a-scramble-for-solutions12Lawrence Berkeley National Laboratory / U.S. Department of Energy. 2024 United States Data Center Energy Usage Report. 2024. https://buildings.lbl.gov/publications/2024-lbnl-data-center-energy-usage-report13Electric Power Research Institute. Powering Intelligence: Analyzing Artificial Intelligence and Data Center Energy Consumption. 2024. https://restservice.epri.com/publicdownload/000000003002028905/0/Product14Uptime Institute. Giant data center power plans reach extreme levels. 2026. https://intelligence.uptimeinstitute.com/sites/default/files/2026-01/UI%20Field%20report%20194Giant%20data%20center%20power%20plans%20reach%20extreme%20levels.pdf
Inside the site, power moves through a chain:
| Stage | Plain-English role |
|---|---|
| Utility interconnect | The site gets power from the grid |
| Substation | Voltage is stepped, switched, protected, and metered |
| Medium-voltage distribution | Large power blocks move around the campus |
| Transformers | Voltage is stepped down for buildings and equipment |
| Switchgear and breakers | Faults are isolated and maintenance becomes possible |
| UPS | Short interruptions are bridged and equipment gets time to ride through or shut down |
| Generators or alternate supply | Longer outages are covered if the design requires it |
| Busway and rack distribution | Power reaches rows, racks, and power shelves |
| Server power supplies | Electricity becomes usable DC power for chips, memory, fans, pumps, and controllers |
Redundancy means spare capacity in the right place. It does not mean safety by slogan. N+1 means one extra component beyond the needed number. 2N means two independent systems sized for the load. Distributed redundant designs spread the spare capacity differently. Each choice adds cost, complexity, test work, and failure modes.
PUE is the common energy-efficiency ratio. It means total facility energy divided by IT equipment energy. If a data centre has a PUE of 1.2, then each 1 MW of IT equipment needs another 0.2 MW for cooling, power losses, pumps, lights, controls, and other overhead. ISO/IEC 30134-2 standardizes PUE, and The Green Grid helped popularize it.15International Organization for Standardization. ISO/IEC 30134-2:2026, Data centres key performance indicators - Power usage effectiveness. 2026. https://www.iso.org/standard/30134-216The Green Grid. Data Center Power Efficiency Metrics: PUE and DCiE. 2007. https://www.thegreengrid.org/en/resources/library-and-tools/20-Data-Center-Power-Efficiency-Metrics-PUE-and-DCiE
PUE is useful, but it can mislead. It does not tell you whether the GPUs are doing useful work. A low-PUE site with badly scheduled accelerators can still waste money. A slightly higher-PUE site in a cleaner grid or water-constrained region may be the better decision. For AI, the energy question is not only "how efficient is the building?" It is also "how much useful model work comes out per watt, per dollar, and per litre of water?"
Cooling: The Heat Has To Leave
A chip is a heater that happens to compute.
Air cooling works by moving cold air to the front of servers and hot air away from the back. Good air-cooled data halls manage hot aisles, cold aisles, blanking panels, floor tiles, containment, fan energy, pressure, humidity, filters, and equipment inlet temperature. ASHRAE TC 9.9 matters because IT equipment has a thermal operating envelope; the target is not human comfort.17ASHRAE TC 9.9. Thermal Guidelines for Data Processing Environments, 5th edition reference card. 2021 / 2024. https://www.ashrae.org/file%20library/technical%20resources/bookstore/supplemental%20files/therm-gdlns-5th-r-e-refcard.pdf
Dense AI racks make this harder. The rack may draw so much power that air alone becomes inefficient or impractical. Direct-to-chip liquid cooling puts a cold plate on the hot components, usually CPUs and accelerators. Liquid carries heat to a coolant distribution unit, then into a facility water loop, then to chillers, dry coolers, cooling towers, or another heat-rejection system. Open Compute Project cold-plate work exists because these liquid-cooled interfaces need common expectations.18Open Compute Project. Cooling Environments - Cold Plate. 2025. https://www.opencompute.org/wiki/CoolingEnvironments/ColdPlate
Plainly:
| Cooling part | What it does |
|---|---|
| Cold plate | Touches the hot chip package and collects heat |
| Rack manifold | Distributes coolant to many cold plates |
| CDU | Controls flow, pressure, temperature, and separation between technology coolant and facility water |
| Facility loop | Moves heat away from the room |
| Chiller, dry cooler, or tower | Rejects heat outside the building |
| Sensors and controls | Detect temperature, pressure, leaks, flow, and abnormal states |
Liquid cooling does not remove the operations problem. It changes it. The team now needs leak procedures, coolant chemistry, pump maintenance, valve checks, quick-disconnect discipline, spare parts, and technicians who can work near expensive powered equipment.
Water is the uncomfortable tradeoff. Evaporative cooling can be energy efficient, but it consumes water. Dry or closed-loop designs can reduce water use, but may use more electricity depending on climate and design. Microsoft has said its next-generation zero-water cooling designs reduce water consumed for cooling but cause a nominal increase in annual energy use compared with evaporative systems. Google frames cooling choices as a local balance between energy, water, carbon-free supply, climate, and workload.19Microsoft Cloud Blog. "Sustainable by design: Next-generation datacenters consume zero water for cooling." 2024-12-09. https://www.microsoft.com/en-us/microsoft-cloud/blog/2024/12/09/sustainable-by-design-next-generation-datacenters-consume-zero-water-for-cooling/20Microsoft Local. "Understanding water use at Microsoft datacenters." 2026. https://local.microsoft.com/blog/understanding-water-use-at-microsoft-datacenters/21Google. Operating sustainably - Google Data Centers. 2025-2026. https://www.datacenters.google/operating-sustainably
There is no universal best cooling system. There is only a cooling system that fits the rack density, climate, water politics, energy source, operating team, and failure tolerance.
Networking: North-South, East-West, And The Fabric
Two terms are worth getting right.
North-south traffic enters or leaves the data centre or cluster. User requests, API calls, internet traffic, and region-to-region traffic often fit here.
East-west traffic moves inside the data centre. Service-to-service calls, storage reads, model-shard communication, gradient exchange, checkpoint traffic, and management traffic fit here.22Cisco. What Is Data Center Networking? 2025-2026. https://www.cisco.com/site/us/en/learn/topics/computing/what-is-data-center-networking.html
Traditional web serving cares a lot about north-south traffic because users are outside the facility. AI training cares heavily about east-west traffic because accelerators must cooperate inside the cluster.
Modern data-centre networks use Clos or spine-leaf ideas. A leaf switch connects servers. Spine switches connect leaves. The goal is predictable paths between racks without one central choke point. Clos switching theory is old, but hyperscalers turned it into modern data-centre fabrics. Google's Jupiter and Meta's F16 work show how much engineering goes into making these fabrics cheap, fast, observable, and controllable at scale.23Clos, C. "A Study of Non-Blocking Switching Networks." Bell System Technical Journal, 1953. https://onlinelibrary.wiley.com/doi/10.1002/j.1538-7305.1953.tb01433.x24Google / SIGCOMM. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. 2015. https://research.google/pubs/jupiter-rising-a-decade-of-clos-topologies-and-centralized-control-in-googles-datacenter-network-2/25Meta Engineering. "Reinventing our data center network with F16, Minipack." 2019. https://engineering.fb.com/2019/03/14/data-center-engineering/f16-minipack/
AI adds a back-end network problem. A training job may need thousands of accelerators to exchange data at the same time. If one path is slow, the job can wait. The network needs high bandwidth, low latency, good congestion control, fast failure detection, and useful telemetry.
This is where InfiniBand, Ethernet/RoCE, and newer AI-focused Ethernet work appear.
| Fabric choice | Why teams choose it | What they inherit |
|---|---|---|
| InfiniBand | Mature low-latency HPC and AI training fabric | More specialized operations and supplier concentration |
| Ethernet with RoCE | Huge ecosystem, cloud familiarity, supplier diversity | Careful congestion tuning and operational discipline |
| Proprietary scale-up links | Very high bandwidth inside a rack or pod | Platform lock-in and shorter reach |
InfiniBand remains important in AI and HPC. Ethernet is being adapted for AI and HPC at scale, including through the Ultra Ethernet Consortium.26InfiniBand Trade Association. InfiniBand - A low-latency, high-bandwidth interconnect. 2025-2026. https://www.infinibandta.org/about-infiniband/27Ultra Ethernet Consortium. "Ultra Ethernet Consortium launches Specification 1.0." 2025-06-11. https://ultraethernet.org/ultra-ethernet-consortium-uec-launches-specification-1-0-transforming-ethernet-for-ai-and-hpc-at-scale/
The physical layer matters too. Optics, copper, fibre trunks, patch panels, cable trays, labels, bend radius, cleaning, transceivers, and switch thermals are not accessories. A dirty connector can become a training delay. A late optics shipment can become a cluster delay. A hot switch can become a network incident.
Compute: The Chip Is Only One Layer
The accelerator gets the attention because it is expensive and visible. It is still only one part of the compute path.
An AI server or rack brings together:
| Component | Why it matters |
|---|---|
| Accelerator | Performs the dense numerical work |
| HBM | Feeds the accelerator with very high memory bandwidth |
| CPU | Handles orchestration, preprocessing, host work, and parts of the application |
| NIC or DPU | Moves data onto the network and may offload work from the CPU |
| Local storage | Handles cache, scratch, logs, or fast staging |
| Firmware and drivers | Decide whether the hardware behaves consistently |
| Libraries | Make the hardware usable by model code |
| Rack interconnect | Lets accelerators behave like a larger system |
HBM is worth calling out. It is stacked memory placed close to the accelerator package. It gives much higher bandwidth than ordinary server memory, but it also ties AI performance to advanced packaging, memory yield, thermal design, and supplier capacity. JEDEC standards, SK hynix production announcements, Micron product material, Samsung HBM material, and TSMC annual reporting all point to the same fact: AI infrastructure depends on memory and packaging, not just logic chips.28JEDEC / Business Wire. "JEDEC Publishes HBM3 Update to High Bandwidth Memory Standard." 2022-01-27. https://www.businesswire.com/news/home/20220127005320/en/JEDEC-Publishes-HBM3-Update-to-High-Bandwidth-Memory-HBM-Standard29SK hynix Newsroom. "SK hynix Begins Volume Production of Industry's First HBM3E." 2024-03-19. https://news.skhynix.com/sk-hynix-begins-volume-production-of-industry-first-hbm3e/30Micron. HBM3E. 2025-2026 product page. https://www.micron.com/products/memory/hbm/hbm3e31Samsung Semiconductor. High Bandwidth Memory. 2025-2026 product page. https://semiconductor.samsung.com/dram/hbm/32TSMC Investor Relations. TSMC 2025 Annual Report. 2026. https://investor.tsmc.com/sites/ir/annual-report/2025/2025%20Annual%20ReportE.pdf
This is why supplier conversations quickly become architectural conversations. NVIDIA, AMD, Intel, Broadcom, TSMC, memory suppliers, server OEMs, network vendors, and power/cooling suppliers are all part of the same machine.33NVIDIA. NVIDIA 2025 Annual Report. 2025. https://s201.q4cdn.com/141608511/files/docfinancials/2025/annual/NVIDIA-2025-Annual-Report.pdf34AMD. AMD Instinct MI300X Accelerators. 2024-2026 product page. https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html35Intel. Intel Gaudi 3 AI Accelerator. 2024. https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi.html36Broadcom Investor Relations. Broadcom Annual Reports. 2025. https://investors.broadcom.com/financial-information/annual-reports
The practical rule is simple: an accelerator is idle whenever the rest of the system cannot feed it, cool it, schedule it, or recover it.
Storage: The Slow Part Can Be Somewhere Else
AI storage is easy to underestimate because cloud object storage feels bottomless. Inside the machine, storage is hardware, network paths, filesystems, metadata services, rebuild behaviour, durability policy, capacity planning, and power draw.
Training needs datasets, transformed shards, checkpoints, logs, model weights, evaluation outputs, and experiment records. Inference needs model weights, cache, retrieval indexes, logs, safety records, and sometimes vector databases. Governance needs lineage, retention, deletion, access control, and audit evidence.
A training path may look like this:
- Raw data lands in object storage or a data lake.
- Preprocessing turns it into training-ready shards.
- High-throughput storage feeds the cluster.
- Accelerators consume batches.
- Checkpoints write back on a schedule.
- Failed jobs restart from checkpoints.
- Model artifacts move to evaluation, fine-tuning, serving, or archive.
Each stage can slow the job. NVMe, NVMe over Fabrics, parallel filesystems, burst buffers, object stores, metadata services, and storage efficiency work all matter because GPUs are too expensive to wait politely for data.37NVM Express. NVMe Specifications. 2026. https://nvmexpress.org/developers/nvme-specification/38SNIA. SNIA Emerald Program. 2026. https://www.snia.org/forums/cmsi/programs/emerald
Storage is not the glamorous part of an AI data centre. It is one of the easiest places to waste the glamorous part.
Control: The Building And The Fleet Both Need Operators
There are two control systems in an AI data centre.
The facility control system watches the building: power, switchgear, UPS, generators, pumps, valves, chillers, CDUs, cooling towers, leak sensors, fire systems, cameras, doors, and environmental sensors.
The compute control system watches the fleet: servers, accelerators, NICs, storage, images, firmware, drivers, schedulers, jobs, quotas, logs, metrics, traces, alerts, and customer services.
In a low-density facility those worlds can be loosely connected. In an AI facility they need to talk. A scheduler may need to avoid racks under thermal constraint. A liquid-cooling alarm may need a workload drain. A firmware update may change power behaviour. A network maintenance window may kill a training job that assumed the fabric would stay stable for days.
Google's Borg paper is still useful because it shows the data centre as a managed resource pool. Kubernetes is the common cloud-native control-plane model. Slurm remains common in HPC and AI clusters. OpenTelemetry gives a modern observability vocabulary for traces, metrics, and logs.39Google / EuroSys. Large-scale cluster management at Google with Borg. 2015. https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/40Kubernetes Documentation. Kubernetes Overview. 2026. https://kubernetes.io/docs/concepts/overview/41SchedMD. Slurm Workload Manager Overview. 2026. https://slurm.schedmd.com/overview.html42OpenTelemetry. What is OpenTelemetry? 2026. https://opentelemetry.io/docs/what-is-opentelemetry/
The question is not only which scheduler to use. The question is who has authority.
Who may reserve a thousand GPUs? Who may preempt a job? Who may drain a rack? Who may roll firmware? Who may ignore utilization to protect reliability? Who owns the incident when a facility alarm and a model-training failure are the same event?
If that authority model is vague, the cluster will be run by escalation.
Hyperscalers
Hyperscalers are not just large customers. They are infrastructure manufacturers.
Amazon, Microsoft, Google, Meta, and Oracle have enough scale to shape server designs, accelerator plans, switch fabrics, data-centre locations, cooling choices, energy contracts, and internal operating systems. Their advantage is not only money. It is repetition. They build enough sites to learn, standardize, and push suppliers.
Their public filings and investor updates show the capital weight of AI infrastructure. Microsoft, Alphabet, Amazon, and Meta all frame cloud and AI infrastructure as a major investment area.43Microsoft Investor Relations. Microsoft 2025 Annual Report. 2025. https://www.microsoft.com/investor/reports/ar25/index.html44Alphabet Investor Relations. Alphabet 2025 Annual Report. 2026. https://s206.q4cdn.com/479360582/files/docfinancials/2025/q4/GOOG-10-K-2025.pdf45Amazon Investor Relations. Amazon annual reports, proxies and shareholder letters. 2026. https://ir.aboutamazon.com/annual-reports-proxies-and-shareholder-letters/default.aspx46Meta Investor Relations. Meta Reports Second Quarter 2025 Results. 2025. https://investor.fb.com/investor-news/press-release-details/2025/Meta-Reports-Second-Quarter-2025-Results/default.aspx
Hyperscalers also hide a hard truth from customers: every simple cloud SKU is backed by a complex capacity bet. A GPU instance offered in a region means someone has already made decisions about land, power, cooling, hardware supply, network topology, customer demand, depreciation, and failure risk.
The cloud customer buys abstraction. The hyperscaler sells abstraction by owning more of the mess.
Suppliers
The supplier base is wider than the AI conversation usually admits.
| Supplier group | What they control |
|---|---|
| NVIDIA, AMD, Intel, custom ASIC teams | Accelerator roadmaps, software ecosystems, rack design pressure |
| Broadcom, Arista, Cisco, NVIDIA Networking, others | Switching silicon, NICs, fabrics, telemetry, congestion behaviour |
| TSMC and advanced packaging suppliers | Whether the most advanced silicon and package designs can ship |
| SK hynix, Micron, Samsung | HBM supply, yield, bandwidth, capacity |
| Dell, HPE, Supermicro, Lenovo, ODMs | Server integration, rack serviceability, firmware, spares |
| Schneider Electric, Vertiv, Eaton, ABB, Siemens | Electrical infrastructure, cooling infrastructure, controls, monitoring |
| Cummins, Caterpillar, fuel and backup suppliers | Backup generation, fuel logistics, emergency power |
| Construction and commissioning firms | Whether drawings become usable capacity |
NVIDIA is central now because it sells more than GPUs: systems, networking, interconnects, software, reference architectures, and management tooling. Dell, HPE, Supermicro, Lenovo, and others turn platforms into products that can be ordered, installed, serviced, and supported. Schneider Electric, Vertiv, Eaton, ABB, Siemens, Cummins, and Caterpillar show why power and cooling vendors are now AI infrastructure vendors too.47Dell Technologies. Dell AI Factory with NVIDIA. 2025-2026. https://www.dell.com/en-us/lp/dt/dell-ai-factory-with-nvidia48Hewlett Packard Enterprise. HPE Cray Supercomputing. 2025-2026. https://www.hpe.com/us/en/compute/hpc/supercomputing/cray.html49Supermicro. Supermicro NVIDIA GB200 NVL72. 2025-2026. https://www.supermicro.com/en/products/system/gpu/48u/srs-gb200-nvl7250Lenovo. Lenovo Neptune Liquid Cooling. 2025-2026. https://www.lenovo.com/us/en/servers-storage/solutions/neptune/51Schneider Electric. AI-ready data center solutions. 2025-2026. https://www.se.com/ww/en/work/solutions/for-business/data-centers-and-networks/ai-ready-data-center/52Vertiv Investor Relations. Vertiv 2025 Annual Report. 2026. https://s205.q4cdn.com/554782763/files/docfinancials/2025/ar/Vertiv-2025-Annual-Report.pdf53Eaton. Data centers. 2025-2026. https://www.eaton.com/us/en-us/markets/data-centers.html54ABB. Data centers. 2025-2026. https://new.abb.com/data-centers55Siemens. Data centers. 2025-2026. https://www.siemens.com/global/en/markets/data-centers.html56Cummins. Data Centers. 2025-2026. https://www.cummins.com/generators/data-centers57Caterpillar. Data Center Power Solutions. 2025-2026. https://www.cat.com/enUS/by-industry/electric-power/data-centers.html
The decision lesson: do not treat the GPU purchase order as the project. The project is the whole supplier chain arriving in the right order.
Operations: The Part That Keeps Happening
Construction ends. Operations does not.
A running AI data centre has several daily loops:
| Operating loop | What it does |
|---|---|
| Facilities operations | Watches power, cooling, fire, water, leaks, fuel, alarms, and maintenance state |
| IT operations | Watches servers, accelerators, storage, firmware, operating systems, and workloads |
| Network operations | Watches switch health, optics, congestion, routing, maintenance, and capacity |
| Security operations | Watches access, cameras, badges, deliveries, contractors, rack doors, media, and incidents |
| Capacity operations | Forecasts demand, manages quotas, reserves clusters, and plans expansion |
| Maintenance | Replaces parts, tests generators, services UPS systems, checks cooling loops, and calibrates sensors |
| Change management | Controls what changes, when it changes, who approves, and how to roll back |
| Incident management | Detects, triages, communicates, mitigates, learns, and updates procedures |
The procedures have unromantic names: MOPs, SOPs, EOPs, permits to work, lockout/tagout, rounds, shift handovers, spares, runbooks, maintenance windows, post-incident reviews, access logs, and change freezes. They exist because a small mistake can affect many customers or many millions of dollars of equipment.
Uptime Institute's outage and survey work keeps returning to the same general lesson: failures are not only equipment failures. They are also process, staffing, maintenance, power, networking, and change-control failures.58Uptime Institute. 2025 Global Data Center Survey. 2025. https://intelligence.uptimeinstitute.com/resource/2025-global-data-center-survey-results-and-crosstabs59Uptime Institute. Annual Outage Analysis 2024. 2024. https://uptimeinstitute.com/resources/research-and-reports/annual-outage-analysis-2024
AI adds pressure because utilization is financially painful. Idle accelerators look like waste. But chasing utilization too aggressively can create fragile operations: overloaded cooling zones, rushed updates, weak drain procedures, ignored alarms, and maintenance debt. The operating model needs a rule for when to protect the fleet instead of chasing one more percentage point of use.
Staffing
AI data centres are marketed through software, but they are run by people who understand electrical rooms, cooling loops, controls, racks, fibres, safety, and security.
A serious site needs electrical engineers, mechanical engineers, controls engineers, network engineers, hardware technicians, cluster operators, security staff, safety staff, supply-chain planners, and commissioning specialists.
The awkward part is overlap. A technician working on a liquid-cooled accelerator rack needs IT hardware skill, mechanical awareness, safety training, vendor procedure, and the judgement to stop when a leak or alarm looks wrong. A cluster operator needs to understand that a facility constraint can be a scheduler constraint. A security officer needs to know whether a contractor's badge request matches a real work order and a permitted area.
Staffing is therefore a scaling limit. You can order hardware faster than you can train people to run it well.
Security
Cloud security begins below the cloud.
AWS, Microsoft, and Google publish physical security controls because customers need to trust the provider's physical-to-logical chain. Those controls include perimeter security, guards, cameras, badges, multi-factor access, cages, locked racks, visitor controls, equipment movement controls, media handling, incident response, and audit logs.60AWS Trust Center. Data Center - Our Controls. 2026. https://aws.amazon.com/trust-center/data-center/our-controls/61Microsoft Service Assurance. Datacenter physical access security. 2025. https://learn.microsoft.com/en-us/compliance/assurance/assurance-datacenter-physical-access-security62Google Cloud Security. "How Google protects the physical-to-logical space in a data center." 2025. https://cloud.google.com/docs/security/physical-to-logical-space
AI raises the stakes:
| Security issue | Why it matters |
|---|---|
| Model weights | A small number of files can represent enormous training cost and strategic value |
| Accelerator scarcity | Hardware theft, diversion, or tampering has direct business impact |
| Firmware and supply chain | NICs, baseboard controllers, drives, accelerators, optics, and firmware expand the attack surface |
| Remote hands | Contractors may touch high-value systems during urgent work |
| Multi-tenancy | Customers need isolation across expensive shared hardware |
| Physical-to-logical events | A badge event, rack-door alarm, drive movement, or camera alert may matter to cyber investigation |
| Sovereignty | Some customers care exactly where data, staff, and hardware operations sit |
NIST SP 800-53 is useful because it refuses to separate cyber security from physical, personnel, maintenance, contingency, incident, and supply-chain controls. ISO/IEC 27001 frames security as a management system. FedRAMP shows how cloud customers inherit provider controls rather than operating every physical control themselves.63NIST. SP 800-53 Rev. 5, Security and Privacy Controls for Information Systems and Organizations. 2020, updates through 2025. https://csrc.nist.gov/Pubs/sp/800/53/r5/upd1/Final64International Organization for Standardization. ISO/IEC 27001:2022 Information security management systems. 2022. https://www.iso.org/standard/2700165FedRAMP. FedRAMP baselines. 2026. https://www.fedramp.gov/baselines/
A data-centre security failure does not have to be dramatic. It can be an unrevoked badge, a contractor in the wrong room, a mislabeled drive, an unmanaged maintenance laptop, a bad firmware chain, or a rack-door alarm no one correlated with a logical event.
Scaling Failures
AI data-centre plans usually fail at the joins between disciplines.
| Failure mode | What it really means |
|---|---|
| "We have GPUs but cannot use them" | The network, storage, scheduler, power, cooling, or software stack is limiting useful work |
| "The building is ready but not live" | Utility power, commissioning, permits, fibre, switchgear, or operating readiness is late |
| "The cluster is hot" | Rack density, containment, flow, workload placement, or liquid loop behaviour is wrong |
| "The network is flaky" | Optics, congestion control, firmware, cabling, routing, or job traffic is outside the expected envelope |
| "Storage is slow" | Data layout, metadata, checkpointing, rebuilds, or network paths are starving accelerators |
| "Utilization is poor" | Resources exist, but not in the shape jobs need, or the scheduler cannot pack them cleanly |
| "Maintenance causes incidents" | Procedures, authority, drain logic, spares, or rollback paths are weak |
| "Security slows everything down" | Access, audit, contractor, and equipment movement processes were added after the operating model |
| "The community is angry" | Power, water, tax, noise, land use, or emissions questions were treated as externalities |
The common cause is simple: the team optimized one layer and assumed the others would comply.
They do not comply. They negotiate.
The Century Behind The Machine
AI data centres feel new because the demand shock is new. The ingredients have a long history.
| Period | Breakthrough | What it changed |
|---|---|---|
| 1900s-1930s | Electrified industry and tabulating machines | Computation became tied to rooms, power, operators, and business process |
| 1940s | Electronic computers | Heat, reliability, maintenance, and power became computing problems |
| 1950s | Transistors and switching theory | Smaller electronics and scalable network ideas entered the story |
| 1960s | Integrated circuits, mainframes, time sharing | Central compute became shared institutional infrastructure |
| 1970s | DRAM, Ethernet, minicomputers | Memory density and local networking moved toward clustered systems |
| 1980s | TCP/IP, fibre, client-server computing | Networked computing became normal |
| 1990s | Web scale and commodity servers | Software had to assume hardware failure |
| 2000s | Warehouse-scale computing and virtualization | The data centre became the computer |
| 2010s | Deep learning, GPUs, TPUs, Kubernetes, NVMe, Clos fabrics | Accelerated cloud infrastructure became mainstream |
| 2020s | Transformers, HBM, chiplets, advanced packaging, liquid cooling, rack-scale AI systems | Power, cooling, memory, network, and supplier capacity became first-order AI constraints |
The modern AI branch runs through GPU-accelerated deep learning, transformers, large language-model scaling, in-data-centre accelerators, high-bandwidth memory, and rack-scale systems.66Krizhevsky, A., Sutskever, I., and Hinton, G. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS, 2012. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html67Vaswani, A. et al. "Attention Is All You Need." 2017. https://arxiv.org/abs/1706.0376268Brown, T. et al. "Language Models are Few-Shot Learners." 2020. https://arxiv.org/abs/2005.1416569Kaplan, J. et al. "Scaling Laws for Neural Language Models." 2020. https://arxiv.org/abs/2001.0836170Jouppi, N.P. et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." 2017. https://arxiv.org/abs/1704.0476028JEDEC / Business Wire. "JEDEC Publishes HBM3 Update to High Bandwidth Memory Standard." 2022-01-27. https://www.businesswire.com/news/home/20220127005320/en/JEDEC-Publishes-HBM3-Update-to-High-Bandwidth-Memory-HBM-Standard71PCI-SIG. PCI Express 6.0 Specification. 2022. https://pcisig.com/pci-express-60-specification72CXL Consortium. Compute Express Link. 2026. https://www.computeexpresslink.org/
The pattern repeats. More useful compute exposes the next bottleneck. Faster chips expose heat. Bigger models expose network and memory. Better cloud products create more demand. More demand exposes grid limits. The data centre absorbs the consequence.
Useful Terms
| Term | Plain meaning |
|---|---|
| AI factory | A data centre arranged to produce AI outputs, not just generic compute |
| Availability Zone | A cloud failure-domain promise backed by physical facilities and networks |
| Back-end network | The cluster network used for accelerator, storage, and training traffic |
| BMS | Building management system |
| Busway | Electrical distribution that carries power along rows or overhead paths |
| CDU | Coolant distribution unit for liquid-cooled systems |
| Clos fabric | A scalable multi-stage switching design used in data-centre networks |
| DCIM | Data-centre infrastructure management tooling |
| Direct-to-chip cooling | Liquid cooling that removes heat through cold plates on hot components |
| East-west traffic | Traffic between systems inside the data centre |
| EPMS | Electrical power monitoring system |
| Front-end network | Network used for service, user, storage, or management traffic depending on context |
| HBM | High bandwidth memory placed close to accelerators |
| Hot aisle/cold aisle | Airflow layout separating server intake and exhaust |
| InfiniBand | Low-latency fabric used in HPC and many AI clusters |
| IT load | Power used by servers, storage, and network equipment |
| MOP | Method of procedure for controlled maintenance work |
| North-south traffic | Traffic entering or leaving the data centre or cluster boundary |
| N+1 | Enough capacity plus one spare component |
| PDU | Power distribution unit |
| PUE | Facility energy divided by IT equipment energy |
| RDMA | Network data movement with low CPU involvement |
| RoCE | RDMA over Converged Ethernet |
| Spine-leaf | Common two-layer data-centre network fabric |
| Straggler | Slow worker, path, or component that delays the whole distributed job |
| UPS | Uninterruptible power supply |
| White space | The data hall area where racks live |
| WUE | Water-use metric for data-centre operation |
What To Inspect Before Building Or Buying
The first useful artifact is not a hardware shopping list. It is a constraint map.
Ask:
| Gate | Question |
|---|---|
| Workload | What mix of training, fine-tuning, batch inference, online inference, retrieval, and storage must run? |
| Power | What power is actually available, by date, and with what expansion path? |
| Cooling | What rack densities must be supported over the next five years? |
| Network | What east-west bandwidth, latency, and congestion behaviour do jobs need? |
| Storage | What feed rate, checkpoint rate, metadata rate, and recovery time are required? |
| Scheduler | How will scarce accelerators be allocated, reserved, preempted, and charged back? |
| Operations | Who owns maintenance, changes, incidents, capacity, and security? |
| Supplier risk | Which components have long lead times or single-supplier exposure? |
| Security | Which physical, logical, personnel, supply-chain, and compliance controls must be inherited or operated? |
| Exit | What would make the build worse than cloud, colocation, or a managed AI cluster? |
The last question is important. Most organizations should not build a serious AI data centre. They should rent, colocate, reserve cloud capacity, or buy a managed cluster until the workload is stable enough and large enough to justify owning the risk.
The builder owns power risk, construction risk, cooling risk, supplier risk, staffing risk, utilization risk, security risk, and technology-refresh risk. The cloud buyer pays a premium to avoid much of that risk. Neither path is automatically superior.
Honest Summary
The cloud is not fake. It is an interface over real machinery.
AI makes the machinery visible because it concentrates demand. The data centre has to deliver power, remove heat, move data, schedule work, protect assets, and recover from failure while the equipment gets denser and more expensive.
The serious question is not "can we get GPUs?"
It is:
Can we get enough power to the right place, remove the heat, feed the accelerators, keep the network stable, operate the fleet safely, and prove to customers that the system is secure?
If the answer is no, the correct decision may be to stay out of the data-centre business. If the answer is yes, the work starts with megawatts, cooling loops, network topology, controls, people, and failure drills.
The model comes later.
References
Footnotes
-
Amazon Web Services Documentation. AWS Regions and Availability Zones. 2026. https://docs.aws.amazon.com/global-infrastructure/latest/regions/aws-regions-availability-zones.html ↩
-
Google Cloud Documentation. Geography and regions. 2026. https://cloud.google.com/docs/geography-and-regions ↩
-
Microsoft Azure. Global Infrastructure. 2026. https://azure.microsoft.com/en-us/explore/global-infrastructure/ ↩
-
Oracle. Public Cloud Regions and Data Centers. 2026. https://www.oracle.com/cloud/architecture-and-regions.html ↩
-
Dean, J. and Barroso, L.A. "The Tail at Scale." Communications of the ACM, 2013. https://research.google/pubs/the-tail-at-scale/ ↩
-
NVIDIA. GB200 NVL72. 2024-2026 product page. https://www.nvidia.com/en-us/data-center/gb200-nvl72/ ↩
-
NVIDIA Documentation. NVIDIA DGX SuperPOD Reference Architecture Featuring DGX GB200. 2025. https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/dgx-superpod-architecture.html ↩
-
National Renewable Energy Laboratory. Transmission Interconnection Roadmap. 2024. https://www.nrel.gov/grid/transmission-interconnection-roadmap ↩
-
North American Electric Reliability Corporation. 2025 Long-Term Reliability Assessment. 2025. https://www.nerc.com/pa/RAPA/ra/Pages/default.aspx ↩
-
International Energy Agency. Energy and AI. 2025. https://www.iea.org/reports/energy-and-ai ↩
-
International Energy Agency. "Data centre electricity use surged in 2025, even with tightening bottlenecks driving a scramble for solutions." 2026. https://www.iea.org/news/data-centre-electricity-use-surged-in-2025-even-with-tightening-bottlenecks-driving-a-scramble-for-solutions ↩
-
Lawrence Berkeley National Laboratory / U.S. Department of Energy. 2024 United States Data Center Energy Usage Report. 2024. https://buildings.lbl.gov/publications/2024-lbnl-data-center-energy-usage-report ↩
-
Electric Power Research Institute. Powering Intelligence: Analyzing Artificial Intelligence and Data Center Energy Consumption. 2024. https://restservice.epri.com/publicdownload/000000003002028905/0/Product ↩
-
Uptime Institute. Giant data center power plans reach extreme levels. 2026. https://intelligence.uptimeinstitute.com/sites/default/files/2026-01/UI%20Field%20report%20194_Giant%20data%20center%20power%20plans%20reach%20extreme%20levels.pdf ↩
-
International Organization for Standardization. ISO/IEC 30134-2:2026, Data centres key performance indicators - Power usage effectiveness. 2026. https://www.iso.org/standard/30134-2 ↩
-
The Green Grid. Data Center Power Efficiency Metrics: PUE and DCiE. 2007. https://www.thegreengrid.org/en/resources/library-and-tools/20-Data-Center-Power-Efficiency-Metrics-PUE-and-DCiE ↩
-
ASHRAE TC 9.9. Thermal Guidelines for Data Processing Environments, 5th edition reference card. 2021 / 2024. https://www.ashrae.org/file%20library/technical%20resources/bookstore/supplemental%20files/therm-gdlns-5th-r-e-refcard.pdf ↩
-
Open Compute Project. Cooling Environments - Cold Plate. 2025. https://www.opencompute.org/wiki/Cooling_Environments/Cold_Plate ↩
-
Microsoft Cloud Blog. "Sustainable by design: Next-generation datacenters consume zero water for cooling." 2024-12-09. https://www.microsoft.com/en-us/microsoft-cloud/blog/2024/12/09/sustainable-by-design-next-generation-datacenters-consume-zero-water-for-cooling/ ↩
-
Microsoft Local. "Understanding water use at Microsoft datacenters." 2026. https://local.microsoft.com/blog/understanding-water-use-at-microsoft-datacenters/ ↩
-
Google. Operating sustainably - Google Data Centers. 2025-2026. https://www.datacenters.google/operating-sustainably ↩
-
Cisco. What Is Data Center Networking? 2025-2026. https://www.cisco.com/site/us/en/learn/topics/computing/what-is-data-center-networking.html ↩
-
Clos, C. "A Study of Non-Blocking Switching Networks." Bell System Technical Journal, 1953. https://onlinelibrary.wiley.com/doi/10.1002/j.1538-7305.1953.tb01433.x ↩
-
Google / SIGCOMM. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. 2015. https://research.google/pubs/jupiter-rising-a-decade-of-clos-topologies-and-centralized-control-in-googles-datacenter-network-2/ ↩
-
Meta Engineering. "Reinventing our data center network with F16, Minipack." 2019. https://engineering.fb.com/2019/03/14/data-center-engineering/f16-minipack/ ↩
-
InfiniBand Trade Association. InfiniBand - A low-latency, high-bandwidth interconnect. 2025-2026. https://www.infinibandta.org/about-infiniband/ ↩
-
Ultra Ethernet Consortium. "Ultra Ethernet Consortium launches Specification 1.0." 2025-06-11. https://ultraethernet.org/ultra-ethernet-consortium-uec-launches-specification-1-0-transforming-ethernet-for-ai-and-hpc-at-scale/ ↩
-
JEDEC / Business Wire. "JEDEC Publishes HBM3 Update to High Bandwidth Memory Standard." 2022-01-27. https://www.businesswire.com/news/home/20220127005320/en/JEDEC-Publishes-HBM3-Update-to-High-Bandwidth-Memory-HBM-Standard ↩ ↩2
-
SK hynix Newsroom. "SK hynix Begins Volume Production of Industry's First HBM3E." 2024-03-19. https://news.skhynix.com/sk-hynix-begins-volume-production-of-industry-first-hbm3e/ ↩
-
Micron. HBM3E. 2025-2026 product page. https://www.micron.com/products/memory/hbm/hbm3e ↩
-
Samsung Semiconductor. High Bandwidth Memory. 2025-2026 product page. https://semiconductor.samsung.com/dram/hbm/ ↩
-
TSMC Investor Relations. TSMC 2025 Annual Report. 2026. https://investor.tsmc.com/sites/ir/annual-report/2025/2025%20Annual%20Report_E.pdf ↩
-
NVIDIA. NVIDIA 2025 Annual Report. 2025. https://s201.q4cdn.com/141608511/files/doc_financials/2025/annual/NVIDIA-2025-Annual-Report.pdf ↩
-
AMD. AMD Instinct MI300X Accelerators. 2024-2026 product page. https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html ↩
-
Intel. Intel Gaudi 3 AI Accelerator. 2024. https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi.html ↩
-
Broadcom Investor Relations. Broadcom Annual Reports. 2025. https://investors.broadcom.com/financial-information/annual-reports ↩
-
NVM Express. NVMe Specifications. 2026. https://nvmexpress.org/developers/nvme-specification/ ↩
-
SNIA. SNIA Emerald Program. 2026. https://www.snia.org/forums/cmsi/programs/emerald ↩
-
Google / EuroSys. Large-scale cluster management at Google with Borg. 2015. https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/ ↩
-
Kubernetes Documentation. Kubernetes Overview. 2026. https://kubernetes.io/docs/concepts/overview/ ↩
-
SchedMD. Slurm Workload Manager Overview. 2026. https://slurm.schedmd.com/overview.html ↩
-
OpenTelemetry. What is OpenTelemetry? 2026. https://opentelemetry.io/docs/what-is-opentelemetry/ ↩
-
Microsoft Investor Relations. Microsoft 2025 Annual Report. 2025. https://www.microsoft.com/investor/reports/ar25/index.html ↩
-
Alphabet Investor Relations. Alphabet 2025 Annual Report. 2026. https://s206.q4cdn.com/479360582/files/doc_financials/2025/q4/GOOG-10-K-2025.pdf ↩
-
Amazon Investor Relations. Amazon annual reports, proxies and shareholder letters. 2026. https://ir.aboutamazon.com/annual-reports-proxies-and-shareholder-letters/default.aspx ↩
-
Meta Investor Relations. Meta Reports Second Quarter 2025 Results. 2025. https://investor.fb.com/investor-news/press-release-details/2025/Meta-Reports-Second-Quarter-2025-Results/default.aspx ↩
-
Dell Technologies. Dell AI Factory with NVIDIA. 2025-2026. https://www.dell.com/en-us/lp/dt/dell-ai-factory-with-nvidia ↩
-
Hewlett Packard Enterprise. HPE Cray Supercomputing. 2025-2026. https://www.hpe.com/us/en/compute/hpc/supercomputing/cray.html ↩
-
Supermicro. Supermicro NVIDIA GB200 NVL72. 2025-2026. https://www.supermicro.com/en/products/system/gpu/48u/srs-gb200-nvl72 ↩
-
Lenovo. Lenovo Neptune Liquid Cooling. 2025-2026. https://www.lenovo.com/us/en/servers-storage/solutions/neptune/ ↩
-
Schneider Electric. AI-ready data center solutions. 2025-2026. https://www.se.com/ww/en/work/solutions/for-business/data-centers-and-networks/ai-ready-data-center/ ↩
-
Vertiv Investor Relations. Vertiv 2025 Annual Report. 2026. https://s205.q4cdn.com/554782763/files/doc_financials/2025/ar/Vertiv-2025-Annual-Report.pdf ↩
-
Eaton. Data centers. 2025-2026. https://www.eaton.com/us/en-us/markets/data-centers.html ↩
-
ABB. Data centers. 2025-2026. https://new.abb.com/data-centers ↩
-
Siemens. Data centers. 2025-2026. https://www.siemens.com/global/en/markets/data-centers.html ↩
-
Cummins. Data Centers. 2025-2026. https://www.cummins.com/generators/data-centers ↩
-
Caterpillar. Data Center Power Solutions. 2025-2026. https://www.cat.com/en_US/by-industry/electric-power/data-centers.html ↩
-
Uptime Institute. 2025 Global Data Center Survey. 2025. https://intelligence.uptimeinstitute.com/resource/2025-global-data-center-survey-results-and-crosstabs ↩
-
Uptime Institute. Annual Outage Analysis 2024. 2024. https://uptimeinstitute.com/resources/research-and-reports/annual-outage-analysis-2024 ↩
-
AWS Trust Center. Data Center - Our Controls. 2026. https://aws.amazon.com/trust-center/data-center/our-controls/ ↩
-
Microsoft Service Assurance. Datacenter physical access security. 2025. https://learn.microsoft.com/en-us/compliance/assurance/assurance-datacenter-physical-access-security ↩
-
Google Cloud Security. "How Google protects the physical-to-logical space in a data center." 2025. https://cloud.google.com/docs/security/physical-to-logical-space ↩
-
NIST. SP 800-53 Rev. 5, Security and Privacy Controls for Information Systems and Organizations. 2020, updates through 2025. https://csrc.nist.gov/Pubs/sp/800/53/r5/upd1/Final ↩
-
International Organization for Standardization. ISO/IEC 27001:2022 Information security management systems. 2022. https://www.iso.org/standard/27001 ↩
-
FedRAMP. FedRAMP baselines. 2026. https://www.fedramp.gov/baselines/ ↩
-
Krizhevsky, A., Sutskever, I., and Hinton, G. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS, 2012. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html ↩
-
Vaswani, A. et al. "Attention Is All You Need." 2017. https://arxiv.org/abs/1706.03762 ↩
-
Brown, T. et al. "Language Models are Few-Shot Learners." 2020. https://arxiv.org/abs/2005.14165 ↩
-
Kaplan, J. et al. "Scaling Laws for Neural Language Models." 2020. https://arxiv.org/abs/2001.08361 ↩
-
Jouppi, N.P. et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." 2017. https://arxiv.org/abs/1704.04760 ↩
-
PCI-SIG. PCI Express 6.0 Specification. 2022. https://pcisig.com/pci-express-60-specification ↩
-
CXL Consortium. Compute Express Link. 2026. https://www.computeexpresslink.org/ ↩