Lyon Industries lion sigil

AI Data Centres: The Machine Beneath The Cloud

A plain-English technical guide to the physical system behind AI infrastructure: power, cooling, network fabrics, accelerators, suppliers, operations, staffing, security, and the limits that appear when cloud capacity becomes real machinery.

Published
May 14, 2026
Reading
30 min
Author
Christopher Lyon
Filed
Research
Inside an AI data centre row with liquid-cooled accelerator racks, overhead busway, fibre trunks, and a single blue status signal.

You can know the cloud well and still have the wrong picture of a data centre.

The cloud makes infrastructure look like a menu. Pick a region. Pick a zone. Pick an instance. Attach storage. Start a job. Watch a model train.

Underneath that menu is a physical machine. It takes electricity from the grid, pushes it through switchgear and power shelves, turns most of it into heat inside silicon, removes that heat with air or liquid, moves data through fibres and switches, and keeps the whole thing working with operators, controls, spares, guards, procedures, and alarms.

AI has made that machine harder to hide. A web service can often survive with more ordinary servers, normal rack power, and a network built for mixed traffic. Large AI training and inference systems concentrate power, heat, memory bandwidth, network traffic, storage pressure, supplier risk, and capital cost in one place. At that point the data centre stops being background infrastructure. It becomes part of the product.

The short version:

An AI data centre is not a room full of GPUs. It is a factory for useful computation. The raw inputs are power, chips, memory, data, network bandwidth, cooling capacity, and human operating discipline. The output is trained models, tokens, embeddings, search results, recommendations, simulations, and cloud services.

If any input is weak, the output gets expensive or unreliable.

Abstract

This article explains how AI data centres work for readers who already understand AI systems and cloud services. It does not start with "what is a server?" It starts with the next layer down: what has to be built and operated before a GPU instance, model endpoint, training cluster, or cloud region can exist.

The main claim is simple. AI infrastructure is constrained by five linked systems:

  1. Power: how much electricity can reach the site, the building, the row, the rack, and the chip.
  2. Cooling: how quickly heat can leave the chip, rack, room, and campus.
  3. Data movement: how fast data, gradients, model weights, checkpoints, and requests can move without stalling expensive accelerators.
  4. Control: how the facility and compute fleet are monitored, scheduled, patched, drained, repaired, and recovered.
  5. Trust: how physical security, personnel controls, supply chain, audit evidence, and customer isolation are maintained.

The hard work is not buying the most powerful accelerator. The hard work is building a powered, cooled, networked, secure, observable, maintainable failure domain that can run the workload at the required cost and schedule.

The Useful Mental Model

Think of the data centre as a machine with four flows.

FlowPlain-English jobWhat can break
Electricity inBring power from the grid to chips without unsafe faults or unacceptable interruptionsUtility delay, transformer shortage, UPS fault, breaker trip, rack power limit
Heat outMove heat away from chips before equipment throttles or failsAirflow problem, pump fault, bad water chemistry, cooling-tower limit, leak
Data throughMove bits between users, storage, CPUs, GPUs, and other data centresCongestion, bad optics, weak fabric, high tail latency, slow storage
Control overDecide what runs, where, when to change it, when to stop it, and who may touch itBad scheduler policy, weak telemetry, poor change control, unclear authority, security gap

Most bad explanations of data centres describe the equipment but miss the flows. The equipment is there to protect the flows.

A GPU rack is not valuable because it looks dense. It is valuable only if enough clean power reaches it, enough heat leaves it, enough data reaches it, the network lets it cooperate with other racks, the scheduler keeps it busy, and operators can repair it without causing a wider outage.

That is the frame for the rest of the article.

Cloud Words, Physical Meaning

Cloud providers turn facilities into abstractions. That is the point of cloud computing. The abstraction is useful, but it can hide the physical commitments behind it.

AWS says an Availability Zone contains one or more discrete data centres with redundant power, networking, and connectivity. Google Cloud, Azure, and Oracle use similar region and zone language because customers need to reason about latency, data residency, service availability, and regulatory placement.1Amazon Web Services Documentation. AWS Regions and Availability Zones. 2026. https://docs.aws.amazon.com/global-infrastructure/latest/regions/aws-regions-availability-zones.html2Google Cloud Documentation. Geography and regions. 2026. https://cloud.google.com/docs/geography-and-regions3Microsoft Azure. Global Infrastructure. 2026. https://azure.microsoft.com/en-us/explore/global-infrastructure/4Oracle. Public Cloud Regions and Data Centers. 2026. https://www.oracle.com/cloud/architecture-and-regions.html

The translation looks like this:

Cloud wordWhat exists underneath
RegionA commercial geography backed by campuses, fibre, utilities, operating teams, contracts, and legal commitments
Availability ZoneA failure-domain promise backed by separated buildings, power paths, network paths, and operating assumptions
GPU instanceA slice of accelerator hardware that had to be bought, shipped, mounted, cabled, powered, cooled, monitored, patched, and amortized
CapacityA forecast that turned into land, power reservations, equipment orders, construction schedules, and operational risk
LatencyGeography plus fibre path plus switch hops plus congestion plus software fan-out
ReliabilityDesign, commissioning, monitoring, spares, maintenance, change control, incident response, and luck managed down
SustainabilityEnergy source, PUE, water strategy, carbon intensity, utilization, hardware lifetime, and local grid impact

The cloud interface is a promise. The data centre is how the promise is kept.

What Makes AI Different

AI did not invent the data centre. Banks, telecom networks, search engines, cloud providers, laboratories, and governments have run serious compute facilities for decades. The change is density and synchronization.

Earlier cloud growth was often scale-out web infrastructure: many servers, many services, much traffic, but not always one tightly coupled job requiring thousands of accelerators to behave like one machine. Large AI training changes that. The accelerators exchange data frequently. A slow link, a congested switch, a storage pause, or one failing worker can slow the job. Google's "tail at scale" point applies brutally: rare per-node delays become ordinary when a request or training step depends on many nodes.5Dean, J. and Barroso, L.A. "The Tail at Scale." Communications of the ACM, 2013. https://research.google/pubs/the-tail-at-scale/

Inference changes the problem again. It may need lower latency, higher availability, model-weight distribution, retrieval systems, safety systems, logging, and burst capacity close to users. A training cluster can sometimes run far from the customer. An inference service may need to sit inside a region where latency, sovereignty, and product reliability matter.

AI also changes the rack. The important unit used to be a server. Then it became a rack. With systems like NVIDIA GB200 NVL72 and DGX SuperPOD reference architectures, the unit starts to look like a rack-scale or pod-scale computer: CPUs, accelerators, NVLink or other scale-up interconnect, networking, storage, cooling, power, and management designed together.6NVIDIA. GB200 NVL72. 2024-2026 product page. https://www.nvidia.com/en-us/data-center/gb200-nvl72/7NVIDIA Documentation. NVIDIA DGX SuperPOD Reference Architecture Featuring DGX GB200. 2025. https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/dgx-superpod-architecture.html

This is why the phrase "AI factory" is useful when it is used carefully. The facility is not just hosting computation. It is arranged to turn capital equipment and electricity into model outputs.

The Build Sequence

A serious data-centre project does not start by picking a favourite GPU. It starts with a workload and a constraint model.

The first questions are plain:

QuestionWhy it matters
Is the site for training, inference, storage, general cloud, or a mix?Each load shape stresses a different part of the system
How many megawatts are needed now and later?Power is often the schedule gate
What rack density must the building support?Cooling and electrical design follow rack density
What network behaviour does the workload require?AI training punishes weak east-west bandwidth and poor congestion control
What data must stay local?Residency, latency, and data gravity shape region choice
How long may a job or service be interrupted?Redundancy, checkpointing, maintenance, and customer promises depend on it
Who will operate it?A design that the team cannot maintain is not a design

Then comes site selection. Cheap land is not enough. A workable AI site needs power, fibre, civil access, permits, water or a waterless cooling strategy, construction labor, political acceptance, security, and expansion room. Utility interconnection can decide the schedule before the first rack arrives. NREL's interconnection work and NERC's reliability assessments are useful reminders that new load is a power-system problem, not just a customer procurement problem.8National Renewable Energy Laboratory. Transmission Interconnection Roadmap. 2024. https://www.nrel.gov/grid/transmission-interconnection-roadmap9North American Electric Reliability Corporation. 2025 Long-Term Reliability Assessment. 2025. https://www.nerc.com/pa/RAPA/ra/Pages/default.aspx

After the site comes the design basis. This is the document that says what the building is being designed to handle: IT load, rack density, redundancy, cooling medium, power topology, network architecture, physical security zones, maintainability, and commissioning tests. A bad design basis poisons the project. If the design assumes 20 kW racks and procurement later buys 120 kW liquid-cooled racks, everyone is now negotiating with physics.

Procurement follows. For AI, procurement is not a back-office function. It is part of the architecture:

Supplier layerWhat it decides
Accelerator vendorCompute density, software ecosystem, rack shape, power draw, cooling path
Foundry and packagingWhether the chips and HBM packages can be made in volume
Memory supplierHBM capacity, bandwidth, yield, and delivery schedule
Server OEM or ODMHow chips become serviceable systems and racks
Network vendorFabric bandwidth, telemetry, congestion behaviour, optics, support model
Power vendorSwitchgear, UPS, busway, transformers, breakers, generators, protection
Cooling vendorChillers, dry coolers, CDUs, cold plates, pumps, valves, monitoring
Construction tradesWhether the design becomes tested capacity on a real schedule

Commissioning is the first honest exam. Electrical teams test switchgear, protection, UPS behaviour, generator starts, transfer sequences, grounding, and load banks. Mechanical teams test pumps, valves, airflow, heat rejection, water chemistry, leak detection, and controls. IT teams test cabling, optics, firmware, network paths, storage throughput, scheduler integration, and workload burn-in.

Commissioning is not a ceremony. It is where the building is encouraged to fail while the consequences are still contained.

Power: The First Gate

Every watt that enters a data centre becomes heat. That one sentence explains why power and cooling are inseparable.

Power starts outside the fence. A utility must be able to serve the site. That can require transmission work, distribution upgrades, substations, transformers, protection studies, metering, contracts, and time. AI campuses can be large enough that a local utility, regulator, or community cannot treat them like ordinary commercial load. IEA, LBNL, EPRI, and Uptime Institute all frame AI data-centre growth as an energy-system issue, not merely an IT issue.10International Energy Agency. Energy and AI. 2025. https://www.iea.org/reports/energy-and-ai11International Energy Agency. "Data centre electricity use surged in 2025, even with tightening bottlenecks driving a scramble for solutions." 2026. https://www.iea.org/news/data-centre-electricity-use-surged-in-2025-even-with-tightening-bottlenecks-driving-a-scramble-for-solutions12Lawrence Berkeley National Laboratory / U.S. Department of Energy. 2024 United States Data Center Energy Usage Report. 2024. https://buildings.lbl.gov/publications/2024-lbnl-data-center-energy-usage-report13Electric Power Research Institute. Powering Intelligence: Analyzing Artificial Intelligence and Data Center Energy Consumption. 2024. https://restservice.epri.com/publicdownload/000000003002028905/0/Product14Uptime Institute. Giant data center power plans reach extreme levels. 2026. https://intelligence.uptimeinstitute.com/sites/default/files/2026-01/UI%20Field%20report%20194Giant%20data%20center%20power%20plans%20reach%20extreme%20levels.pdf

Inside the site, power moves through a chain:

StagePlain-English role
Utility interconnectThe site gets power from the grid
SubstationVoltage is stepped, switched, protected, and metered
Medium-voltage distributionLarge power blocks move around the campus
TransformersVoltage is stepped down for buildings and equipment
Switchgear and breakersFaults are isolated and maintenance becomes possible
UPSShort interruptions are bridged and equipment gets time to ride through or shut down
Generators or alternate supplyLonger outages are covered if the design requires it
Busway and rack distributionPower reaches rows, racks, and power shelves
Server power suppliesElectricity becomes usable DC power for chips, memory, fans, pumps, and controllers

Redundancy means spare capacity in the right place. It does not mean safety by slogan. N+1 means one extra component beyond the needed number. 2N means two independent systems sized for the load. Distributed redundant designs spread the spare capacity differently. Each choice adds cost, complexity, test work, and failure modes.

PUE is the common energy-efficiency ratio. It means total facility energy divided by IT equipment energy. If a data centre has a PUE of 1.2, then each 1 MW of IT equipment needs another 0.2 MW for cooling, power losses, pumps, lights, controls, and other overhead. ISO/IEC 30134-2 standardizes PUE, and The Green Grid helped popularize it.15International Organization for Standardization. ISO/IEC 30134-2:2026, Data centres key performance indicators - Power usage effectiveness. 2026. https://www.iso.org/standard/30134-216The Green Grid. Data Center Power Efficiency Metrics: PUE and DCiE. 2007. https://www.thegreengrid.org/en/resources/library-and-tools/20-Data-Center-Power-Efficiency-Metrics-PUE-and-DCiE

PUE is useful, but it can mislead. It does not tell you whether the GPUs are doing useful work. A low-PUE site with badly scheduled accelerators can still waste money. A slightly higher-PUE site in a cleaner grid or water-constrained region may be the better decision. For AI, the energy question is not only "how efficient is the building?" It is also "how much useful model work comes out per watt, per dollar, and per litre of water?"

Cooling: The Heat Has To Leave

A chip is a heater that happens to compute.

Air cooling works by moving cold air to the front of servers and hot air away from the back. Good air-cooled data halls manage hot aisles, cold aisles, blanking panels, floor tiles, containment, fan energy, pressure, humidity, filters, and equipment inlet temperature. ASHRAE TC 9.9 matters because IT equipment has a thermal operating envelope; the target is not human comfort.17ASHRAE TC 9.9. Thermal Guidelines for Data Processing Environments, 5th edition reference card. 2021 / 2024. https://www.ashrae.org/file%20library/technical%20resources/bookstore/supplemental%20files/therm-gdlns-5th-r-e-refcard.pdf

Dense AI racks make this harder. The rack may draw so much power that air alone becomes inefficient or impractical. Direct-to-chip liquid cooling puts a cold plate on the hot components, usually CPUs and accelerators. Liquid carries heat to a coolant distribution unit, then into a facility water loop, then to chillers, dry coolers, cooling towers, or another heat-rejection system. Open Compute Project cold-plate work exists because these liquid-cooled interfaces need common expectations.18Open Compute Project. Cooling Environments - Cold Plate. 2025. https://www.opencompute.org/wiki/CoolingEnvironments/ColdPlate

Plainly:

Cooling partWhat it does
Cold plateTouches the hot chip package and collects heat
Rack manifoldDistributes coolant to many cold plates
CDUControls flow, pressure, temperature, and separation between technology coolant and facility water
Facility loopMoves heat away from the room
Chiller, dry cooler, or towerRejects heat outside the building
Sensors and controlsDetect temperature, pressure, leaks, flow, and abnormal states

Liquid cooling does not remove the operations problem. It changes it. The team now needs leak procedures, coolant chemistry, pump maintenance, valve checks, quick-disconnect discipline, spare parts, and technicians who can work near expensive powered equipment.

Water is the uncomfortable tradeoff. Evaporative cooling can be energy efficient, but it consumes water. Dry or closed-loop designs can reduce water use, but may use more electricity depending on climate and design. Microsoft has said its next-generation zero-water cooling designs reduce water consumed for cooling but cause a nominal increase in annual energy use compared with evaporative systems. Google frames cooling choices as a local balance between energy, water, carbon-free supply, climate, and workload.19Microsoft Cloud Blog. "Sustainable by design: Next-generation datacenters consume zero water for cooling." 2024-12-09. https://www.microsoft.com/en-us/microsoft-cloud/blog/2024/12/09/sustainable-by-design-next-generation-datacenters-consume-zero-water-for-cooling/20Microsoft Local. "Understanding water use at Microsoft datacenters." 2026. https://local.microsoft.com/blog/understanding-water-use-at-microsoft-datacenters/21Google. Operating sustainably - Google Data Centers. 2025-2026. https://www.datacenters.google/operating-sustainably

There is no universal best cooling system. There is only a cooling system that fits the rack density, climate, water politics, energy source, operating team, and failure tolerance.

Networking: North-South, East-West, And The Fabric

Two terms are worth getting right.

North-south traffic enters or leaves the data centre or cluster. User requests, API calls, internet traffic, and region-to-region traffic often fit here.

East-west traffic moves inside the data centre. Service-to-service calls, storage reads, model-shard communication, gradient exchange, checkpoint traffic, and management traffic fit here.22Cisco. What Is Data Center Networking? 2025-2026. https://www.cisco.com/site/us/en/learn/topics/computing/what-is-data-center-networking.html

Traditional web serving cares a lot about north-south traffic because users are outside the facility. AI training cares heavily about east-west traffic because accelerators must cooperate inside the cluster.

Modern data-centre networks use Clos or spine-leaf ideas. A leaf switch connects servers. Spine switches connect leaves. The goal is predictable paths between racks without one central choke point. Clos switching theory is old, but hyperscalers turned it into modern data-centre fabrics. Google's Jupiter and Meta's F16 work show how much engineering goes into making these fabrics cheap, fast, observable, and controllable at scale.23Clos, C. "A Study of Non-Blocking Switching Networks." Bell System Technical Journal, 1953. https://onlinelibrary.wiley.com/doi/10.1002/j.1538-7305.1953.tb01433.x24Google / SIGCOMM. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. 2015. https://research.google/pubs/jupiter-rising-a-decade-of-clos-topologies-and-centralized-control-in-googles-datacenter-network-2/25Meta Engineering. "Reinventing our data center network with F16, Minipack." 2019. https://engineering.fb.com/2019/03/14/data-center-engineering/f16-minipack/

AI adds a back-end network problem. A training job may need thousands of accelerators to exchange data at the same time. If one path is slow, the job can wait. The network needs high bandwidth, low latency, good congestion control, fast failure detection, and useful telemetry.

This is where InfiniBand, Ethernet/RoCE, and newer AI-focused Ethernet work appear.

Fabric choiceWhy teams choose itWhat they inherit
InfiniBandMature low-latency HPC and AI training fabricMore specialized operations and supplier concentration
Ethernet with RoCEHuge ecosystem, cloud familiarity, supplier diversityCareful congestion tuning and operational discipline
Proprietary scale-up linksVery high bandwidth inside a rack or podPlatform lock-in and shorter reach

InfiniBand remains important in AI and HPC. Ethernet is being adapted for AI and HPC at scale, including through the Ultra Ethernet Consortium.26InfiniBand Trade Association. InfiniBand - A low-latency, high-bandwidth interconnect. 2025-2026. https://www.infinibandta.org/about-infiniband/27Ultra Ethernet Consortium. "Ultra Ethernet Consortium launches Specification 1.0." 2025-06-11. https://ultraethernet.org/ultra-ethernet-consortium-uec-launches-specification-1-0-transforming-ethernet-for-ai-and-hpc-at-scale/

The physical layer matters too. Optics, copper, fibre trunks, patch panels, cable trays, labels, bend radius, cleaning, transceivers, and switch thermals are not accessories. A dirty connector can become a training delay. A late optics shipment can become a cluster delay. A hot switch can become a network incident.

Compute: The Chip Is Only One Layer

The accelerator gets the attention because it is expensive and visible. It is still only one part of the compute path.

An AI server or rack brings together:

ComponentWhy it matters
AcceleratorPerforms the dense numerical work
HBMFeeds the accelerator with very high memory bandwidth
CPUHandles orchestration, preprocessing, host work, and parts of the application
NIC or DPUMoves data onto the network and may offload work from the CPU
Local storageHandles cache, scratch, logs, or fast staging
Firmware and driversDecide whether the hardware behaves consistently
LibrariesMake the hardware usable by model code
Rack interconnectLets accelerators behave like a larger system

HBM is worth calling out. It is stacked memory placed close to the accelerator package. It gives much higher bandwidth than ordinary server memory, but it also ties AI performance to advanced packaging, memory yield, thermal design, and supplier capacity. JEDEC standards, SK hynix production announcements, Micron product material, Samsung HBM material, and TSMC annual reporting all point to the same fact: AI infrastructure depends on memory and packaging, not just logic chips.28JEDEC / Business Wire. "JEDEC Publishes HBM3 Update to High Bandwidth Memory Standard." 2022-01-27. https://www.businesswire.com/news/home/20220127005320/en/JEDEC-Publishes-HBM3-Update-to-High-Bandwidth-Memory-HBM-Standard29SK hynix Newsroom. "SK hynix Begins Volume Production of Industry's First HBM3E." 2024-03-19. https://news.skhynix.com/sk-hynix-begins-volume-production-of-industry-first-hbm3e/30Micron. HBM3E. 2025-2026 product page. https://www.micron.com/products/memory/hbm/hbm3e31Samsung Semiconductor. High Bandwidth Memory. 2025-2026 product page. https://semiconductor.samsung.com/dram/hbm/32TSMC Investor Relations. TSMC 2025 Annual Report. 2026. https://investor.tsmc.com/sites/ir/annual-report/2025/2025%20Annual%20ReportE.pdf

This is why supplier conversations quickly become architectural conversations. NVIDIA, AMD, Intel, Broadcom, TSMC, memory suppliers, server OEMs, network vendors, and power/cooling suppliers are all part of the same machine.33NVIDIA. NVIDIA 2025 Annual Report. 2025. https://s201.q4cdn.com/141608511/files/docfinancials/2025/annual/NVIDIA-2025-Annual-Report.pdf34AMD. AMD Instinct MI300X Accelerators. 2024-2026 product page. https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html35Intel. Intel Gaudi 3 AI Accelerator. 2024. https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi.html36Broadcom Investor Relations. Broadcom Annual Reports. 2025. https://investors.broadcom.com/financial-information/annual-reports

The practical rule is simple: an accelerator is idle whenever the rest of the system cannot feed it, cool it, schedule it, or recover it.

Storage: The Slow Part Can Be Somewhere Else

AI storage is easy to underestimate because cloud object storage feels bottomless. Inside the machine, storage is hardware, network paths, filesystems, metadata services, rebuild behaviour, durability policy, capacity planning, and power draw.

Training needs datasets, transformed shards, checkpoints, logs, model weights, evaluation outputs, and experiment records. Inference needs model weights, cache, retrieval indexes, logs, safety records, and sometimes vector databases. Governance needs lineage, retention, deletion, access control, and audit evidence.

A training path may look like this:

  1. Raw data lands in object storage or a data lake.
  2. Preprocessing turns it into training-ready shards.
  3. High-throughput storage feeds the cluster.
  4. Accelerators consume batches.
  5. Checkpoints write back on a schedule.
  6. Failed jobs restart from checkpoints.
  7. Model artifacts move to evaluation, fine-tuning, serving, or archive.

Each stage can slow the job. NVMe, NVMe over Fabrics, parallel filesystems, burst buffers, object stores, metadata services, and storage efficiency work all matter because GPUs are too expensive to wait politely for data.37NVM Express. NVMe Specifications. 2026. https://nvmexpress.org/developers/nvme-specification/38SNIA. SNIA Emerald Program. 2026. https://www.snia.org/forums/cmsi/programs/emerald

Storage is not the glamorous part of an AI data centre. It is one of the easiest places to waste the glamorous part.

Control: The Building And The Fleet Both Need Operators

There are two control systems in an AI data centre.

The facility control system watches the building: power, switchgear, UPS, generators, pumps, valves, chillers, CDUs, cooling towers, leak sensors, fire systems, cameras, doors, and environmental sensors.

The compute control system watches the fleet: servers, accelerators, NICs, storage, images, firmware, drivers, schedulers, jobs, quotas, logs, metrics, traces, alerts, and customer services.

In a low-density facility those worlds can be loosely connected. In an AI facility they need to talk. A scheduler may need to avoid racks under thermal constraint. A liquid-cooling alarm may need a workload drain. A firmware update may change power behaviour. A network maintenance window may kill a training job that assumed the fabric would stay stable for days.

Google's Borg paper is still useful because it shows the data centre as a managed resource pool. Kubernetes is the common cloud-native control-plane model. Slurm remains common in HPC and AI clusters. OpenTelemetry gives a modern observability vocabulary for traces, metrics, and logs.39Google / EuroSys. Large-scale cluster management at Google with Borg. 2015. https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/40Kubernetes Documentation. Kubernetes Overview. 2026. https://kubernetes.io/docs/concepts/overview/41SchedMD. Slurm Workload Manager Overview. 2026. https://slurm.schedmd.com/overview.html42OpenTelemetry. What is OpenTelemetry? 2026. https://opentelemetry.io/docs/what-is-opentelemetry/

The question is not only which scheduler to use. The question is who has authority.

Who may reserve a thousand GPUs? Who may preempt a job? Who may drain a rack? Who may roll firmware? Who may ignore utilization to protect reliability? Who owns the incident when a facility alarm and a model-training failure are the same event?

If that authority model is vague, the cluster will be run by escalation.

Hyperscalers

Hyperscalers are not just large customers. They are infrastructure manufacturers.

Amazon, Microsoft, Google, Meta, and Oracle have enough scale to shape server designs, accelerator plans, switch fabrics, data-centre locations, cooling choices, energy contracts, and internal operating systems. Their advantage is not only money. It is repetition. They build enough sites to learn, standardize, and push suppliers.

Their public filings and investor updates show the capital weight of AI infrastructure. Microsoft, Alphabet, Amazon, and Meta all frame cloud and AI infrastructure as a major investment area.43Microsoft Investor Relations. Microsoft 2025 Annual Report. 2025. https://www.microsoft.com/investor/reports/ar25/index.html44Alphabet Investor Relations. Alphabet 2025 Annual Report. 2026. https://s206.q4cdn.com/479360582/files/docfinancials/2025/q4/GOOG-10-K-2025.pdf45Amazon Investor Relations. Amazon annual reports, proxies and shareholder letters. 2026. https://ir.aboutamazon.com/annual-reports-proxies-and-shareholder-letters/default.aspx46Meta Investor Relations. Meta Reports Second Quarter 2025 Results. 2025. https://investor.fb.com/investor-news/press-release-details/2025/Meta-Reports-Second-Quarter-2025-Results/default.aspx

Hyperscalers also hide a hard truth from customers: every simple cloud SKU is backed by a complex capacity bet. A GPU instance offered in a region means someone has already made decisions about land, power, cooling, hardware supply, network topology, customer demand, depreciation, and failure risk.

The cloud customer buys abstraction. The hyperscaler sells abstraction by owning more of the mess.

Suppliers

The supplier base is wider than the AI conversation usually admits.

Supplier groupWhat they control
NVIDIA, AMD, Intel, custom ASIC teamsAccelerator roadmaps, software ecosystems, rack design pressure
Broadcom, Arista, Cisco, NVIDIA Networking, othersSwitching silicon, NICs, fabrics, telemetry, congestion behaviour
TSMC and advanced packaging suppliersWhether the most advanced silicon and package designs can ship
SK hynix, Micron, SamsungHBM supply, yield, bandwidth, capacity
Dell, HPE, Supermicro, Lenovo, ODMsServer integration, rack serviceability, firmware, spares
Schneider Electric, Vertiv, Eaton, ABB, SiemensElectrical infrastructure, cooling infrastructure, controls, monitoring
Cummins, Caterpillar, fuel and backup suppliersBackup generation, fuel logistics, emergency power
Construction and commissioning firmsWhether drawings become usable capacity

NVIDIA is central now because it sells more than GPUs: systems, networking, interconnects, software, reference architectures, and management tooling. Dell, HPE, Supermicro, Lenovo, and others turn platforms into products that can be ordered, installed, serviced, and supported. Schneider Electric, Vertiv, Eaton, ABB, Siemens, Cummins, and Caterpillar show why power and cooling vendors are now AI infrastructure vendors too.47Dell Technologies. Dell AI Factory with NVIDIA. 2025-2026. https://www.dell.com/en-us/lp/dt/dell-ai-factory-with-nvidia48Hewlett Packard Enterprise. HPE Cray Supercomputing. 2025-2026. https://www.hpe.com/us/en/compute/hpc/supercomputing/cray.html49Supermicro. Supermicro NVIDIA GB200 NVL72. 2025-2026. https://www.supermicro.com/en/products/system/gpu/48u/srs-gb200-nvl7250Lenovo. Lenovo Neptune Liquid Cooling. 2025-2026. https://www.lenovo.com/us/en/servers-storage/solutions/neptune/51Schneider Electric. AI-ready data center solutions. 2025-2026. https://www.se.com/ww/en/work/solutions/for-business/data-centers-and-networks/ai-ready-data-center/52Vertiv Investor Relations. Vertiv 2025 Annual Report. 2026. https://s205.q4cdn.com/554782763/files/docfinancials/2025/ar/Vertiv-2025-Annual-Report.pdf53Eaton. Data centers. 2025-2026. https://www.eaton.com/us/en-us/markets/data-centers.html54ABB. Data centers. 2025-2026. https://new.abb.com/data-centers55Siemens. Data centers. 2025-2026. https://www.siemens.com/global/en/markets/data-centers.html56Cummins. Data Centers. 2025-2026. https://www.cummins.com/generators/data-centers57Caterpillar. Data Center Power Solutions. 2025-2026. https://www.cat.com/enUS/by-industry/electric-power/data-centers.html

The decision lesson: do not treat the GPU purchase order as the project. The project is the whole supplier chain arriving in the right order.

Operations: The Part That Keeps Happening

Construction ends. Operations does not.

A running AI data centre has several daily loops:

Operating loopWhat it does
Facilities operationsWatches power, cooling, fire, water, leaks, fuel, alarms, and maintenance state
IT operationsWatches servers, accelerators, storage, firmware, operating systems, and workloads
Network operationsWatches switch health, optics, congestion, routing, maintenance, and capacity
Security operationsWatches access, cameras, badges, deliveries, contractors, rack doors, media, and incidents
Capacity operationsForecasts demand, manages quotas, reserves clusters, and plans expansion
MaintenanceReplaces parts, tests generators, services UPS systems, checks cooling loops, and calibrates sensors
Change managementControls what changes, when it changes, who approves, and how to roll back
Incident managementDetects, triages, communicates, mitigates, learns, and updates procedures

The procedures have unromantic names: MOPs, SOPs, EOPs, permits to work, lockout/tagout, rounds, shift handovers, spares, runbooks, maintenance windows, post-incident reviews, access logs, and change freezes. They exist because a small mistake can affect many customers or many millions of dollars of equipment.

Uptime Institute's outage and survey work keeps returning to the same general lesson: failures are not only equipment failures. They are also process, staffing, maintenance, power, networking, and change-control failures.58Uptime Institute. 2025 Global Data Center Survey. 2025. https://intelligence.uptimeinstitute.com/resource/2025-global-data-center-survey-results-and-crosstabs59Uptime Institute. Annual Outage Analysis 2024. 2024. https://uptimeinstitute.com/resources/research-and-reports/annual-outage-analysis-2024

AI adds pressure because utilization is financially painful. Idle accelerators look like waste. But chasing utilization too aggressively can create fragile operations: overloaded cooling zones, rushed updates, weak drain procedures, ignored alarms, and maintenance debt. The operating model needs a rule for when to protect the fleet instead of chasing one more percentage point of use.

Staffing

AI data centres are marketed through software, but they are run by people who understand electrical rooms, cooling loops, controls, racks, fibres, safety, and security.

A serious site needs electrical engineers, mechanical engineers, controls engineers, network engineers, hardware technicians, cluster operators, security staff, safety staff, supply-chain planners, and commissioning specialists.

The awkward part is overlap. A technician working on a liquid-cooled accelerator rack needs IT hardware skill, mechanical awareness, safety training, vendor procedure, and the judgement to stop when a leak or alarm looks wrong. A cluster operator needs to understand that a facility constraint can be a scheduler constraint. A security officer needs to know whether a contractor's badge request matches a real work order and a permitted area.

Staffing is therefore a scaling limit. You can order hardware faster than you can train people to run it well.

Security

Cloud security begins below the cloud.

AWS, Microsoft, and Google publish physical security controls because customers need to trust the provider's physical-to-logical chain. Those controls include perimeter security, guards, cameras, badges, multi-factor access, cages, locked racks, visitor controls, equipment movement controls, media handling, incident response, and audit logs.60AWS Trust Center. Data Center - Our Controls. 2026. https://aws.amazon.com/trust-center/data-center/our-controls/61Microsoft Service Assurance. Datacenter physical access security. 2025. https://learn.microsoft.com/en-us/compliance/assurance/assurance-datacenter-physical-access-security62Google Cloud Security. "How Google protects the physical-to-logical space in a data center." 2025. https://cloud.google.com/docs/security/physical-to-logical-space

AI raises the stakes:

Security issueWhy it matters
Model weightsA small number of files can represent enormous training cost and strategic value
Accelerator scarcityHardware theft, diversion, or tampering has direct business impact
Firmware and supply chainNICs, baseboard controllers, drives, accelerators, optics, and firmware expand the attack surface
Remote handsContractors may touch high-value systems during urgent work
Multi-tenancyCustomers need isolation across expensive shared hardware
Physical-to-logical eventsA badge event, rack-door alarm, drive movement, or camera alert may matter to cyber investigation
SovereigntySome customers care exactly where data, staff, and hardware operations sit

NIST SP 800-53 is useful because it refuses to separate cyber security from physical, personnel, maintenance, contingency, incident, and supply-chain controls. ISO/IEC 27001 frames security as a management system. FedRAMP shows how cloud customers inherit provider controls rather than operating every physical control themselves.63NIST. SP 800-53 Rev. 5, Security and Privacy Controls for Information Systems and Organizations. 2020, updates through 2025. https://csrc.nist.gov/Pubs/sp/800/53/r5/upd1/Final64International Organization for Standardization. ISO/IEC 27001:2022 Information security management systems. 2022. https://www.iso.org/standard/2700165FedRAMP. FedRAMP baselines. 2026. https://www.fedramp.gov/baselines/

A data-centre security failure does not have to be dramatic. It can be an unrevoked badge, a contractor in the wrong room, a mislabeled drive, an unmanaged maintenance laptop, a bad firmware chain, or a rack-door alarm no one correlated with a logical event.

Scaling Failures

AI data-centre plans usually fail at the joins between disciplines.

Failure modeWhat it really means
"We have GPUs but cannot use them"The network, storage, scheduler, power, cooling, or software stack is limiting useful work
"The building is ready but not live"Utility power, commissioning, permits, fibre, switchgear, or operating readiness is late
"The cluster is hot"Rack density, containment, flow, workload placement, or liquid loop behaviour is wrong
"The network is flaky"Optics, congestion control, firmware, cabling, routing, or job traffic is outside the expected envelope
"Storage is slow"Data layout, metadata, checkpointing, rebuilds, or network paths are starving accelerators
"Utilization is poor"Resources exist, but not in the shape jobs need, or the scheduler cannot pack them cleanly
"Maintenance causes incidents"Procedures, authority, drain logic, spares, or rollback paths are weak
"Security slows everything down"Access, audit, contractor, and equipment movement processes were added after the operating model
"The community is angry"Power, water, tax, noise, land use, or emissions questions were treated as externalities

The common cause is simple: the team optimized one layer and assumed the others would comply.

They do not comply. They negotiate.

The Century Behind The Machine

AI data centres feel new because the demand shock is new. The ingredients have a long history.

PeriodBreakthroughWhat it changed
1900s-1930sElectrified industry and tabulating machinesComputation became tied to rooms, power, operators, and business process
1940sElectronic computersHeat, reliability, maintenance, and power became computing problems
1950sTransistors and switching theorySmaller electronics and scalable network ideas entered the story
1960sIntegrated circuits, mainframes, time sharingCentral compute became shared institutional infrastructure
1970sDRAM, Ethernet, minicomputersMemory density and local networking moved toward clustered systems
1980sTCP/IP, fibre, client-server computingNetworked computing became normal
1990sWeb scale and commodity serversSoftware had to assume hardware failure
2000sWarehouse-scale computing and virtualizationThe data centre became the computer
2010sDeep learning, GPUs, TPUs, Kubernetes, NVMe, Clos fabricsAccelerated cloud infrastructure became mainstream
2020sTransformers, HBM, chiplets, advanced packaging, liquid cooling, rack-scale AI systemsPower, cooling, memory, network, and supplier capacity became first-order AI constraints

The modern AI branch runs through GPU-accelerated deep learning, transformers, large language-model scaling, in-data-centre accelerators, high-bandwidth memory, and rack-scale systems.66Krizhevsky, A., Sutskever, I., and Hinton, G. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS, 2012. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html67Vaswani, A. et al. "Attention Is All You Need." 2017. https://arxiv.org/abs/1706.0376268Brown, T. et al. "Language Models are Few-Shot Learners." 2020. https://arxiv.org/abs/2005.1416569Kaplan, J. et al. "Scaling Laws for Neural Language Models." 2020. https://arxiv.org/abs/2001.0836170Jouppi, N.P. et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." 2017. https://arxiv.org/abs/1704.0476028JEDEC / Business Wire. "JEDEC Publishes HBM3 Update to High Bandwidth Memory Standard." 2022-01-27. https://www.businesswire.com/news/home/20220127005320/en/JEDEC-Publishes-HBM3-Update-to-High-Bandwidth-Memory-HBM-Standard71PCI-SIG. PCI Express 6.0 Specification. 2022. https://pcisig.com/pci-express-60-specification72CXL Consortium. Compute Express Link. 2026. https://www.computeexpresslink.org/

The pattern repeats. More useful compute exposes the next bottleneck. Faster chips expose heat. Bigger models expose network and memory. Better cloud products create more demand. More demand exposes grid limits. The data centre absorbs the consequence.

Useful Terms

TermPlain meaning
AI factoryA data centre arranged to produce AI outputs, not just generic compute
Availability ZoneA cloud failure-domain promise backed by physical facilities and networks
Back-end networkThe cluster network used for accelerator, storage, and training traffic
BMSBuilding management system
BuswayElectrical distribution that carries power along rows or overhead paths
CDUCoolant distribution unit for liquid-cooled systems
Clos fabricA scalable multi-stage switching design used in data-centre networks
DCIMData-centre infrastructure management tooling
Direct-to-chip coolingLiquid cooling that removes heat through cold plates on hot components
East-west trafficTraffic between systems inside the data centre
EPMSElectrical power monitoring system
Front-end networkNetwork used for service, user, storage, or management traffic depending on context
HBMHigh bandwidth memory placed close to accelerators
Hot aisle/cold aisleAirflow layout separating server intake and exhaust
InfiniBandLow-latency fabric used in HPC and many AI clusters
IT loadPower used by servers, storage, and network equipment
MOPMethod of procedure for controlled maintenance work
North-south trafficTraffic entering or leaving the data centre or cluster boundary
N+1Enough capacity plus one spare component
PDUPower distribution unit
PUEFacility energy divided by IT equipment energy
RDMANetwork data movement with low CPU involvement
RoCERDMA over Converged Ethernet
Spine-leafCommon two-layer data-centre network fabric
StragglerSlow worker, path, or component that delays the whole distributed job
UPSUninterruptible power supply
White spaceThe data hall area where racks live
WUEWater-use metric for data-centre operation

What To Inspect Before Building Or Buying

The first useful artifact is not a hardware shopping list. It is a constraint map.

Ask:

GateQuestion
WorkloadWhat mix of training, fine-tuning, batch inference, online inference, retrieval, and storage must run?
PowerWhat power is actually available, by date, and with what expansion path?
CoolingWhat rack densities must be supported over the next five years?
NetworkWhat east-west bandwidth, latency, and congestion behaviour do jobs need?
StorageWhat feed rate, checkpoint rate, metadata rate, and recovery time are required?
SchedulerHow will scarce accelerators be allocated, reserved, preempted, and charged back?
OperationsWho owns maintenance, changes, incidents, capacity, and security?
Supplier riskWhich components have long lead times or single-supplier exposure?
SecurityWhich physical, logical, personnel, supply-chain, and compliance controls must be inherited or operated?
ExitWhat would make the build worse than cloud, colocation, or a managed AI cluster?

The last question is important. Most organizations should not build a serious AI data centre. They should rent, colocate, reserve cloud capacity, or buy a managed cluster until the workload is stable enough and large enough to justify owning the risk.

The builder owns power risk, construction risk, cooling risk, supplier risk, staffing risk, utilization risk, security risk, and technology-refresh risk. The cloud buyer pays a premium to avoid much of that risk. Neither path is automatically superior.

Honest Summary

The cloud is not fake. It is an interface over real machinery.

AI makes the machinery visible because it concentrates demand. The data centre has to deliver power, remove heat, move data, schedule work, protect assets, and recover from failure while the equipment gets denser and more expensive.

The serious question is not "can we get GPUs?"

It is:

Can we get enough power to the right place, remove the heat, feed the accelerators, keep the network stable, operate the fleet safely, and prove to customers that the system is secure?

If the answer is no, the correct decision may be to stay out of the data-centre business. If the answer is yes, the work starts with megawatts, cooling loops, network topology, controls, people, and failure drills.

The model comes later.

References

Footnotes

  1. Amazon Web Services Documentation. AWS Regions and Availability Zones. 2026. https://docs.aws.amazon.com/global-infrastructure/latest/regions/aws-regions-availability-zones.html

  2. Google Cloud Documentation. Geography and regions. 2026. https://cloud.google.com/docs/geography-and-regions

  3. Microsoft Azure. Global Infrastructure. 2026. https://azure.microsoft.com/en-us/explore/global-infrastructure/

  4. Oracle. Public Cloud Regions and Data Centers. 2026. https://www.oracle.com/cloud/architecture-and-regions.html

  5. Dean, J. and Barroso, L.A. "The Tail at Scale." Communications of the ACM, 2013. https://research.google/pubs/the-tail-at-scale/

  6. NVIDIA. GB200 NVL72. 2024-2026 product page. https://www.nvidia.com/en-us/data-center/gb200-nvl72/

  7. NVIDIA Documentation. NVIDIA DGX SuperPOD Reference Architecture Featuring DGX GB200. 2025. https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/dgx-superpod-architecture.html

  8. National Renewable Energy Laboratory. Transmission Interconnection Roadmap. 2024. https://www.nrel.gov/grid/transmission-interconnection-roadmap

  9. North American Electric Reliability Corporation. 2025 Long-Term Reliability Assessment. 2025. https://www.nerc.com/pa/RAPA/ra/Pages/default.aspx

  10. International Energy Agency. Energy and AI. 2025. https://www.iea.org/reports/energy-and-ai

  11. International Energy Agency. "Data centre electricity use surged in 2025, even with tightening bottlenecks driving a scramble for solutions." 2026. https://www.iea.org/news/data-centre-electricity-use-surged-in-2025-even-with-tightening-bottlenecks-driving-a-scramble-for-solutions

  12. Lawrence Berkeley National Laboratory / U.S. Department of Energy. 2024 United States Data Center Energy Usage Report. 2024. https://buildings.lbl.gov/publications/2024-lbnl-data-center-energy-usage-report

  13. Electric Power Research Institute. Powering Intelligence: Analyzing Artificial Intelligence and Data Center Energy Consumption. 2024. https://restservice.epri.com/publicdownload/000000003002028905/0/Product

  14. Uptime Institute. Giant data center power plans reach extreme levels. 2026. https://intelligence.uptimeinstitute.com/sites/default/files/2026-01/UI%20Field%20report%20194_Giant%20data%20center%20power%20plans%20reach%20extreme%20levels.pdf

  15. International Organization for Standardization. ISO/IEC 30134-2:2026, Data centres key performance indicators - Power usage effectiveness. 2026. https://www.iso.org/standard/30134-2

  16. The Green Grid. Data Center Power Efficiency Metrics: PUE and DCiE. 2007. https://www.thegreengrid.org/en/resources/library-and-tools/20-Data-Center-Power-Efficiency-Metrics-PUE-and-DCiE

  17. ASHRAE TC 9.9. Thermal Guidelines for Data Processing Environments, 5th edition reference card. 2021 / 2024. https://www.ashrae.org/file%20library/technical%20resources/bookstore/supplemental%20files/therm-gdlns-5th-r-e-refcard.pdf

  18. Open Compute Project. Cooling Environments - Cold Plate. 2025. https://www.opencompute.org/wiki/Cooling_Environments/Cold_Plate

  19. Microsoft Cloud Blog. "Sustainable by design: Next-generation datacenters consume zero water for cooling." 2024-12-09. https://www.microsoft.com/en-us/microsoft-cloud/blog/2024/12/09/sustainable-by-design-next-generation-datacenters-consume-zero-water-for-cooling/

  20. Microsoft Local. "Understanding water use at Microsoft datacenters." 2026. https://local.microsoft.com/blog/understanding-water-use-at-microsoft-datacenters/

  21. Google. Operating sustainably - Google Data Centers. 2025-2026. https://www.datacenters.google/operating-sustainably

  22. Cisco. What Is Data Center Networking? 2025-2026. https://www.cisco.com/site/us/en/learn/topics/computing/what-is-data-center-networking.html

  23. Clos, C. "A Study of Non-Blocking Switching Networks." Bell System Technical Journal, 1953. https://onlinelibrary.wiley.com/doi/10.1002/j.1538-7305.1953.tb01433.x

  24. Google / SIGCOMM. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. 2015. https://research.google/pubs/jupiter-rising-a-decade-of-clos-topologies-and-centralized-control-in-googles-datacenter-network-2/

  25. Meta Engineering. "Reinventing our data center network with F16, Minipack." 2019. https://engineering.fb.com/2019/03/14/data-center-engineering/f16-minipack/

  26. InfiniBand Trade Association. InfiniBand - A low-latency, high-bandwidth interconnect. 2025-2026. https://www.infinibandta.org/about-infiniband/

  27. Ultra Ethernet Consortium. "Ultra Ethernet Consortium launches Specification 1.0." 2025-06-11. https://ultraethernet.org/ultra-ethernet-consortium-uec-launches-specification-1-0-transforming-ethernet-for-ai-and-hpc-at-scale/

  28. JEDEC / Business Wire. "JEDEC Publishes HBM3 Update to High Bandwidth Memory Standard." 2022-01-27. https://www.businesswire.com/news/home/20220127005320/en/JEDEC-Publishes-HBM3-Update-to-High-Bandwidth-Memory-HBM-Standard 2

  29. SK hynix Newsroom. "SK hynix Begins Volume Production of Industry's First HBM3E." 2024-03-19. https://news.skhynix.com/sk-hynix-begins-volume-production-of-industry-first-hbm3e/

  30. Micron. HBM3E. 2025-2026 product page. https://www.micron.com/products/memory/hbm/hbm3e

  31. Samsung Semiconductor. High Bandwidth Memory. 2025-2026 product page. https://semiconductor.samsung.com/dram/hbm/

  32. TSMC Investor Relations. TSMC 2025 Annual Report. 2026. https://investor.tsmc.com/sites/ir/annual-report/2025/2025%20Annual%20Report_E.pdf

  33. NVIDIA. NVIDIA 2025 Annual Report. 2025. https://s201.q4cdn.com/141608511/files/doc_financials/2025/annual/NVIDIA-2025-Annual-Report.pdf

  34. AMD. AMD Instinct MI300X Accelerators. 2024-2026 product page. https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html

  35. Intel. Intel Gaudi 3 AI Accelerator. 2024. https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi.html

  36. Broadcom Investor Relations. Broadcom Annual Reports. 2025. https://investors.broadcom.com/financial-information/annual-reports

  37. NVM Express. NVMe Specifications. 2026. https://nvmexpress.org/developers/nvme-specification/

  38. SNIA. SNIA Emerald Program. 2026. https://www.snia.org/forums/cmsi/programs/emerald

  39. Google / EuroSys. Large-scale cluster management at Google with Borg. 2015. https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/

  40. Kubernetes Documentation. Kubernetes Overview. 2026. https://kubernetes.io/docs/concepts/overview/

  41. SchedMD. Slurm Workload Manager Overview. 2026. https://slurm.schedmd.com/overview.html

  42. OpenTelemetry. What is OpenTelemetry? 2026. https://opentelemetry.io/docs/what-is-opentelemetry/

  43. Microsoft Investor Relations. Microsoft 2025 Annual Report. 2025. https://www.microsoft.com/investor/reports/ar25/index.html

  44. Alphabet Investor Relations. Alphabet 2025 Annual Report. 2026. https://s206.q4cdn.com/479360582/files/doc_financials/2025/q4/GOOG-10-K-2025.pdf

  45. Amazon Investor Relations. Amazon annual reports, proxies and shareholder letters. 2026. https://ir.aboutamazon.com/annual-reports-proxies-and-shareholder-letters/default.aspx

  46. Meta Investor Relations. Meta Reports Second Quarter 2025 Results. 2025. https://investor.fb.com/investor-news/press-release-details/2025/Meta-Reports-Second-Quarter-2025-Results/default.aspx

  47. Dell Technologies. Dell AI Factory with NVIDIA. 2025-2026. https://www.dell.com/en-us/lp/dt/dell-ai-factory-with-nvidia

  48. Hewlett Packard Enterprise. HPE Cray Supercomputing. 2025-2026. https://www.hpe.com/us/en/compute/hpc/supercomputing/cray.html

  49. Supermicro. Supermicro NVIDIA GB200 NVL72. 2025-2026. https://www.supermicro.com/en/products/system/gpu/48u/srs-gb200-nvl72

  50. Lenovo. Lenovo Neptune Liquid Cooling. 2025-2026. https://www.lenovo.com/us/en/servers-storage/solutions/neptune/

  51. Schneider Electric. AI-ready data center solutions. 2025-2026. https://www.se.com/ww/en/work/solutions/for-business/data-centers-and-networks/ai-ready-data-center/

  52. Vertiv Investor Relations. Vertiv 2025 Annual Report. 2026. https://s205.q4cdn.com/554782763/files/doc_financials/2025/ar/Vertiv-2025-Annual-Report.pdf

  53. Eaton. Data centers. 2025-2026. https://www.eaton.com/us/en-us/markets/data-centers.html

  54. ABB. Data centers. 2025-2026. https://new.abb.com/data-centers

  55. Siemens. Data centers. 2025-2026. https://www.siemens.com/global/en/markets/data-centers.html

  56. Cummins. Data Centers. 2025-2026. https://www.cummins.com/generators/data-centers

  57. Caterpillar. Data Center Power Solutions. 2025-2026. https://www.cat.com/en_US/by-industry/electric-power/data-centers.html

  58. Uptime Institute. 2025 Global Data Center Survey. 2025. https://intelligence.uptimeinstitute.com/resource/2025-global-data-center-survey-results-and-crosstabs

  59. Uptime Institute. Annual Outage Analysis 2024. 2024. https://uptimeinstitute.com/resources/research-and-reports/annual-outage-analysis-2024

  60. AWS Trust Center. Data Center - Our Controls. 2026. https://aws.amazon.com/trust-center/data-center/our-controls/

  61. Microsoft Service Assurance. Datacenter physical access security. 2025. https://learn.microsoft.com/en-us/compliance/assurance/assurance-datacenter-physical-access-security

  62. Google Cloud Security. "How Google protects the physical-to-logical space in a data center." 2025. https://cloud.google.com/docs/security/physical-to-logical-space

  63. NIST. SP 800-53 Rev. 5, Security and Privacy Controls for Information Systems and Organizations. 2020, updates through 2025. https://csrc.nist.gov/Pubs/sp/800/53/r5/upd1/Final

  64. International Organization for Standardization. ISO/IEC 27001:2022 Information security management systems. 2022. https://www.iso.org/standard/27001

  65. FedRAMP. FedRAMP baselines. 2026. https://www.fedramp.gov/baselines/

  66. Krizhevsky, A., Sutskever, I., and Hinton, G. "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS, 2012. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

  67. Vaswani, A. et al. "Attention Is All You Need." 2017. https://arxiv.org/abs/1706.03762

  68. Brown, T. et al. "Language Models are Few-Shot Learners." 2020. https://arxiv.org/abs/2005.14165

  69. Kaplan, J. et al. "Scaling Laws for Neural Language Models." 2020. https://arxiv.org/abs/2001.08361

  70. Jouppi, N.P. et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit." 2017. https://arxiv.org/abs/1704.04760

  71. PCI-SIG. PCI Express 6.0 Specification. 2022. https://pcisig.com/pci-express-60-specification

  72. CXL Consortium. Compute Express Link. 2026. https://www.computeexpresslink.org/