AI Factories – Designing for Trillion-Parameter, Real-Time Workloads

While most modern data centers are designed with a standard power capacity of approximately 3-5 kilowatts (kW) per rack to host applications or store data, an "AI Factory" tasked with processing large language models (LLMs) requires a vastly higher power density. The baseline starts at 10 kW, and top-tier systems like the NVIDIA GB200 NVL72 can demand up to 125 kW—a gap of 20 to 30 times compared to traditional cloud racks.

The critical question for enterprises today is a realistic assessment of readiness: Can your existing power and cooling systems handle this new breed of workload? And when the time comes to expand, what is the most cost-effective and sustainable path forward?

In this article, we summarize Key Insights from the session "AI Factories – Designing for Trillion-Parameter, Real-Time Workloads." This deep dive features Terry Yin, Senior Data Scientist & Deep Learning System Architect at NVIDIA, moderated by Yabodee Chittikuladilok , Chief Data Officer at DataX, to decode the architecture behind the "AI Factory" that leading organizations need to understand.

Data Center vs. AI Factory: When Old Standards Become Limitations

Terry Yin highlights a critical turning point: In the past, we built Data Centers to host applications. Today, we need facilities that function as "Production Lines" for intelligence. These two concepts are engineered completely differently, particularly regarding "Power Density," which legacy systems were simply not prepared to support.

To illustrate this clearly, Terry offers a comparison: In a Traditional Data Center, the standard power capacity per rack is around 3-5 kW, which is sufficient for basic IT tasks. However, for an AI Factory running LLMs, the baseline starts at 10 kW per rack. If you move to cutting-edge systems, the requirement skyrockets to 120-125 kW per rack. This figure confirms that running modern AI in legacy environments is virtually impossible without a major retrofit of power and cooling systems.

The Cloud vs. Build Crossroads: Deciding with Data Pipelines and Interconnects

When choosing between building your own or using the Cloud, Terry suggests looking at your organization’s "Data Pipeline" first. If your data is largely Cloud-native, migrating it out takes time. However, the true indicators of success aren't just about data location, but "Compute Supply" and "Interconnect."

The crucial test for any Cloud provider is this: Can they deliver GPU Clusters connected via cutting-edge networking? The goal is to combine the compute power of individual servers so that massive numbers of instances act as a "One System." This is the heart of large-scale model training, a requirement that general-purpose clouds often fail to meet.

The Utility Mindset Trap and the Cost of Waiting

Many executives believe they should wait for AI technology to become as stable and standardized as electricity before investing. Terry argues that this "Utility Mindset" puts organizations at a disadvantage. AI is not a plug-and-play commodity; it is a technology that relies on "learning" your organization's specific data to achieve maximum efficiency.

Therefore, the real risk lies in The Cost of Waiting. Waiting for technological perfection causes organizations to miss the opportunity to build "In-house Capability." The later you start, the more your team lacks the critical understanding of how to tune models to your proprietary data—a skill that takes time to accumulate.

Behind Trillion-Parameter Success: Stability is Key

When training trillion-parameter models, the ultimate challenge isn't just speed, but Stability and Continuity. Terry explains that at this scale, we are talking about loading a 1 TB model across 1,000 GPU servers simultaneously. A minor error here can lead to major losses.

This is where Network Vitality plays a role. If the network isn't robust, restarting the system after a failure can take a long time, severely impacting budgets and timelines. This is why NVIDIA prioritizes technologies like Blackwell Ultra (optimized to strip away non-AI essentials) and the NVLink network system. These connect massive numbers of GPUs to act as a single unit, providing high Resilience, enabling fast Recovery, and ensuring the work continues without interruption.

Real-Time Inference: Speed Exchanged for Understanding

In the battleground of real-time inference, the challenge shifts from model intelligence to Latency and Throughput.

Terry emphasizes that building a system that is both fast and stable is "not easy." Enterprises must "map the Workflow" of their applications to find two often-overlooked points:

Hot Spots: Bottlenecks that consume the most resources.
Fragile Points: Vulnerable spots where a single failure can break the entire workflow.

Furthermore, because AI technology moves so fast, last month's blueprint might already be obsolete. NVIDIA shares Reference Architectures gathering global best practices, updated on a nearly "Monthly Basis." Studying these blueprints helps organizations save trial-and-error time and close risk gaps precisely.

How to Start? 3 Steps to Turn Vision into a Tangible AI Factory

For organizations wanting to start building their own AI Factory but unsure where to invest, Terry advises: "Don't rush to buy Hardware; start with the Model first" through 3 key steps:

Select the Model: Starting isn't just about subscribing to a generic Chatbot. It's about selecting the "Model Family" that best fits your business problem to serve as the project's Anchor Point.
Stick to Roadmap: Make a strategic decision: will you "follow" the roadmap of that model's family, or will you "build your own"?
Map Sizing to Infra: Once the model is selected, infrastructure size shouldn't be a guess. It is automatically derived from the Model Size, Number of Users, and Desired Experience.

Following this sequence allows organizations to turn anxiety about massive spending into Precision Investment. It prevents Over-spec (wasted budget) or Under-spec (system crashes), and most importantly, it allows the organization to take a secure first step toward becoming an AI Factory without waiting for the technology to be perfect.

Watch the full session here: https://youtu.be/0myfCFwdeQw?si=48bEaU67RuG7AkVo

#AIVOLUTION #SCB10X #NVIDIA #AIFactory #AIInfrastructure #EnterpriseAI #DataCenter #Blackwell #GenerativeAI #TrillionParameter #DigitalTransformation