3 research outputs found
The infrastructure powering IBM's Gen AI model development
AI Infrastructure plays a key role in the speed and cost-competitiveness of
developing and deploying advanced AI models. The current demand for powerful AI
infrastructure for model training is driven by the emergence of generative AI
and foundational models, where on occasion thousands of GPUs must cooperate on
a single training job for the model to be trained in a reasonable time.
Delivering efficient and high-performing AI training requires an end-to-end
solution that combines hardware, software and holistic telemetry to cater for
multiple types of AI workloads. In this report, we describe IBM's hybrid cloud
infrastructure that powers our generative AI model development. This
infrastructure includes (1) Vela: an AI-optimized supercomputing capability
directly integrated into the IBM Cloud, delivering scalable, dynamic,
multi-tenant and geographically distributed infrastructure for large-scale
model training and other AI workflow steps and (2) Blue Vela: a large-scale,
purpose-built, on-premises hosting environment that is optimized to support our
largest and most ambitious AI model training tasks. Vela provides IBM with the
dual benefit of high performance for internal use along with the flexibility to
adapt to an evolving commercial landscape. Blue Vela provides us with the
benefits of rapid development of our largest and most ambitious models, as well
as future-proofing against the evolving model landscape in the industry. Taken
together, they provide IBM with the ability to rapidly innovate in the
development of both AI models and commercial offerings.Comment: Corresponding Authors: Talia Gershon, Seetharami Seelam,Brian
Belgodere, Milton Bonill
