[CONTENT]

Executive Summary

Amazon SageMaker HyperPod supports large-scale artificial intelligence (AI) model development with resilient, cost-effective infrastructure, as well as access to the latest hardware. With Amazon SageMaker HyperPod, customers save time across the model-training lifecycle from pre-training to training, fine-tuning, and inference. Faster AI compute infrastructure accelerates time to market for new products, driving incremental revenue.

Amazon SageMaker HyperPod offers purpose-built infrastructure to support large-scale AI model training.

Amazon commissioned Forrester Consulting to conduct a Total Economic Impact™ (TEI) study and examine the potential return on investment (ROI) enterprises may realize by deploying SageMaker HyperPod.¹ The purpose of this study is to provide readers with a framework to evaluate the potential financial impact of SageMaker HyperPod on their organizations.

178%

Return on investment (ROI)

$20.6M

Net present value (NPV)

To better understand the benefits, costs, and risks associated with this investment, Forrester interviewed four decision-makers with experience using SageMaker HyperPod. For the purposes of this study, Forrester aggregated the experiences of the interviewees and combined the results into a single composite organization, which is a rapidly growing global organization that develops AI models as a key driver of its business strategy.

Interviewees said that prior to using SageMaker HyperPod, their organizations struggled to train large AI models quickly and cost-effectively. Their prior AI model-training infrastructure was expensive. Every training model setup could take several months, and AI model-training runs were disrupted by node failures. The AI model development teams had to spend a significant amount of time debugging issues and replacing failed nodes. These issues increased the time required to train AI models and created a potential opportunity cost by delaying new product launches.

After the investment in SageMaker HyperPod, the interviewees now have a cost-effective infrastructure for AI model training. AI training runs ran smoothly with minimal disruption. Training development teams quickly set up the infrastructure for new models. They spent less time replacing nodes that failed during training runs since that was automated with Amazon SageMaker HyperPod. With faster model training, new AI models and products reached the market faster, accelerating revenue growth.

Key Findings

Quantified benefits. Three-year, risk-adjusted present value (PV) quantified benefits for the composite organization include:

Technical team time savings of 88% for AI model-training infrastructure setup. The composite organization’s infrastructure team sets up the infrastructure for new model training easier and faster with Amazon SageMaker HyperPod. Amazon SageMaker HyperPod integrations with Slurm and Amazon EKS help orchestrate cluster creation. Setting up infrastructure and software to train a new AI model dropped from eight weeks to seven days. This benefit is worth $1.0 million PV to the composite organization over three years.
Technical team time savings of 98% for AI model training. When a node fails during a training run, Amazon SageMaker HyperPod identifies the underlying issue, replaces the node, and restarts the training job automatically. Before Amazon SageMaker HyperPod, manually debugging and replacing failed nodes took the composite organization’s model development team 24 hours per failure, with training paused until repairs were completed. After implementing Amazon SageMaker HyperPod, recovery averages just 30 minutes. This benefit is worth $433,000 PV to the composite organization over three years.
Optimized AI model-training infrastructure cost savings of 50%. The composite organization halves the cost of AI model training with Amazon SageMaker HyperPod. The underlying Amazon compute infrastructure is competitively priced compared to alternatives. This benefit is worth $18.9 million PV to the composite organization over three years.
Faster AI model training with improved infrastructure utilization for a 42% improvement in infrastructure availability. Amazon SageMaker HyperPod offers advanced resiliency so that even when a node fails, it can be isolated allowing the model-training run to continue. In addition, SageMaker HyperPod offers features such as task governance or observability to maximize compute resources utilization. The composite organization can complete more model training with the same infrastructure since downtime and delays are minimized. This benefit is worth $4.2 million PV to the composite organization over three years.
Faster time to market, generating $9.3 million incremental profit over three years. The composite organization uses Amazon SageMaker HyperPod primarily for model development, and a number of those models are put into production, generating incremental revenue worth $7.6 million PV to the composite organization over three years.

Unquantified benefits. Benefits that provide value for the composite organization but are not quantified for this study include:

AWS partnership and support. AWS’s partnership and timely customer support help the composite organization navigate any issues.
Access to the latest hardware. The composite organization benefits from access to the latest and fastest GPUs with Amazon SageMaker HyperPod, including Amazon Trainium chips, a family of AI chips purpose-built by AWS for AI training and inference.
Security. Amazon SageMaker HyperPod provides robust enterprise-grade security to protect the composite organization’s AI workloads and data. Amazon SageMaker HyperPod leverages AWS Identity and Access Management (IAM) for authentication and authorization, allowing the composite organization to define permissions to control who can access HyperPod resources.

Costs. Three-year, risk-adjusted PV costs for the composite organization include:

Amazon SageMaker HyperPod cost of $11.2 million PV over three years. The composite organization pays Amazon an annual fee for SageMaker HyperPod and for data storage.
Installation and maintenance cost of $381,000 PV over three years. The composite organization transitions to Amazon SageMaker HyperPod very quickly. The composite organization runs a pilot for three weeks and then spends one week transitioning to Amazon SageMaker HyperPod. One-half (50%) of a full-time equivalent (FTE) technical team member’s time is devoted to maintaining the platform.

The financial analysis that is based on the interviews found that a composite organization experiences benefits of $32.2 million over three years versus costs of $11.5 million, adding up to a net present value (NPV) of $20.6 million and an ROI of 178%.

“HyperPod has been rock solid for large-scale model training. AWS lived up to the hype.”

CTO, healthcare

Key Statistics

178%

Return on investment (ROI)

$32.2M

Benefits PV

$20.6M

Net present value (NPV)

<6 months

Payback

Benefits (Three-Year)

[CHART DIV CONTAINER]

Technical team time savings on AI model-training infrastructure setup Technical team time savings on AI model training Optimized AI model-training infrastructure cost savings Faster AI model training with improved infrastructure utilization Faster time to market

The Amazon Customer Journey

Drivers leading to the SageMaker HyperPod investment

Interviews

Role	Industry	Region	Revenue (USD)
Technical staff	Biotech	Europe	$2 million
Research engineer	Collaboration platform	North America	$50 million
CTO	Healthcare	Global	Not applicable
Chief scientist	Media	North America	$8 million

Key Challenges

The interviewees used a variety of AI model-training approaches before adopting Amazon SageMaker HyperPod, including running virtual machines (VMs) on bare metal or using clusters on alternative cloud infrastructure. One of the interviewees’ organizations was a startup and selected Amazon SageMaker HyperPod as its initial platform after exploring alternatives and running proof-of-concept (POC) tests with several vendors.

Interviewees noted how their organizations struggled with common challenges, including:

Challenges executing large AI model-training runs. The interviewees regularly trained large, multi-node AI models. Their prior infrastructure solutions did not support efficient model training since training would often be disrupted if a cluster node failed. Node replacement often required manual effort, and it was time-consuming and inefficient to debug and resolve issues.
Time-intensive AI model setup. It could take several months to set up clusters on the prior infrastructure. This delayed model training and resulted in delays in getting new products to market.
Expensive infrastructure. Interviewees shared that their prior AI model-training infrastructure was costly and did not scale.

Solution Requirements

The interviewees searched for a solution that could:

Support efficient, fault-tolerant training of large AI models.
Offer access to the latest hardware.
Ensure enterprise-grade security.
Provide excellent customer support.

“The more nodes you have available, the higher the likelihood of various types of failures and the more important it is to be able to detect those and triage them properly.”

Technical staff, biotech

“During our POC, we looked at other platforms, but things didn’t work — there was a lot of overpromising and underdelivery. But that was not the case with AWS. In a good way, there were really no surprises.”

CTO, healthcare

Composite Organization

Based on the interviews, Forrester constructed a TEI framework, a composite company, and an ROI analysis that illustrates the areas financially affected. The composite organization is representative of the interviewees’ organizations, and it is used to present the aggregate financial analysis in the next section. The composite organization has the following characteristics:

Description of composite. The composite organization is a rapidly growing global organization. AI model development is a key component of its business strategy. Before deploying Amazon SageMaker HyperPod, the composite organization managed its own AI model-training infrastructure and manually managed on-premises bare-metal GPUs and cloud-hosted VM instances.
Deployment characteristics. Twenty model developers use Amazon SageMaker HyperPod to create AI models at the composite organization. Amazon SageMaker HyperPod is used primarily for AI model training but can also be used for AI model inference. The model developers perform 50 AI model-training runs annually. On average, each training run takes one week. The model developers draw on 100TB of data to train the AI models. Data is typically stored in lower-cost cold storage (such as Amazon Glacier) and then moved into high-performance storage (such as Amazon FSx for Lustre) when it is required for AI model training. On average, each AI model has 20 nodes. Nvidia H100/H200 Tensor Core GPUs provide compute power for AI model training.

KEY ASSUMPTIONS

20 model developers
50 AI model-training runs annually
20 nodes per AI model

Analysis Of Benefits

Quantified benefit data as applied to the composite

Total Benefits

Ref.	Benefit	Year 1	Year 2	Year 3	Total	Present Value
Atr	Technical team time savings on AI model-training infrastructure setup	$347,776	$417,331	$500,797	$1,265,904	$1,037,318
Btr	Technical team time savings: on AI model training	$145,236	$174,283	$209,140	$528,659	$433,198
Ctr	Optimized AI model-training infrastructure cost savings	$6,307,200	$7,568,640	$9,145,440	$23,021,280	$18,859,997
Dtr	Faster AI model training with improved infrastructure utilization	$1,418,069	$1,701,683	$2,056,200	$5,175,952	$4,240,356
Etr	Faster time to market	$2,550,000	$3,060,000	$3,672,000	$9,282,000	$7,605,935
	Total benefits (risk-adjusted)	$10,768,281	$12,921,937	$15,583,577	$39,273,795	$32,176,804

Technical Team Time Savings On AI Model-Training Infrastructure Setup

Evidence and data. The interviewees found it easy to get started with Amazon SageMaker HyperPod, and it was faster to set up the infrastructure for AI model training compared to prior solutions. The technical teams spent less time on model setup and could devote that time to higher-value tasks.

Amazon SageMaker HyperPod offers Slurm and Amazon Elastic Kubernetes Service (EKS) integrations to orchestrate cluster creation.
Slurm support in Amazon SageMaker HyperPod helps users provision resilient clusters for running workloads to develop AI models. The technical staff member at a biotech organization explained: “Amazon SageMaker HyperPod is out of the box with a Slurm cluster. If we just had bare-metal nodes, it would have taken us a lot longer to go and build up the infrastructure.” He estimated that it took several months of work to set up a Slurm cluster before Amazon SageMaker HyperPod.
He added: “Setting up a Slurm cluster can be a bit painful, but HyperPod has a lot of tooling and automated scripts that help us set up Slurm in a way that just works for us. It has saved us a lot of time. When it comes to setting up the cluster, we’ve completely relied on the automated script.” Without Amazon SageMaker HyperPod, his organization would have had to hire a full-time cluster admin to manage the process.

Modeling and assumptions. Based on the interviews, Forrester assumes the following about the composite organization:

The composite conducts fifty AI model-training runs per year.
Twenty percent of these model-training runs require unique setup.
Before Amazon SageMaker HyperPod, it took eight weeks to set up the infrastructure for an AI model-training run.
Time savings with Amazon SageMaker HyperPod is 88%.
The fully burdened hourly rate for a technical team member is $130 per hour.

Risks. The expected financial impact is subject to risks and variation based on several factors:

The percentage of model-training runs that require setup.
The fully burdened salary for model developers.

Results. To account for these risks, Forrester adjusted this benefit downward by 5%, yielding a three-year, risk-adjusted total PV (discounted at 10%) of $1.0 million.

88%

Technical team time savings for AI model setup

“HyperPod gave us any easy way to get started, and it just worked out of the box. We were able to bootstrap open source, customize it, add our value, and get along with our training.”

CTO, healthcare

Technical Team Time Savings On AI Model-Training Infrastructure Setup

Ref.	Metric	Source		Year 1	Year 2	Year 3
A1	AI model-training runs annually	Composite		50	60	72
A2	Percentage of AI model-training runs that require unique setup	Composite		20%	20%	20%
A3	Time to set up an AI model before Amazon SageMaker HyperPod (weeks)	Interviews		8	8	8
A4	Technical team time savings with Amazon SageMaker HyperPod	Interviews		88%	88%	88%
A5	Fully burdened hourly rate for a technical team staff member	Composite		$130	$130	$130
At	Technical team time savings on AI model-training infrastructure setup	A1A2A340A4*A5		$366,080	$439,296	$527,155
	Risk adjustment	↓5%
Atr	Technical team time savings on AI model-training infrastructure setup (risk-adjusted)			$347,776	$417,331	$500,797
Three-year total: $1,265,904			Three-year present value: $1,037,318

Technical Team Time Savings On AI Model Training

Evidence and data. The interviewees found it easier and faster to debug issues and replace instances with Amazon SageMaker HyperPod. This meant that their technical teams could use the time they saved on higher-value tasks.

The interviewees shared that before Amazon SageMaker HyperPod, when a node failed, it would take a long time to manually replace the node and identify what caused the problem. With Amazon SageMaker HyperPod, failed nodes were replaced almost instantaneously. The interviewees estimated that before Amazon SageMaker HyperPod, debugging and replacing nodes could take several days or up to two weeks. Now with Amazon SageMaker HyperPod, it takes between 30 minutes and 3 hours.
The chief scientist at a media organization explained: “Sometimes there are issues with the hardware where we need to replace the instance. That doesn’t happen constantly but pretty regularly, and Amazon gives us a reasonable interface to handle the replacements. This is where SageMaker HyperPod is interesting.” He estimated that before Amazon SageMaker HyperPod, it could take a week or two to replace an instance; with Amazon SageMaker HyperPod, it is a few hours, or at most, one day. Prior to implementing Amazon SageMaker HyperPod, they would have had to hire an additional site reliability engineer (SRE) to provide support.
The technical staff member at a biotech organization shared, “The spare nodes were very helpful for us to swap in and swap out.” He added: “To replace a node, it’s a matter of running a single command line. So once we know that we would like to replace a node, fixing it takes a few seconds to issue the command, but then HyperPod will take care of it. In total, to replace a node for us takes half an hour, so that’s great.”
The research engineer at a collaboration platform noted: “Having provisioned clusters with HyperPod really helps enable us to debug quicker. Before if we need to debug a log stream from a serverless architecture training job, it would take a day or two to just sift through the logs.”

Modeling and assumptions. Based on the interviews, Forrester assumes the following about the composite organization:

The composite organization conducts 50 AI model-training runs per year.
The organization experiences one disruption per week.
Prior to Amazon SageMaker HyperPod, debugging and replacing failed instances required 24 hours of technical team time.
The reduction in time spent debugging and replacing failed instances with Amazon SageMaker HyperPod is 98%.
The fully burdened hourly rate for a model developer is $130.

Risks. The expected financial impact is subject to risks and variation based on several factors:

The number of AI model-training disruptions.
The skill of the technical team in resolving disruptions.
Technical team salary.

Results. To account for these risks, Forrester adjusted this benefit downward by 5%, yielding a three-year, risk-adjusted total PV (discounted at 10%) of $433,000.

98%

Technical team time savings on model training

“The main benefit of HyperPod is the reliability for training. When I run training, I don’t want my hardware to fail intermittently and I have to debug because these are multiday trainings. I don’t have time for that.”

Research engineer, collaboration platform

Technical Team Time Savings On AI Model Training

Ref.	Metric	Source		Year 1	Year 2	Year 3
B1	AI model-training runs annually	Composite		50	60	72
B2	Disruptions per model per week before Amazon SageMaker HyperPod	Interviews		1	1	1
B3	Time debugging and/or replacing nodes before Amazon SageMaker HyperPod (hours per model)	Interviews		24	24	24
B4	Technical team time savings with Amazon SageMaker HyperPod	Interviews		98%	98%	98%
B5	Fully burdened hourly rate for a technical team staff member	Composite		$130	$130	$130
Bt	Technical team time savings on AI model training	B1B2B3B4B5		$152,880	$183,456	$220,147
	Risk adjustment	↓5%
Btr	Technical team time savings on AI model training (risk-adjusted)			$145,236	$174,283	$209,140
Three-year total: $528,659			Three-year present value: $433,198

Optimized AI Model-Training Infrastructure Cost Savings

Evidence and data. The interviewees found that the overall cost of AI model training was significantly lower with Amazon SageMaker HyperPod than with their prior solutions, even though the prior states varied by company. Interviewees estimated that they experienced infrastructure cost savings ranging from 20% to 60% with Amazon SageMaker HyperPod.

The interviewees used a variety of AI model-training approaches before adopting Amazon SageMaker HyperPod, including running VMs on bare metal or using clusters on alternative cloud infrastructure. One of the interviewees’ organizations was a startup and selected Amazon SageMaker HyperPod as its initial platform after exploring alternatives and running POC tests with several vendors.
The technical staff member at a biotech organization shared: “Model training is for sure cheaper now [with Amazon SageMaker HyperPod]. Before, we would run on-demand instances in various cloud providers. Those are extremely expensive, so even though we would only spin up the nodes when we needed them, they would be really expensive. Now we have a set of dedicated nodes with AWS, and the pricing is good, so we have significant savings versus what we would have paid before.”
He added, “There’s no point in training models outside of HyperPod because it’s going to be more expensive.”
The research engineer at a collaboration platform said that the cost savings with Amazon SageMaker HyperPod were “substantial.” His company used clusters on an alternative cloud platform before moving to Amazon SageMaker HyperPod.

Modeling and assumptions. Based on the interviews, Forrester assumes the following about the composite organization:

The prior AI model-training infrastructure cost was $5 per GPU hour. Before deploying Amazon SageMaker HyperPod, the composite organization managed its own AI model-training infrastructure and manually managed on-premises bare-metal GPUs and cloud-hosted VM instances.
AI model infrastructure costs were reduced by 50% with Amazon SageMaker HyperPod.
The infrastructure (GPUs) required for AI model training increases by 20% each year as the composite organization grows and expands AI model training.

Risks. The expected financial impact is subject to risks and variation based on several factors:

Infrastructure pricing.
Compute requirements and selections.

Results. To account for these risks, Forrester adjusted this benefit downward by 10%, yielding a three-year, risk-adjusted total PV (discounted at 10%) of $18.9 million.

50%

Infrastructure cost savings

“As a startup, it’s important to have an affordable contract for compute.”

Technical staff, biotech

Optimized AI Model-Training Infrastructure Cost Savings

Ref.	Metric	Source		Year 1	Year 2	Year 3
C1	Prior model-training infrastructure cost (dollars per GPU hour)	Interviews		$5	$5	$5
C2	Percent cost savings with Amazon SageMaker HyperPod	Interviews		50%	50%	50%
C3	GPUs	Composite		160	192	232
Ct	Optimized AI model-training infrastructure cost savings	C1C324*365		$7,008,000	$8,409,600	$10,161,600
	Risk adjustment	↓10%
Ctr	Optimized AI model-training infrastructure cost savings (risk-adjusted)			$6,307,200	$7,568,640	$9,145,440
Three-year total: $23,021,280			Three-year present value: $18,859,997

Faster AI Model Training With Improved Infrastructure Utilization

Evidence and data. Amazon SageMaker HyperPod offered resiliency so that even if a node failed, it could be isolated, allowing the AI model-training run to continue. This allowed the interviewees’ organizations to optimize their AI model infrastructure. They could complete more model training with the same infrastructure since downtime and delays were minimized.

The technical staff member at a biotech organization shared, “[With Amazon SageMaker HyperPod], our training runs are fine — they’re set and forget most of the time.”
The research engineer at a collaboration platform noted: “[With Amazon SageMaker HyperPod], the clusters work seamlessly. It saves everybody’s time.”
This resiliency allowed Amazon SageMaker HyperPod to improve mean time between failures (MTBF). The chief scientist at a media organization shared, “Amazon SageMaker HyperPod improved MTBF by about 10%.”

Modeling and assumptions. Based on the interviews, Forrester assumes the following about the composite organization:

AI model-training runs were stalled or disrupted 72 hours each week before using Amazon SageMaker HyperPod.
AI model-training run disruptions are reduced to 30 minutes per week with Amazon SageMaker HyperPod.
The decrease in disruptions allows the composite organization to optimize its infrastructure. For the same infrastructure cost, the composite organization can spend more time training models with less time wasted.

Risks. The expected financial impact is subject to risks and variation based on several factors:

Model-training hours lost to disruptions before Amazon SageMaker HyperPod.
Disruptions with Amazon SageMaker HyperPod.
Cost of model-training infrastructure.

Results. To account for these risks, Forrester adjusted this benefit downward by 5%, yielding a three-year, risk-adjusted total PV (discounted at 10%) of $4.2 million.

42.6%

Improvement in infrastructure availability

“The resiliency features of HyperPod have enabled us to isolate when the failures occur. HyperPod has been resilient enough to give us continuing utilization with hardware faults, so we can just keep running long data-generation jobs within our clusters.”

Research engineer, collaboration platform

Faster AI Model Training With Improved Infrastructure Utilization

Ref.	Metric	Source		Year 1	Year 2	Year 3
D1	Model-training time per week (hours)	7*24		168	168	168
D2	Model-training time lost to disruptions per week before Amazon SageMaker HyperPod	Interviews		72	72	72
D3	Percentage of hours lost before Amazon SageMaker HyperPod	D2/D1		42.9%	42.9%	42.9%
D4	Model time lost to bad nodes per week with Amazon SageMaker HyperPod (hours)	Interviews		0.5	0.5	0.5
D5	Hours lost now as percentage	D4/D1		0.3%	0.3%	0.3%
D6	Reduction in infrastructure cost with Amazon SageMaker HyperPod	D3-D5		42.6%	42.6%	42.6%
Dt	Faster AI model training with improved infrastructure utilization	D6*F3		$1,492,704	$1,791,245	$2,164,421
	Risk adjustment	↓5%
Dtr	Faster AI model training with improved infrastructure utilization (risk-adjusted)			$1,418,069	$1,701,683	$2,056,200
Three-year total: $5,175,952			Three-year present value: $4,240,356

Faster Time To Market

Evidence and data. The interviewees used Amazon SageMaker HyperPod for AI model development, and a number of those models evolved into production models, generating incremental revenue for the interviewees’ organizations.

By enabling faster AI model training, Amazon SageMaker HyperPod helped the interviewees’ organizations get new products to market faster, accelerating incremental revenue. Interviewees estimated that Amazon SageMaker HyperPod accelerated AI model training and new product development by several months. With Amazon SageMaker HyperPod, the interviewees could focus on growing their business rather than building up infrastructure.
The research engineer at a collaboration platform noted: “Amazon SageMaker HyperPod has helped our business grow because we are training more and better models every day. That translates into us being more efficient and pushing the best performance to our customers. HyperPod has been really helpful in providing reliable hardware.”
The CTO at a healthcare organization explained: “Without Amazon SageMaker HyperPod, it would have taken us longer to build up our infrastructure. It would have been a big distraction and a big opportunity cost. We didn’t want to focus on building complex infrastructure; we want to get ahead with our business.”

Modeling and assumptions. Based on the interviews, Forrester assumes the following about the composite organization:

Incremental revenue of $25 million in Year 1, growing 20% annually to $36 million by Year 3.
The operating margin is 12% to reflect the costs associated with the incremental revenue.

Risks. The expected financial impact is subject to risks and variation based on several factors:

A company’s size, revenue, and growth.
A company’s operating margin.
A company’s ability to successfully launch new products.

Results. To account for these risks, Forrester adjusted this benefit downward by 15%, yielding a three-year, risk-adjusted total PV (discounted at 10%) of $7.6 million.

$9.3 million

Incremental profit over three years

“Amazon SageMaker HyperPod has definitely helped us because our key product is AI models. Without them, we wouldn’t be able to have any type of revenue.”

Technical staff, biotech

Faster Time To Market

Ref.	Metric	Source		Year 1	Year 2	Year 3
E1	Incremental revenue from faster time to market	Composite		$25,000,000	$30,000,000	$36,000,000
E2	Operating margin	NYU Stern School of Business		12%	12%	12%
Et	Faster time to market	E1*E2		$3,000,000	$3,600,000	$4,320,000
	Risk adjustment	↓15%
Etr	Faster time to market (risk-adjusted)			$2,550,000	$3,060,000	$3,672,000
Three-year total: $9,282,000			Three-year present value: $7,605,935

Unquantified Benefits

Interviewees mentioned the following additional benefits that their organizations experienced but were not able to quantify:

AWS partnership and support. The interviewees called out and praised AWS’s partnership and support. The research engineer at a collaboration platform shared: “AWS has been great. Whenever we run into issues, they come back with good customer support.”
Access to the latest hardware. The interviewees valued access to the latest and fastest GPUs with Amazon SageMaker HyperPod, including AWS Trainium chips, a family of AI chips purpose-built by AWS for AI training and inference. The research engineer at a collaboration platform noted, “We need the latest hardware, and Amazon has always been up to date.” The CTO for a healthcare organization added: “Having the latest GPUs has been very important. The fact that we were able to upgrade to the H200 has been just really phenomenal.”
Security. Confidence in the security of Amazon’s platform was a key reason the interviewees chose Amazon SageMaker HyperPod. Amazon SageMaker HyperPod leverages IAM for authentication and authorization. Organizations can define permissions to control who can access HyperPod resources. The CTO at a healthcare organization shared, “Security on Amazon SageMaker HyperPod is great, and it’s one of the reasons we wanted to be on a tier one cloud provider.” He added, “With HyperPod, everything is integrated with IAM to manage all the user identities.”

“AWS coming in with its engineering expertise and helping us debug network issues and hardware issues in a timely manner was definitely very helpful and a big selling point of SageMaker HyperPod.”

CTO, healthcare

Flexibility

The value of flexibility is unique to each customer. There are multiple scenarios in which a customer might implement SageMaker HyperPod and later realize additional uses and business opportunities, including:

Ability to quickly leverage new open-source AI models. New open-source AI models are released constantly. With Amazon SageMaker HyperPod, the testing infrastructure is already in place, so model developers can easily set up model-testing runs to explore ways to leverage the new models. The CTO at a healthcare organization explained: “A new model comes out, and it completely changes the whole game. We need training capacity, so when new models come out, we can immediately train on them and figure out what this means for our business.”
Scalability. The interviewees found that Amazon SageMaker HyperPod provided the scalability to meet business requirements. The research engineer at a collaboration platform noted: “We can seamlessly scale up and bring it down because no model builder needs to do a pretraining from scratch all the time. That is something we really like about HyperPod.”

Flexibility would also be quantified when evaluated as part of a specific project (described in more detail in Total Economic Impact Approach).

“With Amazon SageMaker HyperPod, we were quite confident that we could scale our clusters, if needed, for large training runs. Scaling has been quite seamless for us.”

Research engineer, collaboration platform

Analysis Of Costs

Quantified cost data as applied to the composite

Total Costs

Ref.	Cost	Initial	Year 1	Year 2	Year 3	Total	Present Value
Ftr	Amazon SageMaker HyperPod cost	$0	$3,742,200	$4,490,640	$5,425,560	$13,658,400	$11,189,576
Gtr	Installation and maintenance cost	$63,788	$127,575	$127,575	$127,575	$446,513	$381,048
	Total costs (risk-adjusted)	$63,788	$3,869,775	$4,618,215	$5,553,135	$14,104,913	$11,570,624

Amazon SageMaker HyperPod Cost

Evidence and data. The interviewees’ organizations paid Amazon a fee for SageMaker HyperPod and for data storage.

The interviewees’ organizations typically used Nvidia H100 and H200 Tensor Core GPUs for AI model training, but in some cases also used Nvidia A100 Tensor Core GPUs.
The interviewees’ organizations used anywhere from 10TB to several PB of data to train their AI models. They typically stored data in lower-cost cold storage, (such as Amazon Glacier) and then moved the data into high-performance storage (such as Amazon FSx for Lustre) when the data was required for AI model training.
The interviewees’ organizations typically entered into a three-year contract with Amazon for SageMaker HyperPod. Pricing may vary for Amazon SageMaker HyperPod. Contact Amazon for additional details.

Modeling and assumptions. Based on the interviews, Forrester assumes the following about the composite organization:

The cost to use Amazon SageMaker HyperPod increases over time as the composite organization grows its AI model training and requires more infrastructure capacity. The cost is based on usage of Nvidia H100 and H200 Tensor Core GPUs for AI model training.
Storage cost is based on 100TB of data in Year 1 to train the AI models. Storage cost increases over time as additional data is stored.

Risks. The expected financial impact is subject to risks and variation based on several factors:

The type and number of GPUs used for AI model training.
The volume of data stored for AI model training and the type of storage.

Results. To account for these risks, Forrester adjusted this cost upward by 5%, yielding a three-year, risk-adjusted total PV (discounted at 10%) of $11.2 million.

Amazon SageMaker HyperPod Cost

Ref.	Metric	Source	Initial	Year 1	Year 2	Year 3
F1	GPUs	Composite		160	192	232
F2	Cost per GPU per hour	Interviews		$2.50	$2.50	$2.50
F3	Amazon SageMaker HyperPod cost	F1F224*365		$3,504,000	$4,204,800	$5,080,800
F4	Data storage (TB)	Composite		100	120	144
F5	Storage cost per TB per month	Interviews		$50	$50	$50
F6	Storage cost	F4F512		$60,000	$72,000	$86,400
Ft	Amazon SageMaker HyperPod cost	F3+F6	$0	$3,564,000	$4,276,800	$5,167,200
	Risk adjustment	↑5%
Ftr	Amazon SageMaker HyperPod cost (risk-adjusted)		$0	$3,742,200	$4,490,640	$5,425,560
Three-year total: $13,658,400			Three-year present value: $11,189,576

Installation And Maintenance Cost

Evidence and data. The interviewees’ organizations were able to transition to Amazon SageMaker HyperPod very quickly.

Most of the interviewees’ organizations transitioned to Amazon SageMaker HyperPod in about a week. One organization ran a two- to three-week pilot and then spent a week on setup when they decided to move forward with Amazon SageMaker HyperPod.
It was easy for the model developers to begin using Amazon SageMaker HyperPod. The chief scientist at a media organization explained, “The learning curve is not very high.”

Modeling and assumptions. Based on the interviews, Forrester assumes the following about the composite organization:

Four technical team members pilot Amazon SageMaker HyperPod for three weeks.
One-half of an FTE’s time is dedicated to maintaining Amazon SageMaker HyperPod.
Technical team salary including benefits is $243,000 annually.

Risks. The expected financial impact is subject to risks and variation based on several factors:

The complexity of the required AI model-training infrastructure.
Technical team salaries.

Results. To account for these risks, Forrester adjusted this cost upward by 5%, yielding a three-year, risk-adjusted total PV (discounted at 10%) of $381,000.

1 week

Transition to Amazon SageMaker HyperPod

“We were fully transitioned to Amazon SageMaker HyperPod in a week.”

Research engineer, collaboration platform

Installation And Maintenance Cost

Ref.	Metric	Source	Initial	Year 1	Year 2	Year 3
G1	Installation and setup (FTE)	Interviews	0.25
G2	Ongoing maintenance (FTE)	Interviews		0.5	0.5	0.5
G3	Fully burdened annual salary for FTE maintaining Amazon SageMaker HyperPod	Interviews	$243,000	$243,000	$243,000	$243,000
Gt	Installation and maintenance cost	(G1+G2)*G3	$60,750	$121,500	$121,500	$121,500
	Risk adjustment	↑5%
Gtr	Installation and maintenance cost (risk-adjusted)		$63,788	$127,575	$127,575	$127,575
Three-year total: $446,513			Three-year present value: $381,048

Financial Summary

Consolidated Three-Year, Risk-Adjusted Metrics

Cash Flow Chart (Risk-Adjusted)

[CHART DIV CONTAINER]

Total costs Total benefits Cumulative net benefits Initial Year 1 Year 2 Year 3

Cash Flow Analysis (Risk-Adjusted)

	Initial	Year 1	Year 2	Year 3	Total	Present Value
Total costs	($63,788)	($3,869,775)	($4,618,215)	($5,553,135)	($14,104,913)	($11,570,624)
Total benefits	$0	$10,768,281	$12,921,937	$15,583,577	$39,273,795	$32,176,804
Net benefits	($63,788)	$6,898,506	$8,303,722	$10,030,442	$25,168,882	$20,606,180
ROI						178%
Payback						<6 months

Please Note

The financial results calculated in the Benefits and Costs sections can be used to determine the ROI, NPV, and payback period for the composite organization’s investment. Forrester assumes a yearly discount rate of 10% for this analysis.

These risk-adjusted ROI, NPV, and payback period values are determined by applying risk-adjustment factors to the unadjusted results in each Benefit and Cost section.

The initial investment column contains costs incurred at “time 0” or at the beginning of Year 1 that are not discounted. All other cash flows are discounted using the discount rate at the end of the year. PV calculations are calculated for each total cost and benefit estimate. NPV calculations in the summary tables are the sum of the initial investment and the discounted cash flows in each year. Sums and present value calculations of the Total Benefits, Total Costs, and Cash Flow tables may not exactly add up, as some rounding may occur.

From the information provided in the interviews, Forrester constructed a Total Economic Impact™ framework for those organizations considering an investment in SageMaker HyperPod.

The objective of the framework is to identify the cost, benefit, flexibility, and risk factors that affect the investment decision. Forrester took a multistep approach to evaluate the impact that Amazon SageMaker HyperPod can have on an organization.

Due Diligence

Interviewed Amazon stakeholders and Forrester analysts to gather data relative to SageMaker HyperPod.

Interviews

Interviewed four decision-makers at organizations using SageMaker HyperPod to obtain data about costs, benefits, and risks.

Composite Organization

Designed a composite organization based on characteristics of the interviewees’ organizations.

Financial Model Framework

Constructed a financial model representative of the interviews using the TEI methodology and risk-adjusted the financial model based on issues and concerns of the interviewees.

Case Study

Employed four fundamental elements of TEI in modeling the investment impact: benefits, costs, flexibility, and risks. Given the increasing sophistication of ROI analyses related to IT investments, Forrester’s TEI methodology provides a complete picture of the total economic impact of purchase decisions. Please see Appendix A for additional information on the TEI methodology.

Total Economic Impact Approach

Benefits

Benefits represent the value the solution delivers to the business. The TEI methodology places equal weight on the measure of benefits and costs, allowing for a full examination of the solution’s effect on the entire organization.

Costs

Costs comprise all expenses necessary to deliver the proposed value, or benefits, of the solution. The methodology captures implementation and ongoing costs associated with the solution.

Flexibility

Flexibility represents the strategic value that can be obtained for some future additional investment building on top of the initial investment already made. The ability to capture that benefit has a PV that can be estimated.

Risks

Risks measure the uncertainty of benefit and cost estimates given: 1) the likelihood that estimates will meet original projections and 2) the likelihood that estimates will be tracked over time. TEI risk factors are based on “triangular distribution.”

Financial Terminology

Present value (PV)

The present or current value of (discounted) cost and benefit estimates given at the cost of capital (the discount rate). The PV of costs and benefits feeds into the total NPV of cash flows.

Net present value (NPV)

The present or current value of (discounted) future net cash flows given the cost of capital (the discount rate). A positive project NPV normally indicates that the investment should be made unless other projects have higher NPVs.

Return on investment (ROI)

A project’s expected return in percentage terms. ROI is calculated by dividing net benefits (benefits less costs) by costs.

Discount rate

The weighted average cost of capital used in cash flow analysis to take into account the time value of money. Organizations typically use discount rates between 8% and 16%.

Payback

The breakeven point for an investment. This is the point in time at which net benefits (benefits minus costs) equal initial investment or cost.

Appendix A

Total Economic Impact

Total Economic Impact is a methodology developed by Forrester Research that enhances a company’s technology decision-making processes and assists solution providers in communicating their value proposition to clients. The TEI methodology helps companies demonstrate, justify, and realize the tangible value of business and technology initiatives to both senior management and other key stakeholders.

Appendix B

Endnotes

¹ Total Economic Impact is a methodology developed by Forrester Research that enhances a company’s technology decision-making processes and assists solution providers in communicating their value proposition to clients. The TEI methodology helps companies demonstrate, justify, and realize the tangible value of business and technology initiatives to both senior management and other key stakeholders.

Disclosures

Readers should be aware of the following:

This study is commissioned by Amazon and delivered by Forrester Consulting. It is not meant to be used as a competitive analysis.

Forrester makes no assumptions as to the potential ROI that other organizations will receive. Forrester strongly advises that readers use their own estimates within the framework provided in the study to determine the appropriateness of an investment in SageMaker HyperPod. For any interactive functionality, the intent is for the questions to solicit inputs specific to a prospect's business. Forrester believes that this analysis is representative of what companies may achieve with SageMaker HyperPod based on the inputs provided and any assumptions made. Forrester does not endorse Amazon or its offerings. Although great care has been taken to ensure the accuracy and completeness of this model, Amazon and Forrester Research are unable to accept any legal responsibility for any actions taken on the basis of the information contained herein. The interactive tool is provided ‘AS IS,’ and Forrester and Amazon make no warranties of any kind.

Amazon reviewed and provided feedback to Forrester, but Forrester maintains editorial control over the study and its findings and does not accept changes to the study that contradict Forrester’s findings or obscure the meaning of the study.

Amazon provided the customer names for the interviews but did not participate in the interviews.

Consulting Team:

Jennifer Adams

Published

December 2025

The Total Economic Impact™ Of Amazon SageMaker HyperPod

Cost Savings And Business Benefits Enabled By SageMaker HyperPod

Table Of Contents

Executive Summary

178%

Return on investment (ROI)

$20.6M

Net present value (NPV)

Key Findings

Key Statistics

178%

$32.2M

$20.6M

<6 months

Benefits (Three-Year)

The Amazon Customer Journey

Drivers leading to the SageMaker HyperPod investment

Interviews

Key Challenges

Solution Requirements

Composite Organization

KEY ASSUMPTIONS

Analysis Of Benefits

Quantified benefit data as applied to the composite

Total Benefits

Technical Team Time Savings On AI Model-Training Infrastructure Setup

88%

Technical Team Time Savings On AI Model-Training Infrastructure Setup

Technical Team Time Savings On AI Model Training

98%

Technical Team Time Savings On AI Model Training

Optimized AI Model-Training Infrastructure Cost Savings

50%

Optimized AI Model-Training Infrastructure Cost Savings

Faster AI Model Training With Improved Infrastructure Utilization

42.6%

Faster AI Model Training With Improved Infrastructure Utilization

Faster Time To Market

$9.3 million

Faster Time To Market

Unquantified Benefits

Flexibility

Analysis Of Costs

Quantified cost data as applied to the composite

Total Costs

Amazon SageMaker HyperPod Cost

Amazon SageMaker HyperPod Cost

Installation And Maintenance Cost

1 week

Installation And Maintenance Cost

Financial Summary

Consolidated Three-Year, Risk-Adjusted Metrics

Cash Flow Chart (Risk-Adjusted)

Cash Flow Analysis (Risk-Adjusted)

Please Note

TEI Framework And Methodology

Due Diligence

Interviews

Composite Organization

Financial Model Framework

Case Study

Glossary

Total Economic Impact Approach

Benefits

Costs

Flexibility

Risks

Financial Terminology

Present value (PV)

Net present value (NPV)

Return on investment (ROI)

Discount rate

Payback

Appendixes

Appendix A

Total Economic Impact

Appendix B

Endnotes

Disclosures

Consulting Team: