Best Practices for Running HPC on AWS

12.2018

Use placement group

A cluster placement group is a logical grouping of instances within a single Availability Zone.

Cluster placement groups are recommended for applications that benefit from low network latency, high network throughput, or both, and if the majority of the network traffic is between the instances in the group.

Reference:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html

Disable hyper-threading

Amazon EC2 instances support Intel Hyper-Threading Technology (HT Technology), which enables multiple threads to run concurrently on a single Intel Xeon CPU core. Each thread is represented as a virtual CPU (vCPU) on the instance

A good example of contention that makes Hyper-Threading Technology slower is an HPC job that relies heavily on floating point calculations. In this case, the two threads in each core share a single floating point unit (FPU) and are often blocked by one another.

Reference:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-optimize-cpu.html

https://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-linux/

Use fast connection between the nodes

Amazon EC2 C5n and P3dn Instances allows high-end instances to provide up to 100Gbps of network throughput, along with a higher ceiling on packets per second for simulations, in-memory caches, data lakes, and other communication-intensive applications.

Reference:

https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-amazon-ec2-c5n-instances/

https://aws.amazon.com/blogs/aws/new-c5n-instances-with-100-gbps-networking/

https://aws.amazon.com/blogs/aws/new-ec2-p3dn-gpu-instances-with-100-gbps-networking-local-nvme-storage-for-faster-machine-learning-p3-price-reduction/

Use instance type G3 or P3 using GPU for graphic related tasks

Amazon EC2 G3 instances provides access to NVIDIA Tesla M60 GPUs, each with up to 2,048 parallel processing cores, 8 GiB of GPU memory.

Amazon EC2 P3 instances deliver high performance compute in the cloud with up to 8 NVIDIA V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications.

Reference:

https://aws.amazon.com/ec2/instance-types/g3/

https://aws.amazon.com/ec2/instance-types/p3/

Use AWS Parallel Cluster

AWS Parallel Cluster is a fully supported and maintained open source cluster management tool that makes it easy for scientists, researchers, and IT administrators to deploy and manage High Performance Computing (HPC) clusters in the AWS cloud.

Reference:

https://aws.amazon.com/blogs/opensource/aws-parallelcluster/

https://aws.amazon.com/about-aws/whats-new/2018/11/AWSParallelCluster/

Use parallel file system

Amazon FSx for Lustre is a fully managed file system that is optimized for compute-intensive workloads, such as high-performance computing and machine learning.

With Amazon FSx for Lustre, you can launch and run a Lustre file system that can process massive data sets at up to hundreds of gigabytes per second of throughput, millions of IOPS, and sub-millisecond latencies.

Reference:

https://aws.amazon.com/about-aws/whats-new/2018/11/amazon-fsx-lustre/

https://aws.amazon.com/fsx/lustre/

https://cloudacademy.com/blog/amazon-fsx-for-lustre-makes-high-performance-computing-more-accessible/

Use biggest possible instance type

Use biggest possible compute optimized (c5n.18xlarge) or the latest general purpose compute instance (m5.24xlarge).

Compute optimized instances are ideal for compute-bound applications that benefit from high-performance processors.

Reference:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/compute-optimized-instances.html

https://aws.amazon.com/ec2/instance-types/c5/

https://aws.amazon.com/ec2/instance-types/m5/

Use Elastic Fabric Adapter (EFA)

Elastic Fabric Adapter (EFA) – a network interface for Amazon EC2 instances that enables customers to run HPC applications requiring high levels of inter-instance communications (Currently in Preview).

Reference:

https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-elastic-fabric-adapter/

https://aws.amazon.com/ec2/faqs/#efa

https://www.slideshare.net/AmazonWebServices/new-launch-scaling-tightlycoupled-hpc-workloads-on-hpc-with-elastic-fabric-adapter-and-high-bandwidth-network-optimized-ec2-instances-ent360-aws-reinvent-2018

Eyal Estrin
Eyal Estrin is a Cloud Architect. He joined IUCC in December 2017 and his main focus is promoting and supporting cloud services in Universities in Israel. He brings with him more than 20 years of experience in the IT and information security field.
Follow him at @eyalestrin