AWS has released Data on EKS (DoEKS), an open source project that provides templates, guidance, and best practices for deploying data workloads on Amazon Elastic Kubernetes Service (EKS). The main focus is running Apache Spark on Amazon EKS, but blueprints exist for other data workloads such as Ray, Apache Airflow, Argo Workflows, Kubeflow, and more.
Built on the Amazon EKS Blueprints project, DoEKS provides Infrastructure as Code (IaC) templates (both Terraform and AWS CDK), sample jobs, references to AWS resources, and performance benchmark reports. Solutions within DoEKS fall into five areas: Data Analytics, AI/ML, Distributed Databases, Streaming Platforms, and Scheduler Workflow Patterns.
Guidance and patterns are provided for configuring observability and logging, handling multi-tenancy, and choosing a cluster autoscaler. In addition to integration with AWS managed services, several open source tools, Kubernetes operators, and frameworks are covered by DoEKS.
One of the provided patterns covers deploying EMR on EKS using Karpenter. This pattern creates an EKS cluster control plane and one managed node group. This node group has three instances across multiple Availability Zones to handle system-critical pods. This includes Cluster Autoscaler, CoreDNS, Observability and Logging. This pattern enables EMR in EKS and sets some own defaults.
It can be deployed and created using the provided Terraform template.
git clone https://github.com/awslabs/data-on-eks.git
cd data-on-eks/analytics/terraform/emr-eks-karpenter
terraform init
export AWS_REGION="us-west-2"
terraform plan
terraform apply
Argo Workflows is an open-source, container-native engine for coordinating parallel jobs in Kubernetes. The Argo Workflows on EKS pattern describes how to use Argo Workflows on Amazon EKS. This includes using Argo workflows to create Spark jobs via Spark operators and Amazon SQS messages.
Blueprints for streaming platform and distributed database are still in development. For streaming platforms, you will need to provide details for Apache Kafka, Apache Flink, and Apache Pulsar. A distributed database blueprint should include Apache Cassandra, Amazon DynamoDB, and Apache Presto.
There are now details on how to use CloudNativePG to manage PostgreSQL workloads via Kubernetes. Some recommendations provide details on storage selection, monitoring settings, and backup and restore operations. For storage, DoEKS recommends using Amazon Elastic Block Store (EBS) volumes. This is because it “provides high performance and fault tolerance.” Specifically, we recommend using either Provisioned IOPS SSDs (io2 or io1) or General Purpose SSDs (gp3 or gp2). Examples for both cases are included as YAML files, as shown in the io2 example below.
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
storageclass.kubernetes.io/is-default-class: "true"
name: storageclass-io2
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
csi.storage.k8s.io/fstype: xfs
encrypted: "true"
type: io2
iopsPerGB: "50"
This can be provisioned using: kubectl create -f examples/storageclass.yaml
.
The DoEKS library is available under the Apache 2.0 license. This is not a supported AWS service, instead it is maintained by AWS Solutions Architects and the DoEKS Blueprints community.