Terraforming a Spark cluster on Amazon

Kristina Georgieva
6 min readNov 19, 2017

This post is about setting up the infrastructure to run yor spark jobs on a cluster hosted on Amazon.

Before we start, here is some terminology that you will need to know:

  • Amazon EMR — The Amazon service that provides a managed Hadoop framework
  • Terraform — A tool for setting up infrastructure using code

At the end of this post you should have an EMR 5.9.0 cluster that is set up in the Frankfurt region with the following tools:

  • Hadoop 2.7.3
  • Spark 2.2.0
  • Zeppelin 0.7.2
  • Ganglia 3.7.2
  • Hive 2.3.0
  • Hue 4.0.1
  • Oozie 4.3.0

By default EMR Spark clusters come with Apache Yarn installed as the resource manager.

We will need to set up an S3 bucket, a network, some roles , a key pair and the cluster itself. Let’s get started.

VPC setup

A VPC (Virtual private cloud) is a virtual network to which the cluster can be assigned. All nodes in the cluster will become part of a subnet within this network.

To set up a VPC in terraform fist create a VPC resource:

resource "aws_vpc" "main" { 
cidr_block = "172.19.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags { Name = "VPC_name" }
}

Then we can create a public subnet. The availability zone is generally optional, but for this exercise you should have it as some of the settings that we choose are only compatible with eu-central-1a (such as the types of instances that we use).

resource "aws_subnet" "public-subnet" { 
vpc_id = "${aws_vpc.main.id}"
cidr_block = "172.19.0.0/21"
availability_zone = "eu-central-1a"
tags { Name = "example_public_subnet" }
}

We then create a gateway for the public subnet.

resource "aws_internet_gateway" "gateway" { 
vpc_id = "${aws_vpc.main.id}"
tags { Name = "gateway_name" }
}

A routing table is then needed to allow traffic to go through the gateway.

resource "aws_route_table" "public-routing-table" { 
vpc_id = "${aws_vpc.main.id}"
route {
cidr_block = "0.0.0.0/0"
gateway_id = "${aws_internet_gateway.gateway.id}"
}
tags { Name = "gateway_name" }
}

Lastly, the routing table must be assigned to the to the subnet to allow the traffic in and out from it.

resource "aws_route_table_association" "public-route-association" {
subnet_id = "${aws_subnet.public-subnet.id}"
route_table_id = "${aws_route_table.public-routing-table.id}"
}

Roles

Next we need to set up some roles for the EMR cluster.

First a service role is necessary. This role defines what the cluster is allowed to do within the EMR environment.

Note that EOF tags imply content with a structure. These need to have no trailing spaces, which leads to strange indentation.

resource "aws_iam_role" "spark_cluster_iam_emr_service_role" { 
name = "spark_cluster_emr_service_role"
assume_role_policy = <<EOF
{
"Version": "2008-10-17",
"Statement": [ {
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "elasticmapreduce.amazonaws.com"
},
"Action": "sts:AssumeRole"
} ]
} EOF
}

This service role needs a policy attached. In this example we will simply used the default EMR role.

resource "aws_iam_role_policy_attachment" "emr-service-policy-attach" { 
role = "${aws_iam_role.spark_cluster_iam_emr_service_role.id}"
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole"
}

Next we need a role for the EMR profile.

resource "aws_iam_role" "spark_cluster_iam_emr_profile_role" { 
name = "spark_cluster_emr_profile_role"
assume_role_policy = <<EOF
{
"Version": "2008-10-17",
"Statement": [ {
"Sid": "",
"Effect": "Allow",
"Principal": { "Service": "ec2.amazonaws.com" },
"Action": "sts:AssumeRole"
} ]
} EOF
}

This role is assigned the EC2 default role, which defines what the cluster is allowed to do in the EC2 environment.

resource "aws_iam_role_policy_attachment" "profile-policy-attach" {
role = "${aws_iam_role.spark_cluster_iam_emr_profile_role.id}"
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role"
}

Lastly the instance profile, which is used to pass the role’s details to the EC2 instances.

resource "aws_iam_instance_profile" "emr_profile" { 
name = "spark_cluster_emr_profile"
role = "${aws_iam_role.spark_cluster_iam_emr_profile_role.name}" }

Key setup

Next you will need ssh keys that will allow you to ssh into the master node.

To create the ssh key and .pem file run the following command:

ssh-keygen -t rsa

Enter a key name, such as cluster-key, and enter no password. Then create a pem file from the private key.

ssh-keygen -f cluster-key -e -m pem

Lastly create a key pair in terraform, linking to the key that you have created

resource "aws_key_pair" "emr_key_pair" { 
key_name = "emr-key"
public_key = "${file("/.ssh/cluster-key.pub")}"
}

S3

Next we need an s3 bucket. You may need more that one depending on your project requirements. In this example we will simply create one for the cluster logs.

resource "aws_s3_bucket" "logging_bucket" { 
bucket = "emr-logging-bucket"
region = "eu-central-1" versioning { enabled = "enabled" }
}

Security groups

Next we need a security group for the master node. This security group should allow the nodes to communicate with the master node, but also to be accessed via certain ports from your personal VPN.

You can find your public IP address by simply going to this site.

Let’s assume that your public address is 123.123.123.123 with subnet /16.

resource "aws_security_group" "master_security_group" { 
name = "master_security_group"
description = "Allow inbound traffic from VPN"
vpc_id = "${aws_vpc.main.id}" # Avoid circular dependencies stopping the destruction of the cluster
revoke_rules_on_delete = true # Allow communication between nodes in the VPC
ingress {
from_port = "0"
to_port = "0"
protocol = "-1"
self = true
}
ingress {
from_port = "8443"
to_port = "8443"
protocol = "TCP"
}
egress {
from_port = "0"
to_port = "0"
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
} # Allow SSH traffic from VPN
ingress {
from_port = 22
to_port = 22
protocol = "TCP"
cidr_blocks = ["123.123.0.0/16"]
} #### Expose web interfaces to VPN # Yarn
ingress {
from_port = 8088
to_port = 8088
protocol = "TCP"
cidr_blocks = ["123.123.0.0/16"]
} # Spark History
ingress {
from_port = 18080
to_port = 18080
protocol = "TCP"
cidr_blocks = ["123.123.0.0/16"]
} # Zeppelin
ingress {
from_port = 8890
to_port = 8890
protocol = "TCP"
cidr_blocks = ["123.123.0.0/16"]
} # Spark UI
ingress {
from_port = 4040
to_port = 4040
protocol = "TCP"
cidr_blocks = ["123.123.0.0/16"]
} # Ganglia
ingress {
from_port = 80
to_port = 80
protocol = "TCP"
cidr_blocks = ["123.123.0.0/16"]
} # Hue
ingress {
from_port = 8888
to_port = 8888
protocol = "TCP"
cidr_blocks = ["123.123.0.0/16"]
}
lifecycle {
ignore_changes = ["ingress", "egress"]
}
tags { name = "emr_test" }
}

We also need a security group for the rest of the nodes. These nodes should only communicate internally.

resource "aws_security_group" "slave_security_group" { 
name = "slave_security_group"
description = "Allow all internal traffic"
vpc_id = "${aws_vpc.main.id}"
revoke_rules_on_delete = true # Allow communication between nodes in the VPC
ingress {
from_port = "0"
to_port = "0"
protocol = "-1"
self = true
}
ingress {
from_port = "8443"
to_port = "8443"
protocol = "TCP"
}
egress {
from_port = "0"
to_port = "0"
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
} # Allow SSH traffic from VPN
ingress {
from_port = 22
to_port = 22
protocol = "TCP"
cidr_blocks = ["123.123.0.0/16"]
}
lifecycle {
ignore_changes = ["ingress", "egress"]
}
tags { name = "emr_test" }
}

Note that when you create 2 security groups ircular dependencies are created. When destroying the terraformed infrastructure in such a case, you need to delete the associations of the security groups before deleting the groups themselves. The revoke_rules_on_delete option takes care of this automatically.

Cluster

Finally, now that we have all the components, we can set up the cluster.

First add the provider

provider "aws" { region = "eu-central-1" }

Then we add the cluster itself

resource "aws_emr_cluster" "emr-spark-cluster" { 
name = "EMR-cluster-example"
release_label = "emr-5.9.0"
applications = ["Ganglia", "Spark", "Zeppelin", "Hive", "Hue"]
ec2_attributes {
instance_profile = "${aws_iam_instance_profile.emr_profile.arn}"
key_name = "${aws_key_pair.emr_key_pair.key_name}"
subnet_id = "${aws_vpc.main.id}"
emr_managed_master_security_group = "${aws_security_group.master_security_group.id}"
emr_managed_slave_security_group = "${aws_security_group.slave_security_group.id}" }
master_instance_type = "m3.xlarge"
core_instance_type = "m2.xlarge"
core_instance_count = 2
log_uri = "${aws_s3_bucket.logging_bucket.uri}"
tags { name = "EMR-cluster" role = "EMR_DefaultRole" }
service_role = "${aws_iam_role.spark_cluster_iam_emr_service_role.arn}" }

You can add task nodes as follows

resource "aws_emr_instance_group" "task_group" { 
cluster_id = "${aws_emr_cluster.emr-spark-cluster.id}"
instance_count = 4
instance_type = "m3.xlarge"
name = "instance_group"
}

Saving the file

Save the file as your prefered name with the extention `.tf`

Creating the cluster

To run the terraform script ensure the following:

Run the following to make sure that your setup is valid:

terraform plan

If there are no errors, you can run the following to create the cluster:

terraform apply

Destroying the cluster

To take down all the terraformed infrastructure run the following:

terraform destroy

You can add the following to you file if you want the terraform state file to be saved to an S3 bucket. This file allows terraform to know the last state of terraforming your infrastructure (what has been created or destroyed)

terraform { 
backend "s3" {
bucket = "terraform-bucket-name"
region = "eu-central-1"
}
}

Originally published at intothedepthsofdataengineering.wordpress.com on November 19, 2017.

--

--

Kristina Georgieva

Data scientist, software engineer, poet, writer, blogger, ammature painter