Terraforming a Spark cluster on Amazon

This post is about setting up the infrastructure to run yor spark jobs on a cluster hosted on Amazon.

Before we start, here is some terminology that you will need to know:

  • Amazon EMR — The Amazon service that provides a managed Hadoop framework
  • Terraform — A tool for setting up infrastructure using code

At the end of this post you should have an EMR 5.9.0 cluster that is set up in the Frankfurt region with the following tools:

  • Hadoop 2.7.3
  • Spark 2.2.0
  • Zeppelin 0.7.2
  • Ganglia 3.7.2
  • Hive 2.3.0
  • Hue 4.0.1
  • Oozie 4.3.0

By default EMR Spark clusters come with Apache Yarn installed as the resource manager.

We will need to set up an S3 bucket, a network, some roles , a key pair and the cluster itself. Let’s get started.

VPC setup

A VPC (Virtual private cloud) is a virtual network to which the cluster can be assigned. All nodes in the cluster will become part of a subnet within this network.

To set up a VPC in terraform fist create a VPC resource:

Then we can create a public subnet. The availability zone is generally optional, but for this exercise you should have it as some of the settings that we choose are only compatible with eu-central-1a (such as the types of instances that we use).

We then create a gateway for the public subnet.

A routing table is then needed to allow traffic to go through the gateway.

Lastly, the routing table must be assigned to the to the subnet to allow the traffic in and out from it.


Next we need to set up some roles for the EMR cluster.

First a service role is necessary. This role defines what the cluster is allowed to do within the EMR environment.

Note that EOF tags imply content with a structure. These need to have no trailing spaces, which leads to strange indentation.

This service role needs a policy attached. In this example we will simply used the default EMR role.

Next we need a role for the EMR profile.

This role is assigned the EC2 default role, which defines what the cluster is allowed to do in the EC2 environment.

Lastly the instance profile, which is used to pass the role’s details to the EC2 instances.

Key setup

Next you will need ssh keys that will allow you to ssh into the master node.

To create the ssh key and .pem file run the following command:

Enter a key name, such as cluster-key, and enter no password. Then create a pem file from the private key.

Lastly create a key pair in terraform, linking to the key that you have created


Next we need an s3 bucket. You may need more that one depending on your project requirements. In this example we will simply create one for the cluster logs.

Security groups

Next we need a security group for the master node. This security group should allow the nodes to communicate with the master node, but also to be accessed via certain ports from your personal VPN.

You can find your public IP address by simply going to this site.

Let’s assume that your public address is with subnet /16.

We also need a security group for the rest of the nodes. These nodes should only communicate internally.

Note that when you create 2 security groups ircular dependencies are created. When destroying the terraformed infrastructure in such a case, you need to delete the associations of the security groups before deleting the groups themselves. The revoke_rules_on_delete option takes care of this automatically.


Finally, now that we have all the components, we can set up the cluster.

First add the provider

Then we add the cluster itself

You can add task nodes as follows

Saving the file

Save the file as your prefered name with the extention `.tf`

Creating the cluster

To run the terraform script ensure the following:

Run the following to make sure that your setup is valid:

If there are no errors, you can run the following to create the cluster:

Destroying the cluster

To take down all the terraformed infrastructure run the following:

You can add the following to you file if you want the terraform state file to be saved to an S3 bucket. This file allows terraform to know the last state of terraforming your infrastructure (what has been created or destroyed)

Originally published at intothedepthsofdataengineering.wordpress.com on November 19, 2017.

Data scientist, software engineer, poet, writer, blogger, ammature painter

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store