Distributed computing frameworks often require the creation of a set of machines that work with one another to perform some sort of large-scale data processing.
This cluster of computing resources can be provided through a variety of services. For instance, Amazon Web Services offers a service called EMR, which allows the user to quickly spin up a Spark cluster and configure how many machines will make up the cluster. In other cases, programmers might use tools, such as Terraform, to programmatically create and configure a set of machines that can communicate with one another.
For the beginner, it might be helpful to learn how to create a cluster of instances using user-friendly web console utilities provided by AWS, then access those machines from your laptop and finally, configure them to communicate with one another.
This blog will take you through the steps for how to create a cluster of four machines on AWS and then ensure that, from one of those machines (designated the master), you’ll be able to connect via
ssh to the other (worker) machines.
To complete the steps, you must have a Unix laptop with
ssh access and be familiar with ssh (and ssh-keygen), public and private IP addresses, DNS, VPCs, subnets, AWS EC2 key-pairs (or .pem files) and other concepts. You don’t need to have a deep understanding of each and every concept to complete the steps, but it would be helpful.
Also note that, because this blog relies heavily on the AWS web console as it exists now in May 2020, there’s a chance AWS could make modifications to its web console that would render some of the details listed here obsolete.
Configure the machines
Login in to AWS and navigate to the “EC2 Dashboard” and click on “Launch Instance.” You’ll then be prompted to choosing an AMI. In our case, we’ll scroll down until we get to an Ubuntu AMI and select that.
Next, you’ll have to choose the instance type. You can peruse the specs and the costs here and choose what’s best for you. For this exercise, I’m assuming this cluster would be used to do a fairly large amount of data processing, so I chose m5a.large instances, which has 2 virtual CPUs and 8GB in memory.
Next, because I’m planning on creating a 4-node Hadoop cluster with one master and three workers, I set the number of (ec2) instances to 4 and designate a VPC and subnet.
Next, you’ll have to add storage and you can take a rough guess of how much you might need per machine at one time. If, as an example, you think you’ll be storing around 100GB at any one time, it might be enough to provision around 30GB of storage for each instance. Keep in mind that you’ll be paying per GB-month that you’ll be using it — as of this writing that was about $.10 for a SSD volume and about half that for an older generation magnetic volume.
The next step involves adding key-value tags if you want a way to quickly identify the four machines you are about to spin up (e.g., Hadoop cluster, Spark cluster).
Associate security group
After that, you’ll be asked to associate a security group with your cluster of machines. For the purposes of this section, let’s set up a brand new security group and associate that with our instances. It’s important for you to correctly set up your security group so that you can connect to the new machines you are about to spin up while still not making your cluster publicly accessible to the hackers of the world.
When you choose ‘My IP’ under the ‘Source’ column, AWS will automatically identify the IP address that you’re coming from and only allow traffic coming from that IP address to access machines associated with that security group (note if you move from work to home or a new location, you may need to update that IP address or add a new line).
Later we’ll update this newly created security group to allow for the machines in your cluster to communicate with each other — that’ll be very important for the machines in your cluster to talk to one another.
Once you’ve configured your security group, you’re ready to launch your instances. Note the message at the top, which is letting you know that what you are about to do is going to cost you money.
Associate .pem file
After pressing Launch, the next screen should allow you to associate an existing key pair or create a new one. A key pair is a file ending in
.pem that holds credentials allowing you to
ssh into the instances that you just spun up. If you created a key pair previously, you can go ahead and use that, otherwise, create a new one — just be sure and download the
.pem file to your laptop and make sure that it has the right file permissions so that you’ll be able to connect to that instance from your laptop.
In the case illustrated above, the key pair that I associated with my new instances is named
hoa-nguyen.pem. If that .pem file was new, I’d save the file and move it to the
~/.ssh directory on my laptop and ensure it has the right read-only permissions (e.g.,
chmod 400 ~/.ssh/hoa-nguyen.pem). If I don’t change the permissions, the next time I try to access my new instances via
ssh with that pem file, I’d most likely receive permission errors.
Returning to the AWS web console, once you press “Launch instances”, you can navigate to your EC2 Dashboard and ‘Instances’, you should see the four instances that you just created.
To make it easier for you to identify which machine will be your master and which ones are the workers for the later stages, go ahead and name each instance as a ‘master’ or ‘worker’ as seen below.
Update security group settings
Finally, we want to go back and update the security group we created in the previous step to allow machines in the same security group to communicate with one another. See below for where you can find the security group you associated with your instances (circled in red).
Click on that security group and edit your inbound rules. Add a new rule that allows ‘All traffic’ between the same security group (i.e., Mentally note the security group id listed at the top — it starts with ‘sg’ — and start typing it in the text box and AWS should autocomplete for you). Remember to press ‘Save rules’ for your changes to take effect.
Be advised that limiting traffic to machines in the same security group (as well as your laptop) should be adequate for the purposes of this walk-through but would not be secure enough for production.
Setting up passwordless ssh
Now that the machines have been provisioned, we’re going to want to
ssh into all four machines and set up passwordless ssh, which essentially is a mechanism for the master instance to communicate with the workers without having to pass around the key pair (or .pem file). Setting up passwordless ssh is a requirement for using distributed computing frameworks, such as Hadoop and Spark.
- Connect to your master instance
Below is an example of connecting to your ‘master’ machine using
ssh and the key pair (.pem) file you associated with those machines that were just created.
$ ssh -i ~/.ssh/PEM_FILE ubuntu@MASTER_DNS
Substitute your own pem file and public DNS of the instance you marked as the master. You should be able to find the public DNS of your master instance from your Amazon web console details.
So if my .pem file was named
hoa-nguyen.pem (make sure that it is indeed saved in your
~/.ssh directory) and my MASTER DNS was
ec2–35–170–172–230.compute-1.amazonaws.com, this is the command I'd execute:
$ ssh -i ~/.ssh/hoa-nguyen.pem email@example.com
If you are unable to ssh into your machine, some things to double check are that you have the right MASTER DNS, your pem file has the right file permissions (e.g., chmod 400 pem-file) and your security group is allowing access from the laptop that you are executing ssh. Note that the first time you log into a machine from your laptop, ssh will warn you that the authenticity of the machine can’t be established and ask if you want to continue connecting — you’ll want to answer ‘yes’.
If you are unsure, where to find the DNS for your machine, refer to the below picture and to find that information on the AWS web console:
2. Generate an authorization key
To generate an authorization key on the master instance, you can use
$ ssh-keygen -t rsa -P ""
You’ll be prompted to enter the file in which you want to save the key — just press enter and it’ll automatically choose
Once you’ve generated your keys, you’ll want to now append it to the
authorized-keys file on your master instance
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Once you do this, you should be able to
ssh into your current (localhost) machine with the following command:
$ ssh localhost
Now, you’re going to want to copy the credentials stored in
id_rsa.pub on your master instance to the other three machines in your cluster.
The easiest way is to go back to your laptop and use the pem file on your laptop to copy the master instance’s keys to the workers by issuing the following commands, making sure that you replace
WORKER_PUBLIC_DNS with your own values:
$ ssh -i ~/.ssh/PEM_FILE ubuntu@MASTER_PUBLIC_DNS 'cat ~/.ssh/id_rsa.pub' | ssh -i ~/.ssh/PEM_FILE ubuntu@WORKER_1_PUBLIC_DNS 'cat >> ~/.ssh/authorized_keys'$ ssh -i ~/.ssh/PEM_FILE ubuntu@MASTER_PUBLIC_DNS 'cat ~/.ssh/id_rsa.pub' | ssh -i ~/.ssh/PEM_FILE ubuntu@WORKER_2_PUBLIC_DNS 'cat >> ~/.ssh/authorized_keys'$ ssh -i ~/.ssh/PEM_FILE ubuntu@MASTER_PUBLIC_DNS 'cat ~/.ssh/id_rsa.pub' | ssh -i ~/.ssh/PEM_FILE ubuntu@WORKER_3_PUBLIC_DNS 'cat >> ~/.ssh/authorized_keys'
Once you’ve copied your public keys to your workers, ssh back into your master and from there try to ssh into each of your workers (e.g.,
ssh ubuntu@WORKER_PUBLIC_DNS) and you should be able to connect. Make sure you can reach all of your workers before you go on to the next steps.
If you encounter errors and various permission issues with trying to copy around your public keys, you may spend a lot of time trying to debug the issue. Often, the easiest workaround would be to copy-and-paste the contents of the
~/.ssh/id_rsa.pub file on the master instance and append it to the bottom of the
~/.ssh/authorized_keys file on your worker instance. This method is prone to copy-and-paste mishaps so it’s not foolproof but it might be a simpler way than debugging permission issues.
At this point, you can move to installing distributed computing frameworks and skip the rest of this section. Proceed through the rest of this section if you want to reserve an elastic IP that can be associated with your four machines regardless of whether they are running or stopped.
Optional: Associate Elastic IP addresses
Remember that if you stop and start your instances via the AWS web console, the elastic IPs that was associated with your instances at the time you created will generally be released. When you re-start your instances, new elastic IPs will be assigned and any configuration files where you wrote in the previous public DNS must be updated to reflect those new assignments (This will be an important lesson to remember if you proceed to installing Hadoop and Spark). Note that private IPs will all remain unchanged, and luckily, you won’t have to re-do passwordless ssh.
If you anticipate wanting to stop and start your instances frequently, you can allocate your own IP addresses and persistently associate it to your instances, or you can use a tool, such as pegasus or build your own Terraform scripts to bring up and down your machines.
Below is a manual way to allocate and associate an elastic IP. Start by navigating your web console to the section pertaining to elastic IP address. Press the button that says “Allocate Elastic IP address” and then “Allocate”
Once you’ve allocated, you should now ‘Associate this Elastic IP address’:
On the next page, you’ll want to click onto the ‘Instance’ search bar for one of the instances you just recently created. Once you’ve picked the instance, click on the Private IP address search bar and it should auto populate — select that and also choose “Allow this Elastic IP address to be reassociated” in case you want to reuse it, then click ‘Associate’.
Now repeat the same action for your three other instances.
Congratulations, you now have four elastic IP addresses that will continue to persist even when your instances are down. Keep in mind that AWS does charge a daily fee for any allocated elastic IP address that is not associated with an instance.
If you are unable to allocate enough elastic IP addresses, you can contact AWS support to increase the limit available to you.