Hadoopishness in Toronto: Setting up EC2 instances for an Apache Ambari install of HDP 1.2.2

March 21 2013

This document is an informal guide to setting up a test cluster on Amazon AWS, specifically the EC2 service. This is not a best practice guide nor is it suitable for a full PoC or production install of HDP.

Please refer to Hortonworks documentation online to get a complete set of documentation:

http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.2.2/bk_using_Ambari_book/content/ambari-chap1.html

Created the following RHEL 6.3 64bit instances:

m1.medium ambarimaster
m1.large hdpmaster1
m1.large hdpmaster2
m1.medium hdpslave1
m1.medium hdpslave2
m1.medium hdpslave3

Note: when instantiating instances, I increased the root partition to 100Gb on each of them. For long term use, you may want to create separate volumes for your each of the datanodes to store larger amounts of data. Typical raw storage per node is 12-24Tb per slave node.
Note: I edit the Name column in the EC2 Instances screen to the names mentioned above so I know which box I'm dealing with

Used the following security group rules:

ICMP
Port (Service)	Source	Action
ALL	sg-79c54511 (hadoop)	Delete
TCP
Port (Service)	Source	Action
0 - 65535	sg-79c54511 (hadoop)	Delete
22 (SSH)	0.0.0.0/0	Delete
80 (HTTP)	0.0.0.0/0	Delete
7180	0.0.0.0/0	Delete
8080 (HTTP*)	0.0.0.0/0	Delete
50000 - 50100	0.0.0.0/0	Delete
UDP
Port (Service)	Source	Action
0 - 65535	sg-79c54511 (hadoop)	Delete

On each and every node (using root):

vim /etc/sysconfig/selinux (set SELINUX=disabled)

vim /etc/sysconfig/network (set HOSTNAME=<chosen_name>.hdp.hadoop where <chosen_name> is one of the following: ambarimaster, hdpmaster1, hdpmaster2, hdpslave1, hdpslave2, hdpslave3 – depending on what EC2 instance you are on)

chkconfig iptables off

chkconfig ip6tables off

shutdown -r now (only after the commands above are completed)

Note: when I do a restart of the node in this manner, my external EC2 names did NOT change. They will change if you actually halt the instance. This is separate concern from the internal IP addresses which we will get to further on in these instructions
Note: SSH on the RHEL instances has a time out. If your session hangs just give it a few seconds and you will get a “Write failed: Broken pipe” message; just reconnect the box and everything will be fine. Change the SSH timeout if you desire.
Logged onto the ambarimaster ONLY: ssh-keygen -t rsa
On your local box (assuming a linux/mac laptop/workstation, if not use Cygwin, WinSCP, FileZilla, etc to accomplish the equivalent secure copy):

scp -i amuisekey.pem root@ec2-54-234-94-128.compute-1.amazonaws.com:/root/.ssh/id_rsa.pub ./
scp -i amuisekey.pem root@ec2-54-234-94-128.compute-1.amazonaws.com:/root/.ssh/id_rsa ./

Once you have your public and private key on your local box, you can distribute the public key to each node. Do this for every host except for the ambarimaster:

scp -i amuisekey.pem ./id_rsa.pub root@ec2-174-129-186-149.compute-1.amazonaws.com:/root/.ssh/

Log on to each host copy the public key for ambarimaster into each server's authorized_key file:

cat id_rsa.pub >> authorized_keys

To confirm the passwordless ssh is working:

Pick a host other than ambarimaster and determine the internal IP and keep it handy: ifconfig -a
Log on to your ambarimaster and test passwordless ssh using the IP of the host you had just looked up: ssh root@10.110.35.23
confirm that you did actually land on the right host by checking the name: hostname
make sure you exit out of your remote session to your child node from the ambarimaster or things could get confusing very fast

Log on to the ambarimaster and edit the hosts:

On each host, check the internal ip with: ifconfig -a
Edit the hosts file on your ambarimaster: vim /etc/hosts
Edit the hosts file to look the one below, taking into account your own IP addresses for each host:

127.0.0.1 localhost.localdomain localhost

::1 localhost6.localdomain6 localhost6

10.110.35.23 hdpmaster1.hdp.hadoop hdpmaster1

10.191.45.41 hdpmaster2.hdp.hadoop hdpmaster2

10.151.94.30 hdpslave1.hdp.hadoop hdpslave1

10.151.87.239 hdpslave2.hdp.hadoop hdpslave2

10.70.78.233 hdpslave3.hdp.hadoop hdpslave3

10.151.22.30 ambarimaster.hdp.hadoop ambarimaster

Finally, copy the /etc/hosts file from the ambarimaster to every other node:

scp /etc/hosts root@hdpmaster1:/etc/hosts
scp /etc/hosts root@hdpmaster2:/etc/hosts
scp /etc/hosts root@hdpslave1:/etc/hosts
scp /etc/hosts root@hdpslave2:/etc/hosts
scp /etc/hosts root@hdpslave3:/etc/hosts

Note: the /etc/hosts is the file you will need to change if you shut down your EC2 instance and get a new internal IP. When you update this file you must make sure that all nodes have the same copy.
On the ambarimaster only, install the HDP yum repository:

cd
wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/GA/ambari.repo
cp ambari.repo /etc/yum.repos.d
yum install epel-release
yum repolist

Install and initialize the ambari server:

yum install ambari-server
ambari-server setup
ambari-server start

Now you can logon to Ambari. Make a note of the external hostname of your ambarimaster EC instance in the AWS console and go to: http://<yourExternalEC2HostName>:8080 using your local host's favorite web browser
Log on to Ambari with admin/admin
Going through the Ambari cluster install process:

Name your cluster whatever you want
Install Options::Target Hosts – on each line enter the fully qualified hostnames as below (do not add our ambarimaster to the list):

hdpmaster1.hdp.hadoophdpmaster2.hdp.hadoophdpslave1.hdp.hadoophdpslave2.hdp.hadoophdpslave3.hdp.hadoop

Install Options::Host Registration Information - Find the id_rsa (private key) file you downloaded from ambarimaster when you were setting up. Click on choose file and select this file.
Install Options::Advanced Options – leave these as default
Click Register and Confirm
Confirm Hosts - Wait for the ambari agents to be installed and registered on each of your nodes and click next when all have been marked success. Note that you can always add nodes at a later time, but make sure you have your two masters and at least 1 slave.
Choose Services – By default all services are selected. Note that you cannot go back and reinstall services later in this version of Ambari so choose what you want now.
Assign Masters – Likely the default is fine, but see below for a good setup. Note that one of the slaves will need to be a ZooKeeper instance to have an odd number for quorum.

hdpmaster1: NameNode, NagiosServer, GangliaCollector, HBaseMaster, ZooKeeper
hdpmaster2: SNameNode, JobTracker, HiveServer2, HiveMetastore, WebHCatServer, OozieServer, ZooKeeper
hdpslave1: ZooKeeper

Assign Slaves and Clients – For a demo cluster it is fine to have all of the boxes run datanode, tasktracker, regionserver, and client libraries. If you want to expand this cluster with many more slave nodes then I would suggest only running the datanode, tasktracker, and regionserver roles to the hdpslave nodes. The clients can be installed where you like but be sure at least one or two boxes have a client role. Click Next after you are done.
Customize Services – You will note that two services have red markers next to their name: Hive/HCat and Nagios.

Select Hive/HCat and choose your password for hive user on the MySQL database (this stores metadata only); remember the password.
Select Nagios and choose your admin password. Setup your Hadoop admin email to your email (or the email of someone you don't like very much) and you can experience Hadoop alerts from your cluster! Wow.

Review – Take note of the Print command in the top corner. I usually save this to a pdf. Then click Deploy.
Get a coffee.
Note: you may need to refresh the web page if the installer appears stuck (this happens very occasionally depending on various browser/network situations)
Very 100% installed and click Next
Summary – You should see that all of the Master services were installed successfully and none should have failed. Click Complete.
At this point the Ambari installation and the HDP Cluster is complete so you should see the Ambari Dashboard.

Explore Ambari:

Services is the main place to go for status, starting/stopping services, and configuration.
Heatmaps provide useful chart information.
Hosts give you a list of the hosts and their current status. You can also add additional hosts here.
Jobs is actually one of Ambari's best features. This has an excellent job inspector that lets you determine the sequence and statistics of mapreduce tasks within a greater job. The more you use Hadoop, the more you will appreciate this.
Admin is where you manage users for Ambari specifically (not Hadoop). Note that the current version of Ambari can be configured to use an existing LDAP to manage Ambari users.

Now we need to test a job out on the cluster. We will get some sample data, upload it to one of the EC2 linux nodes, uncompress it, upload it to HDFS, and then perform a word count mapreduce job on it.

Go to the following website on your local host and download a list of wikipedia articles. Note that you can avoid registration if you like, the link is smaller font on the bottom. The file is about 80mb compressed. http://www.infochimps.com/datasets/wikipedia-articles
Choose a host that you installed a Client service on (I chose hdpslave3) and upload the wikipedia sample data:
Log on to hdpslave3 with root and uncompress the file (gzip is not splittable by mapreduce):
Change permissions on the uploaded file:
Change users to hdfs:
Upload to HDFS (moves from the hdpslave3 to the distributed file system):
Confirm it is in HDFS:
Execute the wordcount example (ensure you are the hdfs user). Details here: http://wiki.apache.org/hadoop/WordCount

hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount /tmp/encyclopedic.wikipedia.dbpedia.wikipedia_articles /tmp/wordcountoutput_run1

After the job completes, check the Ambari Jobs tab for details on the your wordcount job.

You can leave your cluster running as long as you want but be warned that the instances and volumes will cost you on AWS. To ensure that you will not be charged you can terminate (not just stop) your instances and delete your volumes in AWS. I encourage you to keep them for a a week or so as you decide how to setup your actual Hadoop PoC cluster (be it on actual hardware, Virtual Machines, or another cloud solution). The instances you created will be handy for reference as you install your next cluster and generally are low cost. Consult AWS documentation for details on management and pricing. Please look into Rackspace as well.

Relevant Links:

HDP Documentation:

http://docs.hortonworks.com/

AWS Instructions for using Putty with Linux EC2 instances:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html

AWS Discussion of Static IPs:

https://forums.aws.amazon.com/thread.jspa?threadID=71177

Adding a new drive to RHEL 6 (Relevant for adding volumes for Datanode storage):

http://www.techotopia.com/index.php/Adding_a_New_Disk_Drive_to_an_RHEL_6_System

Toronto Hadoop User Group:

http://www.meetup.com/TorontoHUG/

Some more details on this process with Amazon EC2 and Cygwin:
http://pthakkar.com/2013/03/installing-hadoop-apache-ambari-amazon-ec2/

Hadoopishness in Toronto

Thursday, 21 March 2013

Setting up EC2 instances for an Apache Ambari install of HDP 1.2.2

1 comment:

About Me