Hadoop Cluster Setup Using Ansible

Hello Fellow Programmers !!!

Today I would like to share my learnings regarding Hadoop Cluster Setup using Ansible , it provided a base for applying multiple concepts of Ansible , thereby it was indeed a fruitful learning experience.

Content of this Blog

  • About Hadoop
  • Ansible Playbook for Hadoop Cluster Setup

About Ansible

Ansible is a radically simple IT automation engine that automates cloud provisioning, configuration management, application deployment, intra-service orchestration, and many other IT needs.

Why use Ansible ?

  1. Simple
  • Human readable automation
  • No special coding skills needed
  • Tasks executed in order
  • Get productive quickly

2. Powerful

  • App deployment
  • Configuration management
  • Workflow orchestration
  • Orchestrate the app lifecycle

3. Agentless

  • Agentless architecture
  • Uses OpenSSH and WinRM
  • No agents to exploit or update
  • Predictable, reliable and secure

About Hadoop

In Hadoop, Master Node is also known as NameNode whereas Slave Node is also known as DataNode, also cluster involving single node is known as Single Node Cluster whereas in case of Multiple Node Cluster, it involves multiple nodes.

Hadoop Cluster Configuration

Ansible Playbook for Hadoop Cluster Setup

- hosts: namenode
vars_files:
- vars.yml
tasks:
- name: Registering Namenode Facts in namenode_ip
setup:
register: namenode_ip
- name: Downloading Java JDK from URL
get_url:
url: "http://35.244.242.82/yum/java/el7/x86_64/jdk-8u171-linux-x64.rpm"
dest: "/root/jdk-8u171-linux-x64.rpm"
- name: Downloading Hadoop file from URL
get_url:
url: "https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm"
dest: "/root/hadoop-1.2.1-1.x86_64.rpm"
- name: Checking if Java JDK is installed or not
command: "rpm -q jdk1.8"
register: check_java
ignore_errors: yes
- debug:
var: check_java
- name: Installing Java JDK
command: "rpm -i /root/jdk-8u171-linux-x64.rpm"
when: '"is not installed" in check_java.stdout'
- name: Checking if Hadoop is installed or not
command: "rpm -q hadoop"
register: check_hadoop
ignore_errors: yes
- debug:
var: check_hadoop
- name: Installing Hadoop
command: "rpm -i /root/hadoop-1.2.1-1.x86_64.rpm --force"
when: '"is not installed" in check_hadoop.stdout'
- name: Namenode Directory creation
file:
state: directory
path: "{{ namenode_dir }}"
notify: Update the changes

- name: Setting up core-site.xml file
template:
dest: "/etc/hadoop/core-site.xml"
src: "/t11_1/namenode/core-site.xml"
- name: Setting up hdfs-site.xml file
template:
dest: "/etc/hadoop/hdfs-site.xml"
src: "/t11_1/namenode/hdfs-site.xml"
- name: Registering dummy host to pass on the namenode IP to the datanode by registering it to namenode_ip_new
add_host:
name: "Dummy_Host"
namenode_ip_new: "{{ namenode_ip }}"
handlers:
- name: Stop the Namenode process if it's already running
command: "hadoop-daemon.sh stop namenode"
ignore_errors: yes
listen: Update the changes
- name: Namenode Formatting
shell: "echo Y | hadoop namenode -format"
listen: Update the changes

- name: Starting Namenode
command: "hadoop-daemon.sh start namenode"
listen: Update the changes

Though the code above is self- explanatory, some things are quite important to be understood and they are as follows

  • Changes in “NameNode Directory creation” notifies the task specified within handlers i.e,. Stopping the existing NameNode process, Formatting of NameNode and Starting the NameNode process.
  • A Dummy Host is created by using the “add_host” module within which the NameNode Facts are stored in a variable. The reason behind the same is to pass the Facts to the DataNode so that NameNode IP could be extracted from it and specified in the core-site.xml file in DataNode so as to set up the cluster.

DataNode Setup

- hosts: datanode
gather_facts: false
vars_files:
- vars.yml
tasks:
- name: Storing the IP obtained from namenode_ip_new to namenode_ip_updated
shell: echo "{{ hostvars['Dummy_Host']['namenode_ip_new']['ansible_facts']['ansible_all_ipv4_addresses'][0] }}" | tail -1
register: namenode_ip_updated

- debug:
var: namenode_ip_updated.stdout
- name: Downloading Java JDK from URL
get_url:
url: "http://35.244.242.82/yum/java/el7/x86_64/jdk-8u171-linux-x64.rpm"
dest: "/root/jdk-8u171-linux-x64.rpm"
- name: Downloading Hadoop file from URL
get_url:
url: "https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm"
dest: "/root/hadoop-1.2.1-1.x86_64.rpm"
- name: Checking if Java JDK is installed or not
command: "rpm -q jdk1.8"
register: check_java
ignore_errors: yes
- debug:
var: check_java

- name: Installing Java JDK
command: "rpm -i /root/jdk-8u171-linux-x64.rpm"
when: '"is not installed" in check_java.stdout'
- name: Checking if Hadoop is installed or not
command: "rpm -q hadoop"
register: check_hadoop
ignore_errors: yes
- debug:
var: check_hadoop
- name: Installing Hadoop
command: "rpm -i /root/hadoop-1.2.1-1.x86_64.rpm --force"
when: '"is not installed" in check_hadoop.stdout'
- name: Datanode Directory creation
file:
state: directory
path: "{{ datanode_dir }}"
notify: Update the changes
- name: Setting up core-site.xml file
template:
dest: "/etc/hadoop/core-site.xml"
src: "/t11_1/datanode/core-site.xml"
- name: Setting up hdfs-site.xml file
template:
dest: "/etc/hadoop/hdfs-site.xml"
src: "/t11_1/datanode/hdfs-site.xml"
handlers:
- name: Stop the Datanode process if it's already running
command: "hadoop-daemon.sh stop datanode"
ignore_errors: yes
listen: Update the changes
- name: Starting Datanode
command: "hadoop-daemon.sh start datanode"
listen: Update the changes

Though the code above is self-explanatory, some things are quite important to be understood and they are as follows:

  • Similar to NameNode, the “DataNode Directory creation” notifies the task specified within handlers i.e,. Stopping the existing DataNode process and Starting the DataNode process.
  • NameNode IP is extracted from the Facts received from NameNode and is specified in core-site.xml file.

Variable Files : vars.yml (Used in both NameNode and DataNode)

namenode_dir: "/nnode"
datanode_dir: "/dnode"

Used for specifying the value of variables used in Ansible Playbook and Template Files.

Template File : NameNode

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>dfs.name.dir</name>
<value>{{ namenode_dir }}</value>
</property>
</configuration>

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>fs.default.name</name>
<value>hdfs://{{ ansible_facts['all_ipv4_addresses'][0] }}:9001</value>
</property>
</configuration>

Template File : DataNode

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>dfs.data.dir</name>
<value>{{ datanode_dir }}</value>
</property>
</configuration>

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>fs.default.name</name>
<value>hdfs://{{ namenode_ip_updated.stdout }}:9001</value>
</property>
</configuration>

Output

Output of “hadoop dfsadmin -report” command after the cluster has been set up
Hadoop Web Interface
  • Hadoop Web Interface could be accessed by “<NameNode’s IP Address>:50070
  • “hadoop dfsadmin -report” command could be used to check if the DataNodes has successfully connected to NameNode

Important Points

  • “jps” command is used to check if Namenode or Datanode process has started or not.
  • Namenode should be formatted in case of some changes in the filesystem , otherwise not recommended.
  • Template files for the configuration file of both NameNode and DataNode has been created. The files are hdfs-site.xml and core-site.xml respectively.
  • For Playbook to be dynamic in nature, variable file i.e., vars.yml has been created that consist of variables namenode_dir and datanode_dir.
  • Firewall and SELinux could be a hindrance for cluster setup using the above code, thereby it should be modified accordingly.
  • The above code sets up the cluster which involves single NameNode and multiple DataNodes.
  • More DataNodes could be added easily using the above code.

Thank You !!!