Hadoop Cluster Setup Using Ansible

Hello Fellow Programmers !!!

Today I would like to share my learnings regarding Hadoop Cluster Setup using Ansible , it provided a base for applying multiple concepts of Ansible , thereby it was indeed a fruitful learning experience.

Content of this Blog

  • About Ansible
  • About Hadoop
  • Ansible Playbook for Hadoop Cluster Setup

About Ansible

What is Ansible ?

Ansible is a radically simple IT automation engine that automates cloud provisioning, configuration management, application deployment, intra-service orchestration, and many other IT needs.

Why use Ansible ?

  • No special coding skills needed
  • Tasks executed in order
  • Get productive quickly

2. Powerful

  • Configuration management
  • Workflow orchestration
  • Orchestrate the app lifecycle

3. Agentless

  • Uses OpenSSH and WinRM
  • No agents to exploit or update
  • Predictable, reliable and secure

About Hadoop

Hadoop is the one of the software used for implementation of Distributed Storage and the topology used is one Master & multiple Slaves, and here protocol used between Master & Slave is known as HDFS(Hadoop Distributed File System) for file distribution among multiple file system .

In Hadoop, Master Node is also known as NameNode whereas Slave Node is also known as DataNode, also cluster involving single node is known as Single Node Cluster whereas in case of Multiple Node Cluster, it involves multiple nodes.

Hadoop Cluster Configuration

Ansible Playbook for Hadoop Cluster Setup

NameNode Setup

- hosts: namenode
vars_files:
- vars.yml
tasks:
- name: Registering Namenode Facts in namenode_ip
setup:
register: namenode_ip
- name: Downloading Java JDK from URL
get_url:
url: "http://35.244.242.82/yum/java/el7/x86_64/jdk-8u171-linux-x64.rpm"
dest: "/root/jdk-8u171-linux-x64.rpm"
- name: Downloading Hadoop file from URL
get_url:
url: "https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm"
dest: "/root/hadoop-1.2.1-1.x86_64.rpm"
- name: Checking if Java JDK is installed or not
command: "rpm -q jdk1.8"
register: check_java
ignore_errors: yes
- debug:
var: check_java
- name: Installing Java JDK
command: "rpm -i /root/jdk-8u171-linux-x64.rpm"
when: '"is not installed" in check_java.stdout'
- name: Checking if Hadoop is installed or not
command: "rpm -q hadoop"
register: check_hadoop
ignore_errors: yes
- debug:
var: check_hadoop
- name: Installing Hadoop
command: "rpm -i /root/hadoop-1.2.1-1.x86_64.rpm --force"
when: '"is not installed" in check_hadoop.stdout'
- name: Namenode Directory creation
file:
state: directory
path: "{{ namenode_dir }}"
notify: Update the changes

- name: Setting up core-site.xml file
template:
dest: "/etc/hadoop/core-site.xml"
src: "/t11_1/namenode/core-site.xml"
- name: Setting up hdfs-site.xml file
template:
dest: "/etc/hadoop/hdfs-site.xml"
src: "/t11_1/namenode/hdfs-site.xml"
- name: Registering dummy host to pass on the namenode IP to the datanode by registering it to namenode_ip_new
add_host:
name: "Dummy_Host"
namenode_ip_new: "{{ namenode_ip }}"
handlers:
- name: Stop the Namenode process if it's already running
command: "hadoop-daemon.sh stop namenode"
ignore_errors: yes
listen: Update the changes
- name: Namenode Formatting
shell: "echo Y | hadoop namenode -format"
listen: Update the changes

- name: Starting Namenode
command: "hadoop-daemon.sh start namenode"
listen: Update the changes

Though the code above is self- explanatory, some things are quite important to be understood and they are as follows

  • A Dummy Host is created by using the “add_host” module within which the NameNode Facts are stored in a variable. The reason behind the same is to pass the Facts to the DataNode so that NameNode IP could be extracted from it and specified in the core-site.xml file in DataNode so as to set up the cluster.

DataNode Setup

- hosts: datanode
gather_facts: false
vars_files:
- vars.yml
tasks:
- name: Storing the IP obtained from namenode_ip_new to namenode_ip_updated
shell: echo "{{ hostvars['Dummy_Host']['namenode_ip_new']['ansible_facts']['ansible_all_ipv4_addresses'][0] }}" | tail -1
register: namenode_ip_updated

- debug:
var: namenode_ip_updated.stdout
- name: Downloading Java JDK from URL
get_url:
url: "http://35.244.242.82/yum/java/el7/x86_64/jdk-8u171-linux-x64.rpm"
dest: "/root/jdk-8u171-linux-x64.rpm"
- name: Downloading Hadoop file from URL
get_url:
url: "https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm"
dest: "/root/hadoop-1.2.1-1.x86_64.rpm"
- name: Checking if Java JDK is installed or not
command: "rpm -q jdk1.8"
register: check_java
ignore_errors: yes
- debug:
var: check_java

- name: Installing Java JDK
command: "rpm -i /root/jdk-8u171-linux-x64.rpm"
when: '"is not installed" in check_java.stdout'
- name: Checking if Hadoop is installed or not
command: "rpm -q hadoop"
register: check_hadoop
ignore_errors: yes
- debug:
var: check_hadoop
- name: Installing Hadoop
command: "rpm -i /root/hadoop-1.2.1-1.x86_64.rpm --force"
when: '"is not installed" in check_hadoop.stdout'
- name: Datanode Directory creation
file:
state: directory
path: "{{ datanode_dir }}"
notify: Update the changes
- name: Setting up core-site.xml file
template:
dest: "/etc/hadoop/core-site.xml"
src: "/t11_1/datanode/core-site.xml"
- name: Setting up hdfs-site.xml file
template:
dest: "/etc/hadoop/hdfs-site.xml"
src: "/t11_1/datanode/hdfs-site.xml"
handlers:
- name: Stop the Datanode process if it's already running
command: "hadoop-daemon.sh stop datanode"
ignore_errors: yes
listen: Update the changes
- name: Starting Datanode
command: "hadoop-daemon.sh start datanode"
listen: Update the changes

Though the code above is self-explanatory, some things are quite important to be understood and they are as follows:

  • NameNode IP is extracted from the Facts received from NameNode and is specified in core-site.xml file.

Variable Files : vars.yml (Used in both NameNode and DataNode)

namenode_dir: "/nnode"
datanode_dir: "/dnode"

Used for specifying the value of variables used in Ansible Playbook and Template Files.

Template File : NameNode

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>dfs.name.dir</name>
<value>{{ namenode_dir }}</value>
</property>
</configuration>

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>fs.default.name</name>
<value>hdfs://{{ ansible_facts['all_ipv4_addresses'][0] }}:9001</value>
</property>
</configuration>

Template File : DataNode

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>dfs.data.dir</name>
<value>{{ datanode_dir }}</value>
</property>
</configuration>

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. --><configuration><property>
<name>fs.default.name</name>
<value>hdfs://{{ namenode_ip_updated.stdout }}:9001</value>
</property>
</configuration>

Output

Output of “hadoop dfsadmin -report” command after the cluster has been set up
Hadoop Web Interface
  • “hadoop dfsadmin -report” command could be used to check if the DataNodes has successfully connected to NameNode

Important Points

  • Namenode should be formatted in case of some changes in the filesystem , otherwise not recommended.
  • Template files for the configuration file of both NameNode and DataNode has been created. The files are hdfs-site.xml and core-site.xml respectively.
  • For Playbook to be dynamic in nature, variable file i.e., vars.yml has been created that consist of variables namenode_dir and datanode_dir.
  • Firewall and SELinux could be a hindrance for cluster setup using the above code, thereby it should be modified accordingly.
  • The above code sets up the cluster which involves single NameNode and multiple DataNodes.
  • More DataNodes could be added easily using the above code.

Thank You !!!