How to create elastic storage for the Hadoop cluster using LVM & How to automate LVM!!

Shubhambhalala
7 min readMar 8, 2021

--

In an industry with the increase in technology, the generation of digital data also increases, to cover up the necessity we have come up with a solution called big data. We have a fantastic tool from Apache called Hadoop. Hadoop is widely accepted by the industry and there are some use-cases that need to be also solved in Hadoop. One such use-case is, how can we scale up or down the storage shared by the data-node on the fly.

Let’s say, company XYZ is running its critical business use case on a web application whose data is stored in a Hadoop cluster. Suddenly one fine day, the website went viral and the user's data start flooding into the cluster and suddenly the admin got an alert of memory not sufficient in the cluster. What the admin should do in this situation, keeping in mind, they don’t have to lose the customer data and also keep up the cluster running for the data scientist to do analysis.

The solution for this use-case will be, adding physical hardware or spin-up the storage instance from the cloud and link it with Hadoop data-node cluster. It’s not simple as it looks like as to do so we need to shut down the cluster, format the drive, and then mount it and use it. So, we as an admin need to adopt a smart way to handle this, we will use the concept of logical volume management (LVM). In a Linux system, we can actually combine two or more hardware into one and tell our OS to behave in such a way that, this single resource is a hard drive from which you have to use the storage. We can also extend or decrease this storage on the fly without worrying about data loss and formatting. Let’s see a demo of it!

Step1: Attaching hard drives to the data-node

In this demo, I am using a local Hadoop cluster that has a single name and data node running on top of the virtual box VM of RHEL v8. So, I have attached two hard drives from the storage option in settings of VM in the virtual box.

SATA Port1 & Port2 shows the two hard drive attached externally.

You can also check it by the lsblk command on Linux.

sdb and sdc are the two hard drives attached

Step2: Create PV (Physical Volume)

To create a PV we have a command pvcreate but, let’s first understand how LVM works.

https://linuxhowtoguide.blogspot.com/2017/07/how-to-create-lvm-logical-volume-in.html

We have our physical hardware, here we have two hard disks attached to the system. From here, we will create PV, which is the stage where we can make the physical hardware capable to attach together. The next step will be, to create VG (Volume Group). This is the single pool where all the PV are attached. We can add PVs dynamically here. Once we have VG created, we can use it as a normal hard drive only difference will be that, we will use the lvcreate command to create a partition in the VG.

Creating PV from two hard disks

We can also list them and see some more details using the pvdisplay command.

Step3: Creating VG (Volume Group)

Syntax for this is, vgcreate <vgname> <devices/drives>

We can see the details of VG by the vgdisplay command.

Step4: Creating LV (Logical Volume)

We are creating LV of size 10 GB

The syntax for creating LV is, lvcreate — size <size> — name <name of LV> <name of VG>

Step5: Formatting and mounting of LV

We will format it with the help of the mkfs.ext4 command, it will format the drive with ext4 file format.

Finally, we will mount it to the folder which is shared by the datanode.

We can get the drive or the device name from the lvdisplay command.

Step6: Starting name-node and data-node

These are the configuration of the datanode from where we are going to share the storage and which is being backed up by LVM for elastic storage.

datanode configuration

These are the configuration for the namenode or the master node.

namenode configuration

Starting the namenode using hadoop-daemon.

We use the jps command to check whether the node is up or not.

Similarly, we will also start the datanode.

After starting the node, everything will be working fine, but here we are making the mistake of not adding the fstab entry for the LVM. So we will create a fstab entry for the same.

Here, we have used the UUID of the LV instead of the name because there are still rare chances that the name might change afterward.

Now, from the namenode, we can check whether our LVM worked or not. To do so we will use the command, hadoop dfsadmin -report this will give all the details about the namenode and particularly the storage shared.

Right now we can see that the storage shared is 10GB as per the LV created and mounted, it somewhat less because of the metadata. So, we are assured that everything is working great. Now, we will increase the size of the drive shared on the fly and we will check if the update is reflected back onto this command.

Using the command lvextend we can extend the size of LV on the fly without interrupting the cluster or the process going on in it. The syntax of this is, lvextend — size +<size> <device or drive or LV>

After that we also have to format the drive, this will also be done on the fly. This concept is so amazing that we can actually extend a hard drive on the fly. For this, we will use the command resize2fs. The syntax for this is, resize2fs <device or drive or LV name/path>

Now, let’s check whether the update is reflected on the namenode or not.

Hurray!!! It’s being perfectly reflected.

So, we have finally created a scalable Hadoop cluster.

Now, I have developed python scripts for automating LVM. So, you don’t have to worry about learning LVM, I have got you covered. I have three python scripts. One to create LVM, the second to delete, and the third to extend the LVM.

autolvm.py

You can see, we just have to provide basic information to build it, the rest will be handled by the code. You don’t have to memorize the commands.

autolvmextend.py

Similarly, we have to also give the device path and the size to increase. It will increase the size of the LVM on the fly.

autolvmdelete.py

As we can see in the image below that, now our devices or drives are used by LVM and currently in use. We will see after running this script whether they are still connected or not.

As we can see after running the python script the association or the usage of the drives with LVM is removed.

Here is the GitHub link for these python files: https://github.com/AnonMrNone/lvm-automation

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Shubhambhalala
Shubhambhalala

Written by Shubhambhalala

C|EH | Cybersecurity researcher | MLOps | Hybrid Multi Cloud | Devops assembly line | Openshift | AWS EKS | Docker

No responses yet

Write a response