How to create elastic storage for the Hadoop cluster using LVM & How to automate LVM!!
In an industry with the increase in technology, the generation of digital data also increases, to cover up the necessity we have come up with a solution called big data. We have a fantastic tool from Apache called Hadoop. Hadoop is widely accepted by the industry and there are some use-cases that need to be also solved in Hadoop. One such use-case is, how can we scale up or down the storage shared by the data-node on the fly.
Let’s say, company XYZ is running its critical business use case on a web application whose data is stored in a Hadoop cluster. Suddenly one fine day, the website went viral and the user's data start flooding into the cluster and suddenly the admin got an alert of memory not sufficient in the cluster. What the admin should do in this situation, keeping in mind, they don’t have to lose the customer data and also keep up the cluster running for the data scientist to do analysis.
The solution for this use-case will be, adding physical hardware or spin-up the storage instance from the cloud and link it with Hadoop data-node cluster. It’s not simple as it looks like as to do so we need to shut down the cluster, format the drive, and then mount it and use it. So, we as an admin need to adopt a smart way to handle this, we will use the concept of logical volume management (LVM). In a Linux system, we can actually combine two or more hardware into one and tell our OS to behave in such a way that, this single resource is a hard drive from which you have to use the storage. We can also extend or decrease this storage on the fly without worrying about data loss and formatting. Let’s see a demo of it!
Step1: Attaching hard drives to the data-node
In this demo, I am using a local Hadoop cluster that has a single name and data node running on top of the virtual box VM of RHEL v8. So, I have attached two hard drives from the storage option in settings of VM in the virtual box.
You can also check it by the lsblk command on Linux.
Step2: Create PV (Physical Volume)
To create a PV we have a command pvcreate but, let’s first understand how LVM works.
We have our physical hardware, here we have two hard disks attached to the system. From here, we will create PV, which is the stage where we can make the physical hardware capable to attach together. The next step will be, to create VG (Volume Group). This is the single pool where all the PV are attached. We can add PVs dynamically here. Once we have VG created, we can use it as a normal hard drive only difference will be that, we will use the lvcreate command to create a partition in the VG.
We can also list them and see some more details using the pvdisplay command.
Step3: Creating VG (Volume Group)
Syntax for this is, vgcreate <vgname> <devices/drives>
We can see the details of VG by the vgdisplay command.
Step4: Creating LV (Logical Volume)
The syntax for creating LV is, lvcreate — size <size> — name <name of LV> <name of VG>
Step5: Formatting and mounting of LV
We will format it with the help of the mkfs.ext4 command, it will format the drive with ext4 file format.
Finally, we will mount it to the folder which is shared by the datanode.
We can get the drive or the device name from the lvdisplay command.
Step6: Starting name-node and data-node
These are the configuration of the datanode from where we are going to share the storage and which is being backed up by LVM for elastic storage.
These are the configuration for the namenode or the master node.
Starting the namenode using hadoop-daemon.
We use the jps command to check whether the node is up or not.
Similarly, we will also start the datanode.
After starting the node, everything will be working fine, but here we are making the mistake of not adding the fstab entry for the LVM. So we will create a fstab entry for the same.
Here, we have used the UUID of the LV instead of the name because there are still rare chances that the name might change afterward.
Now, from the namenode, we can check whether our LVM worked or not. To do so we will use the command, hadoop dfsadmin -report this will give all the details about the namenode and particularly the storage shared.
Right now we can see that the storage shared is 10GB as per the LV created and mounted, it somewhat less because of the metadata. So, we are assured that everything is working great. Now, we will increase the size of the drive shared on the fly and we will check if the update is reflected back onto this command.
Using the command lvextend we can extend the size of LV on the fly without interrupting the cluster or the process going on in it. The syntax of this is, lvextend — size +<size> <device or drive or LV>
After that we also have to format the drive, this will also be done on the fly. This concept is so amazing that we can actually extend a hard drive on the fly. For this, we will use the command resize2fs. The syntax for this is, resize2fs <device or drive or LV name/path>
Now, let’s check whether the update is reflected on the namenode or not.
Hurray!!! It’s being perfectly reflected.
So, we have finally created a scalable Hadoop cluster.
Now, I have developed python scripts for automating LVM. So, you don’t have to worry about learning LVM, I have got you covered. I have three python scripts. One to create LVM, the second to delete, and the third to extend the LVM.
You can see, we just have to provide basic information to build it, the rest will be handled by the code. You don’t have to memorize the commands.
Similarly, we have to also give the device path and the size to increase. It will increase the size of the LVM on the fly.
As we can see in the image below that, now our devices or drives are used by LVM and currently in use. We will see after running this script whether they are still connected or not.
As we can see after running the python script the association or the usage of the drives with LVM is removed.
Here is the GitHub link for these python files: https://github.com/AnonMrNone/lvm-automation