Big Data! Ever thought how much data we generate? How is it stored and managed?

The goal is to turn data into information, and information into insight. ~Carly Fiorina ex CEO Hewlett-Packard

Over 2.3 quintillion bytes of data are created every single day, and it’s only going to grow from there. In 2020, it’s estimated that more than 1.7MB of data will be created every second for every person on Earth!

Ever imagined, how and where this huge amount of data might get stored?

Yes, the answer is very simple to think, in data centers which has hundreds of servers with great storage capacity and computing power. But, this ain’t that simple. Behind the scene, there are many problems to store such huge data. Big data has become a huge problem! It’s very well said,

The world is one big data problem. ~ Andrew McAfee

What is Big Data?

In simple words, it is a field to systematically extract information from, or otherwise deal with data that are too large to be dealt with the traditional data storage and processing applications. Let’s understand this with one example, just imagine how much data the Gmail server might have to store and manage. It’s not only about how much, but it’s also about which type and where?

Let’s say Gmail has one big device which can store this device, which meant we will make one single big device to store huge data, but here we have a problem. Increasing the size of the device affects I/O performance. It's basically read and write speed.

Similarly, we encounter many such issues in Big Data. Initially, it was noted as 3 V’s, Volume, Velocity, and Variety. As time passed organizations have faced other issues and introduced Veracity. This definition is actually relative to the use case one organization is facing and trying to solve.

Volume

The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not.

Velocity

The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data are produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.

Variety

The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. The Big Data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed(velocity), and huge in size (volume). Later, these tools and technologies were explored and utilized for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, and sensors, etc. Big data draws from text, images, audio, video, plus it completes missing pieces through data fusion.

Veracity

It is the extended definition for big data, which refers to the data quality and the data value. The data quality of captured data can vary greatly, affecting accurate analysis.

Which sectors face Big Data problems?

How Big Data problems are solved?

The answer to this is a distributed storage system. It solves Volume and Velocity problems very efficiently. It’s a concept that works on master-slave architecture. The concept here is if we divide the chuck of data into many fragments and then store it, we will have cost redundancy and more storage with speed. So, we will have one master node which will help us store the data in the other worker nodes. Now, the main question comes here is what type of system to use for this? We will use commodity devices, like laptops or CPU’s for this because comparatively they are cheap and can have high I/O efficiency.

In the market, there are many Big Data products available but from them, Hadoop is widely acceptable.

What is Hadoop?

Apache Hadoop is a software framework employed for clustered file systems and the handling of big data. It processes datasets of big data by means of the MapReduce programming model.

Hadoop is an open-source framework that is written in Java and it provides cross-platform support.

No doubt, this is the topmost big data tool. In fact, over half of the Fortune 50 companies use Hadoop. Some of the Big names include Amazon Web services, Hortonworks, IBM, Intel, Microsoft, Facebook, etc.

Pros:

Cons:

Thank you for reading this article with all your sincerity 😃✌ Feel free to connect with me on LinkedIn and ask any queries.