Big Data! Ever thought how much data we generate? How is it stored and managed?
The goal is to turn data into information, and information into insight. ~Carly Fiorina ex CEO Hewlett-Packard
Over 2.3 quintillion bytes of data are created every single day, and it’s only going to grow from there. In 2020, it’s estimated that more than 1.7MB of data will be created every second for every person on Earth!
Ever imagined, how and where this huge amount of data might get stored?
Yes, the answer is very simple to think, in data centers which has hundreds of servers with great storage capacity and computing power. But, this ain’t that simple. Behind the scene, there are many problems to store such huge data. Big data has become a huge problem! It’s very well said,
The world is one big data problem. ~ Andrew McAfee
What is Big Data?
In simple words, it is a field to systematically extract information from, or otherwise deal with data that are too large to be dealt with the traditional data storage and processing applications. Let’s understand this with one example, just imagine how much data the Gmail server might have to store and manage. It’s not only about how much, but it’s also about which type and where?
Let’s say Gmail has one big device which can store this device, which meant we will make one single big device to store huge data, but here we have a problem. Increasing the size of the device affects I/O performance. It's basically read and write speed.
Similarly, we encounter many such issues in Big Data. Initially, it was noted as 3 V’s, Volume, Velocity, and Variety. As time passed organizations have faced other issues and introduced Veracity. This definition is actually relative to the use case one organization is facing and trying to solve.
The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not.
The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data are produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.
The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. The Big Data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed(velocity), and huge in size (volume). Later, these tools and technologies were explored and utilized for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, and sensors, etc. Big data draws from text, images, audio, video, plus it completes missing pieces through data fusion.
It is the extended definition for big data, which refers to the data quality and the data value. The data quality of captured data can vary greatly, affecting accurate analysis.
Which sectors face Big Data problems?
Big Data analysis was tried out for the BJP to win the Indian General Election 2014.
Israel- personalized diabetic treatments can be created through GlucoMe’s big data solution.
The United Kingdom- Data on prescription drugs: by connecting origin, location, and the time of each prescription, a research unit was able to exemplify the considerable delay between the release of any given drug, and a UK-wide adaptation of the National Institute for Health and Care Excellence guidelines. This suggests that new or most up-to-date drugs take some time to filter through to the general patient.
Channel 4, the British public-service television broadcaster, is a leader in the field of big data and data analysis.
The Utah Data Center has been constructed by the United States National Security Agency. When finished, the facility will be able to handle a large amount of information collected by the NSA over the Internet. The exact amount of storage space is unknown, but more recent sources claim it will be on the order of a few exabytes. This has posed security concerns regarding the anonymity of the data collected.
eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising.
Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005, they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. Amazon uses Hadoop in its Web Services.
Facebook handles 50 billion photos from its user base. As of June 2017, Facebook reached 2 billion monthly active users.
Google was handling roughly 100 billion searches per month as of August 2012.
Walmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data — the equivalent of 167 times the information contained in all the books in the US Library of Congress.
Windermere Real Estate uses location information from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day.
The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster.
The Square Kilometre Array is a radio telescope built of thousands of antennas. It is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one petabyte per day. It is considered one of the most ambitious scientific projects ever undertaken.
Big data can be used to improve training and understanding competitors, using sport sensors. It is also possible to predict winners in a match using big data analytics. The future performance of players could be predicted as well. Thus, players’ value and salary are determined by data collected throughout the season.
How Big Data problems are solved?
The answer to this is a distributed storage system. It solves Volume and Velocity problems very efficiently. It’s a concept that works on master-slave architecture. The concept here is if we divide the chuck of data into many fragments and then store it, we will have cost redundancy and more storage with speed. So, we will have one master node which will help us store the data in the other worker nodes. Now, the main question comes here is what type of system to use for this? We will use commodity devices, like laptops or CPU’s for this because comparatively they are cheap and can have high I/O efficiency.
In the market, there are many Big Data products available but from them, Hadoop is widely acceptable.
What is Hadoop?
Apache Hadoop is a software framework employed for clustered file systems and the handling of big data. It processes datasets of big data by means of the MapReduce programming model.
Hadoop is an open-source framework that is written in Java and it provides cross-platform support.
No doubt, this is the topmost big data tool. In fact, over half of the Fortune 50 companies use Hadoop. Some of the Big names include Amazon Web services, Hortonworks, IBM, Intel, Microsoft, Facebook, etc.
- The core strength of Hadoop is its HDFS (Hadoop Distributed File System) which has the ability to hold all types of data — video, images, JSON, XML, and plain text over the same file system.
- Highly useful for R&D purposes.
- Provides quick access to data.
- Highly scalable
- Highly-available service resting on a cluster of computers
- Sometimes disk space issues can be faced due to its 3x data redundancy.
- I/O operations could have been optimized for better performance.
Thank you for reading this article with all your sincerity 😃✌ Feel free to connect with me on LinkedIn and ask any queries.