Big Data Processing Frameworks

What is Big Data Processing?

Big Data is a term used to describe the very large amount of data which is of a variety of kinds that are being frequently/continuously generated. It is often described as having 3 Vs – Volume, Velocity, and Variety.

Processing Big Data gives us the ability to extract meaningful information from the data and obtain the required insight. Traditional tools of data processing are not suitable for Big Data processing due to the different nature of Big Data.

Table of Content

1 What is Big Data Processing?
2 Big Data Processing Frameworks

The techniques for Big Data processing involve analysing the big data sets at a very large scale such as in terabytes and petabytes.

There are two kinds of processing:

Offline
Real-Time

Typically, the offline batch data processing is done on full scale and full power. It is an efficient way of processing high volumes of data. For example, Hadoop is an example of offline batch data processing. On the other hand, in the case of real-time data processing, there is a continuous input, processing, and generation of the output of data. In real-time processing, data is processed in a very short period. For example, the processing of data at bank ATMs is an example of real-time data processing.

The toughest task or challenge is to do fast or real-time analytics on a complete set of Big Data to get insight into the data. Practically, it means that you have to scan terabytes or petabytes of data within seconds. This can only be achieved by parallel processing of the data. Traditional tools of data processing are not suitable for Big Data processing due to the different nature of Big Data.

Big Data Processing Frameworks

Some of the popular Big Data processing frameworks are as follows:

Hadoop

It is the very first open-source Big Data processing framework. Hadoop is a distributed storage and parallel processing framework.

It consists of mainly two components:

HDFS
MapReduce

HDFS stands for Hadoop Distributed File System used for Big Data storage. This file system is highly fault-tolerant and can store large amounts of data. It is capable of providing easier accessibility to the stored data. MapReduce is a parallel processing technique used for Big Data processing and is based on Java programming language. The two tasks are performed in the MapReduce technique, that is, Map and Reduce. The reduced task is always performed after the map stage.

Hadoop apart from HDFS and Map Reduce has an ecosystem of different tools and technologies such as Pig, Hive, Zookeeper, Flume, YARN, etc., for various purposes. Hadoop is still the most widely used Big Data processing framework.

Spark

Spark is another Big Data processing framework. Spark is not a replacement for Hadoop, but is used in conjunction with it. Spark is designed as an in-memory processing engine to replace Map Reduce. Spark can be accommodated into the Hadoop ecosystem. It leads to a different environmental setup which may have a mix of other technologies and tools from both of these ecosystems. Spark does not have a file storage system. Instead, it uses Hadoop’s HDFS as its file storage system.

As spark uses in-memory processing, it speeds up the task of processing many times. Thus, Spark is faster than the core Hadoop as far as Big Data processing is concerned. It should be noted that Spark and Hadoop aren’t mutually exclusive of each other.

Flink

Flink is an engine for streaming dataflow. It facilitates distributed computation over the streams of data. Flink is both a real-time and a batch and processing framework. Batch processes are considered to be the special cases of streaming data in Flink.

Flink provides a variety of APIs for several programming languages such as Java, Scala, Python, etc. For machine learning and graph processing, Flink has its libraries.

Some of the other features of Flink are:

Low latency and high-performance
Fault-tolerant
Support for events
Streaming APIs with backpressure
Stateful computations

Storm

A storm is a computing system that is distributed and in real time. We can use any programming language with Storm. Unbounded streams can be processed easily using Storm. It is highly scalable and provides a guarantee for job processing. The storm has the benchmark of processing over a million tuples / second / node. Storm is written in the functional programming language ‘Clojure’.

Storm can be used in applications where the data velocity is high. We can also use Storm for real-time analytics and distributed machine learning. It can be integrated as a part of the Hadoop ecosystem as well as it can run on top of YARN. Thus, it provides real-time stream processing as an additional feature to the existing systems.

Some other features of Storm are:

Fast processing benchmark
Scalable
Easy to operate
Fault-tolerant
Reliable

Samza

Samza is another distributed stream processing framework that has the following feature:

Simple API
Managed State
Fault Tolerant
Scalability
Durability
Pluggable API
Processor Isolation

Samza uses Apache Kafka and Hadoop YARN for providing the preceding features. Apache Kafka is a distributed streaming platform; whereas, YARN is used to allocate resources for performing MapReduce jobs.

Article Reference

Stackowiak, R., Licht, A., Mantha, V., & Nagode, L. (2015). Big Data and the Internet of Things: enterprise information architecture for a new age. New York: Apress.
Lucas, P., Ballay, J., & McManus, M. (2012). Trillions: thriving in the emerging information ecology. Hoboken, NJ: Wiley

Business Analytics Tutorial

(Click on Topic to Read)

E-Business

Enterprise Resource Planning

Management Information Systems

Project Management

Emerging Technologies

Big Data Processing Frameworks

What is Big Data Processing?

Big Data Processing Frameworks

Hadoop

Spark

Flink

Storm

Samza

Leave a Reply Cancel reply

World's Best Online Courses at One Place

Digital Marketing

Business

Personal Growth

Finance

FinTech

Language

Tech

Development

Exam Prep

Python

Tech

Development

Child Care

What is Big Data Processing?

Big Data Processing Frameworks

Hadoop

Spark

Flink

Storm

Samza

You Might Also Like

Leave a Reply Cancel reply

World's Best Online Courses at One Place

Digital Marketing

Business

Personal Growth

Finance

FinTech

Language

Tech

Development

Exam Prep

Python

Tech

Development

Child Care