Big Data Processing Frameworks

  • Post last modified:23 February 2023
  • Reading time:9 mins read
  • Post category:Technologies
Coursera $100 off 2024

What is Big Data Processing?

Big Data is a term used to describe the very large amount of data which is of a variety of kinds that are being frequently/continuously generated. It is often described as having 3 Vs – Volume, Velocity, and Variety.

Processing Big Data gives us the ability to extract meaningful information from the data and obtain the required insight. Traditional tools of data processing are not suitable for Big Data processing due to the different nature of Big Data.

The techniques for Big Data processing involve analysing the big data sets at a very large scale such as in terabytes and petabytes.

There are two kinds of processing:

  • Offline
  • Real-Time

Typically, the offline batch data processing is done on full scale and full power. It is an efficient way of processing high volumes of data. For example, Hadoop is an example of offline batch data processing. On the other hand, in the case of real-time data processing, there is a continuous input, processing, and generation of the output of data. In real-time processing, data is processed in a very short period. For example, the processing of data at bank ATMs is an example of real-time data processing.

The toughest task or challenge is to do fast or real-time analytics on a complete set of Big Data to get insight into the data. Practically, it means that you have to scan terabytes or petabytes of data within seconds. This can only be achieved by parallel processing of the data. Traditional tools of data processing are not suitable for Big Data processing due to the different nature of Big Data.


Big Data Processing Frameworks

Some of the popular Big Data processing frameworks are as follows:

Hadoop

It is the very first open-source Big Data processing framework. Hadoop is a distributed storage and parallel processing framework.

It consists of mainly two components:

  • HDFS
  • MapReduce

HDFS stands for Hadoop Distributed File System used for Big Data storage. This file system is highly fault-tolerant and can store large amounts of data. It is capable of providing easier accessibility to the stored data. MapReduce is a parallel processing technique used for Big Data processing and is based on Java programming language. The two tasks are performed in the MapReduce technique, that is, Map and Reduce. The reduced task is always performed after the map stage.

Hadoop apart from HDFS and Map Reduce has an ecosystem of different tools and technologies such as Pig, Hive, Zookeeper, Flume, YARN, etc., for various purposes. Hadoop is still the most widely used Big Data processing framework.

Spark

Spark is another Big Data processing framework. Spark is not a replacement for Hadoop, but is used in conjunction with it. Spark is designed as an in-memory processing engine to replace Map Reduce. Spark can be accommodated into the Hadoop ecosystem. It leads to a different environmental setup which may have a mix of other technologies and tools from both of these ecosystems. Spark does not have a file storage system. Instead, it uses Hadoop’s HDFS as its file storage system.

As spark uses in-memory processing, it speeds up the task of processing many times. Thus, Spark is faster than the core Hadoop as far as Big Data processing is concerned. It should be noted that Spark and Hadoop aren’t mutually exclusive of each other.

Flink

Flink is an engine for streaming dataflow. It facilitates distributed computation over the streams of data. Flink is both a real-time and a batch and processing framework. Batch processes are considered to be the special cases of streaming data in Flink.

Flink provides a variety of APIs for several programming languages such as Java, Scala, Python, etc. For machine learning and graph processing, Flink has its libraries.

Some of the other features of Flink are:

  • Low latency and high-performance
  • Fault-tolerant
  • Support for events
  • Streaming APIs with backpressure
  • Stateful computations

Storm

A storm is a computing system that is distributed and in real time. We can use any programming language with Storm. Unbounded streams can be processed easily using Storm. It is highly scalable and provides a guarantee for job processing. The storm has the benchmark of processing over a million tuples / second / node. Storm is written in the functional programming language ‘Clojure’.

Storm can be used in applications where the data velocity is high. We can also use Storm for real-time analytics and distributed machine learning. It can be integrated as a part of the Hadoop ecosystem as well as it can run on top of YARN. Thus, it provides real-time stream processing as an additional feature to the existing systems.

Some other features of Storm are:

  • Fast processing benchmark
  • Scalable
  • Easy to operate
  • Fault-tolerant
  • Reliable

Samza

Samza is another distributed stream processing framework that has the following feature:

  • Simple API
  • Managed State
  • Fault Tolerant
  • Scalability
  • Durability
  • Pluggable API
  • Processor Isolation

Samza uses Apache Kafka and Hadoop YARN for providing the preceding features. Apache Kafka is a distributed streaming platform; whereas, YARN is used to allocate resources for performing MapReduce jobs.

Article Reference
  • Stackowiak, R., Licht, A., Mantha, V., & Nagode, L. (2015). Big Data and the Internet of Things: enterprise information architecture for a new age. New York: Apress.

  • Lucas, P., Ballay, J., & McManus, M. (2012). Trillions: thriving in the emerging information ecology. Hoboken, NJ: Wiley


Business Analytics Tutorial

(Click on Topic to Read)

E-Business

Enterprise Resource Planning

Management Information Systems

Project Management

Emerging Technologies

Leave a Reply