What is Big Data Processing?
Big Data is a term used to describe the very large amount of data which is of a variety of kinds that are being frequently/continuously generated. It is often described as having 3 Vs – Volume, Velocity, and Variety.
Processing Big Data gives us the ability to extract meaningful information from the data and obtain the required insight. Traditional tools of data processing are not suitable for Big Data processing due to the different nature of Big Data.
Table of Content
The techniques for Big Data processing involve analysing the big data sets at a very large scale such as in terabytes and petabytes.
There are two kinds of processing:
- Offline
- Real-Time
Typically, the offline batch data processing is done on full scale and full power. It is an efficient way of processing high volumes of data. For example, Hadoop is an example of offline batch data processing. On the other hand, in the case of real-time data processing, there is a continuous input, processing, and generation of the output of data. In real-time processing, data is processed in a very short period. For example, the processing of data at bank ATMs is an example of real-time data processing.
The toughest task or challenge is to do fast or real-time analytics on a complete set of Big Data to get insight into the data. Practically, it means that you have to scan terabytes or petabytes of data within seconds. This can only be achieved by parallel processing of the data. Traditional tools of data processing are not suitable for Big Data processing due to the different nature of Big Data.
Big Data Processing Frameworks
Some of the popular Big Data processing frameworks are as follows:
Hadoop
It is the very first open-source Big Data processing framework. Hadoop is a distributed storage and parallel processing framework.
It consists of mainly two components:
- HDFS
- MapReduce
HDFS stands for Hadoop Distributed File System used for Big Data storage. This file system is highly fault-tolerant and can store large amounts of data. It is capable of providing easier accessibility to the stored data. MapReduce is a parallel processing technique used for Big Data processing and is based on Java programming language. The two tasks are performed in the MapReduce technique, that is, Map and Reduce. The reduced task is always performed after the map stage.
Hadoop apart from HDFS and Map Reduce has an ecosystem of different tools and technologies such as Pig, Hive, Zookeeper, Flume, YARN, etc., for various purposes. Hadoop is still the most widely used Big Data processing framework.
Spark
Spark is another Big Data processing framework. Spark is not a replacement for Hadoop, but is used in conjunction with it. Spark is designed as an in-memory processing engine to replace Map Reduce. Spark can be accommodated into the Hadoop ecosystem. It leads to a different environmental setup which may have a mix of other technologies and tools from both of these ecosystems. Spark does not have a file storage system. Instead, it uses Hadoop’s HDFS as its file storage system.
As spark uses in-memory processing, it speeds up the task of processing many times. Thus, Spark is faster than the core Hadoop as far as Big Data processing is concerned. It should be noted that Spark and Hadoop aren’t mutually exclusive of each other.
Flink
Flink is an engine for streaming dataflow. It facilitates distributed computation over the streams of data. Flink is both a real-time and a batch and processing framework. Batch processes are considered to be the special cases of streaming data in Flink.
Flink provides a variety of APIs for several programming languages such as Java, Scala, Python, etc. For machine learning and graph processing, Flink has its libraries.
Some of the other features of Flink are:
- Low latency and high-performance
- Fault-tolerant
- Support for events
- Streaming APIs with backpressure
- Stateful computations
Storm
A storm is a computing system that is distributed and in real time. We can use any programming language with Storm. Unbounded streams can be processed easily using Storm. It is highly scalable and provides a guarantee for job processing. The storm has the benchmark of processing over a million tuples / second / node. Storm is written in the functional programming language ‘Clojure’.
Storm can be used in applications where the data velocity is high. We can also use Storm for real-time analytics and distributed machine learning. It can be integrated as a part of the Hadoop ecosystem as well as it can run on top of YARN. Thus, it provides real-time stream processing as an additional feature to the existing systems.
Some other features of Storm are:
- Fast processing benchmark
- Scalable
- Easy to operate
- Fault-tolerant
- Reliable
Samza
Samza is another distributed stream processing framework that has the following feature:
- Simple API
- Managed State
- Fault Tolerant
- Scalability
- Durability
- Pluggable API
- Processor Isolation
Samza uses Apache Kafka and Hadoop YARN for providing the preceding features. Apache Kafka is a distributed streaming platform; whereas, YARN is used to allocate resources for performing MapReduce jobs.
Business Analytics Tutorial
(Click on Topic to Read)
- What is Data?
- Big Data Management
- Types of Big Data Technologies
- Big Data Analytics
- What is Business Intelligence?
- Business Intelligence Challenges in Organisation
- Essential Skills for Business Analytics Professionals
- Data Analytics Challenges
- What is Descriptive Analytics?
- What is Descriptive Statistics?
- What is Predictive Analytics?
- What is Predictive Modelling?
- What is Data Mining?
- What is Prescriptive Analytics?
- What is Diagnostic Analytics?
- Implementing Business Analytics in Medium Sized Organisations
- Cincinnati Zoo Used Business Analytics for Improving Performance
- Dundas Bi Solution Helped Medidata and Its Clients in Getting Better Data Visualisation
- What is Data Visualisation?
- Tools for Data Visualisation
- Open Source Data Visualisation Tools
- Advantages and Disadvantages of Data Visualisation
- What is Social Media?
- What is Text Mining?
- What is Sentiment Analysis?
- What is Mobile Analytics?
- Types of Results From Mobile Analytics
- Mobile Analytics Tools
- Performing Mobile Analytics
- Financial Fraud Analytics
- What is HR Analytics?
- What is Healthcare Analytics?
- What is Supply Chain Analytics?
- What is Marketing Analytics?
- What is Web Analytics?
- What is Sports Analytics?
- Data Analytics for Government and NGO
E-Business
Enterprise Resource Planning
- What is Enterprise Resource Planning?
- Benefits and Advantages of ERP & Reasons for Growth
- Success Factors of ERP Implementation
- ERP Implementation Life Cycle
- Risk in ERP Implementation, Cross Function, ERP Technology
- Maintenance of ERP
- What is Business Model?
- Business Process Reengineering (BPR)
- Types of Information Systems: TPS, MIS, DSS, EIS
- What is SAP?
- Modules of ERP Software
- SAP Application Modules
- SAP R/3 System
- ERP Modules
- ERP in Manufacturing
- ERP Purchasing Module
- What is SAP Sales and Distribution (SAP SD)?
- ERP Inventory Management Module
- ERP Implementation
- ERP Vendors, Consultants and Users
- BaaN ERP
- Oracle Corporation
- PeopleSoft ERP
- Edwards & Company ERP
- Systems Software Associates ERP
- QAD ERP
- What is ERP II?
- ERP Implementation at Rolls-royce
Management Information Systems
- What is MIS?
- Requirements of Management Information System
- What is Risk Management?
- Nolan Six Stage Model
- What is Cloud Computing?
- Types of Information Systems: TPS, MIS, DSS, EIS
- Information Systems in Organisations
- Challenges Faced by Manager in Managing Information Systems
- Decision Making With MIS
- What is E-Governance?
- What is Green IT?
- What is Smart Cities?
- What is IT Infrastructure?
- What is Cloud Computing?
- Cloud Service Models
- Cloud Migration Challenges
- Security Threats Faced by Organization
- Managing Security of Information Systems
- Software Project Management Challenges
- What is Data Management?
- What is Database?
- What is Data Warehouses?
- Enterprise Resource Planning Systems
Project Management
- What is Project Management?
- Functions of Project Management
- What is Project?
- Project Managers
- What is Project Life Cycle?
- Project Feasibility Study
- What is Project Analysis?
- What is Project Planning?
- What is Project Selection?
- What is Project Schedule?
- What is Project Budget?
- What is Project Risk Management?
- What is Project Control?
- Project Management Body of Knowledge (PMBOK)
- Best Project Management Tools
- What is Project Organisation?
- What is Project Contract?
- Types of Cost Estimates
- What is Project Execution Plan?
- Work Breakdown Structure (WBS)
- Project Scope Management
- Project Scheduling Tools and Techniques
- Project Risk Identification
- Risk Monitoring
- Allocating Scarce Resources in IT Project
- Goldratt’s Critical Chain
- Communication in Project Management | Case Study
- Plan Monitor Control Cycle in Project Management
- Reporting in Project Management
- IT Project Quality Plan
- Project Outsourcing of Software Development
- Implementation Plan of Software Project
- What is Project Implementation?
- What is Project Closure?
- What is Project Evaluation?
- Software Project Management Challenges
- What is Project Management Office (PMO)?
- IT Project Team
- Business Case in IT Project Life Cycle
- PMP Study Guide
Emerging Technologies