Blog entry by Deep Das
Hadoop and Spark popular Big Data Processing Framework
Hadoop and Spark are two of the most popular big data processing frameworks used in the industry. Hadoop is most effective for scenarios that involve processing big data sets in environments where data size exceeds available memory. It reads and writes files to HDFS1. Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat2.
Here are some key differences between the two:
- Spark outperforms Hadoop by providing additional use cases such as the analysis of real-time stream data, handling sets of parallel operations using several iterative algorithms and graph-parallel processing for modelling the data3.
- Hadoop reads and writes files to HDFS while Spark processes data in RAM using a concept known as an RDD (Resilient Distributed Dataset)4.
There are several big data processing frameworks available in the industry. Here are some of the most popular ones:
- Hadoop: This open-source batch-processing framework can be used for the distributed storage and processing of big data sets1.
- Apache Spark: This is a batch-processing framework with the capability of stream processing and making it a hybrid framework1.
- Apache Storm: This is another open-source framework that provides distributed, real-time stream processing2.
- Samza: This is a distributed stream processing framework that provides fault tolerance, local state management, and scalability2.
- Flink: This is an open-source stream processing framework that provides low-latency and high-throughput data processing1.
Batch processing is a technique used in big data processing where a large amount of data is collected and processed at once. It is used for processing large volumes of data that do not require real-time processing. Batch processing is typically used for tasks such as generating reports, aggregating data, and preparing data for further analysis.
What is Real-time processing ?
Real-time processing is a method of processing data at a near-instant rate, requiring a constant flow of data intake and output to maintain real-time insights1. Real-time processing deals with streams of data that are captured in real-time and processed with minimal latency to generate real-time (or near-real-time) reports or automated responses2. For example, a real-time traffic monitoring solution might use sensor data to detect high traffic volumes2.
What is Data Stream ?
A data stream is a sequence of data elements made available over time1. It is a continuous flow of data that is generated from various sources and can be processed in real-time2. For example, a data stream can be generated from sensors that are monitoring temperature or humidity levels2.