OneON: Deep Das: Hadoop Spark and Big Data Framework

Home Search Courses Blog

Blog entry by Deep Das

Anyone in the world

Hadoop and Spark popular Big Data Processing Framework

Hadoop and Spark are two of the most popular big data processing frameworks used in the industry. Hadoop is most effective for scenarios that involve processing big data sets in environments where data size exceeds available memory. It reads and writes files to HDFS ¹. Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark’s standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat ².

Here are some key differences between the two:

There are several big data processing frameworks available in the industry. Here are some of the most popular ones:

Hadoop: This open-source batch-processing framework can be used for the distributed storage and processing of big data sets ¹.
Apache Spark: This is a batch-processing framework with the capability of stream processing and making it a hybrid framework ¹.
Apache Storm: This is another open-source framework that provides distributed, real-time stream processing ².
Samza: This is a distributed stream processing framework that provides fault tolerance, local state management, and scalability ².
Flink: This is an open-source stream processing framework that provides low-latency and high-throughput data processing ¹.

What is Batch Processing ?

Batch processing is a technique used in big data processing where a large amount of data is collected and processed at once. It is used for processing large volumes of data that do not require real-time processing. Batch processing is typically used for tasks such as generating reports, aggregating data, and preparing data for further analysis.

What is Real-time processing ?

Real-time processing is a method of processing data at a near-instant rate, requiring a constant flow of data intake and output to maintain real-time insights ¹. Real-time processing deals with streams of data that are captured in real-time and processed with minimal latency to generate real-time (or near-real-time) reports or automated responses ². For example, a real-time traffic monitoring solution might use sensor data to detect high traffic volumes ².

What is Data Stream ?
A data stream is a sequence of data elements made available over time ¹. It is a continuous flow of data that is generated from various sources and can be processed in real-time ². For example, a data stream can be generated from sensors that are monitoring temperature or humidity levels ².

Learn More:

1. ibm.com 2. spark.apache.org 3. codingninjas.com 4. geeksforgeeks.org 5. ibm.com 6. integrate.io 7. hadoop.apache.org

1. knowledgehut.com 2. papers.ssrn.com 3. techreviewer.co

1. hpe.com 2. learn.microsoft.com 3. techopedia.com 4. hevodata.com

1. geeksforgeeks.org 2.confluent.io 3. cloud.google.com 4. cloud.google.com 5. snowflake.com

Modified: Monday, 3 July 2023, 8:10 AM