@backstreetbrogrammer
--------------------------------------------------------------------------------
Chapter 16 - Spark RDD - Closures and Shared Variables
--------------------------------------------------------------------------------
- Understanding closure concept in Spark
We need to understand the scope and life cycle of variables and methods when executing code across a cluster. RDD operations that modify variables outside their scope can be a frequent source of confusion.
Example:
The following code will NOT compile as int sum variable is not final or effectively final:
final var data = List.of(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
final var myRdd = sparkContext.parallelize(data);
int sum = 0;
myRdd.foreach(x -: sum += x); // WILL NOT COMPILE
System.out.printf("Total sum: %d%n", sum);
To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor.
Prior to execution, Spark computes the task's closure. The closure are those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()).
This closure is serialized and sent to each executor.
The variables within the closure sent to each executor are now copies and thus, when sum is referenced within the foreach() function, it's no longer the sum on the driver node.
There is still a sum in the memory of the driver node but this is no longer visible to the executors! The executors only see the copy from the serialized closure.
Thus, the final value of sum will still be zero since all operations on sum were referencing the value within the serialized closure.
However, Spark does provide two limited types of shared variables for two common usage patterns:
- broadcast variables
- accumulators
Github: github.com/backstreetbrogramm...
- Apache Spark for Java Developers Playlist: • Apache Spark for Java ...
- Upgrade to Java 21 Playlist: • Upgrade to Java 21
- Top Java Coding Interview Problems Playlist: • Top Java Coding Interv...
- Java Serialization Playlist: • Java Serialization
- Dynamic Programming Playlist: • Dynamic Programming
#java #javadevelopers #javaprogramming #apachespark #spark
Негізгі бет Ғылым және технология 78 - Spark RDD - Understanding closure concept in Spark
Пікірлер