How Salting Can Reduce Data Skew By 99%

Spark Performance Tuning
Master the art of Spark Performance Tuning and Data Engineering in this comprehensive Apache Spark tutorial! Data skew is a common issue in big data processing, leading to performance bottlenecks by overloading some nodes while underutilizing others. This video dives deep into a practical example of data skew and demonstrates how to optimize Spark performance by using a technique called 'Salting'. Salting involves adding some randomness to the values before computing the hash for partitioning, thus distributing the data more evenly across partitions and reducing skew. With clear step-by-step explanations, you'll learn how to apply salting in practice, understand the concept behind it, and ultimately improve your data engineering skills.
📄 Complete Code on GitHub: github.com/afaqueahmad7117/sp...
🎥 Full Spark Performance Tuning Playlist: • Apache Spark Performan...
🔗 LinkedIn: / afaque-ahmad-5a5847129
Chapters:
00:00 Salting Concept
07:06 Applying Salting In Joins
12:53 Code Examples For Salting In Joins
16:56 Applying Salting In Aggregations
27:57 Code Examples For Salting In Aggregations
#dataengineering #apachespark #outofmemoryerror #bigdata #salting #dataskew #sparkperformancetuning #sparkoptimization

Жүктеу

Пікірлер: 23

@dhavaldalasaniya
8 күн бұрын
This is excellent Spark content videos. It is prefect explanation on Spark performance concept.
@afaqueahmad7117
2 күн бұрын
Many thanks @dhavaldalasaniya, this means a lot, appreciate it :)
@Wonderscope1
6 ай бұрын
Thanks for great content, You should of used Salt bae gesture when you said salting :) Is Slating still good approached if join is happening between two large datasets with hundreds of million rows? Explode will increase the number of rows for one dataset. Let's say 100,000,000 * 200 Salet_Number = 20,000,000,000 Rows
@sasadsasadsad
Ай бұрын
Precious 30 minutes, quality content
@afaqueahmad7117
Ай бұрын
Thank you @sasadsasadsad, appreciate it :)
@Sandeep-bl9ji
4 ай бұрын
Nice explaination
@gabriells9074
7 ай бұрын
Hi Afaque, thank you for another great explanation, I have a question, since AQE splits skewed partitions into smaller ones, is salting still useful when AQE is enabled ?
@user-nz7uh1qo5o
9 ай бұрын
I have read and watched many things related to salting but this visual explanation just makes it really easy to comprehend it, plus really well articulated. Waiting for more videos to learn from :) Also could you recommend some books or other resources that have enabled you to attain this level of knowledge, Thanks!
@afaqueahmad7117
9 ай бұрын
Hey @user-nz7uh1qo5o, many thanks for the kind words, it means a lot to me, and, glad to know that the video was helpful. Most of the content is based on my work experiences + good ad-hoc content on Medium to which I could relate. My only humble suggestion is to be ruthless, get your hands dirty, question everything that's happening and search the internet if anything doesn't makes sense :)
@SHUBHAM_707
Ай бұрын
what if the values are unique in join 1 to 1 join? will it create skew
@MuhammadAhmad-do1sk
Ай бұрын
Thanks for this. Love from 🇵🇰
@afaqueahmad7117
Ай бұрын
Appreciate it @MuhammadAhmad-do1sk, Love from India :)
@anubhavrastogi7463
3 ай бұрын
Hi, can you please help me why are we considering salt number 3 or4. Is this should be equal to number of shuffle partitions that we have in our data or the distinct values that we have in our dataset.Please explain.
@9figurelifestyle790
9 ай бұрын
@afaqueahmad7117 - Great topic and amazing explanation - Looking forward to learning more from you. One suggestion is to create more videos related to designing idempotent data pipelines, backfilling missed window data, simulating different production failures and how to approach them, coz I see more people are doing interview focused videos. These topics will mentor both entry level and mid level Data engineers to gain confidence in Data Engineering field
@afaqueahmad7117
9 ай бұрын
Glad you liked the video and the explanation! Really appreciate your feedback. Yes, all of that is in the roadmap, but for the upcoming year. The initial plan is to cover all aspects related to Performance Tuning + Foundations.
@akshaybaura
10 ай бұрын
can you show us if salting in aggregations was really worth it ? I'm skeptical that too many shuffles in salting will deteriorate the performance with salting.
@afaqueahmad7117
10 ай бұрын
Hey @akshaybaura, there will indeed be a performance dip due to shuffles when using Salting, but, without Salting you're at the risk of either: a. Getting OOM (out of memory) errors. b. Your jobs running 5-10x slower because fewer resources (cores and memory) are being used while the others remain underutilised. However, even when using Salting, the performance largely depends on factors like the size of dataset and the correct use of Salt Number.
@alokranjan7323
6 ай бұрын
hash(1,0)%3 how to calculate?
@vinothvk2711
5 ай бұрын
0%3
@afaqueahmad7117
5 ай бұрын
@vinothvk2711 is right. As outlined in the video, we're assuming h(1, 0) = 0, so it's equal to 0 % 3 = 0
@gudiatoka
Ай бұрын
After 3.0 salting is not useful
@afaqueahmad7117
Ай бұрын
Hey @gudiatoka, I wish it was so, but just in case you're referring to AQE as the solution, it isn't always very helpful, so you still need to resort to salting.
@gudiatoka
Ай бұрын
@@afaqueahmad7117 yes AQE and partition is useful and in case of larger dataframe when salting key applied to lower df it duplicated records making it more skewed then the concept of salting not valid at least for me...may be it servers different

Data Skew Drama? Not Anymore With Broadcast Joins & AQE

How to Read Spark DAGs | Rock the JVM

🤯 МНЕ НУЖЕН ЕЩЕ 1 ПОДПИСЧИК - и НАСТЯ перестанет ломать пасту @nastyawhere

Always be more smart #shorts

That's how money comes into our family

路飞被小孩吓到了#海贼王#路飞

Bucketing - The One Spark Optimization You're Not Doing

Data Caching in Apache Spark | Optimizing performance using Caching | When and when not to cache

Working with Skewed Data: The Iterative Broadcast - Rob Keevil & Fokko Driesprong

Data Engineering Interview | Apache Spark Interview | Live Big Data Interview

Apache Spark Executor Tuning | Executor Cores & Memory

Apache Spark Data Skew & Salting in Tamil

Database Sharding and Partitioning

How to handle Data skewness in Apache Spark using Key Salting Technique

24 Fix Skewness and Spillage with Salting in Spark

🤯 МНЕ НУЖЕН ЕЩЕ 1 ПОДПИСЧИК - и НАСТЯ перестанет ломать пасту @nastyawhere

How Salting Can Reduce Data Skew By 99%

Пікірлер: 23