If you’ve ever had to delete a set of records for regulatory compliance, update a set of records to fix an issue in the ingestion pipeline, or apply changes in a transaction log to a fact table, you know that row-level operations are becoming critical for modern data lake workflows. Even though the industry has seen a tremendous amount of innovation in this area, row-level operations can be still fairly expensive if the underlying data has to be shuffled.
This session will explain how Apache Spark™ can completely avoid shuffles during row-level operations by leveraging storage-partitioned joins, a key to efficiently modify data at PB scale.
Talk by: Anton Okolnychyi and Chao Sun
Here’s more to explore:
Rise of the Data Lakehouse: dbricks.co/3NHT7CD Lakehouse Fundamentals Training: dbricks.co/44ancQs
Connect with us: Website: databricks.com
Twitter: / databricks
LinkedIn: / databricks
Instagram: / databricksinc
Facebook: / databricksinc
Негізгі бет Ғылым және технология Eliminating Shuffles in Delete Update, and Merge
Пікірлер: 2