Dataset interoperability between data platform components continues to be a difficult hurdle to overcome. This shortcoming often results in siloed data and frustrated users. Although open table formats like Apache Iceberg aim to break down these silos by providing a consistent and scalable table abstraction, migrating your pre-existing data archive to a new format can still be daunting. This talk will outline the challenges we faced when rewriting petabytes of Shopify’s data into Iceberg table format using the Trino engine. A rapidly evolving landscape, I will highlight recent contributions to Trino’s Iceberg integration that made our work possible while also illustrating how we designed our system to scale. Topics will include: what to consider when designing your migration strategy, how we optimized Trino’s write performance and how to recover from corrupt table states. Finally, I will compare the query performance of old and migrated datasets using Shopify’s datasets as benchmarks.
Marc Laforet, Senior Data Engineer at Shopify
Read more: trino.io/blog/2022/12/09/trin...
Негізгі бет Ойын-сауық Rewriting History: Migrating petabytes of data to Apache Iceberg using Trino
Пікірлер: 2