Rewriting History: Migrating petabytes of data to Apache Iceberg using Trino

Dataset interoperability between data platform components continues to be a difficult hurdle to overcome. This shortcoming often results in siloed data and frustrated users. Although open table formats like Apache Iceberg aim to break down these silos by providing a consistent and scalable table abstraction, migrating your pre-existing data archive to a new format can still be daunting. This talk will outline the challenges we faced when rewriting petabytes of Shopify’s data into Iceberg table format using the Trino engine. A rapidly evolving landscape, I will highlight recent contributions to Trino’s Iceberg integration that made our work possible while also illustrating how we designed our system to scale. Topics will include: what to consider when designing your migration strategy, how we optimized Trino’s write performance and how to recover from corrupt table states. Finally, I will compare the query performance of old and migrated datasets using Shopify’s datasets as benchmarks.
Marc Laforet, Senior Data Engineer at Shopify
Read more: trino.io/blog/2022/12/09/trin...

Жүктеу

Пікірлер: 2

@Mentaloow
Жыл бұрын
Rather than considering writing your own task scheduler/runner, consider using the open-source HPC tools out there.. Slurm with auto-scaling is an absolute beast, as it was designed, and is used, to schedule millions of jobs daily for thousands of users against extremely busy/constrained super-computers around the world (over 60% of the supercomputers use it) - job runtimes ranging from sub-second to months. And you benefit from a massive set of other features such as user/team management, quotas, accounting/budgeting, flexible scheduler resources/constraints..
@stavetx
10 ай бұрын
Hmmm. But may be such great difference between json+gzip vs iceberg+parquet is not point of the iceberg. Binary parquet (with metadata in it) vs text json...

Tabular at Trino Fest - CDC patterns in Apache Iceberg

Trino for Large Scale ETL at Lyft

DAD LEFT HIS OLD SOCKS ON THE COUCH…😱😂

Survival Skills: Amazing Basket for Extreme Conditions. #survival #camping #bushcraft #lifehacks

Best KFC Homemade For My Son #cooking #shorts

路飞太过分了，自己游泳。#海贼王#路飞

Why You Shouldn’t Care About Iceberg | Tabular

Episode 14: What Is Apache Iceberg? | Some Engineering Podcast

Building an Open Data Lake House Using Trino and Apache Iceberg

Apache Iceberg Overview (Jan 2024 Edition) - Architecture, Ecosystem, and more!

Fast results using Iceberg and Trino

How Dremio implemented Materialized Views with Iceberg?

Iceberg: a fast table format for S3

Best practices and insights when migrating to Apache Iceberg for data engineers

Journey to Iceberg with SK Telecom

Shopify's ML Platform Journey Using Open Source Tools Case study building Merlin & AMA

GIANT FREEZE DRIED AIR TREAT FAIL 😱 🍭 #mukbang #asmr #satisfying #freezedried #candy #experiment

МЕНТ СПАЛИЛ РАБОТУ КЛАДМ*НА 😱 #фильмы #сериал #сериалы

Моя Жена Босс!

Finger Heart - Fancy Refill (Inside Out Animation)

Мировой Рекорд по Засыпанию (@DazByron )

Whatttt a Mind 🤯🤦🏻‍♀️ #shorts #viralshorts #trending #funny

Oi Oi Oi & E E Ei Meme Looking For a Girlfriend

Rewriting History: Migrating petabytes of data to Apache Iceberg using Trino

Пікірлер: 2