AWS Glue has been pioneering in the space of automating ETL processes by providing a fully managed serverless data integration service. This service is a simple and cost-effective way for customers to categorize their data, clean it, enrich it, and move it swiftly and reliably between various data stores. AWS Glue is made up of a Data Catalog (i.e a metadata store), sophisticated ETL engines with automated code generation and visual interfaces for every persona to do ETL tasks. AWS Glue customers use Apache Spark and Python engines for data integration.
Our Python customers have asked for scaling their Python workloads over large datasets. To enable these use-cases, AWS Glue added support for Ray.io (ray.io/) and launched AWS Glue for Ray. AWS Glue for Ray provides data engineers a distributed Pythonic data analytics platform for performing distributed data integration at scale with Ray core APIs. Using Ray's powerful abstraction of tasks/actors, we were able to horizontally scale python workloads. The simple distributed collection APIs provided by Ray dataset helped our python customers to perform ETL operations efficiently on very large datasets. Distinctly, we launch Ray clusters on ARM based platforms and using IPv6 addressing based workers. Data engineers are comfortable with Pandas and given that popularity, we integrated Modin at scale with Ray. We will cover our experiences with Ray datasets and distributed Pandas at scale. We will also talk about the innovations we did integrating with Ray's robust cluster manager and demand based autoscaler, to offer an instant-on, interactive, easy to use serverless Ray platform for our customers.
About Anyscale
---
Anyscale is the AI Application Platform for developing, running, and scaling AI.
www.anyscale.com/
If you're interested in a managed Ray service, check out:
www.anyscale.com/signup/
About Ray
---
Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads.
docs.ray.io/en/latest/
#llm #machinelearning #ray #deeplearning #distributedsystems #python #genai
Негізгі бет Building an Instant-On Serverless Platform for Large-Scale Data Processing Using Ray
Пікірлер