The Pre-trainer's toolkit: From dataset construction to model scaling

Abstract: Recent breakthroughs in machine learning rely heavily on pre-training techniques, harnessing larger datasets, models, and computational resources to create base-models for subsequent fine-tuning. In this talk, we develop a pre-training toolkit. Drawing from empirical findings, we present methodologies for dataset construction and de-risking large-scale model training. Our discussion touches on both multimodal and language modeling domains. By addressing the entire pre-training pipeline, from dataset creation to downstream evaluation, we aim to create better, more reliable models.
Bio: Samir Yitzhak Gadre (Samir) is an NSF graduate research fellow and PhD student at Columbia University working with Shuran Song and Ludwig Schmidt. He studies the empirical foundations of pre-training. Samir also serves as a core maintainer of OpenLM, a minimal but performative open-source language modeling library.

Жүктеу

Why You Should Retool Your AI Training Set (Not Your Model)

LMQL Programming Large Language Models

Шок. Никокадо Авокадо похудел на 110 кг

SHAPALAQ / 1 серия-Знакомство / 2часть. АЙБАР ПОБИЛ РАСУЛА?🙀 #aminkavitaminka #aminokka

КАК ЖИВУТ КАЙСАР & ТАУАСАР | ДАРАБОЗ vs ТАУАСАР | ОТКРОВЕННОЕ ИНТЕРВЬЮ

Проверил Лайфхак ОГОНЬ-ТРЕНИЕМ Сахар+Марганцовка #фрост #shorts #frost #лайфхаки #лайфхак #выживание

Optimization within Latent Spaces

How AI 'Understands' Images (CLIP) - Computerphile

Query, Key and Value Matrix for Attention Mechanisms in Large Language Models

You Can't Have AI Safety Without Inclusion

Robot Learning by Understanding Egocentric Videos

Training Human-AI Teams

Concept Bottleneck Models for Text Classification

Has Generative AI Already Peaked? - Computerphile

GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem

Why Does Diffusion Work Better than Auto-Regression?

Шок. Никокадо Авокадо похудел на 110 кг

The Pre-trainer's toolkit: From dataset construction to model scaling

Пікірлер