Abstract: Recent breakthroughs in machine learning rely heavily on pre-training techniques, harnessing larger datasets, models, and computational resources to create base-models for subsequent fine-tuning. In this talk, we develop a pre-training toolkit. Drawing from empirical findings, we present methodologies for dataset construction and de-risking large-scale model training. Our discussion touches on both multimodal and language modeling domains. By addressing the entire pre-training pipeline, from dataset creation to downstream evaluation, we aim to create better, more reliable models.
Bio: Samir Yitzhak Gadre (Samir) is an NSF graduate research fellow and PhD student at Columbia University working with Shuran Song and Ludwig Schmidt. He studies the empirical foundations of pre-training. Samir also serves as a core maintainer of OpenLM, a minimal but performative open-source language modeling library.
Негізгі бет The Pre-trainer's toolkit: From dataset construction to model scaling
Пікірлер