SRE for ML: The First 10 Years and the Next 10
Todd Underwood, Google
Over 10 years ago we started building SRE for a large multi-model ML service at Google. We faced many interesting challenges including:
Defining scope: Why do these services need ML anyway?
Unclear SLOs: What are we measuring and how can we actually be responsible for those things?
Fuzzy demarcation with our modeling teams: What is a model quality problem caused by infrastructure vs a model quality problem caused by the model or the data?
With the explosion of ML training and serving platforms, the choices we faced are now confronting many SRE teams across the industry. I will review the history focusing on the decisions we made and why those made sense to us at the time and might make sense for others. And I'll try to answer the question of whether there is a real need for SRE for ML at all.
View the full SREcon21 program at www.usenix.org...
Негізгі бет SREcon21 - SRE for ML: The First 10 Years and the Next 10
Пікірлер