Негізгі бет Machine Learning Lecture 12 "Gradient Descent / Newton's Method" -Cornell CS4780 SP17

Күн бұрын

Machine Learning Lecture 12 "Gradient Descent / Newton's Method" -Cornell CS4780 SP17

Рет қаралды 45,624

Kilian Weinberger

Cornell class CS4780. (Online version: tinyurl.com/eCornellML )

Жүктеу

Пікірлер: 51

@KulvinderSingh-pm7cr
5 жыл бұрын
So informative !!! Thanks professor for making ML so fun and intuitive !!!
@saikumartadi8494
4 жыл бұрын
thanks a lot for the video .please upload other courses offered by you if they are recorded. would love to learn from teachers like you
@PhilMcCrack1
4 жыл бұрын
Stars at 1:22
@vasileiosmpletsos178
4 жыл бұрын
Nice Lecture! Helped me a lot!
@giulianobianco6752
3 ай бұрын
Great lecture, thanks Professor! Integrating the online Certificate with deeper math and concepts
@yukiy.4201
5 жыл бұрын
magnifico!
@benw4361
4 жыл бұрын
Great lecture!
@sooriya233
4 жыл бұрын
Brilliant
@andresfeliperamirez7869
3 жыл бұрын
Amazing!
@yanalishkova1873
5 жыл бұрын
Very well and intuitively explained, and very entertaining as well! Thank you! ][=
@fuweili5320
3 жыл бұрын
learn lots from it!!!
@rameezusmani1294
3 жыл бұрын
sir..i cant explain in words how helpful your videos are and how easy they are to understand....people like you are true heroes of humanity and science...thank you so much....i have 1 question...these days machine learning seems to be limited to deep learning using neural networks...i want to know that do people still use ML or MAP or Naive Bayes ?
@kilianweinberger698
3 жыл бұрын
Yes, other (non-deep) methods are still very much in use. You are right that one can get the impression that nowadays people only use deep networks, however I would say that's also a problem of our time. I often find that in some companies people train big multi-layer deep nets for problems where a simple gradient boosted tree or Naive Bayes classifier would have been far more efficient. Btw, MLE and MAP are really concepts to optimize model parameters, so you can use them for deep networks, too.
@rameezusmani1294
3 жыл бұрын
@@kilianweinberger698 thank you so much for clarifying...i will take it as a home work that how i can use MLE or MAP to optimize my neural network parameters.
@rameezusmani1294
3 жыл бұрын
i realized that least squares approach is a specific case of MLE :). Thank you for making me think in that direction sir
@amitotc
3 жыл бұрын
First of thanks a lot for sharing the video, it really helped me in understanding some of the salient differences between gradient descent and newton's method. I have a few questions: 1) In the ADAGRAD method, the denominator might become too big which then might vanish the gradient pretty quickly in some situations. Isn't it better to take the average of the `s` values rather than just summing them? 2) Towards the end of the lecture you talk about Newton's method where we might have to compute the inverse of a large matrix. In the case of very high dimensional data, shouldn't the cost of the inverting matrix be a lot higher than taking computationally less expensive gradient descent steps, which might then overshadow any benefit we get from using Newton's method? Matrix inversion is usually O(n^3) if we use Gaussian elimination to compute the inverse of the matrix. (As far as I know, matrix inversion is as hard as matrix multiplication, so at least O(n^2.8) if we use something like Strassen's algorithm) 3) If we want to use gradient descent to find the roots of a function, instead of finding minima and maxima, how effective it will be as compared to Newton's method. Can gradient descent be used in all applications where Newton's method is used interchangeably? (Even though it might be less effective in some cases)
@mehmetaliozer2403
2 жыл бұрын
Thanks for this amazing lecture 👍👍
@maddai1764
5 жыл бұрын
Nice explanation
@coolarun3150
Жыл бұрын
awesome!!
@analyticstoolsbyhooman6963
2 жыл бұрын
Heard a lot about Cornell but got the reason for that.
@maharshiroy8471
3 жыл бұрын
I believe in practical implementations, a poorly conditioned Hessian can cause huge numerical errors while converging, ultimately lending second-order methods like Newton very unreliable.
@gabrielabecerrilprado2142
3 жыл бұрын
❤
@meghnashankr9340
2 жыл бұрын
just saved me hours of digging into books for understanding these concepts...
@doyourealise
3 жыл бұрын
sir did u find out where the handouts went? 10:30??
@pnachtwey
Ай бұрын
Everyone seems to have a different version. AdaGrad doesn't always work. The sum of the dot product of the gradient gets too big UNLESS one scales it down. Also, AdaGrad works best with a line search. All variations work best with a line search.
@sudhanshuvashisht8960
4 жыл бұрын
In Adagrad, when we're updating s vector as s = s + g . ^ 2 (not g dot g, rather as you described), what if gradient at a point for all features is < 1 (i.e. every element of g vector is less than one). In that case, how does adding square of each gradient element to s vector serve our purpose?
@subhasdh2446
2 жыл бұрын
That was my question but i think I've found a good explanation to it. Please correct me if I'm wrong. So when it is
@xwcao1991
3 жыл бұрын
what would be the effect of doing gradient descent (or even conjugate GD)on the taylor expansion for finding the optimum of parabola instead of computing its optimum directly? Would that avoid overrunning in a narrow valley? Thanks for any explanation.
@kilianweinberger698
3 жыл бұрын
Newton's Method is essentially doing that for the second moment. However, Taylor's expansion is only a good approximation locally. The moment you use it to take large steps, you may be very far off - leading to divergence.
@xwcao1991
3 жыл бұрын
@@kilianweinberger698 thanks for the explanation. Love your teaching style.
@mikejason3822
3 жыл бұрын
Why was an approximation of l(w) (using tailor series) was used at 13:20?
@gauravsinghtanwar4415
4 жыл бұрын
I didn't understand that how gradient descent works in the case of local minima. Plz help. Danke Schuen!
@kilianweinberger698
4 жыл бұрын
Well, yes we will get there in the later lectures (neural networks). With GD you will get trapped in the local minimum. To fix this you can use Stochastic Gradient Descent in those cases. Here, the gradient is so noisy, that it is not precise enough to get trapped (unless the minimum is really wide). Hope this helps.
@gauravsinghtanwar4415
4 жыл бұрын
@@kilianweinberger698 vielen Dank! Das ist wunderbar kurs.
@cuysaurus
4 жыл бұрын
Is my log-likehood function a loss function?
@kilianweinberger698
4 жыл бұрын
The negative log-likelihood is a loss function (in fact a very common one). The likelihood or log-likelihood of your data is something you want to maximize - so if you negate it you obtain something you want to minimize (i.e. a loss).
@massimoc7494
3 жыл бұрын
It is possible for the newtons method for optimization to get stuck in a flex of a function? I ask this because if i'm in the concave part of the function i'm minimizing, and if i'm in the convex part i'm finding the maximum of the parabola tangent to my point x_{0}
@kilianweinberger698
3 жыл бұрын
hmm, unlikely ... sounds more like a (sign) error?
@vatsan16
4 жыл бұрын
So if I understand what you say correctly, naive bayes is useful when we dont have as much data. But in a real life problem, how would we know if the amount of data that we have is enough?
@kilianweinberger698
4 жыл бұрын
TBH, often it is just the easiest to try both, Naive Bayes and logistic regression, and see which one works better.
@Carlosrv19
Жыл бұрын
@@kilianweinberger698 The form that logistic regression takes for P(y|X) is derived by the assumption of NB when P(X|y) is gaussian. Then, isn't it also same as saying that Logistic regression inherits same assumptions than NB?
@KaushalKishoreTiwari
3 жыл бұрын
But how to decide parabola value
@gregmakov2680
2 жыл бұрын
I think the Naive Bayes has worse performance than Gaussian Naive Bayes or logistic regression because of the generalization capability.
@JoaoVitorBRgomes
3 жыл бұрын
At 23:00 (circa) how do we check if the function is convex?
@kilianweinberger698
3 жыл бұрын
It is convex if and only if its second derivative is non-negative. Or for high dimensional functions, the Hessian is positive semi-definite.
@rakinbaten7305
15 күн бұрын
I'm curious if someone was actually stealing all the notes
@user-me2bw6ir2i
Жыл бұрын
Has anyone found matrix cookbook that professor has mentioned?
@kilianweinberger698
Жыл бұрын
Here you go: www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
@sandeepraj7157
3 жыл бұрын
Imagine there's a tiger outside your house and you want to drive it away. Now since we have no idea to do that, we can safely assume it is a cat as it is much easier to scare away a cat. Problem solved.
@coolshoos
2 жыл бұрын
who stole the freaking handouts???