The only part I was able to understand was when he said "Gimme a moment to sip my tea".
@TsodingDaily
Жыл бұрын
That was the most important part!
@vladlu6362
Жыл бұрын
@@TsodingDaily I apologize for reposting Tsoding, but I believe that contacting you by a comment with a reference is better than just posting and waiting that hopefully you'll see. If you actually did this somewhere along the stream and I didn't notice, I apologize. But here we go: The derivative of q, when q= sin x, is: sqrt(1-q*q). Math: (sin x)' = cos x. sin^2 x + cos^2 x = 1. Cos ^2 x = 1- sin^2 x cos x = sqrt(1- sin^2 x) --- q sub --- cos x = sqrt(1-q*q) // This is why sin is kinda costly. Sqrt is extremely costly.
@Shan224
Жыл бұрын
Thank you for all your videos. You’ve inspired and helped me grow as a programmer. Wishing you the best!
@sp0_od597
Жыл бұрын
5:47 Id say it worked perfectly. You asked it to interpolate between 6 and 8 and it came up with 7
@vladlu6362
Жыл бұрын
The derivative of q, when q= sin x, is: sqrt(1-q*q). Math: (sin x)' = cos x. sin^2 x + cos^2 x = 1. Cos ^2 x = 1- sin^2 x cos x = sqrt(1- sin^2 x) --- q sub --- cos x = sqrt(1-q*q) // This is why sin is kinda costly. Sqrt is extremely costly.
@thomaspeck4537
Жыл бұрын
This does not quite work. There is a plus or minus on the square root, and we have no way to tell which one from the derivative. For example, if q=0, then x could be either 0 or pi, giving a derivative of plus or minus one. This can lead to training pushing it in the wrong direction.
@ratchet1freak
Жыл бұрын
sqrt is super cheap on modern hardware, you spend more cycles computing an integer divide compared to a sqrt the real solution is to keep sin and cos in the activation result
@vladlu6362
Жыл бұрын
@@thomaspeck4537 Yes, that problem does arise, which is unfortunate, as computing Arcsine and cosine is quite expensive. However, negative 0 exists for programming languages. You could just assume that, when q=-0, that the function takes the form -sqrt(1+q*q), and q=+0 the form of sqrt(1+q*q). This is, again, not correct, but I guess reducing the tangent to the first quadrant would be nice. Edit: But then again, the same problem occurs with using Arcsine. Just choose one quadrant for tangent and apply the respective formula. It's definitely faster, while it still has the same problems as the other solutions.
@dcraftbgdev
Жыл бұрын
Excited about the Machine learning. Keep up the quality videos!
@oussamawahbi4976
Жыл бұрын
Somthing that might help the ReLU act function learn better is the Weight initialization method, i actually had this simular problem when i was building a Keras inspired lib for java, so i went to python and noticed that the weights of Dense layers were slightly different than my randomly generated ones , so when I looked further into it i found that there are different weight init methods for each act function , “Xavier Initialization" for Sigmoid and tanh, and "He Initialization" for ReLU, He initialization basically makes weights not too small and not too large by multiplying the random numbers by sqrt(2/size_of_previous_layer). this did make a difference for me
@giacomo.delazzari
Жыл бұрын
How the single sin() is able to learn XOR can be very beautifully seen by the fact that 1.57 is (roughly) half of PI :)
@blimolhm2790
Жыл бұрын
yes that's what I noticed too. thankfully all the trig I've used over the years made that really intuitive
@nero0magistr
Жыл бұрын
Probably smb has already noticed that, but you essentialy get sin(x*pi/2 + y*pi/2), so if x=y=0 you get sin(0) = 0, x=y=1 you get sin(pi) = 0, if x != y, then you get sin(pi/2)=1
@ratchet1freak
Жыл бұрын
because xor has a bump in the result, all the other activation functions are monotonic , so you need another layer to make it bend back down as it were while sin is not monotonic, which makes the NN able to put the peak of a sine bump right where it needs to be active
@josephcbs6510
Жыл бұрын
Great video! (as usual) This approach of having an enum to define the activation function probably works pretty well for the compiler, since the activation function is known at compile time avoiding branching. Another approach that could work is passing function pointers (`float (*activation)(float)` and `float (*activation_diff)(float)`) around. I don't know if this have performance implications, but i would guess so. I think that using these function pointer will make it easier for making the Layer struct, letting each layer have their own activation function. All that I said here is just hand waving, probably there is a better way of doing things that I'm not aware of
@bobbobson6867
Жыл бұрын
Seems like there's a mistake 24:11 (a>=0) calculates relu(x)>=0 which is always true! So this will always return 1.
@Aziqfajar
Жыл бұрын
The way you describing the screenshots are like an artist, but for machine learning on how the shapes forming and taking shapes. Always got me everytime 😂
@coAdjointTom
Жыл бұрын
Great videos. I built an autodiff NN in jai last year and did a video on my channel. Something I didnt have a chance to explore but I thought was very interesting was to use comp time execution to build the network's gradient descent code to benefit from optimisation instead of doing it at runtime. Would love to watch you try it!
@sinan_islam
Жыл бұрын
People dont know that if you are interested in AI you have to go back to school and study Algebra, Linear Algebra, Calculus, and Statistics.
@ratchet1freak
Жыл бұрын
I'm kinda interested to see the non-traditional backprop factor changed as the NN learns and pull it down once it has tweaked the starting layers to make it affect the middle layers more
@ЕгорКолов-ч5с
Жыл бұрын
Another interesting thing about the sine function as activation is that in the paper "Implicit Neural Representations with Periodic Activation Functions " where researches tried to encode images as a transformation (coordinates) => (pixel value) (just like in your videos), the sine activation function was the best at reconstructing an image and it's "derivative" (basically a Sobel filter). I would recommend to look up this paper to at least see a picture of a bear swimming near a egyptian pyramid made by mixing two networks together at page 5
@Shan224
Жыл бұрын
Thank you for all your videos! Your content is great and you’ve inspired me to be a better programmer. Wishing you well from Southern California
@CaridorcTergilti
Жыл бұрын
It would be very simple and interesting to implement residual connections, given the layer_i with input x_i you make its output equal to f(x_i) + x_i rather than just f(x_i) (Where f is matrix multiplication and relu or any other activation). It is really simple but also extremely powerful idea, in this way each layer only has two learn someway to improve the current knowledge a little bit rather than rebuilding everything, mathematically this allows the gradient to flow, removing the vanishing gradient problem. ResNet50 with 50 layers was trained no problem with this method. Remember to keep the addition out of the activation function otherwise if you put it inside you will still have the vanishing gradient
@akkudakkupl
Жыл бұрын
Nice thing would be to start with the fast learning algo and then switch to traditional one after the cost is below some value - eg. 0.01
@Jonathan-di1pb
Жыл бұрын
You would normally also use a different output activation on the final layer to make the results more interpretable
@bukitoo8302
Жыл бұрын
1.57 is aprox pi /2 so, xor 0 0 = sin(0) xor 1 0 = xor 0 1 = sin(pi/2) xor 1 1 = sin( pi)
@DarkStar666
Жыл бұрын
Came here to say this also :)
@Jonathan-di1pb
Жыл бұрын
This just basicly makes the early layers have a higher learning rate, which helps with the vanishing gradient problem that deep into the network with Sigmoid because the tiny gradients far left or right in the Sigmoid now just get boosted like crazy. I guess its just fortunate in this specific simple problem, but I doubt that it will be stable enough for more complex things.
@bobby9568
Жыл бұрын
I was wrong about Tscoding Daily, he is an even more beast programmer than I thought!
@hendrikd2113
Жыл бұрын
Wut?! It scrolls between the two pictures? That's so funny to me!
@Sunrise7463
Жыл бұрын
Funny video! The reason you don't use "sin" as a loss function is that it is not monotonically increasing. It's not a strict rule, but you want to decrease the loss, and with "sin" in some parts of a function, backpropagation would actually increase the loss. Also, after reaching a certain depth in the neural network, you wouldn't use sigmoid due to the vanishing gradient problem. The blockiness of "relu" is mitigated by the number of them and the depth of the network. By the way, in "sin(x * 1.57 + y * 1.57)", 1.57 is an approximation of pi/2.
@MainChelaru-uf8yh
Жыл бұрын
I love this series
@jonasync
Жыл бұрын
The craziness might be good for getting the NN out of local minima. For example, it could "bounce" it out of that scrolling behavior into an interpolation behavior.
@deffuls
9 ай бұрын
You are really cool, thanks for publishing your videos on youtube !
@gameofpj3286
Жыл бұрын
This was really fun to watch
@davidaloysparrow219
Жыл бұрын
Top video as usual!
@CalinMartinconi
Жыл бұрын
How did you learn all of this? You have read some papers? Books? Videos?
@forayer
Жыл бұрын
Really enjoyed the video, thx!
@ЕгорЧилиевич-ю9р
Жыл бұрын
It would be nice to see the difference in convergence if you implemented batch normalization
@AntonioNoack
Жыл бұрын
How about statistical measures in the next video to automatically control "crazyness"? E.g. if relative variance (maybe smoothed) > 0.01 per step, scale down learning rate using multiplying by 0.999, and if relative variance < 0.001 per step, increase learning rate
@mindasb
Жыл бұрын
Why? The slider is so much cooler for experiments and intuition building than learning rate decay (LR decay) or adaptive learning rate.
@AntonioNoack
Жыл бұрын
@@mindasb Then he should make it logarithmic. He was struggling a few times in this video to set it close to zero or above 1.
@jebarchives
Жыл бұрын
gotta love that thumbnail
@mra-f3x
Жыл бұрын
Don’t you want to implement adaptive learning rate?
@caio_c
Жыл бұрын
The non traditional approach is like AI but with anxiety
@GillesLouisReneDeleuze
7 ай бұрын
I wonder if there is any merit to use non-leaky ReLU alongside with derivative of leaky ReLU
@valovanonym
Жыл бұрын
I work in the field of ml. Your analogy about biology and fertilizers is pretty interesting and quite poetical but in practice that's not how it works. We do computer science, motivated by maths :)
@valovanonym
Жыл бұрын
Btw you're doing very well and seing you progress so fast is impressive. I love hearing your thoughts as someone who isn't primarely comming from this field, it's refreshing!
@coc1841
Жыл бұрын
Probably I don't know what I'm talking about, but what if you try switching activation functions during training? By key press or anything just to see it visually.
@jjopl
3 ай бұрын
Is this just L1 vs L2 regularisation?
@lorenzogentile9289
Жыл бұрын
1.57~pi/2
@f_cil
Жыл бұрын
Fınally you've realised that ML includes just if and else nothing more... (!)
@SiiKiiN
Жыл бұрын
It would be so epic if there is a way to do ML over besier curves
@DFPercush
Жыл бұрын
You mean like encoding a vectorized image, instead of pixels?
@ErikBongers
Жыл бұрын
So the non-traditional approach is like a bug? In any case, "going crazy" is not a bad way of learning, since it's equivalent to the idea that you only learn from making errors. And indeed it goes really fast. Have you seen a kid learning to ride a bike? Goes crazy in the beginning. Did you just discover something new by accident? I hope so. In any case, keep tinkering without really knowing what you are doing. A lot of things have been discovered and invented that way.
@rngQ
Жыл бұрын
Now I can just drop sin() into all my networks
@lolix7002
Жыл бұрын
Fun fact : 1.57 = pi/2
@RRKS_TF
Жыл бұрын
Third
@Forbiiden
Жыл бұрын
i really like to play amongus with my sussy ahh uncle at 3 am! its such a great activity to burn time when you cant go to the goofy ahh yearly griddy competition in ohio!
@aftalavera
Жыл бұрын
This bs about ai sounds exactly as bitcoin did!
@__gadonk__
Жыл бұрын
i think it's overhyped in some capacity but if people apply it correctly it's going to be the most powerful and revolutionary thing since ARPANET
@ЕгорКолов-ч5с
Жыл бұрын
@@andrewdunbar828 I'm don't think he is going to get to transformers since a lot of the interesting stuff with them is compute heavy. But maybe he is going to create a cuda alternative for his laptop integrated gpu and i will eat my words
@kevinscales
Жыл бұрын
what are you referring to by "this BS"? by "exactly the same" do you mean they both involve math and programming?
Пікірлер: 66