Mel-Frequency Cepstral Coefficients Explained Easily

Рет қаралды 126,455

Valerio Velardo - The Sound of AI

Жүктеу

Пікірлер: 221

@aberone_library
4 ай бұрын
I cannot express how much I'm thankful to you for making this video! This is my favorite style of explanation that I myself have adopted over the years. You took an hour to explain a concept that could, in principle, have been explained in 15 mins or so, but you did it so clearly and thoroughly that by the end of the video I had a spotless, complete understanding not only of the process of extracting the MFCCs but also of the intuition and the meaning of it. Which is something that a lot of other explanatory videos lack these days. So thank you again for your effort!
@ValerioVelardoTheSoundofAI
4 ай бұрын
Thanks a lot :)
@thebigVLOG
3 жыл бұрын
This is one of the best lectures I've ever effin watched, thank you so much for making this series!
@emrecan9271
5 ай бұрын
You are a perfect man. These videos are literally worth gold. I will watch them from the start. Thank you very much.
@naveenfrancis444
2 жыл бұрын
at 14:38, doesn't the IDFT map a signal on to the time domain? If so, shouldn't the axis be pseudo time instead of pseudo frequency?
@st0a
Жыл бұрын
That's exactly what I was wondering...
@ayishanayyar1283
3 жыл бұрын
Description in a pleasant manner, untiring, relaxing effect on nerves. Thank you Valerio Velardo
@klausjurgenfolz4323
3 жыл бұрын
I've learned more watching this video than a whole semester in my university. Than You!!!!
@BlackHermit
2 жыл бұрын
This was really, exceptionally good. A rather lengthy video, but worth every second. Thank you so much!
@ValerioVelardoTheSoundofAI
2 жыл бұрын
Glad you liked it!
@Bluephoton
3 жыл бұрын
Better than Speech Signal Processing Lecture in terms of explanation and ease of understanding !! Highly recommend to watch for speech related projects!
@ValerioVelardoTheSoundofAI
3 жыл бұрын
Thank you!
@johnmin3821
3 жыл бұрын
This video is doing explanations that I couldn't find or understand from hundreds of websites. You're a legend
@nataliakalashnikova1269
3 жыл бұрын
I did my master's thesis in NLP on automatic emotion recognition comparing CNN and SVM performance using MFCC. I didn't really "get" the meaning of MFCC, how it works, why it is so popular, etc. Now I'm doing my PhD thesis also on emotion classification in speech and I was really struggling with the understanding of these basics concepts. Thank you so much for your work, your clear and vivid explanations! You helped me a lot to move forward in my project. P.S. Sorry for my English, if there are a lot of errors. P.P.S. I am a linguist "I believe it's called" :)
@zahraamuhsen1310
2 жыл бұрын
please I am study and my thesis it also about speech emotion recognition using cnn and mfcc based on GA by using entropy >>>> can you send me your thesis or can you help me to understand
@DiogoSanti
3 жыл бұрын
But is it ok to call the inverse Fourier a Spectrum? I tell that because the inverse Fourier brings back the Frequency Domain to Time Domain, and in my head, spectrum is represented by slices of frequency domain, or am i missing the point?
@jdavibedoya
Ай бұрын
I'm not an expert, but I believe the conventional way to calculate the cepstrum uses the IDFT because of its scaling factor. Both the DFT and IDFT are quite similar and indeed produce results with the same shape.
@seathru1232
2 жыл бұрын
Dear Valerio, I don't get a point. Shouldn't you get time on the x-axis if you apply an IDFT to a signal represented in the frequency domain? If I take a signal x(t) and take the FFT, and then the IDFT, didn't I get back a reconstructed x(t)? Is the log of the FFT the reason behind what you explained?
@pedrobotsaris2036
Жыл бұрын
That is right. I think he is misusing the term inverse Fourier transform here. If you apply a IDFT you get back to the time domain.
@Walsh2571
Жыл бұрын
@@pedrobotsaris2036not if you change the scale before performing an ifft
@user-gb4oo2to4w
6 ай бұрын
@@Walsh2571 Why? If you change the scale before performing IFFT, you just get back to the time domain with a different scale, right?
@tutatis96
5 ай бұрын
@@user-gb4oo2to4wi think that the point is that we got rid of the phase with the log, but im not sure
@Underscore_1234
3 ай бұрын
that's AWESOME STUFF. Did expect good stuff, didn't expect that good stuff, you really did good about explaining cepstrums and the wave to separate glutal pulses from voice track. It really made sense.
@chacmool2581
2 жыл бұрын
What I don't understand is why one takes an inverse FT instead of a FT to get to the quefrency domain. If it's indeed a spectrum of a spectrum shouldn't one take a FT of a FT?
@satyajeetprabhu
9 ай бұрын
Same thought
@booky6149
9 ай бұрын
We are taking inverse Fourier transform to represent the log spectrum in the same way as the human ear hear (i.e Frequency domain to Quefrency domain). FT only takes the Time domain signal as input. FT of FT violates the rules.
@ashwinalinkil7328
Жыл бұрын
Straight up dude, you are an absolute beast! Every other sentence just blows my mind. You made it so easy to understand and gain an intuition on such abstruse concepts. Thank you so much!
@jeevanreji7290
2 жыл бұрын
I absolutely love the way you explain these concepts! Thank you !
@Jamazon
3 ай бұрын
your channel is a gold mine, thank you so much for what you do!
@zhenxinghu4889
3 жыл бұрын
My question is why not apply DFT rather than IDFT again on Log(F(x(t))
@sasankkottapalli6822
5 ай бұрын
Same question here
@tutatis96
5 ай бұрын
@@sasankkottapalli6822 i think it works because we're not considering the phase after the log
@yuefenggao7483
12 күн бұрын
@@sasankkottapalli6822 Both IDFT and DFT are basically equivalent here because they have the same distribution. kzitem.info/news/bejne/tIClnaqGoISddYYsi=I664FcrQVgml_Amf&t=77
@zeyuyang2053
3 жыл бұрын
Best MFCC explanation I‘ve seen ever！Thank you！
@ValerioVelardoTheSoundofAI
3 жыл бұрын
Thank you!
@MegaCadr
3 жыл бұрын
20 minutes in, my mind started melting. Amazing video!
@drillsargentadog
2 жыл бұрын
Nice explanation and great course! One comment: I'm pretty sure big X, E, and H at 27:19 should be functions of frequency, not time, and should be multiplied, not convolved.
@LauraSpinu
2 жыл бұрын
This was so helpful, can't thank you enough for your time and effort. Simply amazing - and your enthusiasm makes it so easy to watch and enjoy through the end!
@Goriuable
2 жыл бұрын
Thank you so much. I searched alot about the Topic of MFCC and I did not found very good explanations. Your Video is really a masterpiece and I have now a good knowledge about the concepts :) For sure I will have a look at some other Videos from you. Keep Up the amazing Work!
@ValerioVelardoTheSoundofAI
2 жыл бұрын
Thank you - glad I could help!
@quincydelp9586
2 жыл бұрын
This is an incredibly helpful video that taught me how to implement an MFCC algorithm and intuition for why it is useful information. I can't recommend it enough.
@ValerioVelardoTheSoundofAI
2 жыл бұрын
Thank you Quincy!
@user-ni2fo1uh2l
Жыл бұрын
I was watching the video and at some point I stopped and started talking to chatGPT to understand those concepts. I found myself learning about convolutions and cepstral coefficients and its intuition. Once, I got back to the lecture, the first thing Valerio started talking about was convolutions and the intuition behind cepstral coefficients. The moral of this story is he is an amazing teacher and just finish the lecture first and then search for stuff that you did not get in the lecture :)
@4abdoulaye
3 жыл бұрын
VERYVERYVERY CLEAR, Best video I've ever seen.
@ValerioVelardoTheSoundofAI
3 жыл бұрын
Thanks!
@dver89
11 сағат бұрын
Incredible video. Thank you!!
@abhi88mcet
3 жыл бұрын
I am more of a Reinforcement Learning guy with a bad squicky voice trying to start a youtube channel. I was researching the use RL to create a realistic vocoder to substitute my voice, and stumbled upon this gem...awesome work..keep up the good work..
@ValerioVelardoTheSoundofAI
3 жыл бұрын
Thanks a lot and good luck with the YT channel -- you're on the verge of starting an amazing journey :)
@vaitom6078
2 жыл бұрын
you're a genius of vulgarization, thank you for the effort
@beincheekym8
3 жыл бұрын
awesome course, so complete, and very clear visualization. really amazing. thank you!
@ValerioVelardoTheSoundofAI
3 жыл бұрын
Thanks!
@JigarRajpopatOfficial
3 жыл бұрын
Very informative. Thank you!
@ValerioVelardoTheSoundofAI
3 жыл бұрын
Thank you Jigar :)
@roninnash6782
3 жыл бұрын
Sorry to be so off topic but does someone know a tool to get back into an instagram account..? I somehow lost the account password. I love any assistance you can offer me
@adambryant9487
3 жыл бұрын
@Ronin Nash instablaster :)
@roninnash6782
3 жыл бұрын
@Adam Bryant i really appreciate your reply. I got to the site through google and im in the hacking process now. Looks like it's gonna take quite some time so I will reply here later with my results.
@roninnash6782
3 жыл бұрын
@Adam Bryant it worked and I finally got access to my account again. Im so happy! Thanks so much, you saved my ass !
@akanshmaurya1568
3 жыл бұрын
I am confused about why it is a spectrum of a spectrum, when we take Fourier transform, we go from time to spectrum, so according to last step while calculating cepstrum, should we not call as inverse of spectrum?
@Erosis
3 жыл бұрын
Yeah, the inverse is kinda confusing me. I thought we'd use another Fourier Transform to get quefrequency, not the inverse (which puts it back into time domain). I read a post about this ( dsp.stackexchange.com/questions/5940/mfcc-process-confusion ) where they say that both are going to produce relatively the same thing, so it doesn't matter in the end.
@bijan8705
2 жыл бұрын
He clearly don't know that inverse FT is not the same as FT at 14:00
@RudraSingh-pb5ls
Жыл бұрын
@@bijan8705 who doesn't know, Valerio or Akansh, the guy who asked this question here ?
@user-gb4oo2to4w
6 ай бұрын
@@Erosis Thank you for this information. I read the post but I'm still confused... why are both going to produce the same thing? One is the inverse of the other
@ashokdhingra4
3 жыл бұрын
Hi, Fourier transform of a time domain signal is a series of terms, and not a single number. What then is the meaning of Log of the Fourier transform? Or is it Log of each term in the Fourier transform? Further, when we take inverse Fourier transform, we should go back in time domain. So it is not really 'spectrum of a spectrum'.
@subrahmanyamkunapuli1860
3 жыл бұрын
👏Excellent way to explain intricate details!! Thanks for the video series.
@ValerioVelardoTheSoundofAI
3 жыл бұрын
Thank you!
@AlBeebe
3 жыл бұрын
Excellent video. 42:08 ended up making me wonder what happened to the slack message i thought i got. :)
@IngridKnoch
3 жыл бұрын
That was so clearly explained!! Thank you for this, Valerio
@katsiarynaruksha9381
Жыл бұрын
Extremely useful series of lectures. Thanks a ton!
@6tyelement979
3 жыл бұрын
4:32 When u cannot answer a question u got asked in front of whole class btw great vid
@ValerioVelardoTheSoundofAI
3 жыл бұрын
LOL
@thierrydesot1164
3 жыл бұрын
Thanks a lot for this brilliant explanation. I have read several papers to grasp the concept of mfcc, mel scaling, delta derivates etc. But after watching this youtube tutorial it is the first time I have the feeling I 'got' it. So I am on my way to watch your other tutorials.
@ritwickjha3954
2 жыл бұрын
maybe you should be a bit clear, taking IFFT of frequency domain will give us time domain. Quefrency is in the time domain. I was a bit confused because you kept saying IFFT will give something like a frequency domain. Also i am not sure if taking log of signal in time domain is correct, since it is convolution of E and H, log should be in frequency domain where it is multiplication of E and H. please correct me if i am wrong. great video
@dataista7717
2 жыл бұрын
Thanks for the series, man. You accelerated my speed jumping into this field a lot. Like A LOT. Really, u rock 🙌
@user-yo4kd7zy9j
Жыл бұрын
So well explaniert! Thanks alot for your amazing work.
@shaidhasan6895
3 жыл бұрын
Thanks a lot. Was waiting for this.
@ValerioVelardoTheSoundofAI
3 жыл бұрын
Glad you liked the video!
@AsEnIxX-wtf
Жыл бұрын
Excellent presentation & explanation
@harutyunyansaten
3 жыл бұрын
I want to learn deeper can you please provide references where you tookthis info?
@parismitasarma1572
3 жыл бұрын
Amazing, you are explaining the underlying concept in much easier way. Thank you so much Sir.
@RahulSharma2501
2 жыл бұрын
This is absolutely amazing.
@shereenelmetwally522
2 жыл бұрын
Thank you very much. It was really wonderful!
@Jononor
3 жыл бұрын
In "Computing Mel-Frequency Cepstral Coefficients" (approx time 38:00) you put Waveform->DFT->Log-Amp->Mel-filterbank->DCT. Is it not more conventional to apply the Mel filterbank to linear magnitude spectrogram, and then do the log transform? But maybe the order is not so important between those two steps?
@ValerioVelardoTheSoundofAI
3 жыл бұрын
It's really a matter of "preference". Both approaches work.
@pratyushsaha8482
3 жыл бұрын
Very well explained. You are awesome man !
@rakhshandamujib2793
10 ай бұрын
Absolutely loved it!
@advaithpillai
2 жыл бұрын
Mate you are a life-saver!
@adrijachakraborty2316
2 жыл бұрын
Mind = Blown!
@avidreader100
3 жыл бұрын
I am still stalled at this video. I feel the founders of the concept have confused us by naming these unique parameters the way they did. Quefrency as a metric with a measure of seconds was quite a big factor confusing me. I am gradually coming to terms with it. Let me share my thoughts so that others can correct me if I am off. In the Fourier transform that gave us the spectrum, we say we convert a signal from the time domain to frequency domain. We look at the time domain signal as an additive value of multiple uniform/ steadier frequency components (all taken within a short time frame). The amplitude in vertical axis is expressed in different units (dB etc), but is conceptually the same - magnitude. The Fourier transform inverted the x axis. From time it went to inverse of time, which is frequency. The cepstrum is basically looking at the up and down shifts of the spectrum as we scan along with respect to frequency. These are the formants in speech. The amplitude is again not tampered with beyond expressing as log etc. The x axis is not flipped once again from cycles per unit time to time. In both spectrum and cepstrum we did flipping of x axis. First time around it analyzed the signal and have all the frequency components. In the second time it gave all the formats. The amplitude of the spike in the cepstrum gave us the significant components, and the quefrency or time value at which the spikes occurred, when inverted gives us the formant frequency corresponding to this spectrum. Does this sound right?
@Waffano
Жыл бұрын
The IDFT part is a typo if you ask me. For me it only makes sense that the cepstrum is a spectrum of a spectrum, meaning DFT applied to a spectrum. This is the only way we can collect the frequencies of the formants. If it was IDFT it would just result in a complex waveform with no information of frequencies. In the end Valerio also specifically uses discrete cosine transform and NOT inverse discrete cosine transform, to get the final MFCCs, which makes sense. So I strongly believe the IDFT in the beginning is just a mistake and should be DFT.
@davidkooi4349
Жыл бұрын
Wonderful video, thank you!
@yatosaurio
2 жыл бұрын
fantastic explanation, very didactic, thank you very much
@pohjanakka1
3 жыл бұрын
Thank you so much. This was clearly explained.
@sagarparmar6715
3 жыл бұрын
greatly admire this video. it's quite detailed. thanks a lot
@keem.studios
3 жыл бұрын
this video just saved my engineering final project
@ValerioVelardoTheSoundofAI
3 жыл бұрын
Nice :)
@brandonlincolnsnyder
2 жыл бұрын
this video is blowing my mind!
@anupambhattarai8765
3 жыл бұрын
Great explaination.👍
@MrOpossumx3
3 жыл бұрын
Another great vid! I would have appreciated a bit more intuition over the meaning of the MFCC coeffs / time matrix presented around 48:37. If a spectrogram is intuitive, if found a MFCCs coefs over time matrix to be harder to interpret. Do you have some intuition of MFCCs coefs over time from a psycho-acoustical perspective? In a Spectrogam, the intensity of a given frequency at a frame nicely link to the perception we have of a sound high or low pitch. What would a perceptual equivalent for MFCCs coefs over time?
@kxiong4021
3 жыл бұрын
Thank you for sharing this amazing content. Very informative and specific. Came for copper and found gold!
@desikharkara9407
3 жыл бұрын
Very good and great explained thanks 👍
@annazaitseva6213
3 жыл бұрын
If cepstrum is a spectrum of a spectrum why inverse Fourier transform is applied to a log spectrum of a signal not forward?
@amitrege502
3 жыл бұрын
Around 50:53 you tell that MFCC ignores fine spectral structures like "pitch" which we don't care about, generally. Then you also say that MFCC works well in speech and music. I think in music, pitch is the most relevant information, people are interested in, because the musical notes themselves are defined around pitch frequencies. I think there is a contradiction in the statements. Will you please clarify. Thank you for such a nice video.
@ValerioVelardoTheSoundofAI
3 жыл бұрын
It's not a contradiction. For tasks like music genre, mood, instrument classification timbreis more important than pitch.
@amitrege502
3 жыл бұрын
@@ValerioVelardoTheSoundofAI Thank you
@Waffano
Жыл бұрын
Say we wanted to identify a specific individual from their speech, to use for unlocking a device with speech for example. In this case spectral detail would be more important than spectral envelope right? Because spectral detail tells us something about the unique pitch of the speakers voice? In contrast, in ASR, where we only care about the words that are spoken, and not by whom they come from, it makes sense to use spectral envelope?
@selinm7775
2 жыл бұрын
Thank you so much
@RickNance
2 жыл бұрын
Sorry... to start with, I might have gotten confused. When you say the MEL spectral analysis shows _perceptually relevant scale for "pitch"_ you mean frequency, right? If not, I've misunderstood something at the start.
@TanupatBoon
3 жыл бұрын
is DCT just another Fourier transform? Why is it the inverse one?
@Jononor
2 жыл бұрын
You should put (MFCC) in the title, I think. It should help people discover the video. Not everyone knows what the abbreviation stands for :)
@tarekziedan9989
2 жыл бұрын
Very helpful
@fabricejumel4630
2 жыл бұрын
Thanks a lot . Just perfect
@MrHowdai
3 жыл бұрын
I never understand it this clear until watching your videos!! Really appreciated it. :)) After watching this I got 2 little questions, 1. According to Nyquist theorem, when extracting the MFCCs, do we need more Mel filter banks when processing audio signals in higher sampling rates? Cuz I found the MFCCs of an audio sampled at 44.1KHz are NOT the same as the down-sampled one, which is at 16Khz. 2. Is it right to say that MFCCs is volume-independent audio features? Thanks for the great videos again! And I hope there's someone can help with my questions, thanks in advance!!
@HorsesandCo657
2 жыл бұрын
thank you very much so much!
@ValerioVelardoTheSoundofAI
2 жыл бұрын
You're welcome!
@AlexTuduran
2 жыл бұрын
Why not just call it a *meta-spectrum*, which is literally a spectrum of a spectrum? Also, this is one of the best explanations I came across. Well done.
@ValerioVelardoTheSoundofAI
2 жыл бұрын
Meta-spectrum sounds really cool!
@user-ih4ml7he1x
8 ай бұрын
I am wondering that the 1st rhamonic is representing the envelope(formants) or the glottal pulse in the latter of this video? I am a little bit confusing here at 16:12
@mitsoskavelos
3 жыл бұрын
Awesome explanation and pleasant presentation. Well done and thank you !
@amoghshekharhiremath6627
2 жыл бұрын
Very Astounding!!!!!!!!!!!!!!!!
@harisbournas6600
3 жыл бұрын
Great explanation
@Waffano
Жыл бұрын
@37:44 You mention that we get a mel spectrum. However most of the ressources I found don't mention any mel spectrum at that step but instead they mention a 1D mel vector with length = M, where M is the number of mel bands and m is the band number. The m'th element of the mel vector then contains the sum of the products between the m'th mel filter bank and the power spectrum. Is this mel vector the same as a mel spectrum? And whats the pros and cons of using either, if they are different?
@ruanjiayang
Жыл бұрын
We apply Fourier transformer or inverse Fourier transformer on the Log power spectrum? Completely different things!
@kaziasifahmed2443
3 жыл бұрын
Nice video sir,I have understand lots of things about MFCC.So If i want i to make a speech recognizer With RNN should i do?only feed spectrums or MFCC.I am not That experts at this sector.Just asking.Again Great job by providing valuable informations.
@ValerioVelardoTheSoundofAI
3 жыл бұрын
These days we tend to use (Mel) spectrograms more than MFCCs for speech recognition tasks.
@muntazirmehdi503
3 жыл бұрын
@@ValerioVelardoTheSoundofAI can we use mel spectrogram with RNN instead of CNN
@MarkEdwardsGreenside
5 ай бұрын
autocorrelation? This feels like achieving autocorreation using fft and ifft. Is there a relationship between cepstrum and autocorrelation? I'm a newbie to this and doing my best to self-learn - would appreciate understanding if this observation is correct!
@srikantachaitanya6561
3 жыл бұрын
Thanks you soo much
@AlexTuduran
2 жыл бұрын
I watched the suggested video for how to compute the envelope, but I find it unfit for this problem or I'm missing something. Basically, to compute the envelope, you take the max of a frame. This works well in general with audio, but in constructing the envelope of a spectrum, the data is rather short / scarce (ex. FFT 1024 => 512 points) and breaking it down in frames increases the chances of computing a rather "false" envelope. How do you manage to avoid the local minima and account only for the actual peaks? And since we're talking about speech, we'll have a lot of local minima. Applying a low-pass filter kind of does it, but it obviously has the disadvantage of potentially shave off important peaks. Sow how to do it properly?
@lingarajmishra8981
Жыл бұрын
After application of Fourier transformation how did the vocal tract response and glottal pulse still was in the time domain....plz explain
@ahmadkhadra992
3 жыл бұрын
Hi Valerio, I have a little question, when we apply DFT on the signal, why we got a power spectrum ? Why not just a spectrum ?
@bigpenguin8457
2 жыл бұрын
Thank you for the video, i wanted to ask if you have any documents or codes related to extracting "spectral detail" or the entire procedure that you described in the video (spectrum-->log amplitude spectrum-->spectral envelope-->spectral detail) i have applied amplitude envelope on log power spectrum which is a spectral envelope by theory but it gives me lesser values so i cannot do element wise subtraction with log power spectrum to get spectral detail, please suggest me if i am wrong somewhere. Thank you.
@amitrege502
3 жыл бұрын
This is a good video. However the question is, in the section on 'Formalizing Speech' why are you using the (t) variable in the transform domain also. The domain should be frequency.
@yannickpezeu3419
3 жыл бұрын
Thanks, very interesting !
@sharonm1261
Жыл бұрын
this is really interesting, great explanation, thanks! now I just have to work out how to relate this to blossom bat squeaks 🤔 (their frequencies are a lot higher)
@mohamadhanifomarsaifuddin4578
3 жыл бұрын
Good Explanation 5star
@ourissueanniversary
Жыл бұрын
Hello! May i have a question about MFCCs? You said that MFCCs are not so great for synthesis. So it means that usual mel spectrograms are mostly used in speech synthesis tasks?
@ValerioVelardoTheSoundofAI
Жыл бұрын
Correct. (Mel) spectrograms, and, lately, directly raw audio.
@DOMINIK32110
3 жыл бұрын
Great video as always! Could you recommend books or other sources (it'd be great if it was possible to find them on the Internet) to read more about MFFCs? Especially in context of speech.
@sharonm1261
Жыл бұрын
could anyone perhaps tell me which is the next video to watch for how to use MFCCs from different speakers to tell the speakers apart....no worries if there's not one, I will also search and google, thank you :)
@user-tj4ut8ox9r
3 жыл бұрын
32:48 how do you choose the sine wave frequency?? I thought we use cepstrums to do that for us automatically?
@Leooel7054
3 жыл бұрын
Thanks for the very informative video. Just some points of confusion: why are you using H(t) when you're in Freq domain? I think you have confused frequency domain and time domain in the video multiple times.
@zweiteid3340
Жыл бұрын
Hello, We are currently doing a project on verification using the human voice (speaker recognition). Would mfcc be useful here at all, when it is actually about filtering out phonemes?
@parasharparikh9352
3 жыл бұрын
Can I use MFCCs for extracting features from the current signal?