Негізгі бет R demo | How to impute missing values with machine learning

Күн бұрын

R demo | How to impute missing values with machine learning

Рет қаралды 6,029

yuzaR Data Science

1 1

Пікірлер: 55

@mustafa_sakalli
2 жыл бұрын
You are a legend! I've spent my hours to find proper tutorial to missing data imputation. They were all about mice and they were applying it to the 20 rows-5 columns dataset :D Since my dataset relatively big, mice package was struggling to compute missing values. But with the help of your small script I was able get a result in approximately 45 minutes. Thank you again
@yuzaR-Data-Science
2 жыл бұрын
Great to hear, Mustafa! I am glad it's useful not only for me :)
@chacmool2581
Жыл бұрын
Using 'ggplotly' to make a missing value heatmap interactive is too computationally expensive and slow for anything but very small datasets. Instead, you could try making an interactive heatmap directly using 'd3heatmap'. Much faster. Plus you can control the aesthetics of 'd3heatmap' to a greater degree than the 'vis_miss()' function.
@yuzaR-Data-Science
Жыл бұрын
good idea, thanks! I'll try d3heatmap out
@auliakhoirunnisa9447
2 ай бұрын
Thank you for your explanation. It really helps me alot! Your voice is indeed calming and soothing😃 will definitely subscribe, Sir!
@yuzaR-Data-Science
2 ай бұрын
Thanks for subbing! Hope you like the rest too ;)
@45tanviirahmed82
4 ай бұрын
This video ends abruptly 🤣 I was so into it, that I thought there was problem. Great video! you playlist on R is becoming an addiction
@yuzaR-Data-Science
4 ай бұрын
Awesome! Really happy you like it. I think I did cut it at the end, because the end turned out to be useless, which killed the retention. So, in more recent videos I try to provide the value every second... doesn't always work, but I think videos got a bit better since then :) Thank you so much for feedback!
@angelajulienne3122
Жыл бұрын
AMAZIIIINGGGGG !! You're incredible thanks :D
@yuzaR-Data-Science
Жыл бұрын
Thanks for your feedback! And thanks for watching!
@haythemboucharif7750
2 жыл бұрын
Mister, i am french so i can tell you that we have problems with enflish, but let me say that you speak really smooth, and very very well, thank you very much
@yuzaR-Data-Science
2 жыл бұрын
Thanks, Haythem! Glad you liked it. I can recommend Deep Exploratory Analysis video. It's long but very useful. Cheers
@vyshnavisanagapalli4314
7 ай бұрын
hi sir, i have gone through this video but im not able to get plot_na_pareto function in R studio. its throwing an error " builtinfunction not found".can u help me how to overcome this issue!?
@yuzaR-Data-Science
7 ай бұрын
hi, it works perfectly at my PC. have you installed and loaded the {dloork} library?
@vyshnavisanagapalli4314
7 ай бұрын
Thank You for replying Sir.I am getting the plot , actually there was some problem with my r studio, I rectified it .and I must say ur videos are so informative and easy understanding.
@yuzaR-Data-Science
6 ай бұрын
that's so nice of you! huge thanks! I am happy my content is useful!
@warmtaylor
Жыл бұрын
Thank you very much for your informative, succinct video! Is {missRanger} package considered the best package for multivariate imputation? Is {missRanger} package better than {mice} or {miceRanger} packages? How did you discover {missRanger} package? I'm sorry if I asked too many questions because I'm new to data imputation and would like to select the best package to impute my data.
@yuzaR-Data-Science
Жыл бұрын
Sorry for a late replay. I was traveling a lot lately. I believe missRanger is the best, but it's just a believe, not fact. I think so because it connects predictive mean matching and multiple imputation. It iterates till the OOB stops improving. I did however not directly compare the results of the packages and usability. The usability is also important, because there tons of packages that don't run without some special things. miss Ranger does, and does it quick. Having said this, if you would compare the results of different imputation, I would be grateful to know how it went. Kind regards and thanks for your nice feedback!
@warmtaylor
Жыл бұрын
@@yuzaR-Data-Science No worries. Thank you very much for your answer.😀I think I will probably stick with {missRanger} package for now due to the fact that it is easy to implement and the great features you have discussed.😄// I was wondering if you could provide me with quantitative method(s) which could be used to assess the accuracy of the imputation rather than visualisation?// In your demonstration of using {missRanger} package (5:28), I think that it is essential to include the argument `pmm.k` (e.g. pmm.k = 3) to conserve data structure/format. This is because, when I first ran the code without the argument `pmm.k`, it gave different rounding to my values. I have checked in the package's vignette, and it is confirmed that setting the `pmm.k` argument to a positive number is needed so that all imputations done during the process can be combined with a predictive mean matching (PMM) step, leading to more natural imputations and improved distributional properties of the resulting values.🤔 Best regards, Poss.
@yuzaR-Data-Science
Жыл бұрын
Oh, cool, thanks for pointing out the "pmm.k" option! I did actually often forget it. What I usually use to follow up on predictions is "num.trees = 10000, verbose = 2, seed = 1, returnOOB = T", which displays the Out Of Bag Error for each variable at each iteration. After some iterations, when the OOB stops improving, it stops imputing and you have a final dataset. I try no to accept any OOB above 10% ... but yeah, it depends on the situation. I also usually plot the data, just to see whether some very strange values were predicted ... I was never the case till now. Of coarse, the more data you have, the better the predictions. Cheers ;)
@warmtaylor
Жыл бұрын
@@yuzaR-Data-Science Oh, I see. Due to the fact that my data contains over 3,000 rows and 30 variables, I have reduced "num.trees" to shorten the processing time to 100. Consequently, this led to the different rounding of imputed values, so I added the argument "pmm.k" to retain their data structure. Thank you very much for your clarification! :)
@ntran04299
10 ай бұрын
Thank you for this great video. May I ask the assumptions that should/must be met before using missRanger to impute data?
@yuzaR-Data-Science
10 ай бұрын
Yes, you may :) but I am afraid they are just common sense. For example, I never impute the response variable, I don't impute when there is a lot >20% of missing values, I always check the imputation results and accept or not accept depending on the result. Like, when I impute categories and after imputation only one category was filled up while the others not (in case there needed to be impited like 10% or more), then I don't accept that. so, no assumptions, but your own shit tests are important here. hope that helps! cheers
@ntran04299
10 ай бұрын
@@yuzaR-Data-Science I see thank you sir!
@yuzaR-Data-Science
9 ай бұрын
you are very welcome!
@achual1909
2 жыл бұрын
Can I use this for time-series data?
@yuzaR-Data-Science
2 жыл бұрын
If you wanna date-format itself (day/month/year : sec/min/hour), I don't think so. But if your timepoints are columns, and you just have some things measured and sometimes missing, then for sure.
@mkklindhardt
2 ай бұрын
Amazing 👏🏽 thank you
@yuzaR-Data-Science
2 ай бұрын
Glad you liked it! Thanks for watching!
@TheBaudoing2007
Жыл бұрын
thank you ! this is great
@yuzaR-Data-Science
Жыл бұрын
glad you enjoyed! Thanks for watching!
@chrisdietrich6400
Жыл бұрын
Thanks a lot! Super helpful video! I am just wondering at which point in the data management process it would be the best to apply the imputation - I have some categorical items that I use for multiple factor analysis, which I then use for multilevel modelling. I am currently applying the imputation after I created the factors - however my intuition says it might be wiser to impute as a very first step. Do you have an opinion on that? (or some literature in regard to this?)
@yuzaR-Data-Science
Жыл бұрын
Glad it was helpful! Well, and I am not familiar with any hard core rules, or rules of thumb. For me it depends on imputation goals and common sense. I did 3 imputations once because the dataset needed lots of operations, so, in order not to loose few point there and few here, I did 3 rounds. Another thing is, the categories or factors supposed to be recognised automatically. So, factorising before imputation makes sense to me. If you have 3 categories, 1,2 and 3 and ask for imputation of such a "numeric" variable, you might get some odd continuous numbers. However, if you want exactly that - go for it.
@jameswhitaker4357
Жыл бұрын
I'm just a mere junior analyst, but I am enamored by cool statistical methods. I have a lot of questions that you answered. While I have a minor mistrust in algos and ML, I have a major intrigue in how accurate and precise imputations can be.
@yuzaR-Data-Science
Жыл бұрын
Glad it was helpful! The missRanger command can even tell you for every variable, how good the imputation was via OOB - out of bag error rate. I don't know anymore whether I talk about it in the video.
@syhusada1130
2 жыл бұрын
Thank you
@yuzaR-Data-Science
2 жыл бұрын
You're welcome!
@syhusada1130
2 жыл бұрын
Been coming back to this video. For a dataset with 165040 rows, missRanger crashed my Rstudio. I ended up using imputate_na with mode as the method since I'm not sure what yvar I should use in the dataset. So it produced "imputation" class, and I'm not sure what to do about it, can I just insert the result into the dataset?
@yuzaR-Data-Science
2 жыл бұрын
I would still recommend to use missRanger. I made the best experience with that. Are all 165k rows and all variables important? If not reduce the dataset or split it in smaller sets. By the way, it never crushed my RStudio, only took a little longer, if dataset was huge. Then, ask yourself are all variables (rows) contribute to the meaningful imputation? E.g. IDs or too diverse categorical variable don't, but they let missRanger think more for no return. If some variable have too many missing values, like 30% do you actually want them to be imputed? I suggest missRanger over "imputate_na" because you can track the OOB error (which is amazing) and because you create a new data set, which you can immediately use if OOB is low: d_imputed % missRanger(., formula = . ~ ., num.trees = 1000, verbose = 2, seed = 999, returnOOB = T)
@syhusada1130
2 жыл бұрын
@@yuzaR-Data-Science thank you for the extra tips, amazing channel by the way, love it!
@yaoliao3517
2 жыл бұрын
Really helpful to me. I see your recommendation from twiter.
@yuzaR-Data-Science
2 жыл бұрын
Glad it was helpful! And thanks for nice comments! They help :)
@rayray0313
3 жыл бұрын
Excellent stuff. Thanks for making this video.
@yuzaR-Data-Science
3 жыл бұрын
My pleasure, Ray! Thanks for a nice feedback!
@dle3528
2 жыл бұрын
This video is awesome. Congrats! Can I use this method before estimating ML models? Should I input data before or after the partition data?
@yuzaR-Data-Science
2 жыл бұрын
Thanks a lot! Imputation before conditioning is for sure better, because you have more data for the model to learn from, so the imputation quality would be better. Cheers!
@dle3528
2 жыл бұрын
@@yuzaR-Data-Science thank you so much ! 😃😃
@muhammedhadedy4570
Жыл бұрын
You are a true legend. I enjoy every single video of your tutorials.
@yuzaR-Data-Science
Жыл бұрын
Glad you enjoy them! Thank you for watching!
@robertasampong
2 жыл бұрын
Absolutely excellent! Thank you!
@yuzaR-Data-Science
2 жыл бұрын
Glad you enjoyed it! Check out later videos. You might like those too. Thanks for feedback!
@edaemanet26
2 жыл бұрын
Thank you sir this is perfect!
@yuzaR-Data-Science
2 жыл бұрын
You are very welcome, glad you like it! :)
@angezoclanclounon1751
3 жыл бұрын
Awesome video! Thanks a lot.
@yuzaR-Data-Science
3 жыл бұрын
Glad you liked it! Cheers!