Негізгі бет How to handle high cardinality predictors for data on museums in the UK

Күн бұрын

How to handle high cardinality predictors for data on museums in the UK

Рет қаралды 5,617

Julia Silge

Жүктеу

Пікірлер: 18

@utubemoitral
Жыл бұрын
Fantastic. Thanks Julia, i was looking for something like this for long. Also hoping you will do a video on what if analysis soon, maybe?
@olexiypukhov-KT
Жыл бұрын
Thanks again for all your videos! You are an amazing teacher.
@Pablo-ln9nd
Жыл бұрын
Hey @JuliaSilge ! Great video as always :) I have a question, what do we do when we have a high cardinality target variable? (lets say 17 categories) Can we still use tree based models? (Random forest, xgboost, etc) Thank you! :)
@JuliaSilge
Жыл бұрын
If your outcome has that many levels, you have two main options that I have seen work out well in practice. You can a) train 17 separate models that are each 1 level vs. all others and at prediction time see which option has the highest probability or b) train one model on to predict for all levels at once. You can see here for an example of the second: smltar.com/mlclassification.html#mlmulticlass
@Pablo-ln9nd
Жыл бұрын
@@JuliaSilge thank you so much for the info Julia. Keep up the great content!
@mcmahonandrewjonatha
Жыл бұрын
I love this! Thank you, Julia. I have two quick questions. 1) The embedding techniques are impressive, but should I take any precautions when using these embedding techniques to avoid harmful data leakage from the output variable? 2) Once the high-cardinality predictor is transformed to a numeric variable, can I treat the correlations with other numeric variables as informative and meaningful on there on (assuming I made an ordinary correlation matrix)? Or would that be misguided?
@JuliaSilge
Жыл бұрын
Be sure to estimate embedding techniques using resampling (inside of resampling, not one time before resampling) to avoid data leakage, yes! If you use it in a workflow like I have shown here, that is the behavior. If you want to use something like `step_corr()` after you have created an embedding, I think that would work great. For something more like a correlation matrix, like in this section: www.tmwr.org/dimensionality.html#beans Then you just need to remember that the outcome was used to estimate that value.
@mcmahonandrewjonatha
Жыл бұрын
@@JuliaSilge Very helpful. Thanks 🙏
@AnkeetSingh-gt9fm
6 ай бұрын
Hey Julia, great tutorial. I had a question. Here you used Subject_Matter as the only high cardinality variable. If we have a dataset where there are multiple columns with high cardinality, can the recipe method be used in such a case for all the high cardinality columns ?
@JuliaSilge
6 ай бұрын
Yes, you sure can! You will need to keep in mind how much data you have vs. how many predictors you are trying to encode in this way, and definitely keep in mind that you are using the **outcome** in your feature engineering. You can read more here: www.tmwr.org/categorical
@AnkeetSingh-gt9fm
6 ай бұрын
@@JuliaSilge Great I’ll keep that in mind. Thank you!
@AnkeetSingh-gt9fm
6 ай бұрын
@@JuliaSilge Hi, I had another question with regards to my previous question. For each column, would we have to define a separate recipe? And while creating the workflow, how would you add the recipes for multiple columns in the workflow(since workflow only allows one recipe)? I was unable to find resources for this online. Any help would be appreciated!
@JuliaSilge
6 ай бұрын
@@AnkeetSingh-gt9fm Oh, you don't need a separate recipe for different columns, just separate steps. So you could do `step_lencode_glm()` then pipe to another `step_lencode_glm()`, etc.
@AnkeetSingh-gt9fm
6 ай бұрын
@@JuliaSilge Thank you that’s what I figured and ran the code. I received an error: Error in - dsy2dpoC.Msymfrom) : not a positive definite matrix (and positive semidefiniteness is not checked), looks like I need to assess some variables in my model. You are very helpful with your prompt replies, I really appreciate it. Thank you!
@mrsnakesss
Жыл бұрын
Thanks for this video! I never used embed for categorical predictors yet. Why the default value for a new level is -0.909 and not 0? What does it represent? Thanks !
@JuliaSilge
Жыл бұрын
This would be like the mean value of the outcome, or like the intercept or bias (if this predictor were the only one being used in the model). It's not zero because there are more unaccredited than accredited museums.
@mrsnakesss
Жыл бұрын
@@JuliaSilge I see, thanks!
@alelust7170
Жыл бұрын
Awesome