Updated the encoding in baked_pumpkins and pumpkins_recipe

2023-09-04 15:11:42 -04:00 · 2023-09-04 15:11:42 -04:00 · 11c780ca9c
--- a/2-Regression/4-Logistic/solution/R/lesson_4.Rmd
+++ b/2-Regression/4-Logistic/solution/R/lesson_4.Rmd
@ -150,6 +150,12 @@ The goal of data exploration is to try to understand the `relationships` between

 Given our the data types of our columns, we can `encode` them and be on our way to making some visualizations. This simply involves `translating` a column with `categorical values` for example our columns of type *char*, into one or more `numeric columns` that take the place of the original. - Something we did in our [last lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/3-Linear/solution/lesson_3.html).

+For feature encoding there are two main types of encoders:
+
+1. Ordinal encoder: it suits well for ordinal variables, which are categorical variables where their data follows a logical ordering, like the `item_size` column in our dataset. It creates a mapping such that each category is represented by a number, which is the order of the category in the column.
+
+2. Categorical encoder: it suits well for nominal variables, which are categorical variables where their data does not follow a logical ordering, like all the features different from `item_size` in our dataset. It is a one-hot encoding, which means that each category is represented by a binary column: the encoded variable is equal to 1 if the pumpkin belongs to that Variety and 0 otherwise.
+
 Tidymodels provides yet another neat package: [recipes](https://recipes.tidymodels.org/)- a package for preprocessing data. We'll define a `recipe` that specifies that all predictor columns should be encoded into a set of integers , `prep` it to estimates the required quantities and statistics needed by any operations and finally `bake` to apply the computations to new data.

 > Normally, recipes is usually used as a preprocessor for modelling where it defines what steps should be applied to a data set in order to get it ready for modelling. In that case it is **highly recommend** that you use a `workflow()` instead of manually estimating a recipe using prep and bake. We'll see all this in just a moment.
@ -158,17 +164,19 @@ Tidymodels provides yet another neat package: [recipes](https://recipes.tidymode

 ```{r recipe_prep_bake}
 # Preprocess and extract data to allow some data analysis
-baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>% 
-  # Encode all columns to a set of integers
-  step_integer(all_predictors(), zero_based = T) %>% 
-  prep() %>% 
+baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>%
+  # Define ordering for item_size column
+  step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%
+  # Convert factors to numbers using the order defined above (Ordinal encoding)
+  step_integer(item_size, zero_based = F) %>%
+  # Encode all other predictors using one hot encoding
+  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
+  prep(data = pumpkin_select) %>%
  bake(new_data = NULL)

-
 # Display the first few rows of preprocessed data
 baked_pumpkins %>% 
  slice_head(n = 5)
-
 ```

 Now let's compare the feature distributions for each label value using box plots. We'll begin by formatting the data to a *long* format to make it somewhat easier to make multiple `facets`.
@ -255,22 +263,22 @@ pumpkins_train %>%

 🙌 We are now ready to train a model by fitting the training features to the training label (color).

-We'll begin by creating a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding categorical variables into a set of integers.
+We'll begin by creating a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding categorical variables into a set of integers. Just like `baked_pumpkins`, we create a `pumpkins_recipe` but do not `prep` and `bake` since it would be bundled into a workflow, which you will see in just a few steps from now. 

 There are quite a number of ways to specify a logistic regression model in Tidymodels. See `?logistic_reg()` For now, we'll specify a logistic regression model via the default `stats::glm()` engine.

 ```{r log_reg}
 # Create a recipe that specifies preprocessing steps for modelling
 pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% 
-  step_integer(all_predictors(), zero_based = TRUE)
-
+  step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%
+  step_integer(item_size, zero_based = F) %>%  
+  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)

 # Create a logistic model specification
 log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

-
 ```

 Now that we have a recipe and a model specification, we need to find a way of bundling them together into an object that will first preprocess the data (prep+bake behind the scenes), fit the model on the preprocessed data and also allow for potential post-processing activities.