Exploratory Analysis II: Identifying effort-related features contributing to misunderstanding resolution via XGBoost

Overview

This script is an adaption of pipeline by Ćwiek et al. (n.d.).

In this script, we are going to use XGBoost, eXtreme Gradient Boosting (Chen and Guestrin 2016), to identify effort-related features beyond those investigated within our confirmatory analysis, i.e., torque change, amplitude envelope, and change in center of pressure.

XGBoost is a machine learning algorithm that builds an ensemble of decision trees to predict an outcome based on input features. For us, outcome refers to communicative attempt (baseline, first correction, second correction) and input features refer to all features of effort collected in Feature extraction script.

We are going to use separate models for each modality, as we expect differences in how (significantly) features contribute to resolving misunderstanding in first and second correction. Finally, when having list of features ordered by their importance in predicting communicative attempt, we will combine the XGBoost-computed importance with PCA analysis we have performed in this script. This is to ensure we pick features from uncorrelated dimensions and thus cover more explanatory planes of effort.

Note that the current version of the script is used with data only from dyad 0. Since this is not sufficient amount of data for any meaningful conclusions, this script serves for building the workflow. We will use identical pipeline with the full dataset, and any deviations from this script will be reported.

Code to prepare the environment
parentfolder <- dirname(getwd())

datasets      <- paste0(parentfolder, '/08_Analysis_XGBoost/datasets/')
models        <- paste0(parentfolder, '/08_Analysis_XGBoost/models/')
plots         <- paste0(parentfolder, '/08_Analysis_XGBoost/plots/')

# Packages 
library(tibble) # Data Manipulation
library(stringr)
library(tidyverse) # includes readr, tidyr, dplyr, ggplot2
library(data.table)
 
library(ggforce) # Plotting
library(ggpubr)
library(gridExtra)

library(rpart) # Random Forests and XGBoost
library(rpart.plot)
library(ranger)
library(tuneRanger)
library(caret)
library(xgboost)
library(parallel)
library(mice)
library(doParallel)

# Use all available cores for parallel computing
options(mc.cores = parallel::detectCores())

colorBlindBlack8  <- c("#000000", "#E69F00", "#56B4E9", "#009E73", 
                       "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

Gesture

In the previous script, we have already separated trials into modalities and clean the resulting dataframes of superfluous columns so now we can directly load them and continue modelling.

# Load in
data_ges <- read_csv(paste0(datasets, "ges_clean_df.csv"))

# Make predictor a factor variable
data_ges$correction_info <- as.factor(data_ges$correction_info)

# Display 
head(data_ges, n=10)
# A tibble: 10 × 325
   arm_duration arm_inter_Kin arm_inter_IK arm_bbmv lowerbody_duration
          <dbl>         <dbl>        <dbl>    <dbl>              <dbl>
 1         3816          28.0         29.5    10.3                3244
 2         5280          28.1         30.9    13.4                5862
 3         4302          28.4         31.2    10.6                4206
 4         4474          27.3         30.4    10.2                4398
 5         4388          27.9         31.2     9.75               3782
 6         5852          29.4         32.0    10.7                5404
 7         2736          26.3         27.7     6.55                  0
 8         4256          28.7         29.8     8.15                  0
 9         2516          26.2         27.8     7.77                884
10         4504          28.6         30.9     9.94               4266
# ℹ 320 more variables: lowerbody_inter_Kin <dbl>, lowerbody_inter_IK <dbl>,
#   lowerbody_bbmv <dbl>, leg_duration <dbl>, leg_inter_Kin <dbl>,
#   leg_inter_IK <dbl>, leg_bbmv <dbl>, head_duration <dbl>,
#   head_inter_Kin <dbl>, head_inter_IK <dbl>, head_bbmv <dbl>,
#   arm_moment_sum_Gmean <dbl>, arm_moment_sum_Gstd <dbl>,
#   arm_moment_sum_pospeak_n <dbl>, arm_moment_sum_integral <dbl>,
#   arm_moment_sum_range <dbl>, arm_moment_sum_change_Gmean <dbl>, …

Random forests

We will build a random forest first.

# prepare predictors
predictors <- setdiff(names(data_ges), "correction_info")

formula_str <- paste("correction_info ~", paste(predictors, collapse = " + "))

# Convert the formula string to a formula object
gesTree_formula <- as.formula(formula_str)

# Now use the formula in rpart
gesTree <- rpart(formula = gesTree_formula, data = data_ges, 
                method='class', # Specify that it's a classification tree
                control = rpart.control(maxdepth = 5)  # Control parameters for the 'rpart' function
)

prp(
  gesTree,         # The decision tree object to be visualized
  extra = 1,      # Show extra information (like node statistics) in the plot
  varlen = 0,     # Length of variable names (0 means auto-determined)
  faclen = 0     # Length of factor levels displayed on the plot (increase as needed)
)

Split the data

# This method should ensure that all levels of our dependent variable are present in both sets
# Ensure each level is present in both sets
train_data <- data_ges %>%
  group_by(correction_info) %>%
  sample_frac(0.8, replace = FALSE) %>%
  ungroup()

# Assign the remaining samples to the test set
test_data <- anti_join(data_ges, train_data)

Building the untuned model.

# Untuned Model with importance (permutation) option set
gesUntuned <- ranger(
  y = train_data$correction_info,
  x = train_data[,0:324],
  num.trees = 500,
  importance = "permutation"
)

predictions <- predict(gesUntuned, data = test_data)$predictions

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_data$correction_info)

# Print the confusion matrix
print(confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction c0 c0_only c1 c2
   c0       0       2  1  0
   c0_only  1       0  0  0
   c1       0       0  0  1
   c2       0       0  0  0

Overall Statistics
                                     
               Accuracy : 0          
                 95% CI : (0, 0.5218)
    No Information Rate : 0.4        
    P-Value [Acc > NIR] : 1          
                                     
                  Kappa : -0.3158    
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: c0 Class: c0_only Class: c1 Class: c2
Sensitivity              0.000         0.0000     0.000       0.0
Specificity              0.250         0.6667     0.750       1.0
Pos Pred Value           0.000         0.0000     0.000       NaN
Neg Pred Value           0.500         0.5000     0.750       0.8
Prevalence               0.200         0.4000     0.200       0.2
Detection Rate           0.000         0.0000     0.000       0.0
Detection Prevalence     0.600         0.2000     0.200       0.0
Balanced Accuracy        0.125         0.3333     0.375       0.5
# Calculate feature importance
feature_importance <- importance(gesUntuned, num.threads = 1, type = 1) 

# Convert to data frame
feature_importance <- as.data.frame(feature_importance, stringsAsFactors = FALSE)
feature_importance$Feature <- rownames(feature_importance)
colnames(feature_importance) <- c("Importance", "Feature")

# Sort by importance
sorted_feature_importance <- feature_importance[order(-feature_importance$Importance), ]

# Print sorted feature importance
head(sorted_feature_importance, n=10)
                              Importance                      Feature
body_slope_95                0.005932540                body_slope_95
head_power_range             0.005242857             head_power_range
arm_speedKin_sum_pospeak_std 0.003863492 arm_speedKin_sum_pospeak_std
arm_speedKin_sum_pospeak_n   0.003602381   arm_speedKin_sum_pospeak_n
arm_inter_Kin                0.003116667                arm_inter_Kin
arm_accKin_sum_integral      0.002766667      arm_accKin_sum_integral
duration_mov                 0.002723810                 duration_mov
body_nComp_95                0.002709524                body_nComp_95
head_angSpeed_sum_range      0.002538095      head_angSpeed_sum_range
arm_accKin_sum_pospeak_std   0.002257143   arm_accKin_sum_pospeak_std

Set the parameters for the random forest.

# Define the number of CPU cores to use
num_cores <- detectCores()

# Create a cluster with specified number of cores
cl <- makeCluster(num_cores)

Tuning the random forest.

tuneGes <- makeClassifTask(data = data_ges[,0:325],
                           target = "correction_info")

tuneGes <- tuneRanger(tuneGes,
                      measure = list(multiclass.brier),
                      num.trees = 500)

# Return hyperparameter values
#tuneGes

# Recommended parameter settings: 
#   mtry min.node.size sample.fraction
# 1   57             4       0.2279307
# Results: 
#   multiclass.brier exec.time
# 1         0.790973     0.164

gesTuned <- ranger(
  y = train_data$correction_info,
  x = train_data[,0:324], 
  num.trees = 5000, 
  mtry = 57, # Set the recommended mtry value (number of features).
  min.node.size = 4, # Set the recommended min.node.size value (number of samples before a node terminates).
  sample.fraction = 0.2279307, # Set the recommended sample fraction value.(% of data for bagging).
  importance = "permutation" # Permutation is a computationally intensive test.
)

predictions <- predict(gesTuned, data = test_data)$predictions

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_data$correction_info)

# Print the confusion matrix
print(confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction c0 c0_only c1 c2
   c0       0       0  0  0
   c0_only  1       2  1  1
   c1       0       0  0  0
   c2       0       0  0  0

Overall Statistics
                                          
               Accuracy : 0.4             
                 95% CI : (0.0527, 0.8534)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : 0.663           
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: c0 Class: c0_only Class: c1 Class: c2
Sensitivity                0.0            1.0       0.0       0.0
Specificity                1.0            0.0       1.0       1.0
Pos Pred Value             NaN            0.4       NaN       NaN
Neg Pred Value             0.8            NaN       0.8       0.8
Prevalence                 0.2            0.4       0.2       0.2
Detection Rate             0.0            0.4       0.0       0.0
Detection Prevalence       0.0            1.0       0.0       0.0
Balanced Accuracy          0.5            0.5       0.5       0.5
# Calculate feature importance
feature_importance <- importance(gesTuned, num.threads = 1, type = 1) 

# Convert to data frame
feature_importance <- as.data.frame(feature_importance, stringsAsFactors = FALSE)
feature_importance$Feature <- rownames(feature_importance)
colnames(feature_importance) <- c("Importance", "Feature")

# Sort by importance
sorted_feature_importance <- feature_importance[order(-feature_importance$Importance), ]

# Print sorted feature importance
head(sorted_feature_importance, n=10)
                    Importance             Feature
arm_duration                 0        arm_duration
arm_inter_Kin                0       arm_inter_Kin
arm_inter_IK                 0        arm_inter_IK
arm_bbmv                     0            arm_bbmv
lowerbody_duration           0  lowerbody_duration
lowerbody_inter_Kin          0 lowerbody_inter_Kin
lowerbody_inter_IK           0  lowerbody_inter_IK
lowerbody_bbmv               0      lowerbody_bbmv
leg_duration                 0        leg_duration
leg_inter_Kin                0       leg_inter_Kin
# Close the cluster when you're done with your parallel tasks
#stopCluster(cl)

Create a tuned model only.

# Create a classification task for tuning
tuneGes <- makeClassifTask(data = train_data[, 0:325], target = "correction_info")

# Tune the model
tuneGes <- tuneRanger(tuneGes, measure = list(multiclass.brier), num.trees = 500)

# Return hyperparameter values
#tuneGes

# Recommended parameter settings: 
#   mtry min.node.size sample.fraction
# 1  221             3       0.2056429
# Results: 
#   multiclass.brier exec.time
# 1        0.8124712     0.168

# Fit the tuned model on the training data
gesTuned <- ranger(
  y = train_data$correction_info,
  x = train_data[, 0:324],
  num.trees = 5000,
  mtry = 221,
  min.node.size = 3,
  sample.fraction = 0.2056429,
  importance = "permutation"
)

# Predict on the test data
predictions <- predict(gesTuned, data = test_data)$predictions

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_data$correction_info)

# Print the confusion matrix
print(confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction c0 c0_only c1 c2
   c0       0       0  0  0
   c0_only  1       2  1  1
   c1       0       0  0  0
   c2       0       0  0  0

Overall Statistics
                                          
               Accuracy : 0.4             
                 95% CI : (0.0527, 0.8534)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : 0.663           
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: c0 Class: c0_only Class: c1 Class: c2
Sensitivity                0.0            1.0       0.0       0.0
Specificity                1.0            0.0       1.0       1.0
Pos Pred Value             NaN            0.4       NaN       NaN
Neg Pred Value             0.8            NaN       0.8       0.8
Prevalence                 0.2            0.4       0.2       0.2
Detection Rate             0.0            0.4       0.0       0.0
Detection Prevalence       0.0            1.0       0.0       0.0
Balanced Accuracy          0.5            0.5       0.5       0.5
# Calculate feature importance
feature_importance <- importance(gesTuned, num.threads = 1, type = 1)

# Convert to data frame
feature_importance_df <- as.data.frame(feature_importance, stringsAsFactors = FALSE)
feature_importance_df$Feature <- rownames(feature_importance_df)
colnames(feature_importance_df) <- c("Importance", "Feature")

# Sort by importance
sorted_feature_importance <- feature_importance_df[order(-feature_importance_df$Importance), ]

# Print sorted feature importance
head(sorted_feature_importance, n=10)
                    Importance             Feature
arm_duration                 0        arm_duration
arm_inter_Kin                0       arm_inter_Kin
arm_inter_IK                 0        arm_inter_IK
arm_bbmv                     0            arm_bbmv
lowerbody_duration           0  lowerbody_duration
lowerbody_inter_Kin          0 lowerbody_inter_Kin
lowerbody_inter_IK           0  lowerbody_inter_IK
lowerbody_bbmv               0      lowerbody_bbmv
leg_duration                 0        leg_duration
leg_inter_Kin                0       leg_inter_Kin
# Close the cluster when you're done with your parallel tasks
#stopCluster(cl)

Save data frame.

write.csv(data_ges, file = paste0(datasets, "gesDataXGB.csv"), row.names = FALSE)

XGBoost

Ensure parallel processing.

# Detect the number of available cores
cores <- detectCores() #- 1  # Leave one core free

# Create a cluster with the detected number of cores
cl <- makeCluster(cores)

# Register the parallel backend
registerDoParallel(cl)

Define the grid and estimate runtime.

grid_tune <- expand.grid(
  nrounds = c(5000, 10000), 
  max_depth = c(3, 6), 
  eta = c(0.05, 0.1), 
  gamma = c(0.1), 
  colsample_bytree = c(0.6, 0.8), 
  min_child_weight = c(1), 
  subsample = c(0.75, 1.0)
)

# Calculate total combinations
total_combinations <- nrow(grid_tune)

# Estimate single model run time (assume 1 minute per run)
single_model_time <- 10 # minute

# Total runs for cross-validation
folds <- 5
total_runs <- total_combinations * folds

# Total time estimation without parallel processing
total_time <- total_runs * single_model_time # in minutes

# Convert to hours
total_time_hours <- total_time / 60

# Output estimated time without parallel processing
print(paste("Estimated time for grid search without parallel processing:", total_time_hours, "hours"))
[1] "Estimated time for grid search without parallel processing: 26.6666666666667 hours"
# Parallel processing with 4 cores
cores <- 24
total_time_parallel <- total_time / cores # in minutes

# Convert to hours
total_time_parallel_hours <- total_time_parallel / 60

# Output estimated time with parallel processing
print(paste("Estimated time for grid search with", cores, "cores:", total_time_parallel_hours, "hours"))
[1] "Estimated time for grid search with 24 cores: 1.11111111111111 hours"
rm(total_combinations,single_model_time,folds,total_runs,total_time,total_time_hours,total_time_parallel,total_time_parallel_hours,cores)

K-fold cross-validation

Create subsets to train and test data (80/20).

# Set seed for reproducibility
set.seed(998)

# Set up train control
train_control <- trainControl(
  method = "cv",        # Cross-validation
  number = 5,           # 5-fold cross-validation
  allowParallel = TRUE  # Enable parallel processing
)

# Define the number of subsets
numSubsets <- 5

# Load data if needed
gesDataXGB <- read_csv(paste0(datasets, "gesDataXGB.csv"))

# Ensure 'correction_info' is a factor
gesDataXGB$correction_info <- as.factor(gesDataXGB$correction_info)

# Remove rows with only NA values
gesDataXGB <- gesDataXGB[rowSums(is.na(gesDataXGB)) < ncol(gesDataXGB), ]

# Split data by levels of 'correction_info'
correction_levels <- levels(gesDataXGB$correction_info)
split_data <- split(gesDataXGB, gesDataXGB$correction_info)

# Initialize a list to store subsets
gesSubsets <- vector("list", length = numSubsets)

# Distribute rows for each level equally across subsets
for (level in correction_levels) {
  level_data <- split_data[[level]]
  subset_sizes <- rep(floor(nrow(level_data) / numSubsets), numSubsets)
  remainder <- nrow(level_data) %% numSubsets
  
  # Distribute remainder rows randomly
  if (remainder > 0) {
    subset_sizes[seq_len(remainder)] <- subset_sizes[seq_len(remainder)] + 1
  }
  
  # Shuffle rows of the level and assign to subsets
  shuffled_data <- level_data[sample(nrow(level_data)), ]
  indices <- cumsum(c(0, subset_sizes))
  
  for (i in 1:numSubsets) {
    if (is.null(gesSubsets[[i]])) {
      gesSubsets[[i]] <- shuffled_data[(indices[i] + 1):indices[i + 1], ]
    } else {
      gesSubsets[[i]] <- rbind(gesSubsets[[i]], shuffled_data[(indices[i] + 1):indices[i + 1], ])
    }
  }
}

# Naming the subsets
names(gesSubsets) <- paste0("gesData", 1:numSubsets)

# Verify balance in subsets
for (i in 1:numSubsets) {
  cat("Subset", i, "contains rows:", nrow(gesSubsets[[i]]), "and levels:\n")
  print(table(gesSubsets[[i]]$correction_info))
}
Subset 1 contains rows: 5 and levels:

     c0 c0_only      c1      c2 
      1       2       1       1 
Subset 2 contains rows: 5 and levels:

     c0 c0_only      c1      c2 
      1       2       1       1 
Subset 3 contains rows: 5 and levels:

     c0 c0_only      c1      c2 
      1       2       1       1 
Subset 4 contains rows: 5 and levels:

     c0 c0_only      c1      c2 
      1       1       1       1 
Subset 5 contains rows: 5 and levels:

     c0 c0_only      c1      c2 
      1       1       1       1 
# Remove any rows with only NAs from subsets just to ensure cleanliness
gesSubsets <- lapply(gesSubsets, function(subset) {
  subset[rowSums(is.na(subset)) < ncol(subset), ]
})

# Access the subsets
gesData1 <- gesSubsets$gesData1
gesData2 <- gesSubsets$gesData2
gesData3 <- gesSubsets$gesData3
gesData4 <- gesSubsets$gesData4
gesData5 <- gesSubsets$gesData5

# Combine subsets into 80% groups
gesData1234 <- rbind(gesData1, gesData2, gesData3, gesData4)
gesData1235 <- rbind(gesData1, gesData2, gesData3, gesData5)
gesData1245 <- rbind(gesData1, gesData2, gesData4, gesData5)
gesData1345 <- rbind(gesData1, gesData3, gesData4, gesData5)
gesData2345 <- rbind(gesData2, gesData3, gesData4, gesData5)

# Final verification of all levels in the combined datasets
combined_sets <- list(gesData1234, gesData1235, gesData1245, gesData1345, gesData2345)
names(combined_sets) <- c("gesData1234", "gesData1235", "gesData1245", "gesData1345", "gesData2345")

for (set_name in names(combined_sets)) {
  cat("Dataset", set_name, "contains rows:", nrow(combined_sets[[set_name]]), "and levels:\n")
  print(table(combined_sets[[set_name]]$correction_info))
}
Dataset gesData1234 contains rows: 19 and levels:

     c0 c0_only      c1      c2 
      4       7       4       4 
Dataset gesData1235 contains rows: 19 and levels:

     c0 c0_only      c1      c2 
      4       7       4       4 
Dataset gesData1245 contains rows: 18 and levels:

     c0 c0_only      c1      c2 
      4       6       4       4 
Dataset gesData1345 contains rows: 18 and levels:

     c0 c0_only      c1      c2 
      4       6       4       4 
Dataset gesData2345 contains rows: 18 and levels:

     c0 c0_only      c1      c2 
      4       6       4       4 

Models

Only run the models one time and then readRDS.

Model 1

gesModel1 <- caret::train(
  correction_info ~ .,              
  data = gesData1234,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(gesModel1, file = paste0(models, "gesModel1.rds"), compress = TRUE)

Model 2

gesModel2 <- caret::train(
  correction_info ~ .,              
  data = gesData1235,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(gesModel2, file = paste0(models, "gesModel2.rds"), compress = TRUE)

Model 3

gesModel3 <- caret::train(
  correction_info ~ .,              
  data = gesData1245,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(gesModel3, file = paste0(models, "gesModel3.rds"), compress = TRUE)

Model 4

gesModel4 <- caret::train(
  correction_info ~ .,              
  data = gesData1345,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)
saveRDS(gesModel4, file = paste0(models, "gesModel4.rds"), compress = TRUE)

Model 5

gesModel5 <- caret::train(
  correction_info ~ .,              
  data = gesData2345,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(gesModel5, file = paste0(models, "gesModel5.rds"), compress = TRUE)

Load models

Load all models after running, if necessary.

gesModel1 <- readRDS(paste0(models, "gesModel1.rds"))
gesModel2 <- readRDS(paste0(models, "gesModel2.rds"))
gesModel3 <- readRDS(paste0(models, "gesModel3.rds"))
gesModel4 <- readRDS(paste0(models, "gesModel4.rds"))
gesModel5 <- readRDS(paste0(models, "gesModel5.rds"))

Test models

Generate predictions and confusion matrices

# Generate predictions
gesPredictions1 <- predict(gesModel1, newdata = gesData5)
[13:45:07] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
gesPredictions2 <- predict(gesModel2, newdata = gesData4)
[13:45:07] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
gesPredictions3 <- predict(gesModel3, newdata = gesData3)
[13:45:07] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
gesPredictions4 <- predict(gesModel4, newdata = gesData2)
[13:45:08] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
gesPredictions5 <- predict(gesModel5, newdata = gesData1)
[13:45:08] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
# Compute confusion matrices
gesCm1 <- confusionMatrix(gesPredictions1, gesData5$correction_info)
gesCm2 <- confusionMatrix(gesPredictions2, gesData4$correction_info)
gesCm3 <- confusionMatrix(gesPredictions3, gesData3$correction_info)
gesCm4 <- confusionMatrix(gesPredictions4, gesData2$correction_info)
gesCm5 <- confusionMatrix(gesPredictions5, gesData1$correction_info)

# Extract p-values (you need to define how to extract these based on your metric, here assumed to be some metric from confusion matrix)
gesPValues <- c(gesCm1$overall['AccuracyPValue'], 
              gesCm2$overall['AccuracyPValue'], 
              gesCm3$overall['AccuracyPValue'], 
              gesCm4$overall['AccuracyPValue'], 
              gesCm5$overall['AccuracyPValue'])

Combine p-values using Fisher’s method

# Fisher's method
gesFisher_combined <- -2 * sum(log(gesPValues))
df <- 2 * length(gesPValues)
gesPCcombined_fisher <- 1 - pchisq(gesFisher_combined, df)
print(gesPCcombined_fisher)
[1] 0.630725
# Stouffer's method
gesZ_scores <- qnorm(1 - gesPValues/2)
gesCombined_z <- sum(gesZ_scores) / sqrt(length(gesPValues))
gesP_combined_stouffer <- 2 * (1 - pnorm(abs(gesCombined_z)))
print(gesP_combined_stouffer)
[1] 0.1239873

The p-values should sum up to 0. Currently we do not have enough data.

Feature importance

Model 1
XGBgesModel1 <- gesModel1$finalModel
importanceXGBgesModel1 <- xgb.importance(model = XGBgesModel1)
head(importanceXGBgesModel1, n=10)
                        Feature       Gain      Cover  Frequency
                         <char>      <num>      <num>      <num>
 1: arm_moment_sum_change_range 0.10621775 0.06140352 0.05448718
 2:            head_power_range 0.08514769 0.07967481 0.08012821
 3:             head_power_Gstd 0.06856142 0.05235004 0.04807692
 4:  arm_moment_sum_change_Gstd 0.05890059 0.05588046 0.05769231
 5:               body_slope_95 0.04080824 0.04771831 0.04487179
 6:        lowerbody_power_Gstd 0.03443709 0.02881254 0.02884615
 7:  spine_moment_sum_pospeak_n 0.03367215 0.02497264 0.01923077
 8:                arm_duration 0.03270931 0.02571379 0.02564103
 9:        arm_moment_sum_Gmean 0.02745091 0.02582984 0.02564103
10:  arm_angSpeed_sum_pospeak_n 0.02553955 0.01598632 0.01282051
xgb.plot.importance(importanceXGBgesModel1)

Model 2
XGBgesModel2 <- gesModel2$finalModel
importanceXGBgesModel2 <- xgb.importance(model = XGBgesModel2)
head(importanceXGBgesModel2, n=10)
                              Feature       Gain      Cover Frequency
                               <char>      <num>      <num>     <num>
 1:        arm_moment_sum_change_Gstd 0.15656328 0.10177312  0.093750
 2:                  head_power_range 0.08191146 0.09313640  0.087500
 3:             head_moment_sum_Gmean 0.05615654 0.04697675  0.037500
 4: lowerbody_jerkKin_sum_pospeak_std 0.04383749 0.02855344  0.025000
 5:                     body_slope_95 0.04140191 0.04229001  0.037500
 6:       arm_moment_sum_change_range 0.04046465 0.03485711  0.034375
 7:              head_angAcc_sum_Gstd 0.02922143 0.02022442  0.018750
 8:                    head_inter_Kin 0.02911760 0.02318121  0.025000
 9:     pelvis_moment_sum_pospeak_std 0.02902350 0.02526397  0.025000
10:                      arm_nComp_80 0.02862824 0.02214754  0.018750
xgb.plot.importance(importanceXGBgesModel2)

Model 3
XGBgesModel3 <- gesModel3$finalModel
importanceXGBgesModel3 <- xgb.importance(model = XGBgesModel3)
head(importanceXGBgesModel3, n=10)
                               Feature       Gain      Cover  Frequency
                                <char>      <num>      <num>      <num>
 1:               arm_moment_sum_Gmean 0.08511288 0.07536878 0.07468124
 2:        arm_moment_sum_change_range 0.06728393 0.05329511 0.04918033
 3:                   head_power_range 0.06628795 0.04044630 0.03278689
 4:               head_angAcc_sum_Gstd 0.06543792 0.08684720 0.09836066
 5:         arm_moment_sum_change_Gstd 0.06096995 0.05092504 0.04918033
 6: head_moment_sum_change_pospeak_std 0.06096008 0.04739633 0.03825137
 7: leg_moment_sum_change_pospeak_mean 0.05280459 0.04304653 0.03825137
 8:                       arm_duration 0.04014164 0.03501963 0.03460838
 9:               head_moment_sum_Gstd 0.03976587 0.03494228 0.03096539
10:           leg_moment_sum_pospeak_n 0.03297082 0.02711725 0.02367942
xgb.plot.importance(importanceXGBgesModel3)

Model 4
XGBgesModel4 <- gesModel4$finalModel
importanceXGBgesModel4 <- xgb.importance(model = XGBgesModel4)
head(importanceXGBgesModel4, n=10)
                        Feature       Gain      Cover  Frequency
                         <char>      <num>      <num>      <num>
 1:                arm_inter_IK 0.16009713 0.10406014 0.09335727
 2:              head_inter_Kin 0.11882778 0.09862493 0.10053860
 3:        arm_moment_sum_Gmean 0.06599796 0.06139296 0.05206463
 4:                arm_duration 0.05548316 0.04347284 0.03949731
 5:               arm_inter_Kin 0.05482641 0.04860660 0.04308797
 6:       head_moment_sum_Gmean 0.05183538 0.05242433 0.04667864
 7:      leg_power_pospeak_mean 0.05120850 0.04967961 0.05026930
 8: arm_moment_sum_pospeak_mean 0.02549236 0.02261922 0.02333932
 9:            head_power_Gmean 0.02352650 0.01920458 0.01436266
10:    head_moment_sum_integral 0.02241479 0.02772374 0.02692998
xgb.plot.importance(importanceXGBgesModel4)

Model 5
XGBgesModel5 <- gesModel5$finalModel
importanceXGBgesModel5 <- xgb.importance(model = XGBgesModel5)
head(importanceXGBgesModel5, n=10)
                          Feature       Gain      Cover  Frequency
                           <char>      <num>      <num>      <num>
 1:           arm_moment_sum_Gstd 0.13850676 0.10096313 0.08521739
 2:   arm_moment_sum_change_range 0.06521056 0.05179674 0.05217391
 3:          arm_moment_sum_range 0.05274530 0.04044015 0.03478261
 4:                head_inter_Kin 0.04433975 0.05737692 0.06434783
 5:                  arm_coupling 0.04294857 0.03466822 0.03130435
 6:              head_power_range 0.03782498 0.04249504 0.04347826
 7: pelvis_moment_sum_pospeak_std 0.03779493 0.03732779 0.03826087
 8:                  arm_slope_95 0.03164522 0.02732536 0.02782609
 9:                leg_power_Gstd 0.02771742 0.02463017 0.02434783
10:           head_power_integral 0.02668085 0.02211122 0.01739130
xgb.plot.importance(importanceXGBgesModel5)

Cumulative feature importance

# Function to extract and normalize importance
get_normalized_importance <- function(model) {
  importance <- xgb.importance(model = model)
  importance$Gain <- importance$Gain / sum(importance$Gain)
  return(importance)
}

# Extract normalized importance for each model
gesImportance1 <- get_normalized_importance(gesModel1$finalModel)
gesImportance2 <- get_normalized_importance(gesModel2$finalModel)
gesImportance3 <- get_normalized_importance(gesModel3$finalModel)
gesImportance4 <- get_normalized_importance(gesModel4$finalModel)
gesImportance5 <- get_normalized_importance(gesModel5$finalModel)

# Combine importances
gesAllImportances <- list(gesImportance1, gesImportance2, gesImportance3, gesImportance4, gesImportance5)

# Function to merge importances
merge_importances <- function(importances) {
  for (i in 2:length(importances)) {
    names(importances[[i]])[2:4] <- paste0(names(importances[[i]])[2:4], "_", i)
  }
  merged <- Reduce(function(x, y) merge(x, y, by = "Feature", all = TRUE), importances)
  merged[is.na(merged)] <- 0  # Replace NAs with 0
  gain_cols <- grep("Gain", colnames(merged), value = TRUE)
  merged$Cumulative <- rowSums(merged[, ..gain_cols])
  return(merged[, .(Feature, Cumulative)])
}

# Merge and sort importances
gesCumulativeImportance <- merge_importances(gesAllImportances)
gesCumulativeImportance <- gesCumulativeImportance[order(-gesCumulativeImportance$Cumulative), ]

# Print cumulative feature importance
head(gesCumulativeImportance, n=10)
                        Feature Cumulative
                         <char>      <num>
 1:  arm_moment_sum_change_Gstd  0.3030472
 2:            head_power_range  0.2885670
 3: arm_moment_sum_change_range  0.2857463
 4:              head_inter_Kin  0.2163460
 5:        arm_moment_sum_Gmean  0.2162907
 6:                arm_inter_IK  0.2116572
 7:         arm_moment_sum_Gstd  0.1714339
 8:                arm_duration  0.1674509
 9:       head_moment_sum_Gmean  0.1308098
10:        head_angAcc_sum_Gstd  0.1161849

PCA

Now, to select features along different (uncorrelated) dimensions, we want to connect these results with the results of PCA we performed in this script.

Custom function
# Function to select top 3 features per component
select_top_features <- function(pc_column, xgb_importance, top_n = 3) {

  # Find common features ranked by XGBoost importance
  common_features <- intersect(pc_column, xgb_importance$Feature)
  common_features <- xgb_importance %>%
    filter(Feature %in% common_features) %>%
    arrange(Rank) %>%
    pull(Feature)
  return(head(common_features, top_n))
}

First, we collect 10 features per component that have highest combined ranking from PCA and XGBoost. This means that for each feature we sum up the ranking it obtained in cumulative importance (XGBoost) and loading on a principal component (PCA).

# Rank the features based on XGBoost importance (cumulative)
gesCumulativeImportance$XGB_Rank <- rank(-gesCumulativeImportance$Cumulative)

# Load in PCA for gesture
ges_pca <- read_csv(paste0(datasets, "PCA_top_contributors_ges.csv"))

# For each PC (PC1, PC2, PC3), rank the features based on their loadings
combined_ranks_per_pc <- list()

for (pc in c("PC1", "PC2", "PC3")) {
  # Extract the features and loadings for the current PC
  pca_pc_loadings <- ges_pca[, c(pc, paste0(pc, "_Loading"))]
  colnames(pca_pc_loadings) <- c("Feature", "Loading")
  
  # Rank the features based on the absolute loading values (higher loadings should get lower rank)
  pca_pc_loadings$PCA_Rank <- rank(-abs(pca_pc_loadings$Loading))
  
  # Merge PCA loadings with XGBoost importance ranks
  merged_data <- merge(pca_pc_loadings, gesCumulativeImportance[, c("Feature", "XGB_Rank")], by = "Feature")
  
  # Calculate combined rank by summing XGBoost rank and PCA rank
  merged_data$Combined_Rank <- merged_data$XGB_Rank + merged_data$PCA_Rank
  
  # Sort by the combined rank (lower rank is better)
  sorted_data <- merged_data[order(merged_data$Combined_Rank), ]
  
  # Select the top n features based on the combined rank for the current PC
  top_n_features <- 10  # Adjust the number of top features as needed
  combined_ranks_per_pc[[pc]] <- head(sorted_data, top_n_features)
}

# Output the top features per PC based on combined ranking
head(combined_ranks_per_pc, n=10)
$PC1
                      Feature    Loading PCA_Rank XGB_Rank Combined_Rank
28               arm_inter_IK 0.06696878       97        6           103
25                   arm_bbmv 0.08606171        5       99           104
68       head_accKin_sum_Gstd 0.08200432       18       86           104
136           leg_power_Gmean 0.07496099       54       56           110
138        leg_power_integral 0.07328730       62       52           114
88                  head_bbmv 0.08376976       11      107           118
15  arm_angJerk_sum_pospeak_n 0.06625538      105       28           133
51     arm_power_pospeak_mean 0.07210599       66       76           142
92     head_jerkKin_sum_Gmean 0.07629644       42      105           147
139    leg_power_pospeak_mean 0.06311577      126       22           148

$PC2
                               Feature     Loading PCA_Rank XGB_Rank
113                   head_power_range  0.08564644       41        2
41         arm_moment_sum_change_range -0.08534023       43        3
109                head_power_integral  0.08958606       33       14
37          arm_moment_sum_change_Gstd -0.08359439       48        1
128 leg_moment_sum_change_pospeak_mean  0.08776522       37       17
112             head_power_pospeak_std  0.09732699       23       36
205             spine_moment_sum_Gmean -0.10247717       14       49
108                    head_power_Gstd  0.07966316       59       12
172          lowerbody_moment_sum_Gstd  0.10128711       16       58
102               head_moment_sum_Gstd  0.08005577       58       19
    Combined_Rank
113            43
41             46
109            47
37             49
128            54
112            59
205            63
108            71
172            74
102            77

$PC3
                          Feature     Loading PCA_Rank XGB_Rank Combined_Rank
177          lowerbody_power_Gstd -0.10452139        7       30            37
45    arm_moment_sum_pospeak_mean  0.09070628       34       20            54
208    spine_moment_sum_pospeak_n -0.09428843       28       26            54
182         lowerbody_power_range -0.10525663        5       51            56
205        spine_moment_sum_Gmean -0.10301385       11       49            60
42           arm_moment_sum_Gmean  0.07878132       60        5            65
133      leg_moment_sum_pospeak_n -0.08279836       52       23            75
154   lowerbody_angSpeed_sum_Gstd -0.10225587       12       70            82
207     spine_moment_sum_integral -0.10149747       14       71            85
186 pelvis_moment_sum_change_Gstd -0.09860998       19       73            92

For modelling, we want to pick only three features per component. Which would it be in this case?

# Number of top features to display
top_n_features <- 3

# Print the top 3 features per component
for (pc in c("PC1", "PC2", "PC3")) {
  cat("\nTop 3 Features for", pc, ":\n")
  
  # Get the top 3 features based on combined rank for the current PC
  top_features <- head(combined_ranks_per_pc[[pc]], top_n_features)
  
  # Print the results
  print(top_features[, c("Feature", "XGB_Rank", "PCA_Rank", "Combined_Rank")])
}

Top 3 Features for PC1 :
                Feature XGB_Rank PCA_Rank Combined_Rank
28         arm_inter_IK        6       97           103
25             arm_bbmv       99        5           104
68 head_accKin_sum_Gstd       86       18           104

Top 3 Features for PC2 :
                        Feature XGB_Rank PCA_Rank Combined_Rank
113            head_power_range        2       41            43
41  arm_moment_sum_change_range        3       43            46
109         head_power_integral       14       33            47

Top 3 Features for PC3 :
                        Feature XGB_Rank PCA_Rank Combined_Rank
177        lowerbody_power_Gstd       30        7            37
45  arm_moment_sum_pospeak_mean       20       34            54
208  spine_moment_sum_pospeak_n       26       28            54

Vocalization

Now, we just repeat the whole workflow using data for vocalization-only trials.

# Load in
data_voc <- read_csv(paste0(datasets, "voc_clean_df.csv"))

# Make predictor a factor variable
data_voc$correction_info <- as.factor(data_voc$correction_info)

# Display
head(data_voc, n=10)
# A tibble: 10 × 71
   envelope_Gmean envelope_Gstd envelope_pospeak_mean envelope_pospeak_std
            <dbl>         <dbl>                 <dbl>                <dbl>
 1          0.288          1.44                 0.297                2.39 
 2          0.805          2.11                -0.586                0.231
 3          0.281          1.84                -0.584                0.196
 4          0.446          1.91                -0.641                0.118
 5          0.804          1.93                 0.003                0.728
 6          0.779          2.05                 1.79                 4.27 
 7          0.317          1.75                -0.5                  0.001
 8          2.47           2.34                -0.592                0    
 9          1.17           2.10                 3.06                 5.17 
10          0.263          1.59                -0.472                0.43 
# ℹ 67 more variables: envelope_pospeak_n <dbl>, envelope_integral <dbl>,
#   envelope_range <dbl>, envelope_change_Gmean <dbl>,
#   envelope_change_Gstd <dbl>, envelope_change_pospeak_mean <dbl>,
#   envelope_change_pospeak_std <dbl>, envelope_change_pospeak_n <dbl>,
#   envelope_change_integral <dbl>, envelope_change_range <dbl>,
#   f0_Gmean <dbl>, f0_Gstd <dbl>, f0_pospeak_n <dbl>, f0_integral <dbl>,
#   f0_range <dbl>, f1_clean_Gmean <dbl>, f1_clean_Gstd <dbl>, …

Random forests

We first run random forests.

# prepare predictors
predictors <- setdiff(names(data_voc), "correction_info")

formula_str <- paste("correction_info ~", paste(predictors, collapse = " + "))

# Convert the formula string to a formula object
vocTree_formula <- as.formula(formula_str)

# Now use the formula in rpart
vocTree <- rpart(formula = vocTree_formula, data = data_voc, 
                method='class', # Specify that it's a classification tree
                control = rpart.control(maxdepth = 5)  # Control parameters for the 'rpart' function
)

prp(
  vocTree,         # The decision tree object to be visualized
  extra = 1,      # Show extra information (like node statistics) in the plot
  varlen = 0,     # Length of variable names (0 means auto-determined)
  faclen = 0     # Length of factor levels displayed on the plot (increase as needed)
)

Split the data

Building the untuned model.

# Untuned Model with importance (permutation) option set
vocUntuned <- ranger(
  y = train_data$correction_info,
  x = train_data[,0:70], # without outcome var
  num.trees = 500,
  importance = "permutation"
)

predictions <- predict(vocUntuned, data = test_data)$predictions

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_data$correction_info)

# Print the confusion matrix
print(confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction c0 c0_only c1 c2
   c0       0       0  0  1
   c0_only  0       0  0  0
   c1       1       0  1  0
   c2       0       0  0  0

Overall Statistics
                                          
               Accuracy : 0.3333          
                 95% CI : (0.0084, 0.9057)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 0.7037          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: c0 Class: c0_only Class: c1 Class: c2
Sensitivity             0.0000             NA    1.0000    0.0000
Specificity             0.5000              1    0.5000    1.0000
Pos Pred Value          0.0000             NA    0.5000       NaN
Neg Pred Value          0.5000             NA    1.0000    0.6667
Prevalence              0.3333              0    0.3333    0.3333
Detection Rate          0.0000              0    0.3333    0.0000
Detection Prevalence    0.3333              0    0.6667    0.0000
Balanced Accuracy       0.2500             NA    0.7500    0.5000
# Calculate feature importance
feature_importance <- importance(vocUntuned, num.threads = 1, type = 1) 

# Convert to data frame
feature_importance <- as.data.frame(feature_importance, stringsAsFactors = FALSE)
feature_importance$Feature <- rownames(feature_importance)
colnames(feature_importance) <- c("Importance", "Feature")

# Sort by importance
sorted_feature_importance <- feature_importance[order(-feature_importance$Importance), ]

# Print sorted feature importance
head(sorted_feature_importance, n=10)
                           Importance                   Feature
VSA_f1f2                  0.009076190                  VSA_f1f2
f3_clean_Gstd             0.008980952             f3_clean_Gstd
f2_clean_vel_range        0.008032540        f2_clean_vel_range
f1_clean_vel_range        0.006071429        f1_clean_vel_range
envelope_integral         0.005361905         envelope_integral
f3_clean_vel_Gmean        0.004771429        f3_clean_vel_Gmean
f1_clean_range            0.004598413            f1_clean_range
f1_clean_Gstd             0.004238095             f1_clean_Gstd
envelope_Gmean            0.003252381            envelope_Gmean
f1_clean_vel_pospeak_mean 0.003186508 f1_clean_vel_pospeak_mean

Set the parameters for the random forest.

# Define the number of CPU cores to use
num_cores <- detectCores()

# Create a cluster with specified number of cores
cl <- makeCluster(num_cores)

Tuning the random forest.

tuneVoc <- makeClassifTask(data = data_voc[,0:71], # with OV
                           target = "correction_info")

tuneVoc <- tuneRanger(tuneVoc,
                      measure = list(multiclass.brier),
                      num.trees = 500)

#Return hyperparameter values
#tuneVoc

# Recommended parameter settings: 
# mtry min.node.size sample.fraction
# 1    6             4        0.522745
# Results: 
#   multiclass.brier exec.time
# 1        0.7256575     0.166

vocTuned <- ranger(
  y = train_data$correction_info,
  x = train_data[,0:70],  #without OV
  num.trees = 5000, 
  mtry = 6, # Set the recommended mtry value (number of features).
  min.node.size = 4, # Set the recommended min.node.size value (number of samples before a node terminates).
  sample.fraction = 0.522745, # Set the recommended sample fraction value.(% of data for bagging).
  importance = "permutation" # Permutation is a computationally intensive test.
)

predictions <- predict(vocTuned, data = test_data)$predictions

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_data$correction_info)

# Print the confusion matrix
print(confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction c0 c0_only c1 c2
   c0       0       0  0  1
   c0_only  0       0  0  0
   c1       1       0  1  0
   c2       0       0  0  0

Overall Statistics
                                          
               Accuracy : 0.3333          
                 95% CI : (0.0084, 0.9057)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 0.7037          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: c0 Class: c0_only Class: c1 Class: c2
Sensitivity             0.0000             NA    1.0000    0.0000
Specificity             0.5000              1    0.5000    1.0000
Pos Pred Value          0.0000             NA    0.5000       NaN
Neg Pred Value          0.5000             NA    1.0000    0.6667
Prevalence              0.3333              0    0.3333    0.3333
Detection Rate          0.0000              0    0.3333    0.0000
Detection Prevalence    0.3333              0    0.6667    0.0000
Balanced Accuracy       0.2500             NA    0.7500    0.5000
# Calculate feature importance
feature_importance <- importance(vocTuned, num.threads = 1, type = 1) 

# Convert to data frame
feature_importance <- as.data.frame(feature_importance, stringsAsFactors = FALSE)
feature_importance$Feature <- rownames(feature_importance)
colnames(feature_importance) <- c("Importance", "Feature")

# Sort by importance
sorted_feature_importance <- feature_importance[order(-feature_importance$Importance), ]

# Print sorted feature importance
head(sorted_feature_importance, n=10)
                           Importance                   Feature
VSA_f1f2                  0.005848687                  VSA_f1f2
f1_clean_range            0.004574312            f1_clean_range
f3_clean_Gstd             0.003591010             f3_clean_Gstd
f3_clean_vel_Gmean        0.003268267        f3_clean_vel_Gmean
f2_clean_vel_range        0.002858143        f2_clean_vel_range
envelope_integral         0.002575750         envelope_integral
f1_clean_Gstd             0.002248725             f1_clean_Gstd
f2_clean_vel_Gmean        0.001997623        f2_clean_vel_Gmean
f1_clean_vel_pospeak_mean 0.001879969 f1_clean_vel_pospeak_mean
f3_clean_pospeak_mean     0.001799207     f3_clean_pospeak_mean
# Close the cluster when you're done with your parallel tasks
#stopCluster(cl)

Create a tuned model only.

# Create a classification task for tuning
tuneVoc <- makeClassifTask(data = train_data[, 0:71], target = "correction_info") #with OV

# Tune the model
tuneVoc <- tuneRanger(tuneVoc, measure = list(multiclass.brier), num.trees = 500)

# Return hyperparameter values
#tuneVoc
# Recommended parameter settings: 
#   mtry min.node.size sample.fraction
# 1    3             3       0.5835222
# Results: 
#   multiclass.brier exec.time
# 1        0.7457527      0.16

# Fit the tuned model on the training data
vocTuned <- ranger(
  y = train_data$correction_info,
  x = train_data[, 0:70], # without OV
  num.trees = 5000,
  mtry = 3,
  min.node.size = 3,
  sample.fraction = 0.5835222,
  importance = "permutation"
)

# Predict on the test data
predictions <- predict(vocTuned, data = test_data)$predictions

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_data$correction_info)

# Print the confusion matrix
print(confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction c0 c0_only c1 c2
   c0       0       0  0  1
   c0_only  0       0  0  0
   c1       1       0  1  0
   c2       0       0  0  0

Overall Statistics
                                          
               Accuracy : 0.3333          
                 95% CI : (0.0084, 0.9057)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 0.7037          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: c0 Class: c0_only Class: c1 Class: c2
Sensitivity             0.0000             NA    1.0000    0.0000
Specificity             0.5000              1    0.5000    1.0000
Pos Pred Value          0.0000             NA    0.5000       NaN
Neg Pred Value          0.5000             NA    1.0000    0.6667
Prevalence              0.3333              0    0.3333    0.3333
Detection Rate          0.0000              0    0.3333    0.0000
Detection Prevalence    0.3333              0    0.6667    0.0000
Balanced Accuracy       0.2500             NA    0.7500    0.5000
# Calculate feature importance
feature_importance <- importance(vocTuned, num.threads = 1, type = 1)

# Convert to data frame
feature_importance_df <- as.data.frame(feature_importance, stringsAsFactors = FALSE)
feature_importance_df$Feature <- rownames(feature_importance_df)
colnames(feature_importance_df) <- c("Importance", "Feature")

# Sort by importance
sorted_feature_importance <- feature_importance_df[order(-feature_importance_df$Importance), ]

# Print sorted feature importance
head(sorted_feature_importance, n=10)
                       Importance               Feature
VSA_f1f2              0.005612179              VSA_f1f2
f1_clean_range        0.004155960        f1_clean_range
f3_clean_Gstd         0.003867630         f3_clean_Gstd
f2_clean_vel_range    0.003017071    f2_clean_vel_range
f3_clean_vel_Gmean    0.002908297    f3_clean_vel_Gmean
f3_clean_pospeak_mean 0.002557821 f3_clean_pospeak_mean
envelope_integral     0.002401515     envelope_integral
envelope_Gmean        0.002082677        envelope_Gmean
f1_clean_vel_range    0.002057444    f1_clean_vel_range
f1_clean_Gstd         0.001727735         f1_clean_Gstd
# Close the cluster when you're done with your parallel tasks
#stopCluster(cl)

Save data frame.

write.csv(data_voc, file = paste0(datasets, "vocDataXGB.csv"), row.names = FALSE)

XGBoost

Ensure parallel processing.

# Detect the number of available cores
cores <- detectCores() #- 1  # Leave one core free

# Create a cluster with the detected number of cores
cl <- makeCluster(cores)

# Register the parallel backend
registerDoParallel(cl)

Define the grid and estimate runtime.

grid_tune <- expand.grid(
  nrounds = c(5000, 10000), 
  max_depth = c(3, 6), 
  eta = c(0.05, 0.1), 
  gamma = c(0.1), 
  colsample_bytree = c(0.6, 0.8), 
  min_child_weight = c(1), 
  subsample = c(0.75, 1.0)
)

# Calculate total combinations
total_combinations <- nrow(grid_tune)

# Estimate single model run time (assume 1 minute per run)
single_model_time <- 10 # minute

# Total runs for cross-validation
folds <- 5
total_runs <- total_combinations * folds

# Total time estimation without parallel processing
total_time <- total_runs * single_model_time # in minutes

# Convert to hours
total_time_hours <- total_time / 60

# Output estimated time without parallel processing
print(paste("Estimated time for grid search without parallel processing:", total_time_hours, "hours"))
[1] "Estimated time for grid search without parallel processing: 26.6666666666667 hours"
# Parallel processing with 4 cores
cores <- 24
total_time_parallel <- total_time / cores # in minutes

# Convert to hours
total_time_parallel_hours <- total_time_parallel / 60

# Output estimated time with parallel processing
print(paste("Estimated time for grid search with", cores, "cores:", total_time_parallel_hours, "hours"))
[1] "Estimated time for grid search with 24 cores: 1.11111111111111 hours"
rm(total_combinations,single_model_time,folds,total_runs,total_time,total_time_hours,total_time_parallel,total_time_parallel_hours,cores)

K-fold cross-validation

Create subsets to train and test data (80/20).

# Set seed for reproducibility
set.seed(998)

# Set up train control
train_control <- trainControl(
  method = "cv",        # Cross-validation
  number = 5,           # 5-fold cross-validation
  allowParallel = TRUE  # Enable parallel processing
)


# Define the number of subsets
numSubsets <- 5

# Load MICE-imputed data (using placeholder 'data_ges' as the input dataset)
vocDataXGB <- read_csv(paste0(datasets, "vocDataXGB.csv"))

# Ensure 'correction_info' is a factor
vocDataXGB$correction_info <- as.factor(vocDataXGB$correction_info)

# Remove rows with only NA values
vocDataXGB <- vocDataXGB[rowSums(is.na(vocDataXGB)) < ncol(vocDataXGB), ]

# Split data by levels of 'correction_info'
correction_levels <- levels(vocDataXGB$correction_info)
split_data <- split(vocDataXGB, vocDataXGB$correction_info)

# Initialize a list to store subsets
vocSubsets <- vector("list", length = numSubsets)

# Distribute rows for each level equally across subsets
for (level in correction_levels) {
  level_data <- split_data[[level]]
  subset_sizes <- rep(floor(nrow(level_data) / numSubsets), numSubsets)
  remainder <- nrow(level_data) %% numSubsets
  
  # Distribute remainder rows randomly
  if (remainder > 0) {
    subset_sizes[seq_len(remainder)] <- subset_sizes[seq_len(remainder)] + 1
  }
  
  # Shuffle rows of the level and assign to subsets
  shuffled_data <- level_data[sample(nrow(level_data)), ]
  indices <- cumsum(c(0, subset_sizes))
  
  for (i in 1:numSubsets) {
    if (is.null(vocSubsets[[i]])) {
      vocSubsets[[i]] <- shuffled_data[(indices[i] + 1):indices[i + 1], ]
    } else {
      vocSubsets[[i]] <- rbind(vocSubsets[[i]], shuffled_data[(indices[i] + 1):indices[i + 1], ])
    }
  }
}

# Naming the subsets
names(vocSubsets) <- paste0("vocData", 1:numSubsets)

# Verify balance in subsets
for (i in 1:numSubsets) {
  cat("Subset", i, "contains rows:", nrow(vocSubsets[[i]]), "and levels:\n")
  print(table(vocSubsets[[i]]$correction_info))
}
Subset 1 contains rows: 7 and levels:

     c0 c0_only      c1      c2 
      2       1       2       2 
Subset 2 contains rows: 7 and levels:

     c0 c0_only      c1      c2 
      2       1       2       1 
Subset 3 contains rows: 5 and levels:

     c0 c0_only      c1      c2 
      1       1       1       1 
Subset 4 contains rows: 5 and levels:

     c0 c0_only      c1      c2 
      1       1       1       1 
Subset 5 contains rows: 5 and levels:

     c0 c0_only      c1      c2 
      1       1       1       1 
# Remove any rows with only NAs from subsets just to ensure cleanliness
vocSubsets <- lapply(vocSubsets, function(subset) {
  subset[rowSums(is.na(subset)) < ncol(subset), ]
})

# Access the subsets
vocData1 <- vocSubsets$vocData1
vocData2 <- vocSubsets$vocData2
vocData3 <- vocSubsets$vocData3
vocData4 <- vocSubsets$vocData4
vocData5 <- vocSubsets$vocData5

# Combine subsets into 80% groups
vocData1234 <- rbind(vocData1, vocData2, vocData3, vocData4)
vocData1235 <- rbind(vocData1, vocData2, vocData3, vocData5)
vocData1245 <- rbind(vocData1, vocData2, vocData4, vocData5)
vocData1345 <- rbind(vocData1, vocData3, vocData4, vocData5)
vocData2345 <- rbind(vocData2, vocData3, vocData4, vocData5)

# Final verification of all levels in the combined datasets
combined_sets <- list(vocData1234, vocData1235, vocData1245, vocData1345, vocData2345)
names(combined_sets) <- c("vocData1234", "vocData1235", "vocData1245", "vocData1345", "vocData2345")

for (set_name in names(combined_sets)) {
  cat("Dataset", set_name, "contains rows:", nrow(combined_sets[[set_name]]), "and levels:\n")
  print(table(combined_sets[[set_name]]$correction_info))
}
Dataset vocData1234 contains rows: 21 and levels:

     c0 c0_only      c1      c2 
      6       4       6       5 
Dataset vocData1235 contains rows: 21 and levels:

     c0 c0_only      c1      c2 
      6       4       6       5 
Dataset vocData1245 contains rows: 21 and levels:

     c0 c0_only      c1      c2 
      6       4       6       5 
Dataset vocData1345 contains rows: 19 and levels:

     c0 c0_only      c1      c2 
      5       4       5       5 
Dataset vocData2345 contains rows: 18 and levels:

     c0 c0_only      c1      c2 
      5       4       5       4 

Models

Model 1

vocModel1 <- caret::train(
  correction_info ~ .,              
  data = vocData1234,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(vocModel1, file = paste0(models, "vocModel1.rds"), compress = TRUE)

Model 2

vocModel2 <- caret::train(
  correction_info ~ .,              
  data = vocData1235,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(vocModel2, file = paste0(models, "vocModel2.rds"), compress = TRUE)

Model 3

vocModel3 <- caret::train(
  correction_info ~ .,              
  data = vocData1245,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(vocModel3, file = paste0(models, "vocModel3.rds"), compress = TRUE)

Model 4

vocModel4 <- caret::train(
  correction_info ~ .,              
  data = vocData1345,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)
saveRDS(vocModel4, file = paste0(models, "vocModel4.rds"), compress = TRUE)

Model 5

vocModel5 <- caret::train(
  correction_info ~ .,              
  data = vocData2345,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(vocModel5, file = paste0(models, "vocModel5.rds"), compress = TRUE)

Load models

Load all models after running, if necessary.

vocModel1 <- readRDS(paste0(models, "vocModel1.rds"))
vocModel2 <- readRDS(paste0(models, "vocModel2.rds"))
vocModel3 <- readRDS(paste0(models, "vocModel3.rds"))
vocModel4 <- readRDS(paste0(models, "vocModel4.rds"))
vocModel5 <- readRDS(paste0(models, "vocModel5.rds"))

Test models

Generate predictions and confusion matrices

# Generate predictions
vocPredictions1 <- predict(vocModel1, newdata = vocData5)
[13:46:05] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
vocPredictions2 <- predict(vocModel2, newdata = vocData4)
[13:46:05] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
vocPredictions3 <- predict(vocModel3, newdata = vocData3)
[13:46:05] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
vocPredictions4 <- predict(vocModel4, newdata = vocData2)
[13:46:05] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
vocPredictions5 <- predict(vocModel5, newdata = vocData1)
[13:46:05] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
# Compute confusion matrices
vocCm1 <- confusionMatrix(vocPredictions1, vocData5$correction_info)
vocCm2 <- confusionMatrix(vocPredictions2, vocData4$correction_info)
vocCm3 <- confusionMatrix(vocPredictions3, vocData3$correction_info)
vocCm4 <- confusionMatrix(vocPredictions4, vocData2$correction_info)
vocCm5 <- confusionMatrix(vocPredictions5, vocData1$correction_info)

# Extract p-values (you need to define how to extract these based on your metric, here assumed to be some metric from confusion matrix)
vocPValues <- c(vocCm1$overall['AccuracyPValue'], 
              vocCm2$overall['AccuracyPValue'], 
              vocCm3$overall['AccuracyPValue'], 
              vocCm4$overall['AccuracyPValue'], 
              vocCm5$overall['AccuracyPValue'])

Combine p-values using Fisher’s method

# Fisher's method
vocFisher_combined <- -2 * sum(log(vocPValues))
df <- 2 * length(vocPValues)
vocPCcombined_fisher <- 1 - pchisq(vocFisher_combined, df)
print(vocPCcombined_fisher)
[1] 0.6893521
# Stouffer's method
vocZ_scores <- qnorm(1 - vocPValues/2)
vocCombined_z <- sum(vocZ_scores) / sqrt(length(vocPValues))
vocP_combined_stouffer <- 2 * (1 - pnorm(abs(vocCombined_z)))
print(vocP_combined_stouffer)
[1] 0.1282563

The p-values should sum up to 0.

Feature importance

Model 1
XGBvocModel1 <- vocModel1$finalModel
importanceXGBvocModel1 <- xgb.importance(model = XGBvocModel1)
head(importanceXGBvocModel1, n=10)
                      Feature       Gain      Cover  Frequency
                       <char>      <num>      <num>      <num>
 1:            envelope_Gmean 0.13602726 0.08502319 0.07502800
 2:        f3_clean_vel_Gmean 0.09818061 0.09322070 0.09070549
 3:             envelope_Gstd 0.07400464 0.05561961 0.04927212
 4:            f1_clean_range 0.07164337 0.07482928 0.07502800
 5:                  VSA_f1f2 0.05515123 0.04918245 0.04143337
 6:         envelope_integral 0.04682987 0.06292669 0.06830907
 7: f1_clean_vel_pospeak_mean 0.04011472 0.03522299 0.03135498
 8:            f3_clean_Gmean 0.03869450 0.03362272 0.03695409
 9:      envelope_change_Gstd 0.02847876 0.03867110 0.03919373
10:            envelope_range 0.02722598 0.02976886 0.03247480
xgb.plot.importance(importanceXGBvocModel1)

Model 2
XGBvocModel2 <- vocModel2$finalModel
importanceXGBvocModel2 <- xgb.importance(model = XGBvocModel2)
head(importanceXGBvocModel2, n=10)
                      Feature       Gain      Cover  Frequency
                       <char>      <num>      <num>      <num>
 1:        f2_clean_pospeak_n 0.13483087 0.13544889 0.12921348
 2:            f1_clean_range 0.12070709 0.16690803 0.14325843
 3:            envelope_Gmean 0.09668145 0.07235325 0.07584270
 4:             f3_clean_Gstd 0.08237970 0.05602321 0.04494382
 5:             envelope_Gstd 0.07577008 0.03880060 0.03370787
 6: envelope_change_pospeak_n 0.06873527 0.03725576 0.04213483
 7:         envelope_integral 0.04674692 0.06279656 0.07865169
 8:            envelope_range 0.04544508 0.01653634 0.01685393
 9:         f2_clean_vel_Gstd 0.03037540 0.01712546 0.01966292
10:                  VSA_f1f2 0.02954035 0.02412326 0.01404494
xgb.plot.importance(importanceXGBvocModel2)

Model 3
XGBvocModel3 <- vocModel3$finalModel
importanceXGBvocModel3 <- xgb.importance(model = XGBvocModel3)
head(importanceXGBvocModel3, n=10)
                  Feature       Gain      Cover  Frequency
                   <char>      <num>      <num>      <num>
 1:        envelope_Gmean 0.19678849 0.13101717 0.11990950
 2:        f1_clean_Gmean 0.06518270 0.06577408 0.05656109
 3:         f1_clean_Gstd 0.06139091 0.05900369 0.04638009
 4:    f2_clean_vel_range 0.05294853 0.04460489 0.04411765
 5:  envelope_change_Gstd 0.05211902 0.05584529 0.04977376
 6:        f1_clean_range 0.04669633 0.05152589 0.05656109
 7:        envelope_range 0.04665922 0.03300070 0.03393665
 8:         envelope_Gstd 0.04528333 0.04547200 0.04751131
 9: envelope_pospeak_mean 0.03633826 0.04397990 0.04751131
10:     f3_clean_vel_Gstd 0.03632154 0.03074968 0.03054299
xgb.plot.importance(importanceXGBvocModel3)

Model 4
XGBvocModel4 <- vocModel4$finalModel
importanceXGBvocModel4 <- xgb.importance(model = XGBvocModel4)
head(importanceXGBvocModel4, n=10)
                         Feature       Gain      Cover  Frequency
                          <char>      <num>      <num>      <num>
 1:               envelope_Gmean 0.16819530 0.09570472 0.08753316
 2:  envelope_change_pospeak_std 0.08807969 0.07223289 0.06896552
 3:         envelope_change_Gstd 0.08052363 0.06335792 0.05835544
 4:                     VSA_f1f2 0.07818772 0.08241809 0.07161804
 5: envelope_change_pospeak_mean 0.06608976 0.05146865 0.04509284
 6:    f1_clean_vel_pospeak_mean 0.05408322 0.05589466 0.04509284
 7:                envelope_Gstd 0.05269826 0.03871238 0.03183024
 8:               f1_clean_Gmean 0.04965680 0.04615765 0.03713528
 9:                f1_clean_Gstd 0.03815808 0.05655439 0.06631300
10:               f1_clean_range 0.03635432 0.04276747 0.04244032
xgb.plot.importance(importanceXGBvocModel4)

Model 5
XGBvocModel5 <- vocModel5$finalModel
importanceXGBvocModel5 <- xgb.importance(model = XGBvocModel5)
head(importanceXGBvocModel5, n=10)
                        Feature       Gain      Cover  Frequency
                         <char>      <num>      <num>      <num>
 1:              envelope_Gmean 0.15496456 0.13085882 0.13613445
 2:              f1_clean_range 0.12172902 0.09508698 0.08235294
 3:           f1_clean_vel_Gstd 0.09589133 0.09944857 0.08739496
 4:               envelope_Gstd 0.07687082 0.05204549 0.05042017
 5:       f3_clean_pospeak_mean 0.06338331 0.04400205 0.03865546
 6:          envelope_pospeak_n 0.04257925 0.03277477 0.03361345
 7:              f2_clean_range 0.03612444 0.01860086 0.01680672
 8:               f1_clean_Gstd 0.03560007 0.03146792 0.03025210
 9: envelope_change_pospeak_std 0.03102815 0.02610947 0.01848739
10:           envelope_integral 0.02690926 0.03638461 0.04369748
xgb.plot.importance(importanceXGBvocModel5)

Cumulative feature importance
# Function to extract and normalize importance
get_normalized_importance <- function(model) {
  importance <- xgb.importance(model = model)
  importance$Gain <- importance$Gain / sum(importance$Gain)
  return(importance)
}

# Extract normalized importance for each model
vocImportance1 <- get_normalized_importance(vocModel1$finalModel)
vocImportance2 <- get_normalized_importance(vocModel2$finalModel)
vocImportance3 <- get_normalized_importance(vocModel3$finalModel)
vocImportance4 <- get_normalized_importance(vocModel4$finalModel)
vocImportance5 <- get_normalized_importance(vocModel5$finalModel)

# Combine importances
vocAllImportances <- list(vocImportance1, vocImportance2, vocImportance3, vocImportance4, vocImportance5)

# Function to merge importances
merge_importances <- function(importances) {
  for (i in 2:length(importances)) {
    names(importances[[i]])[2:4] <- paste0(names(importances[[i]])[2:4], "_", i)
  }
  merged <- Reduce(function(x, y) merge(x, y, by = "Feature", all = TRUE), importances)
  merged[is.na(merged)] <- 0  # Replace NAs with 0
  gain_cols <- grep("Gain", colnames(merged), value = TRUE)
  merged$Cumulative <- rowSums(merged[, ..gain_cols])
  return(merged[, .(Feature, Cumulative)])
}

# Merge and sort importances
vocCumulativeImportance <- merge_importances(vocAllImportances)
vocCumulativeImportance <- vocCumulativeImportance[order(-vocCumulativeImportance$Cumulative), ]

# Print cumulative feature importance
head(vocCumulativeImportance, n=10)
                 Feature Cumulative
                  <char>      <num>
 1:       envelope_Gmean  0.7526571
 2:       f1_clean_range  0.3971301
 3:        envelope_Gstd  0.3246271
 4: envelope_change_Gstd  0.2136246
 5:   f2_clean_pospeak_n  0.1929605
 6:             VSA_f1f2  0.1870122
 7:       f1_clean_Gmean  0.1636278
 8:        f1_clean_Gstd  0.1584567
 9:    envelope_integral  0.1578377
10:   f3_clean_vel_Gmean  0.1536120

PCA

Now let’s collect 10 features per component that have highest combined ranking from PCA and XGBoost. This means that for each feature we sum up the ranking it obtained in cumulative importance (XGBoost) and loading on a principal component (PCA).

# Rank the features based on XGBoost importance (cumulative)
vocCumulativeImportance$XGB_Rank <- rank(-vocCumulativeImportance$Cumulative)

# Load in PCA for gesture
voc_pca <- read_csv(paste0(datasets, "PCA_top_contributors_voc.csv"))

# For each PC (PC1, PC2, PC3), rank the features based on their loadings
combined_ranks_per_pc <- list()

for (pc in c("PC1", "PC2", "PC3")) {
  # Extract the features and loadings for the current PC
  pca_pc_loadings <- voc_pca[, c(pc, paste0(pc, "_Loading"))]
  colnames(pca_pc_loadings) <- c("Feature", "Loading")
  
  # Rank the features based on the absolute loading values (higher loadings should get lower rank)
  pca_pc_loadings$PCA_Rank <- rank(-abs(pca_pc_loadings$Loading))
  
  # Merge PCA loadings with XGBoost importance ranks
  merged_data <- merge(pca_pc_loadings, vocCumulativeImportance[, c("Feature", "XGB_Rank")], by = "Feature")
  
  # Calculate combined rank by summing XGBoost rank and PCA rank
  merged_data$Combined_Rank <- merged_data$XGB_Rank + merged_data$PCA_Rank
  
  # Sort by the combined rank (lower rank is better)
  sorted_data <- merged_data[order(merged_data$Combined_Rank), ]
  
  # Select the top n features based on the combined rank for the current PC
  top_n_features <- 10  # Adjust the number of top features as needed
  combined_ranks_per_pc[[pc]] <- head(sorted_data, top_n_features)
}

# Output the top features per PC based on combined ranking
head(combined_ranks_per_pc, n=10)
$PC1
              Feature    Loading PCA_Rank XGB_Rank Combined_Rank
21     f1_clean_range -0.2038667        7        2             9
31 f2_clean_pospeak_n -0.1992125        9        5            14
50           VSA_f1f2 -0.1981162       10        6            16
19      f1_clean_Gstd -0.1609309       16        8            24
38 f2_clean_vel_range -0.2093378        4       20            24
27 f1_clean_vel_range -0.2044053        6       24            30
32     f2_clean_range -0.2012178        8       23            31
13  envelope_integral -0.1301649       23        9            32
40      f3_clean_Gstd -0.1385837       19       13            32
29      f2_clean_Gstd -0.2075814        5       29            34

$PC2
                     Feature     Loading PCA_Rank XGB_Rank Combined_Rank
13         envelope_integral -0.17231302        7        9            16
11            envelope_Gmean -0.14800731       20        1            21
12             envelope_Gstd -0.14806488       19        3            22
41     f3_clean_pospeak_mean  0.20209426        6       22            28
44        f3_clean_vel_Gmean -0.14362905       21       10            31
5       envelope_change_Gstd  0.11631278       29        4            33
17            envelope_range -0.14099569       22       12            34
18            f1_clean_Gmean  0.11723539       28        7            35
8  envelope_change_pospeak_n  0.13615851       25       17            42
19             f1_clean_Gstd -0.09502265       35        8            43

$PC3
                     Feature     Loading PCA_Rank XGB_Rank Combined_Rank
5       envelope_change_Gstd  0.20397194       11        4            15
12             envelope_Gstd  0.17081447       12        3            15
11            envelope_Gmean  0.14034453       15        1            16
19             f1_clean_Gstd -0.12711091       18        8            26
13         envelope_integral  0.11466983       21        9            30
18            f1_clean_Gmean -0.10469280       23        7            30
8  envelope_change_pospeak_n  0.13146985       16       17            33
4      envelope_change_Gmean  0.25484373        4       30            34
17            envelope_range  0.06834337       27       12            39
41     f3_clean_pospeak_mean -0.12623294       19       22            41

For modelling, we want to pick three features per component. Which would it be in this case?

# Number of top features to display
top_n_features <- 3

# Print the top 3 features per component
for (pc in c("PC1", "PC2", "PC3")) {
  cat("\nTop 3 Features for", pc, ":\n")
  
  # Get the top 3 features based on combined rank for the current PC
  top_features <- head(combined_ranks_per_pc[[pc]], top_n_features)
  
  # Print the results
  print(top_features[, c("Feature", "XGB_Rank", "PCA_Rank", "Combined_Rank")])
}

Top 3 Features for PC1 :
              Feature XGB_Rank PCA_Rank Combined_Rank
21     f1_clean_range        2        7             9
31 f2_clean_pospeak_n        5        9            14
50           VSA_f1f2        6       10            16

Top 3 Features for PC2 :
             Feature XGB_Rank PCA_Rank Combined_Rank
13 envelope_integral        9        7            16
11    envelope_Gmean        1       20            21
12     envelope_Gstd        3       19            22

Top 3 Features for PC3 :
                Feature XGB_Rank PCA_Rank Combined_Rank
5  envelope_change_Gstd        4       11            15
12        envelope_Gstd        3       12            15
11       envelope_Gmean        1       15            16

Multimodal

Now we take trials from combined modality and follow identical workflow.

# Load
data_mult <- read_csv(paste0(datasets, "multi_clean_df.csv"))

# Make predictor a factor variable
data_mult$correction_info <- as.factor(data_mult$correction_info)

# Display
head(data_mult, n=10)
# A tibble: 10 × 395
   arm_duration arm_inter_Kin arm_inter_IK arm_bbmv lowerbody_duration
          <dbl>         <dbl>        <dbl>    <dbl>              <dbl>
 1         3988          26.2         26.7     8.47               2020
 2         3868          26.7         28.1     8.15               2870
 3         4014          26.4         27.6     8.46                874
 4         4046          26.2         28.6     8.48               3750
 5         4708          26.5         28.7     8.31               4508
 6         4004          27.0         28.6     8.70               3598
 7            0           0            0       0                   784
 8         5248          29.0         31.7    10.9                4930
 9         1784          25.0         24.2     6.60                  0
10         1284          24.3         24.5     5.05               1334
# ℹ 390 more variables: lowerbody_inter_Kin <dbl>, lowerbody_inter_IK <dbl>,
#   lowerbody_bbmv <dbl>, leg_duration <dbl>, leg_inter_Kin <dbl>,
#   leg_inter_IK <dbl>, leg_bbmv <dbl>, head_duration <dbl>,
#   head_inter_Kin <dbl>, head_inter_IK <dbl>, head_bbmv <dbl>,
#   arm_moment_sum_Gmean <dbl>, arm_moment_sum_Gstd <dbl>,
#   arm_moment_sum_pospeak_n <dbl>, arm_moment_sum_integral <dbl>,
#   arm_moment_sum_range <dbl>, arm_moment_sum_change_Gmean <dbl>, …

Random forests

We will build a random forest first.

# prepare predictors
predictors <- setdiff(names(data_mult), "correction_info")

formula_str <- paste("correction_info ~", paste(predictors, collapse = " + "))

# Convert the formula string to a formula object
multTree_formula <- as.formula(formula_str)

# Now use the formula in rpart
multTree <- rpart(formula = multTree_formula, data = data_mult, 
                method='class', # Specify that it's a classification tree
                control = rpart.control(maxdepth = 5)  # Control parameters for the 'rpart' function
)

prp(
  multTree,         # The decision tree object to be visualized
  extra = 1,      # Show extra information (like node statistics) in the plot
  varlen = 0,     # Length of variable names (0 means auto-determined)
  faclen = 0     # Length of factor levels displayed on the plot (increase as needed)
)

Split the data

Building the untuned model.

# Untuned Model with importance (permutation) option set
multUntuned <- ranger(
  y = train_data$correction_info,
  x = train_data[,0:394], # without OV
  num.trees = 500,
  importance = "permutation"
)

predictions <- predict(multUntuned, data = test_data)$predictions

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_data$correction_info)

# Print the confusion matrix
print(confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction c0 c0_only c1 c2
   c0       0       0  0  0
   c0_only  1       0  0  0
   c1       0       1  0  1
   c2       0       0  1  0

Overall Statistics
                                     
               Accuracy : 0          
                 95% CI : (0, 0.6024)
    No Information Rate : 0.25       
    P-Value [Acc > NIR] : 1          
                                     
                  Kappa : -0.3333    
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: c0 Class: c0_only Class: c1 Class: c2
Sensitivity               0.00         0.0000    0.0000    0.0000
Specificity               1.00         0.6667    0.3333    0.6667
Pos Pred Value             NaN         0.0000    0.0000    0.0000
Neg Pred Value            0.75         0.6667    0.5000    0.6667
Prevalence                0.25         0.2500    0.2500    0.2500
Detection Rate            0.00         0.0000    0.0000    0.0000
Detection Prevalence      0.00         0.2500    0.5000    0.2500
Balanced Accuracy         0.50         0.3333    0.1667    0.3333
# Calculate feature importance
feature_importance <- importance(multUntuned, num.threads = 1, type = 1) 

# Convert to data frame
feature_importance <- as.data.frame(feature_importance, stringsAsFactors = FALSE)
feature_importance$Feature <- rownames(feature_importance)
colnames(feature_importance) <- c("Importance", "Feature")

# Sort by importance
sorted_feature_importance <- feature_importance[order(-feature_importance$Importance), ]

# Print sorted feature importance
head(sorted_feature_importance, n=10)
                                   Importance                           Feature
envelope_change_pospeak_mean      0.004966667      envelope_change_pospeak_mean
arm_power_pospeak_std             0.003269048             arm_power_pospeak_std
leg_angJerk_sum_pospeak_mean      0.002706349      leg_angJerk_sum_pospeak_mean
arm_accKin_sum_range              0.002615079              arm_accKin_sum_range
lowerbody_jerkKin_sum_Gmean       0.002487302       lowerbody_jerkKin_sum_Gmean
head_speedKin_sum_pospeak_n       0.002206349       head_speedKin_sum_pospeak_n
lowerbody_jerkKin_sum_pospeak_std 0.002028571 lowerbody_jerkKin_sum_pospeak_std
head_angSpeed_sum_range           0.001966667           head_angSpeed_sum_range
arm_accKin_sum_Gmean              0.001816667              arm_accKin_sum_Gmean
head_moment_sum_change_integral   0.001815079   head_moment_sum_change_integral

Set the parameters for the random forest.

# Define the number of CPU cores to use
num_cores <- detectCores()

# Create a cluster with specified number of cores
cl <- makeCluster(num_cores)

Tuning the random forest.

tuneMult <- makeClassifTask(data = data_mult[,0:395], # with OV
                           target = "correction_info")

tuneMult <- tuneRanger(tuneMult,
                      measure = list(multiclass.brier),
                      num.trees = 500)

#Return hyperparameter values
#tuneMult

# Recommended parameter settings: 
#   mtry min.node.size sample.fraction
# 1  232             3         0.70919
# Results: 
#   multiclass.brier exec.time
# 1        0.7385408       0.2

multTuned <- ranger(
  y = train_data$correction_info,
  x = train_data[,0:394],  # without OV
  num.trees = 5000, 
  mtry = 232, # Set the recommended mtry value (number of features).
  min.node.size = 3, # Set the recommended min.node.size value (number of samples before a node terminates).
  sample.fraction = 0.70919, # Set the recommended sample fraction value.(% of data for bagging).
  importance = "permutation" # Permutation is a computationally intensive test.
)

predictions <- predict(multTuned, data = test_data)$predictions

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_data$correction_info)

# Print the confusion matrix
print(confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction c0 c0_only c1 c2
   c0       0       0  0  0
   c0_only  0       0  0  0
   c1       1       1  0  1
   c2       0       0  1  0

Overall Statistics
                                     
               Accuracy : 0          
                 95% CI : (0, 0.6024)
    No Information Rate : 0.25       
    P-Value [Acc > NIR] : 1          
                                     
                  Kappa : -0.3333    
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: c0 Class: c0_only Class: c1 Class: c2
Sensitivity               0.00           0.00      0.00    0.0000
Specificity               1.00           1.00      0.00    0.6667
Pos Pred Value             NaN            NaN      0.00    0.0000
Neg Pred Value            0.75           0.75      0.00    0.6667
Prevalence                0.25           0.25      0.25    0.2500
Detection Rate            0.00           0.00      0.00    0.0000
Detection Prevalence      0.00           0.00      0.75    0.2500
Balanced Accuracy         0.50           0.50      0.00    0.3333
# Calculate feature importance
feature_importance <- importance(multTuned, num.threads = 1, type = 1) 

# Convert to data frame
feature_importance <- as.data.frame(feature_importance, stringsAsFactors = FALSE)
feature_importance$Feature <- rownames(feature_importance)
colnames(feature_importance) <- c("Importance", "Feature")

# Sort by importance
sorted_feature_importance <- feature_importance[order(-feature_importance$Importance), ]

# Print sorted feature importance
head(sorted_feature_importance, n=10)
                                   Importance                           Feature
arm_power_pospeak_std             0.006726943             arm_power_pospeak_std
arm_accKin_sum_range              0.003974206              arm_accKin_sum_range
leg_speedKin_sum_pospeak_mean     0.003076378     leg_speedKin_sum_pospeak_mean
leg_accKin_sum_pospeak_mean       0.002952482       leg_accKin_sum_pospeak_mean
envelope_change_pospeak_mean      0.001955000      envelope_change_pospeak_mean
lowerbody_jerkKin_sum_Gmean       0.001952301       lowerbody_jerkKin_sum_Gmean
head_angSpeed_sum_range           0.001771530           head_angSpeed_sum_range
leg_angJerk_sum_pospeak_mean      0.001765200      leg_angJerk_sum_pospeak_mean
lowerbody_jerkKin_sum_pospeak_std 0.001732114 lowerbody_jerkKin_sum_pospeak_std
arm_accKin_sum_Gstd               0.001284538               arm_accKin_sum_Gstd
# Close the cluster when you're done with your parallel tasks
#stopCluster(cl)

Create a tuned model only.

# Create a classification task for tuning
tuneMult <- makeClassifTask(data = train_data[, 0:395], target = "correction_info") # with OV

# Tune the model
tuneMult <- tuneRanger(tuneMult, measure = list(multiclass.brier), num.trees = 500)

# Return hyperparameter values
#tuneMult
# Recommended parameter settings: 
# mtry min.node.size sample.fraction
# 1  268             3       0.3295434
# Results: 
#   multiclass.brier exec.time
# 1        0.8275102     0.172

# Fit the tuned model on the training data
multTuned <- ranger(
  y = train_data$correction_info,
  x = train_data[, 0:394],  # without OV
  num.trees = 5000,
  mtry = 268,
  min.node.size = 3,
  sample.fraction = 0.3295434,
  importance = "permutation"
)

# Predict on the test data
predictions <- predict(multTuned, data = test_data)$predictions

# Create a confusion matrix
confusion_matrix <- confusionMatrix(predictions, test_data$correction_info)

# Print the confusion matrix
print(confusion_matrix)
Confusion Matrix and Statistics

          Reference
Prediction c0 c0_only c1 c2
   c0       0       0  0  0
   c0_only  0       0  0  0
   c1       1       1  0  1
   c2       0       0  1  0

Overall Statistics
                                     
               Accuracy : 0          
                 95% CI : (0, 0.6024)
    No Information Rate : 0.25       
    P-Value [Acc > NIR] : 1          
                                     
                  Kappa : -0.3333    
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: c0 Class: c0_only Class: c1 Class: c2
Sensitivity               0.00           0.00      0.00    0.0000
Specificity               1.00           1.00      0.00    0.6667
Pos Pred Value             NaN            NaN      0.00    0.0000
Neg Pred Value            0.75           0.75      0.00    0.6667
Prevalence                0.25           0.25      0.25    0.2500
Detection Rate            0.00           0.00      0.00    0.0000
Detection Prevalence      0.00           0.00      0.75    0.2500
Balanced Accuracy         0.50           0.50      0.00    0.3333
# Calculate feature importance
feature_importance <- importance(multTuned, num.threads = 1, type = 1)

# Convert to data frame
feature_importance_df <- as.data.frame(feature_importance, stringsAsFactors = FALSE)
feature_importance_df$Feature <- rownames(feature_importance_df)
colnames(feature_importance_df) <- c("Importance", "Feature")

# Sort by importance
sorted_feature_importance <- feature_importance_df[order(-feature_importance_df$Importance), ]

# Print sorted feature importance
head(sorted_feature_importance, n=10)
                                    Importance
arm_power_pospeak_std             0.0009376557
leg_accKin_sum_pospeak_mean       0.0008753114
arm_accKin_sum_range              0.0007167766
leg_speedKin_sum_pospeak_mean     0.0006603297
head_moment_sum_range             0.0005926007
leg_angJerk_sum_pospeak_mean      0.0005246154
head_power_pospeak_std            0.0004887912
envelope_change_pospeak_mean      0.0004717216
lowerbody_jerkKin_sum_pospeak_std 0.0004468864
head_angSpeed_sum_range           0.0004372161
                                                            Feature
arm_power_pospeak_std                         arm_power_pospeak_std
leg_accKin_sum_pospeak_mean             leg_accKin_sum_pospeak_mean
arm_accKin_sum_range                           arm_accKin_sum_range
leg_speedKin_sum_pospeak_mean         leg_speedKin_sum_pospeak_mean
head_moment_sum_range                         head_moment_sum_range
leg_angJerk_sum_pospeak_mean           leg_angJerk_sum_pospeak_mean
head_power_pospeak_std                       head_power_pospeak_std
envelope_change_pospeak_mean           envelope_change_pospeak_mean
lowerbody_jerkKin_sum_pospeak_std lowerbody_jerkKin_sum_pospeak_std
head_angSpeed_sum_range                     head_angSpeed_sum_range
# Close the cluster when you're done with your parallel tasks
#stopCluster(cl)

Save data frame.

write.csv(data_mult, file = paste0(datasets, "multDataXGB.csv"), row.names = FALSE)

XGBoost

Ensure parallel processing.

# Detect the number of available cores
cores <- detectCores() #- 1  # Leave one core free

# Create a cluster with the detected number of cores
cl <- makeCluster(cores)

# Register the parallel backend
registerDoParallel(cl)

Define the grid and estimate runtime.

grid_tune <- expand.grid(
  nrounds = c(5000, 10000), 
  max_depth = c(3, 6), 
  eta = c(0.05, 0.1), 
  gamma = c(0.1), 
  colsample_bytree = c(0.6, 0.8), 
  min_child_weight = c(1), 
  subsample = c(0.75, 1.0)
)

# Calculate total combinations
total_combinations <- nrow(grid_tune)

# Estimate single model run time (assume 1 minute per run)
single_model_time <- 10 # minute

# Total runs for cross-validation
folds <- 5
total_runs <- total_combinations * folds

# Total time estimation without parallel processing
total_time <- total_runs * single_model_time # in minutes

# Convert to hours
total_time_hours <- total_time / 60

# Output estimated time without parallel processing
print(paste("Estimated time for grid search without parallel processing:", total_time_hours, "hours"))
[1] "Estimated time for grid search without parallel processing: 26.6666666666667 hours"
# Parallel processing with 4 cores
cores <- 24
total_time_parallel <- total_time / cores # in minutes

# Convert to hours
total_time_parallel_hours <- total_time_parallel / 60

# Output estimated time with parallel processing
print(paste("Estimated time for grid search with", cores, "cores:", total_time_parallel_hours, "hours"))
[1] "Estimated time for grid search with 24 cores: 1.11111111111111 hours"
rm(total_combinations,single_model_time,folds,total_runs,total_time,total_time_hours,total_time_parallel,total_time_parallel_hours,cores)

K-fold cross-validation

Create subsets to train and test data (80/20).

# Set seed for reproducibility
set.seed(998)

# Set up train control
train_control <- trainControl(
  method = "cv",        # Cross-validation
  number = 5,           # 5-fold cross-validation
  allowParallel = TRUE  # Enable parallel processing
)

# Define the number of subsets
numSubsets <- 5

# Load MICE-imputed data (using placeholder 'data_ges' as the input dataset)
multDataXGB <- data_mult

# Ensure 'correction_info' is a factor
multDataXGB$correction_info <- as.factor(multDataXGB$correction_info)

# Remove rows with only NA values
multDataXGB <- multDataXGB[rowSums(is.na(multDataXGB)) < ncol(multDataXGB), ]

# Split data by levels of 'correction_info'
correction_levels <- levels(multDataXGB$correction_info)
split_data <- split(multDataXGB, multDataXGB$correction_info)

# Initialize a list to store subsets
multSubsets <- vector("list", length = numSubsets)

# Distribute rows for each level equally across subsets
for (level in correction_levels) {
  level_data <- split_data[[level]]
  subset_sizes <- rep(floor(nrow(level_data) / numSubsets), numSubsets)
  remainder <- nrow(level_data) %% numSubsets
  
  # Distribute remainder rows randomly
  if (remainder > 0) {
    subset_sizes[seq_len(remainder)] <- subset_sizes[seq_len(remainder)] + 1
  }
  
  # Shuffle rows of the level and assign to subsets
  shuffled_data <- level_data[sample(nrow(level_data)), ]
  indices <- cumsum(c(0, subset_sizes))
  
  for (i in 1:numSubsets) {
    if (is.null(multSubsets[[i]])) {
      multSubsets[[i]] <- shuffled_data[(indices[i] + 1):indices[i + 1], ]
    } else {
      multSubsets[[i]] <- rbind(multSubsets[[i]], shuffled_data[(indices[i] + 1):indices[i + 1], ])
    }
  }
}

# Naming the subsets
names(multSubsets) <- paste0("multData", 1:numSubsets)

# Verify balance in subsets
for (i in 1:numSubsets) {
  cat("Subset", i, "contains rows:", nrow(multSubsets[[i]]), "and levels:\n")
  print(table(multSubsets[[i]]$correction_info))
}
Subset 1 contains rows: 7 and levels:

     c0 c0_only      c1      c2 
      2       1       2       2 
Subset 2 contains rows: 4 and levels:

     c0 c0_only      c1      c2 
      1       1       1       1 
Subset 3 contains rows: 4 and levels:

     c0 c0_only      c1      c2 
      1       1       1       1 
Subset 4 contains rows: 4 and levels:

     c0 c0_only      c1      c2 
      1       1       1       1 
Subset 5 contains rows: 4 and levels:

     c0 c0_only      c1      c2 
      1       1       1       1 
# Remove any rows with only NAs from subsets just to ensure cleanliness
multSubsets <- lapply(multSubsets, function(subset) {
  subset[rowSums(is.na(subset)) < ncol(subset), ]
})

# Access the subsets
multData1 <- multSubsets$multData1
multData2 <- multSubsets$multData2
multData3 <- multSubsets$multData3
multData4 <- multSubsets$multData4
multData5 <- multSubsets$multData5

# Combine subsets into 80% groups
multData1234 <- rbind(multData1, multData2, multData3, multData4)
multData1235 <- rbind(multData1, multData2, multData3, multData5)
multData1245 <- rbind(multData1, multData2, multData4, multData5)
multData1345 <- rbind(multData1, multData3, multData4, multData5)
multData2345 <- rbind(multData2, multData3, multData4, multData5)

# Final verification of all levels in the combined datasets
combined_sets <- list(multData1234, multData1235, multData1245, multData1345, multData2345)
names(combined_sets) <- c("multData1234", "multData1235", "multData1245", "multData1345", "multData2345")

for (set_name in names(combined_sets)) {
  cat("Dataset", set_name, "contains rows:", nrow(combined_sets[[set_name]]), "and levels:\n")
  print(table(combined_sets[[set_name]]$correction_info))
}
Dataset multData1234 contains rows: 19 and levels:

     c0 c0_only      c1      c2 
      5       4       5       5 
Dataset multData1235 contains rows: 19 and levels:

     c0 c0_only      c1      c2 
      5       4       5       5 
Dataset multData1245 contains rows: 19 and levels:

     c0 c0_only      c1      c2 
      5       4       5       5 
Dataset multData1345 contains rows: 19 and levels:

     c0 c0_only      c1      c2 
      5       4       5       5 
Dataset multData2345 contains rows: 16 and levels:

     c0 c0_only      c1      c2 
      4       4       4       4 

Models

Only run the models one time and then readRDS.

Model 1

multModel1 <- caret::train(
  correction_info ~ .,              
  data = multData1234,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(multModel1, file = paste0(models, "multModel1.rds"), compress = TRUE)

Model 2

multModel2 <- caret::train(
  correction_info ~ .,              
  data = multData1235,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(multModel2, file = paste0(models, "multModel2.rds"), compress = TRUE)

Model 3

multModel3 <- caret::train(
  correction_info ~ .,              
  data = multData1245,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(multModel3, file = paste0(models, "multModel3.rds"), compress = TRUE)

Model 4

multModel4 <- caret::train(
  correction_info ~ .,              
  data = multData1345,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)
saveRDS(multModel4, file = paste0(models, "multModel4.rds"), compress = TRUE)

Model 5

multModel5 <- caret::train(
  correction_info ~ .,              
  data = multData2345,
  method = "xgbTree",     
  trControl = train_control,
  tuneGrid = grid_tune    
)

saveRDS(multModel5, file = paste0(models, "multModel5.rds"), compress = TRUE)

Load models

Load all models after running, if necessary.

multModel1 <- readRDS(paste0(models, "multModel1.rds"))
multModel2 <- readRDS(paste0(models, "multModel2.rds"))
multModel3 <- readRDS(paste0(models, "multModel3.rds"))
multModel4 <- readRDS(paste0(models, "multModel4.rds"))
multModel5 <- readRDS(paste0(models, "multModel5.rds"))

Test models

Generate predictions and confusion matrices

# Generate predictions
multPredictions1 <- predict(multModel1, newdata = multData5)
[13:47:05] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
multPredictions2 <- predict(multModel2, newdata = multData4)
[13:47:05] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
multPredictions3 <- predict(multModel3, newdata = multData3)
[13:47:05] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
multPredictions4 <- predict(multModel4, newdata = multData2)
[13:47:05] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
multPredictions5 <- predict(multModel5, newdata = multData1)
[13:47:05] WARNING: src/learner.cc:553: 
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.
# Compute confusion matrices
multCm1 <- confusionMatrix(multPredictions1, multData5$correction_info)
multCm2 <- confusionMatrix(multPredictions2, multData4$correction_info)
multCm3 <- confusionMatrix(multPredictions3, multData3$correction_info)
multCm4 <- confusionMatrix(multPredictions4, multData2$correction_info)
multCm5 <- confusionMatrix(multPredictions5, multData1$correction_info)

# Extract p-values (you need to define how to extract these based on your metric, here assumed to be some metric from confusion matrix)
multPValues <- c(multCm1$overall['AccuracyPValue'], 
              multCm2$overall['AccuracyPValue'], 
              multCm3$overall['AccuracyPValue'], 
              multCm4$overall['AccuracyPValue'], 
              multCm5$overall['AccuracyPValue'])

Combine p-values using Fisher’s method

# Fisher's method
multFisher_combined <- -2 * sum(log(multPValues))
df <- 2 * length(multPValues)
multPCcombined_fisher <- 1 - pchisq(multFisher_combined, df)
print(multPCcombined_fisher)
[1] 0.3920742
# Stouffer's method
multZ_scores <- qnorm(1 - multPValues/2)
multCombined_z <- sum(multZ_scores) / sqrt(length(multPValues))
multP_combined_stouffer <- 2 * (1 - pnorm(abs(multCombined_z)))
print(multP_combined_stouffer)
[1] 0.05686547

The p-values should sum up to 0.

Feature importance

Model 1
XGBmultModel1 <- multModel1$finalModel
importanceXGBmultModel1 <- xgb.importance(model = XGBmultModel1)
head(importanceXGBmultModel1, n=10)
                                Feature       Gain      Cover  Frequency
                                 <char>      <num>      <num>      <num>
 1:               arm_power_pospeak_std 0.16370920 0.10839576 0.08992806
 2: head_moment_sum_change_pospeak_mean 0.08237537 0.09064362 0.09352518
 3:                       head_duration 0.08032469 0.06534066 0.05755396
 4:         lowerbody_jerkKin_sum_Gmean 0.06176702 0.07529737 0.06115108
 5:                      head_inter_Kin 0.05656996 0.03174240 0.02517986
 6:                envelope_change_Gstd 0.05258590 0.04236234 0.05755396
 7:                arm_accKin_sum_range 0.05149798 0.05082393 0.04676259
 8:       leg_speedKin_sum_pospeak_mean 0.05088554 0.06236849 0.05395683
 9:                arm_moment_sum_range 0.04731327 0.05174319 0.05035971
10:   lowerbody_jerkKin_sum_pospeak_std 0.04304124 0.03763756 0.03237410
xgb.plot.importance(importanceXGBmultModel1)

Model 2
XGBmultModel2 <- multModel2$finalModel
importanceXGBmultModel2 <- xgb.importance(model = XGBmultModel2)
head(importanceXGBmultModel2, n=10)
                                Feature       Gain      Cover   Frequency
                                 <char>      <num>      <num>       <num>
 1:         lowerbody_jerkKin_sum_Gmean 0.05671646 0.05851969 0.046052632
 2:                 arm_accKin_sum_Gstd 0.04611643 0.02468970 0.019736842
 3:                arm_accKin_sum_range 0.04296956 0.02853323 0.019736842
 4:               head_angAcc_sum_range 0.04274041 0.03817195 0.032894737
 5:                       head_duration 0.03394080 0.01916268 0.016447368
 6:                      head_inter_Kin 0.02912452 0.03218462 0.032894737
 7:               arm_power_pospeak_std 0.02882428 0.02028096 0.019736842
 8:                        arm_inter_IK 0.02682163 0.01527494 0.016447368
 9: head_moment_sum_change_pospeak_mean 0.02476986 0.01973325 0.016447368
10:       lowerbody_accKin_sum_integral 0.02368903 0.01247853 0.009868421
xgb.plot.importance(importanceXGBmultModel2)

Model 3
XGBmultModel3 <- multModel3$finalModel
importanceXGBmultModel3 <- xgb.importance(model = XGBmultModel3)
head(importanceXGBmultModel3, n=10)
                                Feature       Gain      Cover  Frequency
                                 <char>      <num>      <num>      <num>
 1:               arm_power_pospeak_std 0.11936301 0.06412536 0.04587156
 2:         leg_moment_sum_pospeak_mean 0.06558456 0.06864607 0.05810398
 3:        envelope_change_pospeak_mean 0.03526312 0.03716335 0.03363914
 4:         lowerbody_jerkKin_sum_Gmean 0.03033355 0.03258257 0.02752294
 5:              spine_moment_sum_Gmean 0.02960073 0.02155368 0.01529052
 6:                       head_inter_IK 0.02905922 0.01726820 0.01529052
 7:                       body_slope_80 0.02434499 0.04392720 0.05504587
 8: head_moment_sum_change_pospeak_mean 0.02274487 0.03483516 0.03669725
 9:                arm_accKin_sum_range 0.02268550 0.02720615 0.02752294
10:               arm_speedKin_sum_Gstd 0.02068770 0.01205195 0.01223242
xgb.plot.importance(importanceXGBmultModel3)

Model 4
XGBmultModel4 <- multModel4$finalModel
importanceXGBmultModel4 <- xgb.importance(model = XGBmultModel4)
head(importanceXGBmultModel4, n=10)
                                Feature       Gain      Cover  Frequency
                                 <char>      <num>      <num>      <num>
 1: head_moment_sum_change_pospeak_mean 0.13433303 0.10993043 0.08828829
 2:               arm_power_pospeak_std 0.11704856 0.08864239 0.07567568
 3:                arm_moment_sum_range 0.07681624 0.06679121 0.05585586
 4:             head_angSpeed_sum_range 0.07387166 0.06683661 0.05945946
 5:                arm_accKin_sum_range 0.07129803 0.05788595 0.05045045
 6:              head_angJerk_sum_range 0.06440727 0.05423596 0.03783784
 7:        lowerbody_speedKin_sum_Gmean 0.05471338 0.04881843 0.03783784
 8:        lowerbody_speedKin_sum_range 0.05336213 0.04938517 0.03783784
 9:              arm_speedKin_sum_range 0.04297612 0.02185336 0.01441441
10:               head_angAcc_sum_Gmean 0.03311430 0.03207859 0.02522523
xgb.plot.importance(importanceXGBmultModel4)

Model 5
XGBmultModel5 <- multModel5$finalModel
importanceXGBmultModel5 <- xgb.importance(model = XGBmultModel5)
head(importanceXGBmultModel5, n=10)
                        Feature       Gain      Cover  Frequency
                         <char>      <num>      <num>      <num>
 1:       arm_speedKin_sum_Gstd 0.25372488 0.18051833 0.14107884
 2:              head_inter_Kin 0.10834497 0.08322603 0.07883817
 3:     head_angSpeed_sum_range 0.08083857 0.05868575 0.04979253
 4:      arm_speedKin_sum_Gmean 0.06179184 0.06637690 0.06224066
 5:  head_speedKin_sum_integral 0.06069948 0.06488408 0.05394191
 6: leg_moment_sum_pospeak_mean 0.03819193 0.02769087 0.02074689
 7:     arm_accKin_sum_integral 0.03553192 0.03942087 0.03734440
 8: lowerbody_jerkKin_sum_Gmean 0.03357866 0.03170172 0.02489627
 9:        arm_accKin_sum_range 0.02801619 0.02222806 0.01659751
10:       arm_power_pospeak_std 0.02737382 0.03537906 0.03734440
xgb.plot.importance(importanceXGBmultModel5)

Cumulative feature importance
# Function to extract and normalize importance
get_normalized_importance <- function(model) {
  importance <- xgb.importance(model = model)
  importance$Gain <- importance$Gain / sum(importance$Gain)
  return(importance)
}

# Extract normalized importance for each model
multImportance1 <- get_normalized_importance(multModel1$finalModel)
multImportance2 <- get_normalized_importance(multModel2$finalModel)
multImportance3 <- get_normalized_importance(multModel3$finalModel)
multImportance4 <- get_normalized_importance(multModel4$finalModel)
multImportance5 <- get_normalized_importance(multModel5$finalModel)

# Combine importances
multAllImportances <- list(multImportance1, multImportance2, multImportance3, multImportance4, multImportance5)

# Function to merge importances
merge_importances <- function(importances) {
  for (i in 2:length(importances)) {
    names(importances[[i]])[2:4] <- paste0(names(importances[[i]])[2:4], "_", i)
  }
  merged <- Reduce(function(x, y) merge(x, y, by = "Feature", all = TRUE), importances)
  merged[is.na(merged)] <- 0  # Replace NAs with 0
  gain_cols <- grep("Gain", colnames(merged), value = TRUE)
  merged$Cumulative <- rowSums(merged[, ..gain_cols])
  return(merged[, .(Feature, Cumulative)])
}

# Merge and sort importances
multCumulativeImportance <- merge_importances(multAllImportances)
multCumulativeImportance <- multCumulativeImportance[order(-multCumulativeImportance$Cumulative), ]

# Print cumulative feature importance
head(multCumulativeImportance, n=10)
                                Feature Cumulative
                                 <char>      <num>
 1:               arm_power_pospeak_std  0.4563189
 2:               arm_speedKin_sum_Gstd  0.2938622
 3: head_moment_sum_change_pospeak_mean  0.2651477
 4:                arm_accKin_sum_range  0.2164673
 5:                      head_inter_Kin  0.2109458
 6:         lowerbody_jerkKin_sum_Gmean  0.1823957
 7:             head_angSpeed_sum_range  0.1567240
 8:                arm_moment_sum_range  0.1448105
 9:                       head_duration  0.1419503
10:         leg_moment_sum_pospeak_mean  0.1221006

PCA

Now let’s collect 10 features per component that have highest combined ranking from PCA and XGBoost. This means that for each feature we sum up the ranking it obtained in cumulative importance (XGBoost) and loading on a principal component (PCA).

# Rank the features based on XGBoost importance (cumulative)
multCumulativeImportance$XGB_Rank <- rank(-multCumulativeImportance$Cumulative)

# Load in PCA for gesture
mult_pca <- read_csv(paste0(datasets, "PCA_top_contributors_multi.csv"))

# For each PC (PC1, PC2, PC3), rank the features based on their loadings
combined_ranks_per_pc <- list()

for (pc in c("PC1", "PC2", "PC3")) {
  # Extract the features and loadings for the current PC
  pca_pc_loadings <- mult_pca[, c(pc, paste0(pc, "_Loading"))]
  colnames(pca_pc_loadings) <- c("Feature", "Loading")
  
  # Rank the features based on the absolute loading values (higher loadings should get lower rank)
  pca_pc_loadings$PCA_Rank <- rank(-abs(pca_pc_loadings$Loading))
  
  # Merge PCA loadings with XGBoost importance ranks
  merged_data <- merge(pca_pc_loadings, multCumulativeImportance[, c("Feature", "XGB_Rank")], by = "Feature")
  
  # Calculate combined rank by summing XGBoost rank and PCA rank
  merged_data$Combined_Rank <- merged_data$XGB_Rank + merged_data$PCA_Rank
  
  # Sort by the combined rank (lower rank is better)
  sorted_data <- merged_data[order(merged_data$Combined_Rank), ]
  
  # Select the top n features based on the combined rank for the current PC
  top_n_features <- 10  # Adjust the number of top features as needed
  combined_ranks_per_pc[[pc]] <- head(sorted_data, top_n_features)
}

# Output the top features per PC based on combined ranking
head(combined_ranks_per_pc, n=10)
$PC1
                            Feature    Loading PCA_Rank XGB_Rank Combined_Rank
196    lowerbody_speedKin_sum_range 0.08494744        1       17            18
98           head_angJerk_sum_range 0.08114843        9       14            23
94            head_angJerk_sum_Gstd 0.08057174       10       29            39
93           head_angJerk_sum_Gmean 0.07951331       14       28            42
193 lowerbody_speedKin_sum_integral 0.07981523       12       39            51
128      head_speedKin_sum_integral 0.07633696       37       18            55
161   lowerbody_accKin_sum_integral 0.07643321       36       21            57
191    lowerbody_speedKin_sum_Gmean 0.07389508       49       13            62
89            head_angAcc_sum_Gmean 0.07122083       57       11            68
91         head_angAcc_sum_integral 0.07661845       34       34            68

$PC2
                               Feature     Loading PCA_Rank XGB_Rank
64                envelope_change_Gstd -0.08227254     27.0       25
157      leg_speedKin_sum_pospeak_mean  0.07903907     35.5       19
132        leg_accKin_sum_pospeak_mean  0.07903907     35.5       24
102            head_angSpeed_sum_range -0.07046288     70.0        7
109      head_jerkKin_sum_pospeak_mean  0.08860597     14.0       68
53        arm_speedKin_sum_pospeak_std -0.07747848     41.5       52
58                           COPc_Gstd  0.08674708     22.0       76
117 head_moment_sum_change_pospeak_std -0.07674698     45.0       58
6                  arm_angAcc_sum_Gstd  0.07178235     67.0       43
61                    COPc_pospeak_std  0.07857052     38.0       77
    Combined_Rank
64           52.0
157          54.5
132          59.5
102          77.0
109          82.0
53           93.5
58           98.0
117         103.0
6           110.0
61          115.0

$PC3
                               Feature     Loading PCA_Rank XGB_Rank
47               arm_power_pospeak_std -0.09134019       24        1
175        lowerbody_jerkKin_sum_Gmean  0.09121169       25        6
104                      head_duration -0.08806084       28        9
179  lowerbody_jerkKin_sum_pospeak_std  0.10241640       17       20
42                arm_moment_sum_range -0.08711107       30        8
178 lowerbody_jerkKin_sum_pospeak_mean  0.12010976        6       40
201         pelvis_moment_sum_integral -0.12537799        4       56
102            head_angSpeed_sum_range -0.07039109       59        7
49              arm_speedKin_sum_Gmean  0.07121858       57       12
69               envelope_pospeak_mean  0.08530688       33       49
    Combined_Rank
47             25
175            31
104            37
179            37
42             38
178            46
201            60
102            66
49             69
69             82

For modelling, we want to pick three features per component. Which would it be in this case?

# Number of top features to display
top_n_features <- 3

# Print the top 3 features per component
for (pc in c("PC1", "PC2", "PC3")) {
  cat("\nTop 3 Features for", pc, ":\n")
  
  # Get the top 3 features based on combined rank for the current PC
  top_features <- head(combined_ranks_per_pc[[pc]], top_n_features)
  
  # Print the results
  print(top_features[, c("Feature", "XGB_Rank", "PCA_Rank", "Combined_Rank")])
}

Top 3 Features for PC1 :
                         Feature XGB_Rank PCA_Rank Combined_Rank
196 lowerbody_speedKin_sum_range       17        1            18
98        head_angJerk_sum_range       14        9            23
94         head_angJerk_sum_Gstd       29       10            39

Top 3 Features for PC2 :
                          Feature XGB_Rank PCA_Rank Combined_Rank
64           envelope_change_Gstd       25     27.0          52.0
157 leg_speedKin_sum_pospeak_mean       19     35.5          54.5
132   leg_accKin_sum_pospeak_mean       24     35.5          59.5

Top 3 Features for PC3 :
                        Feature XGB_Rank PCA_Rank Combined_Rank
47        arm_power_pospeak_std        1       24            25
175 lowerbody_jerkKin_sum_Gmean        6       25            31
104               head_duration        9       28            37


Now we are ready to model the most predictive features to get deeper understanding of their causal relationship with our predictive variable, i.e., communicative attempt.

Session info

Here is session information for reproducibility:

R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
[3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] doParallel_1.0.17   iterators_1.0.14    foreach_1.5.2      
 [4] mice_3.17.0         xgboost_1.7.9.1     caret_7.0-1        
 [7] lattice_0.22-6      tuneRanger_0.7      lhs_1.2.0          
[10] mlrMBO_1.1.5.1      smoof_1.6.0.3       checkmate_2.3.2    
[13] mlr_2.19.2          ParamHelpers_1.14.2 ranger_0.17.0      
[16] rpart.plot_3.1.2    rpart_4.1.24        gridExtra_2.3      
[19] ggpubr_0.6.0        ggforce_0.4.2       data.table_1.17.0  
[22] lubridate_1.9.4     forcats_1.0.0       dplyr_1.1.4        
[25] purrr_1.0.4         readr_2.1.5         tidyr_1.3.1        
[28] ggplot2_3.5.1       tidyverse_2.0.0     stringr_1.5.1      
[31] tibble_3.2.1       

loaded via a namespace (and not attached):
 [1] Rdpack_2.6.3         pROC_1.18.5          rlang_1.1.5         
 [4] magrittr_2.0.3       e1071_1.7-16         compiler_4.4.1      
 [7] vctrs_0.6.5          reshape2_1.4.4       crayon_1.5.3        
[10] shape_1.4.6.1        pkgconfig_2.0.3      fastmap_1.2.0       
[13] backports_1.5.0      rmarkdown_2.29       prodlim_2024.06.25  
[16] tzdb_0.5.0           nloptr_2.2.1         bit_4.6.0           
[19] jomo_2.7-6           glmnet_4.1-8         xfun_0.52           
[22] jsonlite_2.0.0       recipes_1.2.1        pan_1.9             
[25] tweenr_2.0.3         broom_1.0.8          R6_2.6.1            
[28] stringi_1.8.7        boot_1.3-30          parallelly_1.43.0   
[31] car_3.1-3            Rcpp_1.0.14          knitr_1.50          
[34] future.apply_1.11.3  Matrix_1.7-0         splines_4.4.1       
[37] nnet_7.3-19          timechange_0.3.0     tidyselect_1.2.1    
[40] abind_1.4-8          yaml_2.3.10          timeDate_4041.110   
[43] codetools_0.2-20     listenv_0.9.1        plyr_1.8.9          
[46] withr_3.0.2          evaluate_1.0.3       future_1.34.0       
[49] survival_3.6-4       proxy_0.4-27         polyclip_1.10-7     
[52] pillar_1.10.2        carData_3.0-5        stats4_4.4.1        
[55] reformulas_0.4.0     generics_0.1.3       vroom_1.6.5         
[58] hms_1.1.3            munsell_0.5.1        scales_1.3.0        
[61] minqa_1.2.8          globals_0.16.3       class_7.3-22        
[64] glue_1.8.0           tools_4.4.1          parallelMap_1.5.1   
[67] lme4_1.1-37          ModelMetrics_1.2.2.2 gower_1.0.2         
[70] ggsignif_0.6.4       XML_3.99-0.18        fastmatch_1.1-6     
[73] grid_4.4.1           rbibutils_2.3        ipred_0.9-15        
[76] colorspace_2.1-1     nlme_3.1-164         BBmisc_1.13         
[79] Formula_1.2-5        cli_3.6.4            DiceKriging_1.6.0   
[82] lava_1.8.1           gtable_0.3.6         rstatix_0.7.2       
[85] digest_0.6.37        htmlwidgets_1.6.4    farver_2.1.2        
[88] htmltools_0.5.8.1    lifecycle_1.0.4      hardhat_1.4.1       
[91] mitml_0.4-5          bit64_4.6.0-1        MASS_7.3-60.2       

References

Chen, Tianqi, and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94. KDD ’16. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785.
Ćwiek, Aleksandra, Alina Gregori, Paula Ginesa Sánchez-Ramón, Pilar Prieto, and Frank Kügler. n.d. Acoustic Correlates of Perceived Phrase-Level Prosodic Prominence in Catalan and German.” Accessed.