Exploratory Analysis I: Using PCA to identify effort dimensions

Overview

In this notebook, we will use Principal Component Analysis (PCA) to identify the most relevant planes (components) of effort among the features in the dataset(s) we created in the previous script. We do this paralel to eXtreme Gradient Boosting (XGBoost, see script here) because unlike PCA, XGBoost does not prevent from cummulating most relevant features that are correlated, i.e., they likely explain similar dimension of effort. To increase interpretative power of our analysis, we will combine these two methods to identify the most relevant features within the most relevant dimensions (i.e., components) of effort.

Note that the current version of the script is used with data only from dyad 0. Since this is not sufficient amount of data for any meaningful conclusions, this script serves for building the workflow. We will use identical pipeline with the full dataset, and any deviations from this script will be reported.

Code to prepare the environment
import os
import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder

curfolder = os.getcwd()

# This is where our features live
features = curfolder + '\\..\\07_TS_featureExtraction\\Datasets\\'
dfs = glob.glob(features + '*.csv') 

Because within the three distinct modalities - gesture, vocalization, combined - a set of different components could be decisive to characterize effort, we will perform PCA on each modality separately.

# This is gesture data
ges = [x for x in dfs if 'gesture' in x]
data_ges = pd.read_csv(ges[0])

# This is vocalization data
voc = [x for x in dfs if 'vocal' in x]
data_voc = pd.read_csv(voc[0])

# This is multimodal data
multi = [x for x in dfs if 'combination' in x]
data_multi = pd.read_csv(multi[0])

PCA: Gesture

Let’s start by cleaning the dataframe. In gesture modality, some of the features are not relevant to the current analysis - those are mainly the ones that are related to acoustics and concept-related information. We will remove them from the dataframe before performing PCA.

Custom functions
# Function to clean the data
def clean_df(df, colstodel):

    # Delete all desired columns
    df = df.loc[:,~df.columns.str.contains('|'.join(colstodel))]

    # Fill NaNs with 0
    df = df.fillna(0)   # FLAGGED: this we might change, maybe not the best method (alternative: MICE)

    # Save values from correction_info
    correction_info = df['correction_info']

    # Leave only numerical cols, except correction_info
    df = df.select_dtypes(include=['float64','int64'])

    # Add back correction_info
    df['correction_info'] = correction_info
    
    return df
# These are answer related columns
conceptcols = ['answer', 'expressibility', 'response']

# These are vocalization related columns
voccols = ['envelope', 'audio', 'f0', 'f1', 'f2', 'f3', 'env_', 'duration_voc', 'CoG']

# Concatenate both lists
colstodel = conceptcols + voccols

# Clean the df
ges_clean = clean_df(data_ges, colstodel)

ges_clean.head(15)
arm_duration arm_inter_Kin arm_inter_IK arm_bbmv lowerbody_duration lowerbody_inter_Kin lowerbody_inter_IK lowerbody_bbmv leg_duration leg_inter_Kin ... arm_asymmetry arm_moment_sum_pospeak_mean arm_power_pospeak_std pelvis_moment_sum_change_pospeak_std pelvis_moment_sum_pospeak_std arm_moment_sum_pospeak_std lowerbody_moment_sum_pospeak_std lowerbody_power_pospeak_std leg_moment_sum_change_pospeak_std correction_info
0 3816.0 28.001 29.471 10.348 3244.0 27.815 30.057 9.607 3244.0 27.788 ... -2481.688 -8.489 1.030 0.000 0.646 4.579 0.303 0.000 0.000 c0_only
1 5280.0 28.098 30.942 13.350 5862.0 28.648 31.235 17.882 5862.0 29.740 ... -2803.359 -0.430 1.682 0.213 0.907 1.201 0.409 0.000 0.513 c0_only
2 4302.0 28.401 31.164 10.572 4206.0 28.373 31.096 8.402 4206.0 27.451 ... -6703.803 -6.312 0.348 0.000 1.403 3.175 1.227 0.000 0.000 c0
3 4474.0 27.316 30.376 10.229 4398.0 28.390 32.196 9.064 4398.0 27.665 ... -2827.356 -5.724 1.266 0.000 1.717 5.321 0.124 0.000 0.000 c1
4 4388.0 27.872 31.225 9.748 3782.0 27.458 30.479 8.428 3782.0 27.529 ... -3711.104 -1.685 0.000 0.012 1.045 3.724 0.270 0.000 0.000 c2
5 5852.0 29.417 32.048 10.675 5404.0 29.153 30.545 10.334 5404.0 29.056 ... -8208.505 -8.680 2.553 0.018 0.958 2.077 0.145 0.000 0.000 c0_only
6 2736.0 26.324 27.656 6.547 0.0 0.000 0.000 0.000 0.0 0.000 ... -3005.988 0.645 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c0
7 4256.0 28.675 29.776 8.148 0.0 0.000 0.000 0.000 0.0 0.000 ... -3988.156 -5.168 0.000 0.000 1.373 8.906 0.000 0.000 0.000 c1
8 2516.0 26.234 27.815 7.774 884.0 23.300 25.890 6.598 884.0 23.423 ... -6191.330 -2.486 0.000 0.000 0.909 4.152 0.000 0.000 0.000 c0_only
9 4504.0 28.604 30.892 9.939 4266.0 27.764 29.475 11.045 4266.0 28.404 ... -9892.382 -5.280 1.533 0.000 0.688 1.350 0.390 0.356 0.000 c0
10 5946.0 28.473 30.925 9.779 5136.0 28.712 30.358 9.084 5136.0 28.117 ... -8094.630 -14.464 0.000 0.142 0.512 0.984 0.865 0.000 0.039 c1
11 6100.0 29.300 31.243 10.521 6030.0 29.592 30.960 9.672 6030.0 30.037 ... -6914.443 -4.261 0.850 0.000 0.552 0.636 0.283 0.000 0.000 c2
12 3004.0 26.079 26.368 7.253 2686.0 28.219 27.971 4.870 2686.0 27.750 ... -1630.817 0.000 0.000 0.000 0.000 0.000 0.085 0.000 0.000 c0_only
13 3918.0 28.358 28.936 8.394 3366.0 28.669 28.833 5.086 3366.0 28.617 ... -2136.578 -4.907 0.184 0.460 0.851 1.980 0.855 0.406 1.368 c0
14 3918.0 28.120 29.766 9.343 1692.0 26.370 26.244 4.733 1692.0 24.816 ... -1165.345 -11.533 0.299 0.202 1.278 2.726 0.000 0.140 0.000 c1

15 rows × 325 columns


Now, we first standardize the data and apply PCA to extract principal components. We use custom function PCA_biplot (adapted from here) to visualize the first two principal components. Data points are color-coded on the target variable (correction info) and red arrows represent the contributions of selected variables to the PC.

Custom functions
# Function to plot PCA results
def PCA_biplot(score, coeff, labels=None, selected_vars=None):
    xs = score[:, 0]
    ys = score[:, 1]
    n = coeff.shape[0]
    scalex = 1.0 / (xs.max() - xs.min())
    scaley = 1.0 / (ys.max() - ys.min())

    # Ensure all arrays have the same length
    min_length = min(len(xs), len(ys), len(y))

    # Trim all arrays to the smallest length
    xs_trimmed = xs[:min_length]
    ys_trimmed = ys[:min_length]
    y_trimmed = y[:min_length]  # Adjust 'c' values to match

    
    plt.figure(figsize=(12, 8))  # Increase figure size
    # Now plot safely
    plt.scatter(xs_trimmed * scalex, ys_trimmed * scaley, c=y_trimmed, cmap="viridis", alpha=0.7)
    
    # If selected_vars is provided, only plot these variables
    if selected_vars is not None:
        for i in selected_vars:
            plt.arrow(0, 0, coeff[i, 0], coeff[i, 1], color='r', alpha=0.5)
            if labels is None:
                plt.text(coeff[i, 0] * 1.15, coeff[i, 1] * 1.15, "Var" + str(i + 1), 
                         color='g', ha='center', va='center', fontsize=9)
            else:
                plt.text(coeff[i, 0] * 1.15, coeff[i, 1] * 1.15, labels[i], 
                         color='g', ha='center', va='center', fontsize=9)
    else:
        for i in range(n):
            plt.arrow(0, 0, coeff[i, 0], coeff[i, 1], color='r', alpha=0.5)
            if labels is None:
                plt.text(coeff[i, 0] * 1.15, coeff[i, 1] * 1.15, "Var" + str(i + 1), 
                         color='g', ha='center', va='center', fontsize=9)
            else:
                plt.text(coeff[i, 0] * 1.15, coeff[i, 1] * 1.15, labels[i], 
                         color='g', ha='center', va='center', fontsize=9)

    # Zoom into the plot by narrowing the axis limits
    plt.xlim(-0.5, 0.5)  # Adjust the range as needed
    plt.ylim(-0.5, 0.5)  # Adjust the range as needed
    
    plt.xlabel("PC1", fontsize=14)
    plt.ylabel("PC2", fontsize=14)
    plt.grid()
    plt.title("PCA Biplot", fontsize=16)
    plt.show()
# Prepare data
X = ges_clean.iloc[:, :-1].values  # All columns except the last as features
y = ges_clean.iloc[:, -1].values   # Last column as target variable

# Convert categorical target to numeric if necessary
if y.dtype == 'object' or y.dtype.name == 'category':
    le = LabelEncoder()
    y = le.fit_transform(y)  # Converts categorical labels into numeric labels

# Scale the data
scaler = StandardScaler()
X = scaler.fit_transform(X)    

# PCA transformation
pca = PCA()
x_new = pca.fit_transform(X)

# For intelligibility, let's select only plot some variables
selected_vars = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  

# Call the function. Use only the 2 PCs.
PCA_biplot(x_new[:, 0:2], np.transpose(pca.components_[0:2, :]), selected_vars=selected_vars)


What amount of variances each PC explains?

array([3.84511094e-01, 1.33517247e-01, 1.13255439e-01, 7.31860616e-02,
       4.06776991e-02, 3.60648304e-02, 3.37781522e-02, 2.58262475e-02,
       2.25049502e-02, 2.11170841e-02, 1.88455453e-02, 1.63285979e-02,
       1.42102152e-02, 1.37325090e-02, 1.14525593e-02, 1.05874031e-02,
       9.39941422e-03, 8.83676344e-03, 6.69106898e-03, 5.47711790e-03,
       8.22273514e-33])


So we have really few data, therefore even the first principal component explains only 38% of the variance, second 13% and third 11%. But we will be focusing on the first three principal components, as they together explain at least 50% of the variance

Now we can check the most important features. The larger the absolute value of the Eigenvalue, the more important the feature is for the principal component.

[[5.38681564e-02 5.19802363e-02 6.69687848e-02 ... 3.65783055e-02
  7.72739752e-03 1.33552992e-02]
 [4.86443290e-02 6.05250242e-02 6.33776285e-02 ... 3.84745526e-02
  3.50366562e-02 9.39677427e-02]
 [1.18950177e-02 3.18850132e-02 1.07001005e-02 ... 8.76759978e-02
  6.23252471e-02 3.70224151e-02]
 ...
 [6.91904948e-02 1.05956755e-01 5.99632890e-02 ... 2.86900649e-02
  2.45013548e-02 1.30355450e-02]
 [8.55970837e-03 1.99497041e-02 6.52238928e-02 ... 7.38606597e-02
  1.22245602e-01 2.97327756e-02]
 [2.14046104e-01 7.13652577e-01 1.35550979e-01 ... 5.87310753e-04
  8.63642051e-03 2.48636333e-03]]
# Number of principal components
n_pcs = 3

# Feature names (excluding target column)
feature_names = ges_clean.columns[:-1]  

# Create storage for the ordered feature names and loadings
results_dict_ges = {}

for i in range(n_pcs):
    # Get all features sorted by absolute loading values
    sorted_indices = np.abs(pca.components_[i]).argsort()[::-1]
    sorted_features = feature_names[sorted_indices]  # Feature names
    sorted_loadings = pca.components_[i, sorted_indices]  # Loadings

    # Store in dictionary
    results_dict_ges[f'PC{i+1}'] = sorted_features.values
    results_dict_ges[f'PC{i+1}_Loading'] = sorted_loadings

# Convert dictionary to DataFrame
results_df_ges = pd.DataFrame(results_dict_ges)
results_df_ges.head(20)
PC1 PC1_Loading PC2 PC2_Loading PC3 PC3_Loading
0 bbmv_total 0.088552 leg_moment_sum_change_integral 0.122350 lowerbody_angAcc_sum_Gstd -0.113851
1 lowerbody_bbmv 0.088311 lowerbody_moment_sum_change_integral 0.119478 pelvis_moment_sum_integral 0.110988
2 leg_bbmv 0.088311 leg_moment_sum_range 0.118334 lowerbody_power_pospeak_n -0.110353
3 leg_angSpeed_sum_range 0.086581 leg_moment_sum_Gstd 0.117526 numofArt 0.106303
4 arm_bbmv 0.086062 leg_moment_sum_pospeak_std 0.117384 lowerbody_power_range -0.105257
5 lowerbody_accKin_sum_pospeak_n 0.085862 leg_moment_sum_change_Gmean 0.111512 leg_angAcc_sum_integral 0.104674
6 lowerbody_speedKin_sum_pospeak_n 0.085862 leg_moment_sum_change_pospeak_n 0.110876 lowerbody_power_Gstd -0.104521
7 leg_angSpeed_sum_Gstd 0.084579 arm_angSpeed_sum_Gstd -0.106664 lowerbody_angAcc_sum_range -0.104089
8 head_speedKin_sum_pospeak_std 0.083995 spine_moment_sum_integral -0.106053 spine_moment_sum_change_integral -0.103468
9 head_accKin_sum_pospeak_std 0.083995 lowerbody_moment_sum_change_range 0.105053 leg_angJerk_sum_integral 0.103358
10 head_bbmv 0.083770 arm_angSpeed_sum_range -0.104518 spine_moment_sum_Gmean -0.103014
11 leg_angJerk_sum_range 0.083486 head_angSpeed_sum_pospeak_std 0.102664 lowerbody_angSpeed_sum_Gstd -0.102256
12 leg_speedKin_sum_range 0.083428 head_angAcc_sum_pospeak_std 0.102664 leg_angSpeed_sum_integral 0.101966
13 head_accKin_sum_range 0.082507 spine_moment_sum_Gmean -0.102477 spine_moment_sum_integral -0.101497
14 head_jerkKin_sum_integral 0.082302 spine_moment_sum_change_Gstd 0.101442 pelvis_moment_sum_Gmean 0.100055
15 leg_duration 0.082105 lowerbody_moment_sum_Gstd 0.101287 leg_angJerk_sum_pospeak_mean 0.099828
16 lowerbody_duration 0.082105 lowerbody_moment_sum_change_Gstd 0.101006 lowerbody_angSpeed_sum_range -0.099430
17 head_accKin_sum_Gstd 0.082004 lowerbody_moment_sum_range 0.100636 spine_moment_sum_change_Gmean -0.099217
18 lowerbody_accKin_sum_range 0.081567 lowerbody_moment_sum_change_Gmean 0.100220 pelvis_moment_sum_change_Gstd -0.098610
19 leg_angAcc_sum_range 0.081196 head_moment_sum_change_Gmean 0.099953 leg_angJerk_sum_Gmean 0.097263


Now we have dataframe for gesture modality where each column represents a principal component (PC1-PC3) and each row represents a feature. The values in the dataframe are the loadings of the features on the principal components. The loadings are the correlation coefficients between the features and the principal components. The higher the absolute value of the loading, the more important the feature is for the principal component. The dataframe is sorted by the absolute value of the loadings in descending order.

We will save the top contributors as a file so that we can load it in for the XGBoost analysis. We will also save the clean data which we can use for XGBoost modeling too.

# Save top contributors
results_df_ges.to_csv(curfolder + '\\datasets\\PCA_top_contributors_ges.csv', index=False)

# Save clean data
ges_clean.to_csv(curfolder + '\\datasets\\ges_clean_df.csv', index=False)

PCA: Vocalizations

In the following repetitions, we will use custom function pca_analysis which does all the steps we performed previously for gesture modality in one go

Custom PCA function
def pca_analysis(df_clean):
    # Prepare data
    X = df_clean.iloc[:, :-1].values  # All columns except the last as features
    y = df_clean.iloc[:, -1].values   # Last column as target variable

    # Convert categorical target to numeric if necessary
    if y.dtype == 'object' or y.dtype.name == 'category':
        le = LabelEncoder()
        y = le.fit_transform(y)  # Converts categorical labels into numeric labels

    # Scale the data
    scaler = StandardScaler()
    X = scaler.fit_transform(X)    

    # PCA transformation
    pca = PCA()
    x_new = pca.fit_transform(X)

    # Select few variables
    selected_vars = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]  

    print('Biplot for the first 2 PCs:')
    PCA_biplot(x_new[:, 0:2], np.transpose(pca.components_[0:2, :]), selected_vars=selected_vars)

    PC1explained = pca.explained_variance_ratio_[0]*100
    PC2explained = pca.explained_variance_ratio_[1]*100
    PC3explained = pca.explained_variance_ratio_[2]*100

    print('PCs explained variance:')
    print(f'PC1: {PC1explained:.2f}%')
    print(f'PC2: {PC2explained:.2f}%')
    print(f'PC3: {PC3explained:.2f}%')

    # Getting most contributing features
    n_pcs = 3

    # Feature names (excluding target column)
    feature_names = df_clean.columns[:-1]  

    # Create storage for the ordered feature names and loadings
    results_dict = {}

    for i in range(n_pcs):
        # Get all features sorted by absolute loading values
        sorted_indices = np.abs(pca.components_[i]).argsort()[::-1]
        sorted_features = feature_names[sorted_indices]  # Feature names
        sorted_loadings = pca.components_[i, sorted_indices]  # Loadings

        # Store in dictionary
        results_dict[f'PC{i+1}'] = sorted_features.values
        results_dict[f'PC{i+1}_Loading'] = sorted_loadings

    # Convert dictionary to DataFrame
    results_df = pd.DataFrame(results_dict)

    return results_df

Before PCA, we need to clean the data such that only vocalization-relevant features are kept.

# These are answer related columns
conceptcols = ['answer', 'expressibility', 'response']

# These are vocalization related columns
voccols = ['envelope', 'audio', 'f0', 'f1', 'f2', 'f3', 'env_', 'duration_voc', 'CoG', 'correction_info']

# Concatenate both lists
colstodel = conceptcols 

# Clean the df
voc_clean = clean_df(data_voc, colstodel)

# Keep only those cols that have some in name - at least partially - words from voccols
colstokeep = [col for col in voc_clean.columns if any(word in col for word in voccols)]

# Keep only those columns
voc_clean = voc_clean[colstokeep]

voc_clean.head(15)
envelope_Gmean envelope_Gstd envelope_pospeak_mean envelope_pospeak_std envelope_pospeak_n envelope_integral envelope_range envelope_change_Gmean envelope_change_Gstd envelope_change_pospeak_mean ... f2_clean_pospeak_std f1_clean_pospeak_std f1_clean_vel_pospeak_std CoG_Gmean CoG_Gstd CoG_pospeak_mean CoG_pospeak_std CoG_integral CoG_range correction_info
0 0.288 1.438 0.297 2.393 6.0 337.541 7.888 0.016 0.788 0.042 ... 0.000 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c0
1 0.805 2.106 -0.586 0.231 5.0 1417.739 7.888 -0.258 0.367 -0.336 ... 0.685 0.11 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c1
2 0.281 1.843 -0.584 0.196 3.0 357.523 7.889 -0.243 0.505 -0.542 ... 0.000 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c2
3 0.446 1.907 -0.641 0.118 3.0 359.095 7.858 -0.122 0.714 -0.572 ... 0.000 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c0
4 0.804 1.927 0.003 0.728 7.0 916.490 7.888 0.037 0.890 -0.266 ... 1.624 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c1
5 0.779 2.046 1.788 4.266 3.0 1116.787 7.853 -0.149 0.674 -0.237 ... 0.000 0.00 1.777 0.000 0.000 0.000 0.000 0.000 0.000 c2
6 0.317 1.751 -0.500 0.001 5.0 421.491 7.713 -0.144 0.857 -0.571 ... 0.000 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c0
7 2.471 2.336 -0.592 0.000 2.0 781.091 7.686 0.463 0.957 -0.572 ... 0.000 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c1
8 1.174 2.096 3.063 5.167 2.0 766.878 7.815 -0.077 0.772 -0.572 ... 0.000 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c2
9 0.263 1.589 -0.472 0.430 8.0 447.570 7.889 0.098 0.896 -0.191 ... 0.000 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c0
10 0.994 1.856 -0.492 0.000 8.0 1208.357 7.887 0.481 1.182 -0.572 ... 0.463 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c1
11 1.233 2.069 -0.435 0.000 6.0 1144.241 7.884 0.564 1.231 -0.572 ... 0.000 0.00 0.000 -0.561 1.464 -0.270 0.232 -518.310 5.346 c2
12 0.000 0.000 0.000 0.000 0.0 0.000 0.000 0.000 0.000 0.000 ... 0.000 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c0_only
13 2.322 2.804 0.395 1.654 6.0 2303.726 7.888 0.751 1.304 -0.238 ... 1.347 0.00 0.005 -3.272 0.375 -0.528 2.491 -3242.007 2.236 c0
14 3.156 2.213 2.712 3.700 4.0 3206.970 7.889 -0.137 0.681 -0.466 ... 0.372 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000 c1

15 rows × 71 columns


Now we can use the function to perform the same PCA analysis but on vocal features

# Perform PCA analysis
results_df_voc = pca_analysis(voc_clean)

# Display
results_df_voc.head(20)
Biplot for the first 2 PCs:
PCs explained variance:
PC1: 23.02%
PC2: 15.73%
PC3: 12.03%

PC1 PC1_Loading PC2 PC2_Loading PC3 PC3_Loading
0 f3_clean_vel_pospeak_n -0.227090 f3_clean_vel_pospeak_mean -0.249509 CoG_integral -0.294560
1 f3_clean_pospeak_n -0.225100 f0_Gstd 0.245291 CoG_pospeak_std 0.279256
2 f2_clean_vel_pospeak_n -0.223553 f0_pospeak_n 0.244768 CoG_pospeak_n 0.274855
3 f2_clean_vel_range -0.209338 f0_range 0.240540 envelope_change_Gmean 0.254844
4 f2_clean_Gstd -0.207581 f3_clean_vel_pospeak_std 0.205998 CoG_Gmean -0.250294
5 f1_clean_vel_range -0.204405 f3_clean_pospeak_mean 0.202094 CoG_range 0.248384
6 f1_clean_range -0.203867 envelope_integral -0.172313 f1_clean_pospeak_mean -0.226109
7 f2_clean_range -0.201218 f2_clean_vel_pospeak_mean -0.169850 envelope_change_integral 0.219129
8 f2_clean_pospeak_n -0.199212 f1_clean_vel_integral -0.165138 CoG_Gstd 0.218615
9 VSA_f1f2 -0.198116 f2_clean_integral 0.165138 f2_clean_vel_pospeak_std 0.212810
10 f3_clean_range -0.173889 f3_clean_vel_integral -0.165138 envelope_change_Gstd 0.203972
11 f1_clean_vel_pospeak_n -0.171451 f2_clean_vel_integral -0.165138 envelope_Gstd 0.170814
12 f2_clean_vel_Gstd -0.170596 f1_clean_integral 0.165138 f2_clean_pospeak_mean -0.147427
13 f3_clean_vel_range -0.164741 f3_clean_integral 0.165138 f2_clean_pospeak_std 0.142625
14 f1_clean_pospeak_n -0.163461 f0_Gmean -0.164620 envelope_Gmean 0.140345
15 f1_clean_Gstd -0.160931 envelope_change_integral 0.163419 envelope_change_pospeak_n 0.131470
16 f1_clean_vel_Gstd -0.143258 f1_clean_vel_pospeak_n 0.163373 f1_clean_pospeak_n -0.128745
17 envelope_change_range -0.141037 f3_clean_Gmean 0.149983 f1_clean_Gstd -0.127111
18 f3_clean_Gstd -0.138584 envelope_Gstd -0.148065 f3_clean_pospeak_mean -0.126233
19 f3_clean_vel_pospeak_std -0.135056 envelope_Gmean -0.148007 envelope_change_range 0.124729


Now we save contributors as a file

# Save top contributors
results_df_voc.to_csv(curfolder + '\\datasets\\PCA_top_contributors_voc.csv', index=False)

# Save clean data
voc_clean.to_csv(curfolder + '\\datasets\\voc_clean_df.csv', index=False)

PCA: Combined

Now we do the same for combined condition

# These are answer related columns
conceptcols = ['answer', 'expressibility', 'response']

# Concatenate both lists
colstodel = conceptcols 

# Clean the df
multi_clean = clean_df(data_multi, colstodel)

multi_clean.head(15)
arm_duration arm_inter_Kin arm_inter_IK arm_bbmv lowerbody_duration lowerbody_inter_Kin lowerbody_inter_IK lowerbody_bbmv leg_duration leg_inter_Kin ... f2_clean_pospeak_std f1_clean_pospeak_std f1_clean_vel_pospeak_std CoG_Gmean CoG_Gstd CoG_pospeak_mean CoG_pospeak_std CoG_integral CoG_range correction_info
0 3988.0 26.197 26.728 8.472 2020.0 26.580 26.181 5.400 2020.0 26.759 ... 0.000 0.000 0.000 0.0 0.0 0.0 0.0 0.0 0.0 c0
1 3868.0 26.674 28.122 8.148 2870.0 28.319 28.291 6.487 2870.0 27.096 ... 0.000 0.000 0.000 0.0 0.0 0.0 0.0 0.0 0.0 c1
2 4014.0 26.449 27.565 8.465 874.0 23.434 23.040 2.861 874.0 22.945 ... 0.000 0.000 0.000 0.0 0.0 0.0 0.0 0.0 0.0 c2
3 4046.0 26.207 28.558 8.482 3750.0 28.421 28.796 6.124 3750.0 27.973 ... 0.000 0.000 0.000 0.0 0.0 0.0 0.0 0.0 0.0 c0
4 4708.0 26.502 28.717 8.309 4508.0 29.760 28.762 6.005 4508.0 28.679 ... 0.000 0.000 0.000 0.0 0.0 0.0 0.0 0.0 0.0 c1
5 4004.0 26.978 28.554 8.699 3598.0 28.801 28.616 5.723 3598.0 28.179 ... 0.000 0.000 0.000 0.0 0.0 0.0 0.0 0.0 0.0 c2
6 0.0 0.000 0.000 0.000 784.0 23.222 22.753 2.313 784.0 22.135 ... 0.867 0.000 0.000 0.0 0.0 0.0 0.0 0.0 0.0 c0
7 5248.0 28.973 31.658 10.919 4930.0 28.905 29.800 14.893 4930.0 28.250 ... 0.000 0.000 0.000 0.0 0.0 0.0 0.0 0.0 0.0 c0
8 1784.0 25.011 24.190 6.604 0.0 0.000 0.000 0.000 0.0 0.000 ... 0.000 0.000 0.000 0.0 0.0 0.0 0.0 0.0 0.0 c1
9 1284.0 24.278 24.516 5.054 1334.0 25.381 25.483 4.317 1334.0 24.478 ... 0.000 0.000 0.000 0.0 0.0 0.0 0.0 0.0 0.0 c2
10 1494.0 24.248 24.502 6.483 2944.0 29.247 28.630 5.378 2944.0 26.524 ... 0.000 0.018 0.812 0.0 0.0 0.0 0.0 0.0 0.0 c0_only
11 2486.0 25.718 26.035 6.420 780.0 25.181 21.982 2.271 780.0 22.482 ... 0.000 0.000 0.000 0.0 0.0 0.0 0.0 0.0 0.0 c0_only
12 2968.0 25.494 28.758 8.040 2750.0 28.377 27.107 4.885 2750.0 26.672 ... 0.618 0.000 0.552 0.0 0.0 0.0 0.0 0.0 0.0 c0_only
13 3850.0 27.666 27.348 7.419 3366.0 28.208 28.519 5.033 3366.0 28.230 ... 0.223 0.000 0.502 0.0 0.0 0.0 0.0 0.0 0.0 c0
14 3236.0 26.551 27.446 6.411 0.0 0.000 0.000 0.000 0.0 0.000 ... 0.197 0.429 0.347 0.0 0.0 0.0 0.0 0.0 0.0 c1

15 rows × 395 columns


Now we can use the function to perform the same PCA analysis but all (i.e., both vocal and gestural) features

# Perform PCA analysis
results_df_multi = pca_analysis(multi_clean)

# Display
results_df_multi.head(20)
Biplot for the first 2 PCs:
PCs explained variance:
PC1: 32.33%
PC2: 12.86%
PC3: 8.72%

PC1 PC1_Loading PC2 PC2_Loading PC3 PC3_Loading
0 lowerbody_speedKin_sum_range 0.084947 pelvis_moment_sum_change_pospeak_n -0.108598 pelvis_moment_sum_Gmean -0.131078
1 lowerbody_speedKin_sum_Gstd 0.084128 lowerbody_moment_sum_change_pospeak_n -0.102210 leg_moment_sum_change_pospeak_std 0.126453
2 lowerbody_accKin_sum_range 0.083025 envelope_change_range -0.095835 pelvis_moment_sum_change_Gmean 0.125537
3 leg_angAcc_sum_range 0.082578 leg_moment_sum_pospeak_n -0.095587 pelvis_moment_sum_integral -0.125378
4 bbmv_total 0.082215 leg_angAcc_sum_pospeak_mean 0.095287 leg_moment_sum_change_pospeak_mean 0.123384
5 leg_angJerk_sum_range 0.082004 leg_angSpeed_sum_pospeak_mean 0.095287 lowerbody_jerkKin_sum_pospeak_mean 0.120110
6 lowerbody_accKin_sum_Gstd 0.081860 lowerbody_moment_sum_change_pospeak_std -0.090847 leg_moment_sum_change_pospeak_n 0.114781
7 leg_angSpeed_sum_range 0.081268 leg_moment_sum_pospeak_std -0.090727 lowerbody_moment_sum_change_Gmean 0.114695
8 head_angJerk_sum_range 0.081148 envelope_change_pospeak_n -0.090655 leg_moment_sum_change_Gmean 0.114310
9 head_angJerk_sum_Gstd 0.080572 f0_range -0.090308 pelvis_moment_sum_pospeak_mean -0.113300
10 pelvis_moment_sum_range 0.080076 f1_clean_vel_pospeak_n -0.090298 pelvis_moment_sum_change_integral 0.111752
11 lowerbody_speedKin_sum_integral 0.079815 f3_clean_vel_pospeak_n -0.089649 f1_clean_vel_pospeak_std 0.107322
12 leg_angAcc_sum_Gstd 0.079559 leg_moment_sum_Gstd -0.089473 lowerbody_moment_sum_integral -0.105503
13 head_angJerk_sum_Gmean 0.079513 head_jerkKin_sum_pospeak_mean 0.088606 lowerbody_moment_sum_change_integral 0.105260
14 arm_angJerk_sum_integral 0.079455 f0_Gstd -0.088475 lowerbody_angAcc_sum_Gstd 0.104757
15 leg_power_integral 0.079208 spine_moment_sum_change_pospeak_n -0.087931 lowerbody_moment_sum_Gmean -0.103744
16 leg_angJerk_sum_Gstd 0.079082 lowerbody_moment_sum_change_Gstd -0.087358 lowerbody_jerkKin_sum_pospeak_std 0.102416
17 leg_angSpeed_sum_Gstd 0.078558 leg_jerkKin_sum_pospeak_mean 0.087297 leg_moment_sum_change_integral 0.102272
18 head_accKin_sum_integral 0.078307 lowerbody_power_pospeak_n -0.087118 lowerbody_angAcc_sum_range 0.098288
19 arm_angSpeed_sum_pospeak_std 0.078266 f2_clean_vel_pospeak_n -0.086946 lowerbody_moment_sum_change_range 0.095950
# Save top contributors
results_df_multi.to_csv(curfolder + '\\datasets\\PCA_top_contributors_multi.csv', index=False)

# Save clean data
multi_clean.to_csv(curfolder + '\\datasets\\multi_clean_df.csv', index=False)


In the XGBoost script, we will combine the ranking from Principal Component Analysis with the ranking of cummulative importance to get most predictive features of effort from uncorrelated dimensions.