Extraction of effort-related features

Overview

In this script, we will extract effort-related features from the merged multimodal data we created for each trial in merging script.

Because we are dealing with time-varying data, we will extract number of statistics that characterize both instantaneous as well as cumulative nature of each (acoustic, motion, postural) timeseries in terms of effort.

Before we collect all relevant features, we normalize all time-varying features in the trial by minimum and maximum observed values for that feature for a participant. This is because we want to account for individual differences/strategies in using the whole range of motion and/or acoustic features. We therefore treat effort as a relative measure - but because we cannot access information about the minimum/maximum possible by a participant, we take the maximum and minimum that were observed within the whole experiment.

Code to prepare the environment

# Import packages
import os
import glob
import pandas as pd
from scipy.signal import find_peaks
import numpy as np
import antropy as ent
from scipy.spatial import ConvexHull
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import umap
import seaborn as sns
import builtins


curfolder = os.getcwd()

# Here we store our merged final timeseries data
mergedfolder = curfolder + "\\..\\05_finalMerge\\TS_final\\"
filestotrack = glob.glob(mergedfolder + "merged_anno*.csv")

# Here we store concept similarity data
conceptsimfolder = curfolder + "\\..\\06_ConceptSimilarity\\data\\"

# Here we store voice quality measures
voicefolder = curfolder + "\\..\\03_TS_processing\\TS_acoustics\\"

# Here we store other potentially useful data
metadatafolder = curfolder + "\\..\\00_RAWDATA\\"

# Here we store the final data
datafolder = curfolder + "\\Datasets\\"

# These are practice trials
practicetrials = ['0_2_0', '0_2_1', '0_2_19', '0_2_20', '0_2_21', '0_2_22', '0_2_38', '0_2_39', '0_2_40', '0_2_53', '0_2_54', '0_2_55', '0_2_56', '0_2_67', '0_2_68', '0_2_69', '0_2_70', '0_2_71', '0_2_72', '0_2_92', '0_2_93', '0_2_94', '0_2_95', '0_2_96', '0_2_97']

# We are interested only in data from part 2, and only those that are not practice trials
files_part2 = []  

# Collect to list only files that are in part 2 and are not practice trials
for file in filestotrack:
    exp_part = file.split("\\")[-1].split("_")[3]
    if exp_part == "2":
        if not any([x in file for x in practicetrials]):
            files_part2.append(file)

Preparing additional metadata

To get additional metadata about the trials, we also need to load in and pre-process some more data, such as:

Similarity between answer and a target concept (computed with ConceptNet). This dataframe has information about the similarity between an answer and performed target concept, computed as cosine similarity of embeddings retrieved from ConceptNet.
Expressibility of a concept. This dataframe has information about expressibility of each concept in a stimuli list. The ratings were acquired from a separate group of people that participated in an online study (Ćwiek et al. n.d.; Kadavá et al. 2024).
Response time. This dataframe has information about how long it took a guesser to provide an answer to a performance, measured from the end of the performance until pressing an Enter to confirm the answer.

This is the similarity dataframe

	word	answer	English	answer_en	cosine_similarity	mean_similarity	index
2	vrouw	zwanger	female	pregnant	0.265	4.929	2
3	vrouw	haar	female	hair	0.236	3.714	3
4	vrouw	douchen	female	to shower	0.025	0.571	4
5	verbranden	pijn	to burn	pain	0.119	5.643	5
6	verbranden	verbrand	to burn	burnt	0.883	8.714	6
7	verbranden	heet	to burn	hot	0.412	6.786	7
8	ik	ik	I	I	1.000	10.000	8
9	kauwen	kauwen	to chew	to chew	1.000	10.000	9
10	vliegen	vogel	to fly	bird	0.433	7.143	10
11	vliegen	vliegtuig	to fly	airplane	0.745	8.071	11
12	vliegen	vliegen	to fly	to fly	1.000	10.000	12
13	misschien	denken	maybe	to think	0.567	3.143	13
14	misschien	kiezen	maybe	to choose	0.159	3.643	14
15	misschien	twijfelen	maybe	to doubt	0.284	6.929	15
16	bliksem	graffiti	lightning	graffiti	0.012	0.571	16

This is the expressibility dataframe

	word	modality	fit
0	aanraken	gebaren	0.671
1	aanraken	combinatie	0.695
2	aanraken	geluiden	0.207
3	aarde	gebaren	0.539
4	aarde	combinatie	0.524
5	aarde	geluiden	0.195
6	ademen	gebaren	0.782
7	ademen	combinatie	0.808
8	ademen	geluiden	0.785
9	alles	gebaren	0.351
10	alles	combinatie	0.381
11	alles	geluiden	0.118
12	arm (lichaam)	gebaren	0.537
13	arm (lichaam)	combinatie	0.575
14	arm (lichaam)	geluiden	0.127

This is the response time dataframe

	TrialID	response_time_sec
2	0_2_2_p0	5.112
3	0_2_3_p0	14.576
4	0_2_4_p0	4.008
5	0_2_5_p0	3.793
6	0_2_6_p0	6.091
7	0_2_7_p0	3.719
8	0_2_8_p0	6.466
9	0_2_9_p0	5.328
10	0_2_10_p0	2.727
11	0_2_11_p0	2.912
12	0_2_12_p0	4.849
13	0_2_13_p0	7.510
14	0_2_14_p0	3.823
15	0_2_15_p0	4.358
16	0_2_16_p0	5.861

Z-scoring to get to relative effort

People can be expected to have a different range of effort, depending on factors including their bodily but also cognitive predispositions. To account for this, we will normalize all data by the means of z-scoring to transform all timeseries such that they represent a deviation from a person’s grand mean for that particulat signal (we will include also data from preceseding experiment that the same participants were performing before this experiment, see preregistration of data collection). We will now collect the means and standard deviations of each timeseries for each participant into a dictionary.

# Loop through all files and get maximum and minimum of all timeseries
zscore_dict = {}

sessions = ['0']
participants = ['p0', 'p1']

for session in sessions:
    for file in filestotrack:
        filesession = file.split("\\")[-1].split("_")[2]
        if filesession == session:
            filepart = file.split("\\")[-1].split("_")[5].split(".")[0]
            df_mega_p0 = pd.DataFrame()
            df_mega_p1 = pd.DataFrame()
            
            if filepart == 'p0':
                df_p0 = pd.read_csv(file)
                df_mega_p0 = pd.concat([df_mega_p0, df_p0])
            
            elif filepart == 'p1':
                df_p1 = pd.read_csv(file)
                df_mega_p1 = pd.concat([df_mega_p1, df_p1])
    
            dfs = [df_mega_p0, df_mega_p1]
            
            for df_mega in dfs:
                numcols = df_mega.select_dtypes(include=[np.number]).columns
                numcols = [x for x in numcols if 'time' not in x]

                for col in numcols:
                    mean = df_mega[col].mean()
                    std = df_mega[col].std()
                    pcnID = session + "_" + filepart
                    zscore_dict[pcnID + "_" + col] = (mean, std)

This is how the dictionary looks like:

{'0_p0_left_back': (1.1608994628793343, 0.01302082973841597),
 '0_p0_right_forward': (0.7593088217207401, 0.014045559688118315),
 '0_p0_right_back': (1.6113820217751105, 0.018753518395118775),
 '0_p0_left_forward': (1.2809749822476206, 0.01822508747590821),
 '0_p0_COPXc': (-1.132122668912945e-05, 0.00024849914094133066),
 '0_p0_COPYc': (-3.407442111079415e-05, 0.0003677341561461088),
 '0_p0_COPc': (0.0003731958284368783, 0.00024274134842270764),
 '0_p0_audio': (-6.589913964010284e-07, 2.195982615198923e-05),
 '0_p0_envelope': (0.09080125248013149, 0.12676158005356647),
 '0_p0_pelvis_tilt_moment': (-1.568033508388749, 5.282649177986357)}

Now we can check that the normalization of the sample timeseries worked as expected by plotting it against the timeseries before normalization

# Function to z-score
def zscore(x, col, pcnID, zscore_dict):
    mean, std = zscore_dict[pcnID + "_" + col]
    return (x - mean) / std

# Z-score sample
sample = files_part2[40]
df_sample = pd.read_csv(sample)
numcols = df_sample.select_dtypes(include=[np.number]).columns  # Normalize only numerical columns (except time)
numcols = [x for x in numcols if x != 'time']

trialID = df_sample['TrialID'].iloc[0]
pcnID = trialID.split('_')[0] + '_' + trialID.split('_')[-1]

# Compute all z-scored columns at once
zscore_columns = pd.DataFrame({
    col + '_z': df_sample[col].apply(lambda x: zscore(x, col, pcnID, zscore_dict))
    for col in numcols
})

# Concatenate the z-scored columns all at once to avoid fragmentation
df_sample = pd.concat([df_sample, zscore_columns], axis=1).copy()

# Check the resulting dataframe
df_sample.head(15)

	time	left_back	right_forward	right_back	left_forward	COPXc	COPYc	COPc	TrialID	FileInfo	...	f1_clean_z	f2_clean_z	f3_clean_z	f1_clean_vel_z	f2_clean_vel_z	f3_clean_vel_z	lowerbody_power_z	leg_power_z	head_power_z	arm_power_z
0	0.0	1.125486	0.804025	1.585540	1.305944	0.000518	0.000133	0.000535	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.454043	0.445892	-0.996332	4.290422
1	2.0	1.125361	0.804328	1.585462	1.305529	0.000438	0.000021	0.000438	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.436372	0.439448	-0.994162	4.274307
2	4.0	1.125285	0.804542	1.585431	1.305204	0.000359	-0.000064	0.000365	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.418702	0.433004	-0.991991	4.258192
3	6.0	1.125250	0.804683	1.585438	1.304959	0.000284	-0.000126	0.000311	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.401031	0.426560	-0.989821	4.242077
4	8.0	1.125250	0.804762	1.585474	1.304782	0.000213	-0.000169	0.000272	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.383361	0.420116	-0.987651	4.225962
5	10.0	1.125279	0.804791	1.585530	1.304662	0.000146	-0.000196	0.000245	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.365690	0.413672	-0.985480	4.209848
6	12.0	1.125331	0.804780	1.585600	1.304591	0.000086	-0.000210	0.000227	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.348020	0.407228	-0.983310	4.193733
7	14.0	1.125401	0.804738	1.585678	1.304559	0.000031	-0.000215	0.000217	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.330349	0.400784	-0.981139	4.177618
8	16.0	1.125485	0.804673	1.585759	1.304560	-0.000018	-0.000212	0.000213	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.312679	0.394340	-0.978969	4.161503
9	18.0	1.125579	0.804591	1.585838	1.304585	-0.000059	-0.000203	0.000212	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.295008	0.387896	-0.976799	4.145389
10	20.0	1.125679	0.804499	1.585913	1.304628	-0.000095	-0.000191	0.000213	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.277338	0.381452	-0.974628	4.129274
11	22.0	1.125781	0.804401	1.585981	1.304683	-0.000123	-0.000177	0.000215	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.259667	0.375008	-0.972458	4.113159
12	24.0	1.125884	0.804302	1.586039	1.304745	-0.000145	-0.000162	0.000217	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.241997	0.368564	-0.970288	4.097044
13	26.0	1.125984	0.804204	1.586087	1.304809	-0.000160	-0.000147	0.000218	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.224326	0.362120	-0.968117	4.080929
14	28.0	1.126080	0.804111	1.586124	1.304872	-0.000169	-0.000134	0.000216	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.206656	0.355676	-0.965947	4.064815

15 rows × 1059 columns

Extracting features: rationale

In the following part, we will be extracting summaries from the existing timeseries that we pre-processed in the previous scripts. The goal is to capture the effort-related features that might be relevant in a repair, i.e., when a participant has to correct themselves to regain understanding with their partner who previously did not understand the performed concept.

global mean and standard deviation
number, mean and standard deviation of the peaks (using function find_peaks from scipy.signal). For acceleration and moment change, we also calculate the measurements relating to negative peaks.
range of the values
integral

Additionally, we will compute sample entropy to measure the complexity of a timeseries.

Furthermore, we will utilize measurements that capture characteristics beyond the statistics sketched above. These include:

intermittency (as dimensionless jerk); used in Pouw et al. (2021);
bounding box of movement volume (i.e., gesture space); used in Żywiczyński et al. (2024);
vowel space area (VSA); used in Berisha et al. (2014);
motor complexity; computed as the slope of PCA; similar to Yan et al. (2020);
number of submovements; used in Pouw et al. (2021); Trujillo et al. (2018);
number of moving articulators.

Before proceeding to the feature extraction for all trials, we will first demonstrate the features on a single trial.

This is a dataframe associated with this trial, for now with both zscored and original columns

	time	left_back	right_forward	right_back	left_forward	COPXc	COPYc	COPc	TrialID	FileInfo	...	f1_clean_z	f2_clean_z	f3_clean_z	f1_clean_vel_z	f2_clean_vel_z	f3_clean_vel_z	lowerbody_power_z	leg_power_z	head_power_z	arm_power_z
0	0.0	1.125486	0.804025	1.585540	1.305944	0.000518	0.000133	0.000535	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.454043	0.445892	-0.996332	4.290422
1	2.0	1.125361	0.804328	1.585462	1.305529	0.000438	0.000021	0.000438	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.436372	0.439448	-0.994162	4.274307
2	4.0	1.125285	0.804542	1.585431	1.305204	0.000359	-0.000064	0.000365	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.418702	0.433004	-0.991991	4.258192
3	6.0	1.125250	0.804683	1.585438	1.304959	0.000284	-0.000126	0.000311	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.401031	0.426560	-0.989821	4.242077
4	8.0	1.125250	0.804762	1.585474	1.304782	0.000213	-0.000169	0.000272	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.383361	0.420116	-0.987651	4.225962
5	10.0	1.125279	0.804791	1.585530	1.304662	0.000146	-0.000196	0.000245	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.365690	0.413672	-0.985480	4.209848
6	12.0	1.125331	0.804780	1.585600	1.304591	0.000086	-0.000210	0.000227	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.348020	0.407228	-0.983310	4.193733
7	14.0	1.125401	0.804738	1.585678	1.304559	0.000031	-0.000215	0.000217	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.330349	0.400784	-0.981139	4.177618
8	16.0	1.125485	0.804673	1.585759	1.304560	-0.000018	-0.000212	0.000213	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.312679	0.394340	-0.978969	4.161503
9	18.0	1.125579	0.804591	1.585838	1.304585	-0.000059	-0.000203	0.000212	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.295008	0.387896	-0.976799	4.145389
10	20.0	1.125679	0.804499	1.585913	1.304628	-0.000095	-0.000191	0.000213	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.277338	0.381452	-0.974628	4.129274
11	22.0	1.125781	0.804401	1.585981	1.304683	-0.000123	-0.000177	0.000215	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.259667	0.375008	-0.972458	4.113159
12	24.0	1.125884	0.804302	1.586039	1.304745	-0.000145	-0.000162	0.000217	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.241997	0.368564	-0.970288	4.097044
13	26.0	1.125984	0.804204	1.586087	1.304809	-0.000160	-0.000147	0.000218	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.224326	0.362120	-0.968117	4.080929
14	28.0	1.126080	0.804111	1.586124	1.304872	-0.000169	-0.000134	0.000216	0_2_6_p0	p0_verbranden_combinatie_c1	...	NaN	NaN	NaN	NaN	NaN	NaN	5.206656	0.355676	-0.965947	4.064815

15 rows × 1059 columns

Basic statistics

Basic statistics capture properties of the timeseries in several dimensions: - globally (mean, standard deviation of the timeseries, range of the values, rate of the feature) - locally (number, mean and standard deviation of the peaks) - cummulatively (integral of the timeseries)

We also collect the time stamps of the peaks, as we will use them to compute the synchronization between the modalities.

First, to be able to apply the find_peaks function properly, we will need for each column a mean value of the timeseries across all trials. This mean value is then set as the minimum height of a peak for a timeseries

df_list = []  # List to store individual DataFrames before concatenation

# First, collect all part2 trials into one z-score
for file in files_part2:   
    df = pd.read_csv(file)

    # Get all numerical cols
    numcols = df.select_dtypes(include=[np.number]).columns
    numcols = [x for x in numcols if x != 'time']  # Exclude 'time'

    # Normalize by observed mean and sd for this participant
    TrialID = df['TrialID'].iloc[0]  
    pcnID = TrialID.split('_')[0] + '_' + TrialID.split('_')[-1] 

    normalized_cols = {
        col + '_z': (df[col] - zscore_dict[pcnID + "_" + col][0]) / zscore_dict[pcnID + "_" + col][1]
        for col in numcols
    }

    df_normed = pd.DataFrame(normalized_cols)
    
    # Add necessary columns
    df_normed['TrialID'] = df['TrialID']
    df_normed['time'] = df['time']
    
    df_list.append(df_normed)  # Store in list instead of concatenating immediately

# Concatenate all at once to avoid fragmentation
df_mega = pd.concat(df_list, ignore_index=True)

# Compute general means
df_means = {col + '_general_mean': df_mega[col].mean() for col in df_mega.select_dtypes(include=[np.number]).columns if col != 'time'}
# Compute general std
df_means.update({col + '_general_std': df_mega[col].std() for col in df_mega.select_dtypes(include=[np.number]).columns if col != 'time'})
df_means = pd.DataFrame([df_means]) 


# Check the resulting df_means
df_means.head(15)

	left_back_z_general_mean	right_forward_z_general_mean	right_back_z_general_mean	left_forward_z_general_mean	COPXc_z_general_mean	COPYc_z_general_mean	COPc_z_general_mean	audio_z_general_mean	envelope_z_general_mean	envelope_change_z_general_mean	...	f1_clean_z_general_std	f2_clean_z_general_std	f3_clean_z_general_std	f1_clean_vel_z_general_std	f2_clean_vel_z_general_std	f3_clean_vel_z_general_std	lowerbody_power_z_general_std	leg_power_z_general_std	head_power_z_general_std	arm_power_z_general_std
0	-1.8352	1.496059	-1.401405	1.723082	0.039729	0.062733	1.667578	3.723616	0.09084	-0.15375	...	1.596439	1.299107	1.054862	1.94965	1.3957	0.974311	0.857923	1.595838	1.799934	1.763681

1 rows × 1050 columns

Now we can collect basic statistics.

Note that when we look for peaks, we take the mean value for that signal over all trials. Additionally, in timeseries that have both positive and negative values (e.g., acceleration), we will look for both positive and negative peaks.

Custom functions

# Function to calculate stats for a feature
def get_statistics(cols, df, subdf, dictionary, means_df):
    for col in cols:
        Gmean = subdf[col].mean()   # General mean
        Gstd = subdf[col].std()     # General std

        # If the col is 2nd derivative of position data, joint kinematics, or moments, we will calculate both positive and negative peaks (by reversing the sign of the data)
        if 'acc' in col or 'Acc' in col or 'moment_change' in col:
            ts_mean = means_df[col + '_z_general_mean'].values[0] # get the overall mean of this signal
            height = ts_mean - 2*df_means[col + '_z_general_std'].values[0] # height is mean - 1*std
            pospeaks, _ = find_peaks(subdf[col], height=height, distance=100)
            pospeaks_values = subdf.loc[pospeaks, col]
            pospeaks_times = subdf.loc[pospeaks, "time"].tolist()

            negpeaks, _ = find_peaks(-subdf[col], height=-height, distance=50, prominence=0.4)
            negpeaks_values = subdf.loc[negpeaks, col]
            negpeaks_times = subdf.loc[negpeaks, "time"].tolist()

            negpeak_mean = negpeaks_values.mean()
            negpeak_std = negpeaks_values.std()
            negpeak_n = len(negpeaks_values)

        else:    
            ts_mean = means_df[col + '_z_general_mean'].values[0] # get the overall mean of this signal
            height = ts_mean - 2*df_means[col + '_z_general_std'].values[0] # height is mean - 1*std
            pospeaks, _ = find_peaks(subdf[col], height=height, distance=50, prominence=0.4)

            pospeaks_values = df.loc[pospeaks, col] # Peak values
            pospeaks_times = df.loc[pospeaks, "time"].tolist() # Peak time

            # Plot if needed
            # peak_times = subdf.iloc[pospeaks]['time']
            # peak_values = subdf.iloc[pospeaks][col]
            # plt.plot(subdf['time'], subdf[col], label="Signal")
            # plt.scatter(peak_times, peak_values, color='r', label="Pos Peaks", marker="x")
            # plt.title(col)
            # plt.legend()
            # plt.show()

            pospeak_mean = pospeaks_values.mean() # Peak mean
            pospeak_std = pospeaks_values.std()   # Peak std
            pospeak_n = len(pospeaks_values)      # Number of peaks

            negpeak_mean = None
            negpeak_std = None
            negpeak_n = None
            negpeaks_times = None

        integral = np.trapz(subdf[col]) # Integral
        range = subdf[col].max() - subdf[col].min() # Range
        
        # Save all in dictionary
        dictionary[col] = [Gmean, Gstd, pospeak_mean, pospeak_std, pospeak_n, pospeaks_times, negpeak_mean, negpeak_std, negpeak_n, negpeaks_times, integral, range]

    return dictionary

# Function to adapt row
def adapt_row(row_to_process):
    for col in row_to_process.columns:
        # Calculate for all expcet some already calculated measures
        if 'inter' not in col and 'sampEn' not in col and 'bbmv' not in col and 'duration' not in col:
            row_to_process[col + '_Gmean'] = row_to_process[col].apply(lambda x: x[0])
            row_to_process[col + '_Gstd'] = row_to_process[col].apply(lambda x: x[1])
            row_to_process[col + '_pospeak_mean'] = row_to_process[col].apply(lambda x: x[2])
            row_to_process[col + '_pospeak_std'] = row_to_process[col].apply(lambda x: x[3])
            row_to_process[col + '_pospeak_n'] = row_to_process[col].apply(lambda x: x[4])
            row_to_process[col + '_pospeak_times'] = row_to_process[col].apply(lambda x: x[5])
            row_to_process[col + '_negpeak_mean'] = row_to_process[col].apply(lambda x: x[6])
            row_to_process[col + '_negpeak_std'] = row_to_process[col].apply(lambda x: x[7])
            row_to_process[col + '_negpeak_n'] = row_to_process[col].apply(lambda x: x[8])
            row_to_process[col + '_negpeak_times'] = row_to_process[col].apply(lambda x: x[9])
            row_to_process[col + '_integral'] = row_to_process[col].apply(lambda x: x[10])
            row_to_process[col + '_range'] = row_to_process[col].apply(lambda x: x[11])

    # Now keep only this newly created cols
    row_final = row_to_process[[col for col in row_to_process.columns if any(x in col for x in ['Gmean', 'Gstd', 'pospeak_mean', 'pospeak_std', 'pospeak_n', 'sampen', 'inter', 'integral', 'pospeak_times', 'bbmv', 'range', 'duration', 'negpeak_mean', 'negpeak_std', 'negpeak_n', 'negpeak_times'])]]

    # Get rid of cols with NaNs
    row_final = row_final.dropna(axis=1, how='all')
    
    return row_final

This is a summary of features for this single trial

	RWrist_speed_Gmean	RWrist_speed_Gstd	RWrist_speed_pospeak_mean	RWrist_speed_pospeak_std	RWrist_speed_pospeak_n	RWrist_speed_pospeak_times	RWrist_speed_integral	RWrist_speed_range
0	47.434854	34.936351	77.29624	46.952825	9	[316.0, 1100.0, 1732.0, 2232.0, 2850.0, 3216.0...	125349.559904	135.68402

Bounding box of movement volume

Bounding box of movement volume (BBMV) is a measure of the space covered by the participant’s gestures (Żywiczyński et al. 2024). It is computed as the difference between the maximum and minimum values of the timeseries.

Custom functions

# Function to calcuate bounding box of movement volume     
def get_bbmv(df, group, kp_dict):
    coordinates = [col for col in df.columns if any(x in col for x in ['_x', '_y', '_z'])]

    # Prepare columns that belong to a group (e.g., arm)
    kp = kp_dict[group]
    colstoBBMV = [col for col in coordinates if any(x in col for x in kp)]

    # Keep only unique names without coordinates
    kincols = list(set([col.split('_')[0] for col in colstoBBMV]))

    bbmvs = {}
    for col in kincols:
        # Span of x, y, z
        x_span = df[col + '_x'].max() - df[col + '_x'].min()
        y_span = df[col + '_y'].max() - df[col + '_y'].min()
        z_span = df[col + '_z'].max() - df[col + '_z'].min()

        # Calculate BBMV
        bbmv = x_span * y_span * z_span
        bbmvs[col] = bbmv
        
    # Get the sum for the whole group
    bbmv_sum = sum(bbmvs.values())

    # Natural logarithm
    bbmv_sum = np.log(bbmv_sum) 

    return bbmv_sum

Using custom function compute_bounding_box, we will compute BBMV for the sample timeseries. (Note that the values are log-transformed.)

BBMV of the current trial:  8.089072657442488

Symmetry of arm movement / multi-movement asymmetry of submovements

Symmetry of arm movement is a measure of how much the participant uses both arms. Symmetrical movement is expected to be less effortful than asymmetrical movement, as asymmetrical movement might require more cognitive resources to coordinate the two arms. We will compute symmetry as the correlation between left and right arm trajectories, similar to Xiong, Quek, and Mcneill (2002). If it’s close to 1, it means that the participant uses both arms in a symmetrical way.


def compute_ArmCoupling(df, keypoints=['Wrist'], dimensions=['x', 'y', 'z'], absolute=True, z=False):
    correlations = []
    
    for kp in keypoints:
        for dim in dimensions:
            if z:
                left_col = f"L{kp}_{dim}_z"
                right_col = f"R{kp}_{dim}_z"
            else:
                left_col = f"L{kp}_{dim}"
                right_col = f"R{kp}_{dim}"
                
            if left_col in df.columns and right_col in df.columns:
                corr = np.corrcoef(df[left_col], df[right_col])[0, 1]  # Pearson correlation
                if absolute:
                    corr = np.abs(corr)
                correlations.append(corr)
    
    return np.nanmean(correlations)  # Average correlation across keypoints and dimensions

def compute_multimovementAssymetry(df, z=False):
    joints = ['elbow_flex', 'wrist_flex', 'wrist_dev']
    
    integral_diff_left = 0
    integral_diff_right = 0
    joint_integral_differences = []

    for joint in joints:
        if z:
            joint_r_speed = f"{joint}_r_speed_z"
            joint_l_speed = f"{joint}_l_speed_z"
        else:
            joint_r_speed = f"{joint}_r_speed"
            joint_l_speed = f"{joint}_l_speed"
            
        if joint_r_speed in df.columns and joint_l_speed in df.columns:
            # Compute the integral of velocity (sum of speed over time) for left and right joints
            integral_velocity_left = np.sum(np.abs(df[joint_l_speed]))  # Sum of absolute speed
            integral_velocity_right = np.sum(np.abs(df[joint_r_speed]))  # Sum of absolute speed

            # Compute the difference in integral of angular velocity between left and right 
            # the more +, the more distance is left is travelling), 
            # but note that because od z-scoring, 0 does not reflect perfect symmetry in distance travelled by both hands
            joint_integral_diff = integral_velocity_left - integral_velocity_right
            joint_integral_differences.append(joint_integral_diff)

            # Accumulate for total integral difference calculation
            integral_diff_left += integral_velocity_left
            integral_diff_right += integral_velocity_right

    # Compute the overall difference in integrated velocities
    total_integral_difference = np.mean(joint_integral_differences)  # Mean of differences across joints

    return {
        'mean_integral_difference': total_integral_difference,
        'joint_integral_differences': joint_integral_differences,
        'total_left_integral_velocity': integral_diff_left,
        'total_right_integral_velocity': integral_diff_right
    }

# Example usage
coupling_score = compute_ArmCoupling(df_sample, keypoints=['Wrist', 'Elbow'], dimensions=['x', 'y', 'z'], z=True)
arm_asymmetry = compute_multimovementAssymetry(df_sample, z=True)

# Plot the trajectories
plt.figure(figsize=(10, 8))
ax = plt.axes(projection='3d')

# Plot the left arm
ax.plot3D(df_sample['LWrist_x'], df_sample['LWrist_y'], df_sample['LWrist_z'], 'r')
ax.plot3D(df_sample['LElbow_x'], df_sample['LElbow_y'], df_sample['LElbow_z'], 'r', linestyle='--')

# Plot the right arm
ax.plot3D(df_sample['RWrist_x'], df_sample['RWrist_y'], df_sample['RWrist_z'], 'b')
ax.plot3D(df_sample['RElbow_x'], df_sample['RElbow_y'], df_sample['RElbow_z'], 'b', linestyle='--')

plt.show()

# Print the symmetry score
print("Coupling Score:", coupling_score)
print("Arm Asymmetry:", arm_asymmetry['mean_integral_difference'])

Coupling Score: 0.42867744643596634
Arm Asymmetry: -7651.093615100171

Intermittency

Intermittency characterizes the ‘unsmoothess’ of the movement. We follow Pouw et al. (2021)’s adaptation of Hogan and Sternad (2009) and compute dimensionless jerk measure. This measure is computed as the integral of the second derivative of the speed (i.e., jerk) squared multiplied by the duration cubed over the maximum squared velocity.

Custom functions

# Function to calculate intermittency 
def get_intermittency(jerk_values, speed_values):
    """Calculate the dimensionless smoothness measure using precomputed smoothed jerk and speed."""
    smoothed_jerk = jerk_values
    speed = speed_values
    
    if not np.all(speed == 0):
        integrated_squared_jerk = np.sum(smoothed_jerk ** 2)
        max_squared_speed = np.max(speed ** 2)
        D3 = len(speed) ** 3
        jerk_dimensionless = integrated_squared_jerk * (D3 / max_squared_speed)
        smoothness = jerk_dimensionless
    else:
        smoothness = np.nan

    return smoothness

We can calculate intermittency for kinematics as well as for the joint angles.

Intermittency, kinematics:  29.512725544086184
Intermittency, joint angles:  31.333161498897482

Vowel space area

Vowel space area (VSA) is defined as the are of the quadrilateral formed by the four corner vowels /i/, /a/, /u/ and /ɑ/ in the F1-F2 space. It is a measure of the articulatory ‘working’ space and researchers often use it to characterize speech motor control (Berisha et al. 2014).

Despite the fact that we are not dealing with speech data, we can still compute VSA for the first two formants. Following Daniel R. McCloy’s tutorial for R, we will compute VSA for the sample timeseries as the area of the convex hull encompassing all tokens.

# Function to calculate vocal space area (as a convex hull=
def getVSA(f1, f2, plot=False):
   
    # 2d convex hull
    points = np.array([f1, f2]).T
    hull_2d = ConvexHull(points)
    volume_2d = hull_2d.volume
    volume_2d = np.log(volume_2d) # natural log

    if plot:
        plt.figure()
        plt.plot(points[:, 0], points[:, 1], 'o', markersize=3, label='Formant Points')
        
        for simplex in hull_2d.simplices:
            plt.plot(points[simplex, 0], points[simplex, 1], 'r-', linewidth=1)
        
        plt.xlabel('F1 (Hz)')
        plt.ylabel('F2 (Hz)')
        plt.gca().invert_yaxis()
        plt.legend()
        plt.title('2D Convex Hull of Vowel Space Area')
        plt.show()
    
    return volume_2d

# Calculate on sample
f1_clean = df_sample['f1_clean_z'].dropna()
f2_clean = df_sample['f2_clean_z'].dropna()
vsa = getVSA(f1_clean, f2_clean, plot=True)

Motor complexity

As a part of exploratory analysis, we also want to assess the motor complexity of the movements as a proxy of coordinative - therefore cognitive - effort relating to the signal.

Using OpenSim (Seth et al. 2018), we have been able to extract joint angle measurements for our data. We can use them to assess a motor complexity or alternatively, motor performance of a participant’s movements. We will use Principal Component Analysis (PCA) to assess the motor complexity. We will look at the number of principal components that explain a certain amount of variance in the data and the slope of the explained variance. This will give us a measure of how many (uncorrelated) dimensions/modes cover the most (80 or 95%) of the features of a signal, therefore how complex the movement is in terms of its coordination patterns (Daffertshofer et al. 2004; Yan et al. 2020).

# Function to get the PCA
def get_PCA(df, plot=False):
    # Step 1: Standardize the Data
    scaler = StandardScaler()
    standardized_data = scaler.fit_transform(df)
    
    # Step 2: Apply PCA
    pca = PCA()
    pca.fit(standardized_data)
    
    # Step 3: Explained Variance
    explained_variance = pca.explained_variance_ratio_
    cumulative_explained_variance = explained_variance.cumsum()

    # Step 4: Find the number of components that explain 95% variance
    n_components_for_80_variance = np.argmax(cumulative_explained_variance >= 0.8) + 1
    
    # Step 5: Compute the slope for the first n_components_for_95_variance
    if n_components_for_80_variance > 1:
        slope_80 = (cumulative_explained_variance[n_components_for_80_variance-1] - cumulative_explained_variance[0]) / (n_components_for_80_variance - 1)
    else:
        slope_80 = cumulative_explained_variance[0]  # Only one component case

    # 95% variance
    n_components_for_95_variance = np.argmax(cumulative_explained_variance >= 0.95) + 1
    
    # Step 5: Compute the slope for the first n_components_for_95_variance
    if n_components_for_95_variance > 1:
        slope_95 = (cumulative_explained_variance[n_components_for_95_variance-1] - cumulative_explained_variance[0]) / (n_components_for_95_variance - 1)
    else:
        slope_95 = cumulative_explained_variance[0]  # Only one component case


    # Set Seaborn style for consistency
    sns.set_style("whitegrid")  

    main_color = "black"   # Black for main cumulative variance line
    red_color = "red"      # Basic red for 80% variance threshold
    blue_color = "blue"    # Basic blue for 95% variance threshold

    if plot:

        # get rid of the grid
        plt.grid(False)

        # gwt rid of the lines in the backroug
        plt.gca().spines['top'].set_visible(False)
        
        # Adjust x-axis indexing: Components start from 1, not 0
        x_values = range(1, len(cumulative_explained_variance) + 1)
        
        # Plot cumulative explained variance (black line with markers)
        plt.plot(x_values, cumulative_explained_variance, marker='o', linestyle='-', markersize=4, linewidth=2, color=main_color, alpha=0.9, label='Cumulative Explained Variance')
        
        # Add vertical dashed lines for 80% and 95% variance thresholds
        plt.axvline(x=n_components_for_80_variance, color=red_color, linestyle='--', 
                    linewidth=2, alpha=0.9)
        
        plt.axvline(x=n_components_for_95_variance, color=blue_color, linestyle='--', 
                    linewidth=2, alpha=0.9)

        # Labels for variance thresholds on the graph
        plt.text(n_components_for_80_variance + 0.3, 0.8, "80% Variance", 
                color=red_color, fontsize=12, fontweight='bold')
        
        plt.text(n_components_for_95_variance + 0.3, 0.95, "95% Variance", 
                color=blue_color, fontsize=12, fontweight='bold')

        # Labels and title
        plt.xlabel('Number of Principal Components', fontsize=16, color='black')
        plt.ylabel('Cumulative Explained Variance', fontsize=16, color='black')
        
        # Ensure x-axis starts at 1 and labels are correctly spaced
        plt.xticks(x_values)  # Explicitly setting x-ticks
        
        # Subtle grid
        plt.grid(True, alpha=0.3)  
        
        # Make figure smaller
        plt.gcf().set_size_inches(6, 4)

        # Tight layout and saving for consistency
        plt.tight_layout()

        #xlim
        plt.xlim(0, len(cumulative_explained_variance) +1)  # Adjust x-axis limits
    
        plt.show()
        #plt.close()

    return n_components_for_80_variance, slope_80, n_components_for_95_variance, slope_95

# Calculate on sample
PCAcolls_all = ['pelvis_tilt_z', 'pelvis_list_z', 'pelvis_rotation_z', 'pelvis_tx_z',
       'pelvis_ty_z', 'pelvis_tz_z', 'hip_flexion_r_z', 'hip_adduction_r_z',
       'hip_rotation_r_z', 'knee_angle_r_z', 'knee_angle_r_beta_z', 'ankle_angle_r_z',
       'subtalar_angle_r_z', 'hip_flexion_l_z', 'hip_adduction_l_z',
       'hip_rotation_l_z', 'knee_angle_l_z', 'knee_angle_l_beta_z', 'ankle_angle_l_z',
       'subtalar_angle_l_z', 'L5_S1_Flex_Ext_z', 'L5_S1_Lat_Bending_z',
       'L5_S1_axial_rotation_z', 'L4_L5_Flex_Ext_z', 'L4_L5_Lat_Bending_z',
       'L4_L5_axial_rotation_z', 'L3_L4_Flex_Ext_z', 'L3_L4_Lat_Bending_z',
       'L3_L4_axial_rotation_z', 'L2_L3_Flex_Ext_z', 'L2_L3_Lat_Bending_z',
       'L2_L3_axial_rotation_z', 'L1_L2_Flex_Ext_z', 'L1_L2_Lat_Bending_z',
       'L1_L2_axial_rotation_z', 'L1_T12_Flex_Ext_z', 'L1_T12_Lat_Bending_z',
       'L1_T12_axial_rotation_z', 'neck_flexion_z', 'neck_bending_z',
       'neck_rotation_z', 'arm_flex_r_z', 'arm_add_r_z', 'arm_rot_r_z', 'elbow_flex_r_z',
       'pro_sup_r_z', 'wrist_flex_r_z', 'wrist_dev_r_z', 'arm_flex_l_z', 'arm_add_l_z',
       'arm_rot_l_z', 'elbow_flex_l_z', 'pro_sup_l_z', 'wrist_flex_l_z',
       'wrist_dev_l_z']

PCAcolls_arm = ['arm_flex_r_z', 'arm_add_r_z', 'arm_rot_r_z', 'elbow_flex_r_z', 'pro_sup_r_z',
       'wrist_flex_r_z', 'wrist_dev_r_z', 'arm_flex_l_z', 'arm_add_l_z', 'arm_rot_l_z',
       'elbow_flex_l_z', 'pro_sup_l_z', 'wrist_flex_l_z', 'wrist_dev_l_z']

# Calculate PCA for all
subdf = df_sample
PCA_all = get_PCA(subdf[PCAcolls_all], plot=True)
print('PCA for whole-body movement complexity', PCA_all)

# Calculate PCA for arms
subdf = df_sample
PCA_arms = get_PCA(subdf[PCAcolls_arm], plot=True)
print('PCA for arm movement complexity', PCA_arms)

PCA for whole-body movement complexity (3, 0.18143457258058265, 6, 0.1009103456272205)
PCA for arm movement complexity (3, 0.11732507841208828, 5, 0.09053436514831179)

Extracting features for all trials

Now we will loop through each file (representing a trial), extract the features, add all necessary metadata and save each trial in a overall dataframe. Before any feature is collected, we min-max normalize all data to get to relative feature of effort per participant

import warnings
warnings.filterwarnings("ignore")

# Initialize the dataframe
features_df = pd.DataFrame()

##########################
##### preparations #######
##########################

# These are the names of the columns with movement annotations
movcols = ['upper_mov', 'arms_mov', 'lower_mov', 'head_mov']

# These are our four dimensions for which we have aggregated movement data
groups = ['arm', 'lowerbody', 'leg', 'head']

# Mapping between dimension and annotation column
movcol_keys = {'arm': 'arms_mov', 'lowerbody': 'lower_mov', 'leg': 'lower_mov', 'head': 'head_mov', 'upperbody': 'upper_mov'}

# These are groups for BBMV
kp_arms = ['RWrist', 'RElbow', 'RShoulder', 'LWrist', 'LElbow', 'LShoulder']
kp_lower = ['RAnkle', 'RKnee', 'LAnkle', 'LKnee']
kp_legs = ['RAnkle', 'RKnee', 'LAnkle', 'LKnee']
kp_head = ['Head']
kp_keys = {'arm': kp_arms, 'lowerbody': kp_lower, 'leg': kp_legs, 'head': kp_head}

# These are cols associated with balance
balancecols = ['COPc', 'spine_moment_sum', 'pelvis_moment_sum', 'spine_moment_sum_change', 'pelvis_moment_sum_change']

# These are cols associated with acoustics
voccols = ['envelope', 'envelope_change', 'f0', 'f1_clean', 'f1_clean_vel', 'f2_clean', 'f2_clean_vel', 'f3_clean', 'f3_clean_vel', 'CoG']

PCAcolls_all = ['pelvis_tilt', 'pelvis_list', 'pelvis_rotation', 'pelvis_tx',
       'pelvis_ty', 'pelvis_tz', 'hip_flexion_r', 'hip_adduction_r',
       'hip_rotation_r', 'knee_angle_r', 'knee_angle_r_beta', 'ankle_angle_r',
       'subtalar_angle_r', 'hip_flexion_l', 'hip_adduction_l',
       'hip_rotation_l', 'knee_angle_l', 'knee_angle_l_beta', 'ankle_angle_l',
       'subtalar_angle_l', 'L5_S1_Flex_Ext', 'L5_S1_Lat_Bending',
       'L5_S1_axial_rotation', 'L4_L5_Flex_Ext', 'L4_L5_Lat_Bending',
       'L4_L5_axial_rotation', 'L3_L4_Flex_Ext', 'L3_L4_Lat_Bending',
       'L3_L4_axial_rotation', 'L2_L3_Flex_Ext', 'L2_L3_Lat_Bending',
       'L2_L3_axial_rotation', 'L1_L2_Flex_Ext', 'L1_L2_Lat_Bending',
       'L1_L2_axial_rotation', 'L1_T12_Flex_Ext', 'L1_T12_Lat_Bending',
       'L1_T12_axial_rotation', 'neck_flexion', 'neck_bending',
       'neck_rotation', 'arm_flex_r', 'arm_add_r', 'arm_rot_r', 'elbow_flex_r',
       'pro_sup_r', 'wrist_flex_r', 'wrist_dev_r', 'arm_flex_l', 'arm_add_l',
       'arm_rot_l', 'elbow_flex_l', 'pro_sup_l', 'wrist_flex_l',
       'wrist_dev_l']

PCAcolls_arm = ['arm_flex_r', 'arm_add_r', 'arm_rot_r', 'elbow_flex_r', 'pro_sup_r',
       'wrist_flex_r', 'wrist_dev_r', 'arm_flex_l', 'arm_add_l', 'arm_rot_l',
       'elbow_flex_l', 'pro_sup_l', 'wrist_flex_l', 'wrist_dev_l']


##########################
####### main loop ########
##########################

for file in files_part2:
    print('working on file: ', file)
    df = pd.read_csv(file)

    # Get metadata
    df["concept"] = df["FileInfo"].apply(lambda x: x.split("_")[1])
    df["modality"] = df["FileInfo"].apply(lambda x: x.split("_")[2])
    df["correction"] = df["FileInfo"].apply(lambda x: x.split("_")[3])
    df.drop(columns=["FileInfo"], inplace=True)

    ######################################
    ############## zscoring ##############
    ######################################
    
    # Get all numerical cols
    numcols = df.select_dtypes(include=[np.number]).columns
    numcols = [x for x in numcols if x != 'time']           # not time

    trialID = df['TrialID'][0]
    pcnID = trialID.split('_')[0] + '_' + trialID.split('_')[-1]

    normalized_cols = {}

    for col in numcols:
        mean, std = zscore_dict[pcnID + "_" + col]
        normalized_cols[col] = df[col].apply(lambda x: (x - mean) / std)

    df_normed = pd.DataFrame(normalized_cols)


    # get all cols that are in df but not numcols
    nonnumcols = [x for x in df.columns if x not in numcols]

    # add them to the normalized df
    for col in nonnumcols:
        df_normed[col] = df[col]

    ##########################
    ##### movement feat ######
    ##########################

    # Look only into portion of the data where is movement
    subdf = pd.DataFrame()
    subdf = df_normed[df_normed["movement_in_trial"] == "movement"]
    subdf = subdf.reset_index(drop=True)

    if subdf.shape[0] > 0:
        duration_mov = subdf["time"].iloc[-1] - subdf["time"].iloc[0] # duration
        submovcols = [col for col in movcols if subdf[col].values[0] == "movement"]
        numofArt = len(submovcols)  # number of articulators

        # Prepare dictionary to store the values
        body_sumfeat = {}

        # Loop over our four dimensions-groups and collect features
        for group in groups:

            subsubdf = pd.DataFrame()
            subsubdf = subdf[subdf[movcol_keys[group]] == "movement"]
            subsubdf = subsubdf.reset_index(drop=True)

            # If there is no movement, we set lal values to 0 and continue
            if subsubdf.shape[0] == 0:
                body_sumfeat_row = pd.DataFrame()
                body_sumfeat_row["duration_group"] = 0
                continue
            else:
                # Get the columns that contain group and sum/power (these are cols with aggregated measures)
                cols = [col for col in subsubdf.columns if group in col and 'sum' in col or 'power' in col]
                duration_group = subsubdf["time"].iloc[-1] - subsubdf["time"].iloc[0] # duration
            
                # Get statistics
                body_sumfeat = get_statistics(cols, df_normed, subsubdf, body_sumfeat, df_means)
                body_sumfeat[group + "_duration"] = duration_group
    
                # Calculate intermittency for kinematic measures (from Pose2sim)
                speed = subsubdf[group + "_speedKin_sum"].values
                jerk = subsubdf[group + "_jerkKin_sum"].values
                intermittency = get_intermittency(jerk, speed)
                intermittency = np.log(intermittency) # natural log
                body_sumfeat[group + "_inter_Kin"] = intermittency

                # Calculate intermittency for inverse kinematic measures (from OpenSim)
                speed = subsubdf[group + "_angSpeed_sum"].values
                jerk = subsubdf[group + "_angJerk_sum"].values
                intermittency = get_intermittency(jerk, speed)
                intermittency = np.log(intermittency) # natural log
                body_sumfeat[group + "_inter_IK"] = intermittency
                
                # Get bounding box of movement volume (BBMV)
                bbmv_sum = get_bbmv(subsubdf, group, kp_keys)          
                body_sumfeat[group + "_bbmv"] = bbmv_sum
            
    else:
        duration_mov = 0
        numofArt = 0
        body_sumfeat = {}

    # Convert dictionary to dataframe
    body_sumfeat_row = pd.DataFrame({key: [value] for key, value in body_sumfeat.items()})

    # Adapt
    body_sumfeat_row = adapt_row(body_sumfeat_row)

    # Add duration and numofArt
    body_sumfeat_row['duration_mov'] = duration_mov
    body_sumfeat_row['numofArt'] = numofArt

    # Get BBMV for the whole body (so concatenate all the groups)
    bbmv_total = body_sumfeat_row[[col for col in body_sumfeat_row.columns if 'bbmv' in col and 'rate' not in col]].sum(axis=1)
    body_sumfeat_row['bbmv_total'] = bbmv_total

    ####################
    ###### balance #####
    ####################

    subdf = pd.DataFrame()
    subdf = df_normed[df_normed["movement_in_trial"] == "movement"]
    subdf = subdf.reset_index(drop=True)

    # Collect features (note that here we collect features for the segment of the general movement, i.e., movement of any body part)
    balance_sumfeat = {}
    balance_sumfeat = get_statistics(balancecols, df_normed, subdf, balance_sumfeat, df_means)

    # Convert dictionary to dataframe
    balance_sumfeat_row = pd.DataFrame({key: [value] for key, value in balance_sumfeat.items()})

    # Adapt
    balance_sumfeat_row = adapt_row(balance_sumfeat_row)

    ####################
    #### acoustics #####
    ####################

    # Check if vocalization exists
    if "vocalization" not in df_normed.columns:
        print("No vocalization column in the dataframe")
        vocfeat_row = pd.DataFrame() # empty row
    else:
        # Collect features only for the time of vocalizing
        subdf = pd.DataFrame()
        subdf = df_normed[df_normed["vocalization"] == "sounding"] 
        subdf = subdf.reset_index(drop=True)

        if subdf.shape[0] > 0:
            duration_voc = subdf["time"].iloc[-1] - subdf["time"].iloc[0] # duration

            # Collect features
            vocfeat_dict = {}
            vocfeat_dict = get_statistics(voccols, df_normed, subdf, vocfeat_dict, df_means)


        else:
            print("No vocalization in the trial")
            vocfeat_dict = {}
            duration_voc = 0

        # Convert dictionary to dataframe
        vocfeat_row = pd.DataFrame({key: [value] for key, value in vocfeat_dict.items()})

        # Adapt
        vocfeat_row = adapt_row(vocfeat_row)

        # Append duration
        vocfeat_row['duration_voc'] = duration_voc

        # If f1-f3 are not only nan, calculate the volume of the convex hull (i.e., vowel space area)
        if subdf['f1_clean'].isnull().all() and subdf['f2_clean'].isnull().all():
            print("No f1, f2 in the dataframe")
            vocfeat_row['VSA_f1f2'] = 0
        else:
            f1_clean = subdf['f1_clean'].dropna()
            f2_clean = subdf['f2_clean'].dropna()
            VSA2d = getVSA(f1_clean, f2_clean) # get VSA
            vocfeat_row['VSA_f1f2'] = VSA2d

    # Concatenate the rows for movement, balance, and acoustics
    newrow_concat = pd.concat([body_sumfeat_row, balance_sumfeat_row, vocfeat_row], axis=1)

    ###################################
    ###### arms submovements ##########
    ###################################
   
    # Take only the chunk where arms are moving
    subdf = pd.DataFrame()
    subdf = df_normed[df_normed["arms_mov"] == "movement"]
    subdf = subdf.reset_index(drop=True)

    # Prepare an empty row to store the features
    newrow_arms = pd.DataFrame()

    if subdf.shape[0] == 0:
        print("No movement in arms")
        arms_submov = 0 # number of submovements   
        wrists_tempVar = 0 # temporal variability
    else:
        # Count submovements in wrist
        submovcols = [col for col in subdf.columns if "Wrist" in col and 'speed' in col]
        arms_submov = []
        wrists_tempVar = []
        for col in submovcols:
            # Calculate the number of peaks that delimit unique submovements
            ts_mean = df_means[col + '_z_general_mean'].values[0] # get the overall mean of this signal
            height = ts_mean - df_means[col + '_z_general_std'].values[0] # height is mean - 1*std
            peaks, _ = find_peaks(subdf[col], height=height, distance=150)
            arms_submov.append(len(peaks))
            
            # Calculate the temporal variability of the submovements
            peak_times = df_normed.loc[peaks, "time"]
            intervals = np.diff(peak_times)
            tempVar = np.std(intervals)
            wrists_tempVar.append(tempVar)

    # Append to the row
    newrow_arms['arms_submovements'] = [arms_submov]
    newrow_arms['Wrist_tempVar'] = [wrists_tempVar]

    ###################################
    ######### arm symmetry ############
    ###################################

    # Calculate symmetry for arms
    symmetry_score = compute_ArmCoupling(subdf, keypoints=['Wrist', 'Elbow'], dimensions=['x', 'y', 'z'])
    arm_asymmetry = compute_multimovementAssymetry(subdf)
    newrow_arms['arm_coupling'] = symmetry_score
    newrow_arms['arm_asymmetry'] = arm_asymmetry['mean_integral_difference']


    ###################################
    ###### Motor complexity (PCA) #####
    ###################################

    # Whole-body complexity    
    subdf = pd.DataFrame()          
    subdf = df_normed.loc[df_normed['movement_in_trial'] == 'movement']
    subdf = subdf.reset_index(drop=True)

    if subdf.shape[0] == 0:
        body_nComp_80 = 0
        body_slope_80 = 0
        body_nComp_95 = 0
        body_slope_95 = 0
    else:
        bodydf = subdf.loc[:, PCAcolls_all]
        body_nComp_80, body_slope_80, body_nComp_95, body_slope_95 = get_PCA(bodydf)

    # Append to the row
    newrow_concat['body_nComp_80'] = body_nComp_80
    newrow_concat['body_slope_80'] = body_slope_80
    newrow_concat['body_nComp_95'] = body_nComp_95
    newrow_concat['body_slope_95'] = body_slope_95

    # Arm complexity
    # Empty subdf
    subdf = pd.DataFrame()
    subdf = df_normed.loc[df_normed['arms_mov'] == 'movement']
    if subdf.shape[0] == 0:
        arm_nComp_80 = 0
        arm_slope_80 = 0
        arm_nComp_95 = 0
        arm_slope_95 = 0
    else:
        armdf = subdf.loc[:, PCAcolls_arm]
        arm_nComp_80, arm_slope_80, arm_nComp_95, arm_slope_95 = get_PCA(armdf)

    # Append to the row
    newrow_concat['arm_nComp_80'] = arm_nComp_80
    newrow_concat['arm_slope_80'] = arm_slope_80
    newrow_concat['arm_nComp_95'] = arm_nComp_95
    newrow_concat['arm_slope_95'] = arm_slope_95


    # Concatenate newrow and newrow_arms
    newrow_final = pd.concat([newrow_concat, newrow_arms], axis=1)

    # Add metadata
    newrow_final["TrialID"] = df_normed["TrialID"].values[0]
    newrow_final["concept"] = df_normed["concept"].values[0]
    newrow_final["modality"] = df_normed["modality"].values[0]
    newrow_final["correction"] = df_normed["correction"].values[0]

    # Add the row the the overal df
    features_df = pd.concat([features_df, newrow_final], ignore_index=True)

    ###################################
    ########### Answer info ###########
    ###################################

    index = int(df_normed["TrialID"].values[0].split("_")[2])
    concept = df_normed["concept"].values[0]

    # Check what was the answer and cosine similarity for the index
    answer_fol = similaritydata[similaritydata["index"] == index]["answer"].values[0]
    answer_fol_sim = similaritydata[similaritydata["index"] == index]["cosine_similarity"].values[0]

    # Check what was the answer and cosine similarity for index before - but only if the previous index has the same concept as the current one
    if index - 1 in similaritydata["index"].values and similaritydata[similaritydata["index"] == index - 1]["word"].values[0] == concept:
        answer_prev = similaritydata[similaritydata["index"] == index - 1]["answer"].values[0]
        answer_prev_sim = similaritydata[similaritydata["index"] == index - 1]["cosine_similarity"].values[0]
    else:
        #print("previous index doesn't have the same concept")
        answer_prev = None
        answer_prev_sim = None

    # Append to the features_df
    features_df.loc[features_df["TrialID"] == df_normed["TrialID"].values[0], "answer_fol"] = answer_fol
    features_df.loc[features_df["TrialID"] == df_normed["TrialID"].values[0], "answer_fol_dist"] = answer_fol_sim
    features_df.loc[features_df["TrialID"] == df_normed["TrialID"].values[0], "answer_prev"] = answer_prev
    features_df.loc[features_df["TrialID"] == df_normed["TrialID"].values[0], "answer_prev_dist"] = answer_prev_sim

    # Check what is the expressibility for the concept & modality
    express = express_df[(express_df["word"] == concept) & (express_df["modality"] == df_normed["modality"].values[0])]["fit"].values[0]
    features_df.loc[features_df["TrialID"] == df_normed["TrialID"].values[0], "expressibility"] = express

    # Add response time from answerdata
    response_time = answerdata[answerdata["TrialID"] == df_normed["TrialID"].values[0]]["response_time_sec"].values[0]
    features_df.loc[features_df["TrialID"] == df_normed["TrialID"].values[0], "response_time_sec"] = response_time

# Create a concept id to group together corrections of one concept
for row in features_df.iterrows():
    row = row[1]
    features_df.loc[row.name, "concept_id"] = row["concept"] + "_" + row["modality"] + "_" + row["TrialID"].split("_")[-1]

# Get rid of columns that have only NA values
features_df = features_df.dropna(axis=1, how='all')

# Round all num cols to 3
numcols = features_df.select_dtypes(include=[np.number]).columns
features_df[numcols] = features_df[numcols].round(3)

This is how the dataframe looks like:

	arm_duration	arm_inter_Kin	arm_inter_IK	arm_bbmv	lowerbody_duration	lowerbody_inter_Kin	lowerbody_inter_IK	lowerbody_bbmv	leg_duration	leg_inter_Kin	...	f2_clean_pospeak_std	f1_clean_pospeak_std	f1_clean_vel_pospeak_std	CoG_Gmean	CoG_Gstd	CoG_pospeak_mean	CoG_pospeak_std	CoG_integral	CoG_range	concept_id
0	3988.0	26.197	26.728	8.472	2020.0	26.580	26.181	5.400	2020.0	26.759	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	groot_combinatie_p1
1	3868.0	26.674	28.122	8.148	2870.0	28.319	28.291	6.487	2870.0	27.096	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	groot_combinatie_p1
2	4014.0	26.449	27.565	8.465	874.0	23.434	23.040	2.861	874.0	22.945	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	groot_combinatie_p1
3	4046.0	26.207	28.558	8.482	3750.0	28.421	28.796	6.124	3750.0	27.973	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	hoog_combinatie_p1
4	4708.0	26.502	28.717	8.309	4508.0	29.760	28.762	6.005	4508.0	28.679	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	hoog_combinatie_p1
5	4004.0	26.978	28.554	8.699	3598.0	28.801	28.616	5.723	3598.0	28.179	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	hoog_combinatie_p1
6	NaN	NaN	NaN	NaN	784.0	23.222	22.753	2.313	784.0	22.135	...	0.867	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	blij_combinatie_p1
7	5248.0	28.973	31.658	10.919	4930.0	28.905	29.800	14.893	4930.0	28.250	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	vrouw_combinatie_p0
8	1784.0	25.011	24.190	6.604	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	blij_combinatie_p1
9	1284.0	24.278	24.516	5.054	1334.0	25.381	25.483	4.317	1334.0	24.478	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	blij_combinatie_p1
10	1494.0	24.248	24.502	6.483	2944.0	29.247	28.630	5.378	2944.0	26.524	...	NaN	0.018	0.812	NaN	NaN	NaN	NaN	NaN	NaN	vis_combinatie_p1
11	2486.0	25.718	26.035	6.420	780.0	25.181	21.982	2.271	780.0	22.482	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	ruiken_combinatie_p1
12	2968.0	25.494	28.758	8.040	2750.0	28.377	27.107	4.885	2750.0	26.672	...	0.618	NaN	0.552	NaN	NaN	NaN	NaN	NaN	NaN	wind_combinatie_p1
13	3850.0	27.666	27.348	7.419	3366.0	28.208	28.519	5.033	3366.0	28.230	...	0.223	NaN	0.502	NaN	NaN	NaN	NaN	NaN	NaN	heet_combinatie_p1
14	3236.0	26.551	27.446	6.411	NaN	NaN	NaN	NaN	NaN	NaN	...	0.197	0.429	0.347	NaN	NaN	NaN	NaN	NaN	NaN	heet_combinatie_p1
15	5572.0	28.818	28.100	9.814	5158.0	29.833	29.109	6.470	5158.0	29.409	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	heet_combinatie_p1
16	1144.0	23.839	26.291	8.156	324.0	19.470	20.447	0.414	324.0	19.033	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	vrouw_combinatie_p0
17	3816.0	28.001	29.471	10.348	3244.0	27.815	30.057	9.607	3244.0	27.788	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	vogel_gebaren_p0
18	5280.0	28.098	30.942	13.350	5862.0	28.648	31.235	17.882	5862.0	29.740	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	kruipen_gebaren_p0
19	4302.0	28.401	31.164	10.572	4206.0	28.373	31.096	8.402	4206.0	27.451	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	water_gebaren_p0
20	4474.0	27.316	30.376	10.229	4398.0	28.390	32.196	9.064	4398.0	27.665	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	water_gebaren_p0
21	4388.0	27.872	31.225	9.748	3782.0	27.458	30.479	8.428	3782.0	27.529	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	water_gebaren_p0
22	5852.0	29.417	32.048	10.675	5404.0	29.153	30.545	10.334	5404.0	29.056	...	NaN	NaN	NaN	-0.784	1.429	-3.658	0.171	-210.357	5.017	vuur_gebaren_p0
23	2736.0	26.324	27.656	6.547	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	goed_gebaren_p0
24	4256.0	28.675	29.776	8.148	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	goed_gebaren_p0
25	2516.0	26.234	27.815	7.774	884.0	23.300	25.890	6.598	884.0	23.423	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	horen_gebaren_p0
26	5832.0	29.027	31.043	11.094	3936.0	27.904	29.471	11.080	3936.0	27.885	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	vrouw_combinatie_p0
27	4504.0	28.604	30.892	9.939	4266.0	27.764	29.475	11.045	4266.0	28.404	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	ver_gebaren_p0
28	5946.0	28.473	30.925	9.779	5136.0	28.712	30.358	9.084	5136.0	28.117	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	ver_gebaren_p0
29	6100.0	29.300	31.243	10.521	6030.0	29.592	30.960	9.672	6030.0	30.037	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	ver_gebaren_p0
30	3004.0	26.079	26.368	7.253	2686.0	28.219	27.971	4.870	2686.0	27.750	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	blazen_gebaren_p1
31	3918.0	28.358	28.936	8.394	3366.0	28.669	28.833	5.086	3366.0	28.617	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	geven_gebaren_p1
32	3918.0	28.120	29.766	9.343	1692.0	26.370	26.244	4.733	1692.0	24.816	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	geven_gebaren_p1
33	1202.0	23.839	27.274	8.980	1590.0	25.566	27.624	11.326	1590.0	25.540	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	verbranden_combinatie_p0
34	2352.0	25.612	25.352	6.061	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	kat_gebaren_p1
35	1954.0	25.071	25.036	6.560	1992.0	27.080	26.978	4.535	1992.0	25.756	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	lachen_gebaren_p1
36	4400.0	27.535	28.492	5.513	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	zoet_gebaren_p1
37	5726.0	28.953	29.493	7.501	3038.0	28.781	28.089	3.640	3038.0	27.977	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	zoet_gebaren_p1
38	4310.0	27.565	28.136	6.982	626.0	23.135	22.579	0.392	626.0	22.211	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	zoet_gebaren_p1
39	2234.0	25.175	26.736	6.535	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	slapen_gebaren_p1
40	4428.0	28.941	30.742	7.973	2740.0	27.334	28.586	7.431	2740.0	27.113	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	verbranden_combinatie_p0
41	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	geur_geluiden_p0
42	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0.685	0.110	NaN	NaN	NaN	NaN	NaN	NaN	NaN	geur_geluiden_p0
43	4032.0	26.813	29.551	7.986	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	geur_geluiden_p0
44	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	vlieg_geluiden_p0
45	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	1.624	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	vlieg_geluiden_p0
46	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	1.777	NaN	NaN	NaN	NaN	NaN	NaN	vlieg_geluiden_p0
47	3306.0	27.175	28.826	6.781	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	scherp_geluiden_p0
48	5244.0	28.767	30.137	9.125	4234.0	27.891	30.650	7.553	4234.0	27.853	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	verbranden_combinatie_p0
49	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	scherp_geluiden_p0

50 rows × 466 columns

Post-extraction data wrangling

Before saving it, we also want to see which trials underwent which corrections. Some trials were guessed on first try (only_c0), some after first correction (c0_c1) and some only after second correction or not at all (all_c).

# Identify how are which trials corrected
c0 = features_df[features_df["correction"] == "c0"]["concept_id"].unique()
c1 = features_df[features_df["correction"] == "c1"]["concept_id"].unique()
c2 = features_df[features_df["correction"] == "c2"]["concept_id"].unique()

only_c0 = [x for x in c0 if x not in c1 and x not in c2] # Guessed on first try
c0_c1 = [x for x in c0 if x in c1 and x not in c2] # Guessed on second try
all_c = [x for x in c0 if x in c1 and x in c2] # Guessed on third try or never

# Check whether we have some data that are outside of these three categories (could be that we lost some trials in the processing pipeline)
all = features_df["concept_id"].unique()
missing = [x for x in all if x not in only_c0 and x not in c0_c1 and x not in all_c]
if len(missing) > 0:
    print("Missing: ", missing) # Currently we miss no trials
else:   
    print("No missing trials")

# Add column correction_info to features_df
features_df["correction_info"] = features_df["correction"]

# If the concept_id is in only_c0, then correction_info is c0_only
features_df.loc[features_df["concept_id"].isin(only_c0), "correction_info"] = "c0_only"

# Save it
features_df.to_csv(datafolder + "features_df_final.csv", index=False)

No missing trials

We will be also dealing with some loss of the data because some features was not possible to extract. Let’s see for which features it’s the most of the data

lowerbody_power_pospeak_mean                0.523077
lowerbody_moment_sum_change_pospeak_mean    0.584615
lowerbody_moment_sum_change_pospeak_std     0.846154
lowerbody_speedKin_sum_pospeak_std          0.507692
lowerbody_accKin_sum_pospeak_std            0.507692
lowerbody_angSpeed_sum_pospeak_mean         0.507692
lowerbody_angSpeed_sum_pospeak_std          0.661538
lowerbody_angAcc_sum_pospeak_mean           0.507692
lowerbody_angAcc_sum_pospeak_std            0.661538
lowerbody_angJerk_sum_pospeak_std           0.523077
leg_moment_sum_pospeak_std                  0.507692
leg_moment_sum_change_pospeak_mean          0.600000
leg_angSpeed_sum_pospeak_std                0.569231
leg_angAcc_sum_pospeak_std                  0.569231
head_moment_sum_change_pospeak_std          0.523077
f0_Gmean                                    0.769231
f0_Gstd                                     0.769231
f0_integral                                 0.969231
f0_range                                    0.769231
f1_clean_integral                           0.953846
f1_clean_vel_pospeak_mean                   0.692308
f2_clean_pospeak_mean                       0.738462
f2_clean_integral                           0.953846
f2_clean_vel_pospeak_mean                   0.661538
f2_clean_vel_pospeak_std                    0.800000
f3_clean_pospeak_mean                       0.723077
f3_clean_pospeak_std                        0.815385
f3_clean_integral                           0.953846
f3_clean_vel_pospeak_mean                   0.630769
f3_clean_vel_pospeak_std                    0.769231
arm_power_pospeak_std                       0.646154
pelvis_moment_sum_change_pospeak_std        0.723077
f1_clean_vel_integral                       0.969231
f2_clean_vel_integral                       0.969231
f3_clean_vel_integral                       0.969231
lowerbody_moment_sum_pospeak_std            0.646154
f1_clean_pospeak_mean                       0.830769
lowerbody_power_pospeak_std                 0.830769
leg_moment_sum_change_pospeak_std           0.815385
f2_clean_pospeak_std                        0.846154
f1_clean_pospeak_std                        0.938462
f1_clean_vel_pospeak_std                    0.846154
CoG_Gmean                                   0.923077
CoG_Gstd                                    0.923077
CoG_pospeak_mean                            0.923077
CoG_pospeak_std                             0.938462
CoG_integral                                0.923077
CoG_range                                   0.923077
dtype: float64

Let’s also have a first look on the distribution of our three measures of interest - arm moment (torque), COPc, and envelope.

These are for the features in cummulative form (integral)

These are for the features in instantaneous form (peak mean)

Lastly, let’s also look how are these three measures correlated with each other

Finally, because we will proceed with eXtreme Gradient Boosting per modality, we will also save the data separately for each modality.

# Separate data into three dfs based on modality
features_df_gesture = features_df[features_df["modality"] == "gebaren"]
features_df_combination = features_df[features_df["modality"] == "combinatie"]
features_df_vocal = features_df[features_df["modality"] == "geluiden"]

# Save them
features_df_gesture.to_csv(datafolder + "features_df_gesture.csv", index=False)
features_df_combination.to_csv(datafolder + "features_df_combination.csv", index=False)
features_df_vocal.to_csv(datafolder + "features_df_vocal.csv", index=False)

References

Berisha, Visar, Steven Sandoval, Rene Utianski, Julie Liss, and Andreas Spanias. 2014. “Characterizing the Distribution of the Quadrilateral Vowel Space Area.” The Journal of the Acoustical Society of America 135 (1): 421–27. https://doi.org/10.1121/1.4829528.

Ćwiek, Aleksandra, Susanne Fuchs, Wim Pouw, and Šárka Kadavá. n.d. “The Relationship Between Expectation and Performance: Methodological Evaluation.” Accessed.

Daffertshofer, Andreas, Claudine J. C. Lamoth, Onno G. Meijer, and Peter J. Beek. 2004. “PCA in Studying Coordination and Variability: A Tutorial.” Clinical Biomechanics (Bristol, Avon) 19 (4): 415–28. https://doi.org/10.1016/j.clinbiomech.2004.01.005.

Hogan, Neville, and Dagmar Sternad. 2009. “Sensitivity of Smoothness Measures to Movement Duration, Amplitude, and Arrests.” Journal of Motor Behavior 41 (6): 529–34. https://doi.org/10.3200/35-09-004-RC.

Kadavá, Šárka, Aleksandra Ćwiek, Susanne Fuchs, and Wim Pouw. 2024. “What Do We Mean When We Say Gestures Are More Expressive Than Vocalizations? An Experimental and Simulation Study.” Proceedings of the Annual Meeting of the Cognitive Science Society 46 (0). https://escholarship.org/uc/item/2mp1v3v5.

Pouw, Wim, Mark Dingemanse, Yasamin Motamedi, and Aslı Özyürek. 2021. “A Systematic Investigation of Gesture Kinematics in Evolving Manual Languages in the Lab.” Cognitive Science 45 (7): e13014. https://doi.org/10.1111/cogs.13014.

Seth, Ajay, Jennifer L. Hicks, Thomas K. Uchida, Ayman Habib, Christopher L. Dembia, James J. Dunne, Carmichael F. Ong, et al. 2018. “OpenSim: Simulating Musculoskeletal Dynamics and Neuromuscular Control to Study Human and Animal Movement.” PLOS Computational Biology 14 (7): e1006223. https://doi.org/10.1371/journal.pcbi.1006223.

Trujillo, James, Irina Simanova, Harold Bekkering, and Asli Özyürek. 2018. “Communicative Intent Modulates Production and Comprehension of Actions and Gestures: A Kinect Study.” Cognition 180: 38–51. https://doi.org/10.1016/j.cognition.2018.04.003.

Xiong, Yingen, Francis Quek, and David Mcneill. 2002. “Hand Gesture Symmetric Behavior Detection and Analysis in Natural Conversation.” In, 179–84. https://doi.org/10.1109/ICMI.2002.1166989.

Yan, Yuke, James M. Goodman, Dalton D. Moore, Sara A. Solla, and Sliman J. Bensmaia. 2020. “Unexpected Complexity of Everyday Manual Behaviors.” Nature Communications 11 (1): 3564. https://doi.org/10.1038/s41467-020-17404-0.

Żywiczyński, Przemysław, Marek Placiński, Marta Sibierska, Monika Boruta-Żywiczyńska, Sławomir Wacewicz, Michał Meina, and Peter Gärdenfors. 2024. “Praxis, Demonstration and Pantomime: A Motion Capture Investigation of Differences in Action Performances.” Language and Cognition, 1–28. https://doi.org/10.1017/langcog.2024.8.