Processing III: Merging multimodal data

Overview

In the previous scripts, we have preprocessed various motion and acoustic data. In this script, we will merge the data into a single file per trial. These data include:

  • Balance Board data
  • Kinematics
  • Joint angles
  • Joint moments
  • Amplitude envelope
  • f0
  • Formants
  • Spectral centroid
Code to prepare environment
# packages
import os
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
import matplotlib.pyplot as plt
import seaborn as sns

curfolder = os.getcwd()

# folders with processed data
MTfolder_processed = curfolder + '\\TS_motiontracking\\'
ACfolder_processed = curfolder + '\\TS_acoustics\\'
# folder to save merged data
TSmerged = curfolder + '\\TS_merged\\'

# prepare all files
bbfiles = glob.glob(MTfolder_processed + '/bb*.csv')
idfiles = glob.glob(MTfolder_processed + '/id*.csv')
ikfiles = glob.glob(MTfolder_processed + '/ik*.csv')
mtfiles = glob.glob(MTfolder_processed + '/mt*.csv')
envfiles = glob.glob(ACfolder_processed + '/env*.csv')
f0files = glob.glob(ACfolder_processed + '/f0*.csv')
formants = glob.glob(ACfolder_processed + '/praat_formants*.csv')
scfiles = glob.glob(ACfolder_processed + '/cog*.csv')

When extracting and processing the acoustic and motion signals, we kept the sampling rates untouched. This means that now we have a variety of timeseries that each samples at different frequency. Inspecting a trial per each signal, we see the following sampling rates:

balance board: 499.986924434759
inverse dynamics: 60.00535247744097
inverse kinematics: 59.999988000002396
kinematics: 60.00000000000024
envelope: 48000.00000000008
f0: 487.16350507461976
formants: 38415.36614645814
spectral centroid: 472.65436024146993

We opt for 500 Hz as the final sampling rate we will merge on. That means that we will interpolate all missing data (using linear interpolation) to match this frequency.

Additionally, we will adapt the formants such that we only consider values that are present within a range of an amplitude peak (see acoustics processing script for more details), or where f0 is present, or both. These two situations can be considered as yielding in the most reliable formant values.

Finally, we will also use inverse kinematics and dynamics to calculate power (as joint moment × joint velocity) and smooth it with 1st-polynomial Savitzky-Golay filter with span of 560 ms.

Custom functions
# Function to create chunks of non-NaN values in a dataframe
def create_chunks(df, var):

    df['chunk'] = None

    # annotate chunks of non-NaN values
    chunk = 0
    for index, row in df.iterrows():
        if np.isnan(row[var]):
            continue
        else:
            df.loc[index, 'chunk'] = chunk
            # if the next value is NaN or this is the last row, increase the chunk
            if index == len(df)-1:
                continue
            elif np.isnan(df.loc[index+1, var]):
                chunk += 1

    chunks = df['chunk'].unique()

    if len(chunks) > 1: # skip if chunks are empty (that means that there is no f0 trace)
        # ignore the first chunk (None)
        chunks = chunks[1:]

    return df, chunks

# Function to interpolate chunks of non-NaN values in a dataframe to maintain discontinuities in the signal
def interpolate_chunks(df, chunks, var):
    # we ignore the None chunk above, so if there is some trace, None should not be within chunks
    if None not in chunks:
        for chunk in chunks:
            # get the first and last row of the chunk
            firstrow = df[df['chunk'] == chunk].index[0]
            lastrow = df[df['chunk'] == chunk].index[-1]
            # fill all inbetween with the chunk number
            df.loc[firstrow:lastrow, 'chunk'] = chunk
            # get the rows of the chunk
            chunkrows = df[df['chunk'] == chunk].copy()
            # interpolate
            chunkrows[var] = chunkrows[var].interpolate(method='linear', x = chunkrows['time'])
            # put the interpolated chunk back to the df
            df.loc[df['chunk'] == chunk, var] = chunkrows[var]

    # get rid of the chunk column
    df.drop('chunk', axis=1, inplace=True)

    return df

Merging signals on a common sampling rate

desired_sr = 0.5    # this is the sr we are going to merge on (in Hz/sec)

error_log = []

for file in bbfiles:

    bb_df = pd.read_csv(file)
    # get trial id
    trialid = bb_df['TrialID'][0]

    print('working on ' + trialid)
    
    # find the same trialid in idfiles
    id_file = [x for x in idfiles if trialid in x]
    try:
        id_df = pd.read_csv(id_file[0])
    except IndexError:
        print('IndexError: ' + trialid + ' not found')
        errormessage = 'IndexError: ' + trialid + ' not found for ID'
        error_log.append(errormessage)
        continue
    
    # find the same trialid in mtfiles
    mt_file = [x for x in mtfiles if trialid in x]
    try:
        mt_df = pd.read_csv(mt_file[0])
    except IndexError:
        print('IndexError: ' + trialid + ' not found')
        errormessage = 'IndexError: ' + trialid + ' not found for MT'
        error_log.append(errormessage)
        continue
    # rename Time to time
    mt_df.rename(columns={'Time': 'time'}, inplace=True)

    # find the same trialid in envfiles
    env_file = [x for x in envfiles if trialid in x]
    try:
        env_df = pd.read_csv(env_file[0])
    except IndexError:
        print('IndexError: ' + trialid + ' not found')
        errormessage = 'IndexError: ' + trialid + ' not found for ENV'
        error_log.append(errormessage)
        continue
    
    # rename trialID to TrialID
    env_df.rename(columns={'trialID': 'TrialID'}, inplace=True)

    # find the same trialid in f0files
    f0_file = [x for x in f0files if trialid in x]
    try:
        f0_df = pd.read_csv(f0_file[0])
    except IndexError:
        print('IndexError: ' + trialid + ' not found')
        errormessage = 'IndexError: ' + trialid + ' not found for F0'
        error_log.append(errormessage)
        continue

    # rename time_ms to time
    f0_df.rename(columns={'time_ms': 'time'}, inplace=True)
    # rename ID to TrialID
    f0_df.rename(columns={'ID': 'TrialID'}, inplace=True)

    # find the same trialid in ikfiles
    ik_file = [x for x in ikfiles if trialid in x]
    try:
        ik_df = pd.read_csv(ik_file[0])
    except IndexError:
        print('IndexError: ' + trialid + ' not found')
        errormessage = 'IndexError: ' + trialid + ' not found for IK'
        error_log.append(errormessage)
        continue

    # find the same trialid in formants
    formants_file = [x for x in formants if trialid in x]
    try:
        formants_df = pd.read_csv(formants_file[0])
    except IndexError:
        print('IndexError: ' + trialid + 'not found')
        errormessage = 'IndexError: ' + trialid + ' not found for formants'
        error_log.append(errormessage)
        continue

    # rename triald to TrialID
    formants_df.rename(columns={'trialid': 'TrialID'}, inplace=True)
    formants_df = formants_df[['time', 'f1', 'f2', 'f3', 'TrialID']]
    formants_df['time'] = formants_df['time'] * 1000

    # find the same triaalid in CoG
    sc_file = [x for x in scfiles if trialid in x]
    try:
        sc_df = pd.read_csv(sc_file[0])
    except IndexError:
        print('IndexError: ' + trialid + 'not found')
        errormessage = 'IndexError: ' + trialid + ' not found for CoG'
        error_log.append(errormessage)
        continue

    # write error log
    with open(TSmerged + '/error_log.txt', 'w') as f:
        for item in error_log:
            f.write("%s\n" % item)


    ############## MERGING ########################

    #### regularize sr in bb
    time_new = np.arange(0, max(bb_df['time']), 1/desired_sr)
    bb_interp = pd.DataFrame({'time': time_new})
    
    # interpolate all columns in samplebb 
    colstoint = bb_df.columns
    colstoint = [x for x in colstoint if 'time' not in x]
    colstoint = [x for x in colstoint if 'TrialID' not in x]
    colstoint = [x for x in colstoint if 'FileInfo' not in x]

    for col in colstoint:
        bb_interp[col] = bb_df[col].interpolate(method='linear', x = bb_interp['time'])

    # add trialid and time
    bb_interp['TrialID'] = trialid
    bb_interp['FileInfo'] = bb_df['FileInfo'][0]
    
    ########### merge the bb_interp with env
    # merge the two dataframes
    merge1 = pd.merge(bb_interp, env_df, on=['time', 'TrialID'], how='outer')

    # interpolate missing values of envelope and audio
    colstoint = merge1.columns
    colstoint = [x for x in colstoint if 'audio' in x or 'envelope' in x]

    for col in colstoint: 
        merge1[col] = merge1[col].interpolate(method='linear', x = merge1['time'])

    # now we can kick out all values where COPc is NaN
    merge1 = merge1[~np.isnan(merge1['COPc'])]

    ########### merge with ID
    # merge the two dataframes
    merge2 = pd.merge(merge1, id_df, on=['time', 'TrialID'], how='outer')

    # get cols of sampleid
    colstoint = id_df.columns
    colstoint = [x for x in colstoint if 'time' not in x]
    colstoint = [x for x in colstoint if 'TrialID' not in x]

    # interpolate 
    for col in colstoint:
        merge2[col] = merge2[col].interpolate(method='linear', x = merge2['time'])

    # now we can kick out all values where COPc is NaN to get sampling rate back to 500 
    merge2 = merge2[~np.isnan(merge2['COPc'])]

    ########### merge with MT
    # merge the two dataframes
    merge3 = pd.merge(merge2, mt_df, on=['time', 'TrialID'], how='outer')

    # get cols of samplemt
    colstoint = mt_df.columns
    colstoint = [x for x in colstoint if 'time' not in x]
    colstoint = [x for x in colstoint if 'TrialID' not in x]

    # interpolate missing values of from mt
    for col in colstoint:
        merge3[col] = merge3[col].interpolate(method='linear', x = merge3['time'])

    # now we can kick out all values where COPc is NaN
    merge3 = merge3[~np.isnan(merge3['COPc'])]

    ########### merge with F0
    # for interpolation, we need to again parse f0 into chunks of non-NaN values
    f0_df, chunks = create_chunks(f0_df, 'f0')
    
    # now we can merge
    merge4 = pd.merge(merge3, f0_df, on=['time', 'TrialID'], how='outer')


    # interpolate f0 signal, while maintaining discontinuities
    merge4 = interpolate_chunks(merge4, chunks, 'f0')

    # now we can drop all rows where COPc is NaN
    merge4 = merge4[~np.isnan(merge4['COPc'])]

    ########### merge with IK
    merge5 = pd.merge(merge4, ik_df, on=['time', 'TrialID'], how='outer')

    # get cols of sampleik
    colstoint = ik_df.columns
    colstoint = [x for x in colstoint if 'time' not in x]
    colstoint = [x for x in colstoint if 'TrialID' not in x]

    # interpolate missing values of from ik
    for col in colstoint:
        merge5[col] = merge5[col].interpolate(method='linear', x = merge5['time'])

    # now we can kick out all values where COPc is NaN
    merge5 = merge5[~np.isnan(merge5['COPc'])]

    ########### merge with formants
    merge6 = pd.merge(merge5, formants_df, on=['time', 'TrialID'], how='outer')

    # get cols of sampleformants
    colstoint = formants_df.columns
    colstoint = [x for x in colstoint if 'time' not in x]
    colstoint = [x for x in colstoint if 'TrialID' not in x]

    # interpolate missing values of from formants - currently they do not have NaNs so we can interpolate the whole signal instead of only non-NaN chunks
    for col in colstoint:
        merge6[col] = merge6[col].interpolate(method='linear', x = merge6['time'])

    # now we can kick out all values where COPc is NaN
    merge6 = merge6[~np.isnan(merge6['COPc'])]

    ########### merge with CoG
    merge7 = pd.merge(merge6, sc_df, on=['time', 'TrialID'], how='outer')

    # get cols of samplespecCentroid
    colstoint = sc_df.columns
    colstoint = [x for x in colstoint if 'time' not in x]
    colstoint = [x for x in colstoint if 'TrialID' not in x]

    # for interpolation, we need to again parse specCentroid into chunks of non-NaN values
    sc, chunks = create_chunks(sc_df, 'CoG')
    
    # now we merge
    merge7 = pd.merge(merge6, sc, on=['time', 'TrialID'], how='outer')

    # interpolate CoG signal, while maintaining discontinuities
    merge7 = interpolate_chunks(merge7, chunks, 'CoG')

    # now we can kick out all values where COPc is NaN
    merge7 = merge7[~np.isnan(merge7['COPc'])]

    # this is final df
    merge_final = merge7     

    ############## FORMANT ADAPTATION ########################

    # find peaks in envelope, with min=mean
    peaks, _ = scipy.signal.find_peaks(merge_final['envelope'], height=np.mean(merge_final['envelope']))
    # get widths of the peaks
    widths = scipy.signal.peak_widths(merge_final['envelope'], peaks, rel_height=0.95)
    # peak width df with starts and ends
    peak_widths = pd.DataFrame({'start': widths[2], 'end': widths[3]})

    # now create a new column env_weak_width, and put 0s everywhere, and 1s in the intervals of the width
    merge_final['env_peak_width'] = 0
    for index, row in peak_widths.iterrows():
        merge_final.loc[int(row['start']):int(row['end']), 'env_peak_width'] = 1

    # now we will create formant columns, where we will keep only formants in the intervals of env_pak_width OR where f0 is not NaN
    merge_final['f1_clean_f0'] = merge_final['f1']
    merge_final['f2_clean_f0'] = merge_final['f2']
    merge_final['f3_clean_f0'] = merge_final['f3']

    # where f0 is NaN, we will put NaN - these are formants during f0 only
    merge_final.loc[np.isnan(merge_final['f0']), 'f1_clean_f0'] = np.nan
    merge_final.loc[np.isnan(merge_final['f0']), 'f2_clean_f0'] = np.nan
    merge_final.loc[np.isnan(merge_final['f0']), 'f3_clean_f0'] = np.nan

    # we will also create formants, where we will keep only those in the intervals of env_pak_width
    merge_final['f1_clean_env'] = merge_final['f1']
    merge_final['f2_clean_env'] = merge_final['f2']
    merge_final['f3_clean_env'] = merge_final['f3']

    # where env_peak_width is 0, we will put NaN - these are formants during envelope peaks only
    merge_final.loc[merge_final['env_peak_width'] == 0, 'f1_clean_env'] = np.nan
    merge_final.loc[merge_final['env_peak_width'] == 0, 'f2_clean_env'] = np.nan
    merge_final.loc[merge_final['env_peak_width'] == 0, 'f3_clean_env'] = np.nan

    ## now we create formants where we copy values from clean_env and clean_f0
    merge_final['f1_clean'] = merge_final['f1_clean_env']
    merge_final['f2_clean'] = merge_final['f2_clean_env']
    merge_final['f3_clean'] = merge_final['f3_clean_env']

    # where formant is now NaN, copy values from f_clean_f0 in case there is a value
    merge_final.loc[np.isnan(merge_final['f1_clean']), 'f1_clean'] = merge_final['f1_clean_f0']
    merge_final.loc[np.isnan(merge_final['f2_clean']), 'f2_clean'] = merge_final['f2_clean_f0']
    merge_final.loc[np.isnan(merge_final['f3_clean']), 'f3_clean'] = merge_final['f3_clean_f0']

    # now calculate formant velocities (but only for the f_clean)
    merge_final['f1_clean_vel'] = np.insert(np.diff(merge_final['f1_clean']), 0, 0)
    merge_final['f2_clean_vel'] = np.insert(np.diff(merge_final['f2_clean']), 0, 0)
    merge_final['f3_clean_vel'] = np.insert(np.diff(merge_final['f3_clean']), 0, 0)

    # smooth
    merge_final['f1_clean_vel'] = scipy.signal.savgol_filter(merge_final['f1_clean_vel'], 5, 3)
    merge_final['f2_clean_vel'] = scipy.signal.savgol_filter(merge_final['f2_clean_vel'], 5, 3)
    merge_final['f3_clean_vel'] = scipy.signal.savgol_filter(merge_final['f3_clean_vel'], 5, 3)

    ########## POWER ####################
    groups = ['lowerbody', 'leg', 'head', 'arm']

    for group in groups:
        # get all columns that contain group
        cols = [x for x in merge_final.columns if group in x]

        # get all columns that contain 'moment_sum'
        torque = [x for x in cols if 'moment_sum' in x]
        # but not change
        torque = [x for x in torque if 'change' not in x][0]

        # get all columns that contain 'angSpeed_sum'
        angSpeed = [x for x in cols if 'angSpeed_sum' in x][0]

        # get power which is moment * angSpeed
        merge_final[group + '_power'] = merge_final[torque] * merge_final[angSpeed]
        # smooth
        merge_final[group + '_power'] = scipy.signal.savgol_filter(merge_final[group + '_power'], 281, 1) # window 281 corresponds to 562 ms (we add +1 to have odd length of window)
    
    # write to csv
    merge_final.to_csv(TSmerged + '/merged_' + trialid + '.csv', index=False)  

Here is an example of the file containing all the data

time left_back right_forward right_back left_forward COPXc COPYc COPc TrialID FileInfo ... f1_clean f2_clean f3_clean f1_clean_vel f2_clean_vel f3_clean_vel lowerbody_power leg_power head_power arm_power
0 0.0 1.040234 0.922185 1.447832 1.439272 0.000153 -1.427206e-05 0.000154 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 22.337284 1.403824 8.267574 6.538422
1 2.0 1.040128 0.922236 1.448219 1.439506 0.000218 -4.098744e-06 0.000218 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 22.285626 1.400045 8.244890 6.524016
2 4.0 1.040021 0.922294 1.448528 1.439647 0.000270 5.396631e-08 0.000270 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 22.233968 1.396267 8.222205 6.509609
3 6.0 1.039911 0.922355 1.448768 1.439706 0.000311 -5.992720e-07 0.000311 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 22.182311 1.392488 8.199520 6.495203
4 8.0 1.039796 0.922414 1.448945 1.439691 0.000342 -4.980859e-06 0.000342 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 22.130653 1.388709 8.176836 6.480797
5 10.0 1.039676 0.922470 1.449066 1.439612 0.000365 -1.214347e-05 0.000365 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 22.078995 1.384930 8.154151 6.466391
6 12.0 1.039550 0.922519 1.449138 1.439477 0.000381 -2.126330e-05 0.000382 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 22.027338 1.381152 8.131466 6.451984
7 14.0 1.039417 0.922559 1.449166 1.439293 0.000391 -3.163325e-05 0.000392 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.975680 1.377373 8.108782 6.437578
8 16.0 1.039277 0.922589 1.449157 1.439068 0.000396 -4.265618e-05 0.000398 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.924022 1.373594 8.086097 6.423172
9 18.0 1.039130 0.922607 1.449115 1.438808 0.000397 -5.383815e-05 0.000401 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.872364 1.369816 8.063413 6.408766
10 20.0 1.038976 0.922613 1.449046 1.438519 0.000395 -6.478160e-05 0.000400 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.820707 1.366037 8.040728 6.394359
11 22.0 1.038816 0.922605 1.448954 1.438206 0.000390 -7.517861e-05 0.000397 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.769049 1.362258 8.018043 6.379953
12 24.0 1.038648 0.922583 1.448844 1.437876 0.000383 -8.480413e-05 0.000392 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.717391 1.358479 7.995359 6.365547
13 26.0 1.038475 0.922548 1.448720 1.437532 0.000374 -9.350916e-05 0.000386 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.665734 1.354701 7.972674 6.351140
14 28.0 1.038295 0.922499 1.448586 1.437179 0.000365 -1.012140e-04 0.000378 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.614076 1.350922 7.949989 6.336734
15 30.0 1.038111 0.922437 1.448445 1.436821 0.000354 -1.079016e-04 0.000370 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.562418 1.347143 7.927305 6.322328
16 32.0 1.037923 0.922362 1.448301 1.436461 0.000343 -1.136105e-04 0.000361 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.510761 1.343364 7.904620 6.307922
17 34.0 1.037731 0.922276 1.448155 1.436101 0.000332 -1.184283e-04 0.000352 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.459103 1.339586 7.881936 6.293515
18 36.0 1.037537 0.922179 1.448012 1.435745 0.000321 -1.224848e-04 0.000343 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.407445 1.335807 7.859251 6.279109
19 38.0 1.037341 0.922072 1.447872 1.435395 0.000309 -1.259452e-04 0.000334 0_1_10_p1 p1_auto_geluiden_corrected ... NaN NaN NaN NaN NaN NaN 21.355788 1.332028 7.836566 6.264703

20 rows × 528 columns


And here we visualize some of the timeseries. We can also see that our interpolation did maintain the original discontinuities in the f0 signal.