Movement annotation III: Computing interrater agreement between manual and automatic annotation
Overview
In this script, we prepare data to test the interrater agreement (IA) on movement annotation. To test the robustness, we compute interrater agreement between two human annotators (AC, GR) and between each human annotator and the automatic annotations created in the previous script. We compute IA for each tier separately.
We use EasyDIAG (Holle and Rein 2015) to compute the IA, but document the results here in the table.
Code to prepare the environment
import osimport globimport numpy as npimport pandas as pdimport xml.etree.ElementTree as ETcurfolder = os.getcwd()# Here we store our merged processed filesprocessedfolder = os.path.join(curfolder +'\\..\\03_TS_processing\\TS_merged\\')processedfiles = glob.glob(processedfolder +'*.csv')# Here we store annotations from the logreg modelannotatedfolder = os.path.join(curfolder +'\\TS_annotated_logreg\\')folders = glob.glob(annotatedfolder +'*\\')folders60 = [x for x in folders if'0_6'in x] #60percent confidencefolders80 = [x for x in folders if'0_8'in x] #80percent confidence# Here we store manual annotations from R1 (AC)manualfolder1 = os.path.join(curfolder +'\\ManualAnno\\R1\\')manualfiles1 = glob.glob(manualfolder1 +'*.eaf')manualfiles1 = [x for x in manualfiles1 if'ELAN_tiers'in x]# Here we store manual annotations from R2 (GR)manualfolder2 = os.path.join(curfolder +'\\ManualAnno\\R3\\')manualfiles2 = glob.glob(manualfolder2 +'*.eaf')manualfiles2 = [x for x in manualfiles2 if'ELAN_tiers'in x]# Here we store the txt files we need for EasyDIAGinterfolder = curfolder +'\\InterAg\\'
Preprocessing annotations
Now we need to get both manual and automatic annotations into format that EasyDIAG requires - so simple .txt files with timestamps and annotation values. For annotations that have been created by human annotators, we need to extract the timestamps and values from the .eaf files.
Custom functions
# Function to parse ELAN filedef parse_eaf_file(eaf_file, rel_tiers): tree = ET.parse(eaf_file) root = tree.getroot() time_order = root.find('TIME_ORDER') time_slots = {time_slot.attrib['TIME_SLOT_ID']: time_slot.attrib['TIME_VALUE'] for time_slot in time_order} annotations = [] relevant_tiers = {rel_tiers}for tier in root.findall('TIER'): tier_id = tier.attrib['TIER_ID']if tier_id in relevant_tiers:for annotation in tier.findall('ANNOTATION/ALIGNABLE_ANNOTATION'):# Ensure required attributes are presentif'TIME_SLOT_REF1'in annotation.attrib and'TIME_SLOT_REF2'in annotation.attrib: ts_ref1 = annotation.attrib['TIME_SLOT_REF1'] ts_ref2 = annotation.attrib['TIME_SLOT_REF2']# Get annotation ID if it exists, otherwise set to None ann_id = annotation.attrib.get('ANNOTATION_ID', None) annotation_value = annotation.find('ANNOTATION_VALUE').text.strip() annotations.append({'tier_id': tier_id,'annotation_id': ann_id,'start_time': time_slots[ts_ref1],'end_time': time_slots[ts_ref2],'annotation_value': annotation_value })return annotations# Function to write ELAN into txt filedef ELAN_into_txt(txtfile, raterID, foi, tier):withopen(txtfile, 'w') as f:forfilein foi:print('working on '+file)# Filename filename =file.split('\\')[-1]# Parse ELAN file annotations = parse_eaf_file(file, tier)# Write annotations into txt filefor annotation in annotations: f.write(f"Anno_{raterID}\t{annotation['start_time']}\t{annotation['end_time']}\t{annotation['annotation_value']}\t{filename}\n")
foi = manualfiles2 # here we store manual annotations that we want to convert into txt filesraterIDfile ='R3'# this is the rater as we name it in the txt filesraterID ='R2'# this is the ID we need for EasyDIAG (the software always needs R1 and R2)# These are the files we want to createtxtfile_head = interfolder + raterIDfile +'_Manual_head.txt'txtfile_upper = interfolder + raterIDfile +'_Manual_upper.txt'# we add _2 for files where manual annotator 1 is R1, because we also want to compare with manual annotator 2 (R3)txtfile_lower = interfolder + raterIDfile +'_Manual_lower.txt'txtfile_arms = interfolder + raterIDfile +'_Manual_arms.txt'# For each tier, extract the annotations from ELAN file and save them in a txt fileELAN_into_txt(txtfile_head, raterID, foi, 'head_mov')ELAN_into_txt(txtfile_upper, raterID, foi, 'upper_body')ELAN_into_txt(txtfile_lower, raterID, foi, 'lower_body')ELAN_into_txt(txtfile_arms, raterID, foi, 'arms')
This is how the files look like
0
1
2
3
4
0
Anno_R1
0
3116
nomovement
0_1_11_p1_ELAN_tiers.eaf
1
Anno_R1
0
3629
nomovement
0_1_12_p1_ELAN_tiers.eaf
2
Anno_R1
0
3388
nomovement
0_1_13_p1_ELAN_tiers.eaf
3
Anno_R1
0
5120
nomovement
0_1_14_p1_ELAN_tiers.eaf
4
Anno_R1
0
3978
nomovement
0_1_15_p1_ELAN_tiers.eaf
5
Anno_R1
1620
1730
movement
0_1_16_p1_ELAN_tiers.eaf
6
Anno_R1
0
1620
nomovement
0_1_16_p1_ELAN_tiers.eaf
7
Anno_R1
1730
3524
nomovement
0_1_16_p1_ELAN_tiers.eaf
8
Anno_R1
1650
3610
movement
0_1_17_p1_ELAN_tiers.eaf
9
Anno_R1
0
1650
nomovement
0_1_17_p1_ELAN_tiers.eaf
10
Anno_R1
3610
4263
nomovement
0_1_17_p1_ELAN_tiers.eaf
11
Anno_R1
930
3450
movement
0_1_20_p0_ELAN_tiers.eaf
12
Anno_R1
0
930
nomovement
0_1_20_p0_ELAN_tiers.eaf
13
Anno_R1
3450
3881
nomovement
0_1_20_p0_ELAN_tiers.eaf
14
Anno_R1
0
3595
nomovement
0_1_21_p0_ELAN_tiers.eaf
For automatic annotations, we need to extract the timestamps and values from the .csv files. Before doing that, we need to handle two issues that stem from the the fact that the classifier can create flickering annotations, as the confidence values continuously vary throughout each trial.
Similarly to Pouw et al. (2021), we apply two rules to handle this flickering:
- Rule 1: If there is a nomovement event between two movement events that is shorter than 200 ms, this is considered as part of the movement event. - Rule 2: If there is a movement event between two nomovement events that is shorter than 200 ms, this is considered as part of the nomovement event.
Afterwards, we take the first movement event and the very last movement event, and consider everything in between as a movement.
Custom functions
# Function to get chunks of annotationsdef get_chunks(anno_df): anno_df['chunk'] = (anno_df['anno_values'] != anno_df['anno_values'].shift()).cumsum() anno_df['idx'] = anno_df.index# Calculate start and end of each chunk, grouped by anno_values, save also the first and last index chunks = anno_df.groupby(['anno_values', 'chunk']).agg( time_ms_min=('time_ms', 'first'), time_ms_max=('time_ms', 'last'), idx_min=('idx', 'first'), idx_max=('idx', 'last') ).reset_index()# Order the chunks chunks = chunks.sort_values('idx_min').reset_index(drop=True)return chunks
foi = folders80 # set which folder (threshold) you want to processthreshold ='80'# set the thresholdfor folder in foi:# get tierID tier = folder.split('\\')[-2].split('_')[0]if tier =='head': tier ='head'elif tier =='upperBody': tier ='upper'elif tier =='lowerBody': tier ='lower'# This is the file we want to create txtfile = interfolder +'AutoAnno_'+ tier +'_'+ threshold +'.txt'# List all files in the folder files = glob.glob(folder +'*.csv')forfilein files:print('processing: '+file)# Filename filename =file.split('\\')[-1].split('.')[0] filename = filename.split('_')[2:6] filename ='_'.join(filename)# Check if we have manual file matching to this file, otherwise skip manualfile = [x for x in manualfiles1 if filename in x]iflen(manualfile) ==0:continue# Now we process the annotations made by the logreg model anno_df = pd.read_csv(file)# Chunk the df to see unique annotated chunks chunks = get_chunks(anno_df)# Check for fake pauses (i.e., nomovement annotation that last for less than 200ms)for i inrange(1, len(chunks)-1):if chunks.loc[i, 'anno_values'] =='no movement'and chunks.loc[i-1, 'anno_values'] =='movement'and chunks.loc[i+1, 'anno_values'] =='movement':if chunks.loc[i, 'time_ms_max'] - chunks.loc[i, 'time_ms_min'] <200:print('found a chunk of no movement between two movement chunks that is shorter than 200 ms')# Change the chunk into movement anno_df.loc[chunks.loc[i, 'idx_min']:chunks.loc[i, 'idx_max'], 'anno_values'] ='movement'# Calculate new chunks chunks = get_chunks(anno_df)# Now check for fake movement (i.e., movement chunk that is shorter than 200ms)for i inrange(1, len(chunks)-1):if chunks.loc[i, 'anno_values'] =='movement'and chunks.loc[i-1, 'anno_values'] =='no movement'and chunks.loc[i+1, 'anno_values'] =='no movement':if chunks.loc[i, 'time_ms_max'] - chunks.loc[i, 'time_ms_min'] <200:print('found a chunk of movement between two no movement chunks that is shorter than 250 ms')# change the chunk to no movement in the original df anno_df.loc[chunks.loc[i, 'idx_min']:chunks.loc[i, 'idx_max'], 'anno_values'] ='no movement'# Now, similarly to our human annotators, we consider movement anything from the very first movement to the very last movementif'movement'in anno_df['anno_values'].unique():# Get the first and last index of movement first_idx = anno_df[anno_df['anno_values'] =='movement'].index[0] last_idx = anno_df[anno_df['anno_values'] =='movement'].index[-1]# Change all between to movement anno_df.loc[first_idx:last_idx, 'anno_values'] ='movement'# Calculate new chunks chunks = get_chunks(anno_df)# Rewrite "no movement" in anno_values to "nomovement" (to match the manual annotations) chunks['anno_values'] = chunks['anno_values'].apply(lambda x: 'nomovement'if x =='no movement'else x )# Add elanID to chunks (to match the manual annotations in EasyDIAG) chunks['elanID'] =str(filename +'_ELAN_tiers.eaf')# Write to the text filewithopen(txtfile, 'a') as f:for _, row in chunks.iterrows(): f.write(f"Anno_R1\t{row['time_ms_min']}\t{row['time_ms_max']}\t{row['anno_values']}\t{row['elanID']}\n" )
Creating txt files for EasyDIAG
EasyDIAG requires a txt file that contains all annotations of a tier from both annotators we wish to compare. We therefore need to merge the files we have created above into one file for each tier.
(Note that it is better to delete old files rather than let them overwrite because that can lead to some bugs in the files for which the agreement will be messy)
# eval: false# These tiers we want to comparetoi = ['arms', 'head', 'upper', 'lower'] # We want to compare## auto60 with R1 ## auto80 with R1 ## auto60 with R3 ## auto80 with R3 # r1_2 with r3# For us R1 is the manual annotator, R3 is second manual annotator, R2 is the automatic annotator# But note that manual annotator is in the txt files always as R2, and automatic annotator is always R1 comp1 ='R3'# change here who you want to comparecomp2 ='R1'# with whom# Add adding if necessaryadding ='manual'for tier in toi:print('working on '+ tier) txtfile_auto60 = interfolder +'AutoAnno_'+ tier +'_60.txt'# this is the automatic annotator with threshold 60 txtfile_auto80 = interfolder +'AutoAnno_'+ tier +'_80.txt'# this is the automatic annotator with threshold 80 txtfile_manual_r1 = interfolder +'R1_Manual_'+ tier +'.txt'# this is manual annotator (AC) as R2 txtfile_manual_r3 = interfolder +'R3_Manual_'+ tier +'.txt'# this is manual annotator (GR) as R2 txtfile_manual_r1_2 = interfolder +'R1_Manual_'+ tier +'_2.txt'# this is manual annotator (AC) as R1# Read in the files we want to compare r1_anno = pd.read_csv(txtfile_manual_r3, sep='\t', header=None) # change here who you want to compare r2_anno = pd.read_csv(txtfile_manual_r1_2, sep='\t', header=None) # with whom# Check that both files have the same number of files (EasyDIAG will ignore this mismatch and lower the agreement) files_to_check_r1 = r1_anno[4].unique() files_to_check_r2 = r2_anno[4].unique() files_to_check =list(set(files_to_check_r1) &set(files_to_check_r2))# Adapt both rows_auto = r1_anno[r1_anno[4].isin(files_to_check)] rows_manual = r2_anno[r2_anno[4].isin(files_to_check)]# And concatenate concat_rows = pd.concat([r1_anno, r2_anno])# Save as new file txtfile_IA = interfolder +'IA_'+ comp1 +'_'+ comp2 +'_'+ tier +'_'+ adding +'.txt'# adapt the threshold based on what you work withwithopen(txtfile_IA, 'w') as f:for index, row in concat_rows.iterrows(): f.write(f"{row[0]}\t{row[1]}\t{row[2]}\t{row[3]}\t{row[4]}\n")
Interrater agreement: results
Here we report the raw agreements together with kappa coefficients for interrater agreement between manual annotators (R1, R3) and automatic annotations (with threshold 60 and 80). Interrater agreement was computed using EasyDIAG (Holle and Rein 2015), and the results have been saved in txt files. We will now extract relevant information to report in the table. The overlap was kept at default value of 60% for all tiers.
def extract_ia(lines):# Extracting values linked_units =None raw_agreement =None kappa =None inside_section_2 =False# Flag to track section 2for line in lines:if"Percentage of linked units:"in line: inside_section_2 =False# Ensure we don't mistakenly extract from other partselif"linked"in line and"="in line: linked_units =float(line.split("=")[-1].strip().replace("%", "")) # Extract linked %elif"2) Overall agreement indicies (including no match):"in line: inside_section_2 =True# Activate flag when entering section 2elif inside_section_2:if"Raw agreement"in line and"="in line: raw_agreement =float(line.split("=")[-1].strip()) # Extract correct raw agreementelif"kappa "in line and"="in line: kappa =float(line.split("=")[-1].strip()) # Extract correct kappaelif"3)"in line: inside_section_2 =False# Stop when reaching section 3return linked_units, raw_agreement, kappa
- 60% and 80% thresholds for automatic annotations are yielding similar results both in terms of raw agreement and kappa coefficient. - for arms, interrater agreement between automatic annotation and manual annotator R1 results in kappa coefficient 0.65, which is considered substantial agreement (Landis and Koch 1977). - for upper body and lower body - the kappa signifies moderate agreement, but the same drop we can see in the interrater agreement between manual annotators. - for head, we see only fair agreement both between manual annotators and between automatic annotation and manual annotators.
Generally, interrater agreement between manual annotator R1 and automatic annotation is comparable to the agreement between the two human annotators across all tiers. This suggests that the automatic annotation is a reliable tool for movement annotation, especially for arms. It seems like head is the most difficult tier to annotate, which is also reflected in the interrater agreement between manual annotators.
To improve its predictions for head, upper body, and lower body, and to avoid the risk of overfitting the model on a specific type of behaviour generated by individuals in dyad 0, we will extend the training data by annotating 10% of behaviour per participant per dyad before the final analysis. If kappa does not result in minimum substantial agreement (k = 0.61), we will annotate a larger portion of the data
In the next script, we will work with 60% threshold to annotate all the data.
References
Holle, Henning, and Robert Rein. 2015. “EasyDIAg: A Tool for Easy Determination of Interrater Agreement.”Behavior Research Methods 47 (3): 837–47. https://doi.org/10.3758/s13428-014-0506-7.
Landis, J. R., and G. G. Koch. 1977. “The Measurement of Observer Agreement for Categorical Data.”Biometrics 33 (1). https://pubmed.ncbi.nlm.nih.gov/843571/.
Pouw, Wim, Jan de Wit, Sara Bögels, Marlou Rasenberg, Branka Milivojevic, and Asli Ozyurek. 2021. “Semantically Related Gestures Move Alike: Towards a Distributional Semantics of Gesture Kinematics.” In Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Human Body, Motion and Behavior, edited by Vincent G. Duffy, 269–87. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-77817-0_20.