Movement annotation III: Computing interrater agreement between manual and automatic annotation

Overview

In this script, we prepare data to test the interrater agreement (IA) on movement annotation. To test the robustness, we compute interrater agreement between two human annotators (AC, GR) and between each human annotator and the automatic annotations created in the previous script. We compute IA for each tier separately.

We use EasyDIAG (Holle and Rein 2015) to compute the IA, but document the results here in the table.

Code to prepare the environment

import os
import glob
import numpy as np
import pandas as pd
import xml.etree.ElementTree as ET

curfolder = os.getcwd()

# Here we store our merged processed files
processedfolder = os.path.join(curfolder + '\\..\\03_TS_processing\\TS_merged\\')
processedfiles = glob.glob(processedfolder + '*.csv')

# Here we store annotations from the logreg model
annotatedfolder = os.path.join(curfolder + '\\TS_annotated_logreg\\')
folders = glob.glob(annotatedfolder + '*\\')

folders60 = [x for x in folders if '0_6' in x] #60percent confidence
folders80 = [x for x in folders if '0_8' in x] #80percent confidence

# Here we store manual annotations from R1 (AC)
manualfolder1 = os.path.join(curfolder + '\\ManualAnno\\R1\\')
manualfiles1 = glob.glob(manualfolder1 + '*.eaf')
manualfiles1 = [x for x in manualfiles1 if 'ELAN_tiers' in x]

# Here we store manual annotations from R2 (GR)
manualfolder2 = os.path.join(curfolder + '\\ManualAnno\\R3\\')
manualfiles2 = glob.glob(manualfolder2 + '*.eaf')
manualfiles2 = [x for x in manualfiles2 if 'ELAN_tiers' in x]

# Here we store the txt files we need for EasyDIAG
interfolder = curfolder + '\\InterAg\\'

Preprocessing annotations

Now we need to get both manual and automatic annotations into format that EasyDIAG requires - so simple .txt files with timestamps and annotation values. For annotations that have been created by human annotators, we need to extract the timestamps and values from the .eaf files.

Custom functions

# Function to parse ELAN file
def parse_eaf_file(eaf_file, rel_tiers):
    tree = ET.parse(eaf_file)
    root = tree.getroot()

    time_order = root.find('TIME_ORDER')
    time_slots = {time_slot.attrib['TIME_SLOT_ID']: time_slot.attrib['TIME_VALUE'] for time_slot in time_order}

    annotations = []
    relevant_tiers = {rel_tiers}
    for tier in root.findall('TIER'):
        tier_id = tier.attrib['TIER_ID']
        if tier_id in relevant_tiers:
            for annotation in tier.findall('ANNOTATION/ALIGNABLE_ANNOTATION'):
                # Ensure required attributes are present
                if 'TIME_SLOT_REF1' in annotation.attrib and 'TIME_SLOT_REF2' in annotation.attrib:
                    ts_ref1 = annotation.attrib['TIME_SLOT_REF1']
                    ts_ref2 = annotation.attrib['TIME_SLOT_REF2']
                    # Get annotation ID if it exists, otherwise set to None
                    ann_id = annotation.attrib.get('ANNOTATION_ID', None)
                    annotation_value = annotation.find('ANNOTATION_VALUE').text.strip()
                    annotations.append({
                        'tier_id': tier_id,
                        'annotation_id': ann_id,
                        'start_time': time_slots[ts_ref1],
                        'end_time': time_slots[ts_ref2],
                        'annotation_value': annotation_value
                    })

    return annotations

# Function to write ELAN into txt file
def ELAN_into_txt(txtfile, raterID, foi, tier):
    with open(txtfile, 'w') as f:
        for file in foi:
            print('working on ' + file)
            # Filename
            filename = file.split('\\')[-1]
            # Parse ELAN file
            annotations = parse_eaf_file(file, tier)
            # Write annotations into txt file
            for annotation in annotations:
                f.write(f"Anno_{raterID}\t{annotation['start_time']}\t{annotation['end_time']}\t{annotation['annotation_value']}\t{filename}\n")

foi = manualfiles2  # here we store manual annotations that we want to convert into txt files
raterIDfile = 'R3'  # this is the rater as we name it in the txt files
raterID = 'R2'      # this is the ID we need for EasyDIAG (the software always needs R1 and R2)

# These are the files we want to create
txtfile_head = interfolder + raterIDfile + '_Manual_head.txt'
txtfile_upper = interfolder + raterIDfile + '_Manual_upper.txt'       # we add _2 for files where manual annotator 1 is R1, because we also want to compare with manual annotator 2 (R3)
txtfile_lower = interfolder + raterIDfile + '_Manual_lower.txt'
txtfile_arms = interfolder + raterIDfile + '_Manual_arms.txt'

# For each tier, extract the annotations from ELAN file and save them in a txt file
ELAN_into_txt(txtfile_head, raterID, foi, 'head_mov')
ELAN_into_txt(txtfile_upper, raterID, foi, 'upper_body')
ELAN_into_txt(txtfile_lower, raterID, foi, 'lower_body')
ELAN_into_txt(txtfile_arms, raterID, foi, 'arms')

This is how the files look like

	0	1	2	3	4
0	Anno_R1	0	3116	nomovement	0_1_11_p1_ELAN_tiers.eaf
1	Anno_R1	0	3629	nomovement	0_1_12_p1_ELAN_tiers.eaf
2	Anno_R1	0	3388	nomovement	0_1_13_p1_ELAN_tiers.eaf
3	Anno_R1	0	5120	nomovement	0_1_14_p1_ELAN_tiers.eaf
4	Anno_R1	0	3978	nomovement	0_1_15_p1_ELAN_tiers.eaf
5	Anno_R1	1620	1730	movement	0_1_16_p1_ELAN_tiers.eaf
6	Anno_R1	0	1620	nomovement	0_1_16_p1_ELAN_tiers.eaf
7	Anno_R1	1730	3524	nomovement	0_1_16_p1_ELAN_tiers.eaf
8	Anno_R1	1650	3610	movement	0_1_17_p1_ELAN_tiers.eaf
9	Anno_R1	0	1650	nomovement	0_1_17_p1_ELAN_tiers.eaf
10	Anno_R1	3610	4263	nomovement	0_1_17_p1_ELAN_tiers.eaf
11	Anno_R1	930	3450	movement	0_1_20_p0_ELAN_tiers.eaf
12	Anno_R1	0	930	nomovement	0_1_20_p0_ELAN_tiers.eaf
13	Anno_R1	3450	3881	nomovement	0_1_20_p0_ELAN_tiers.eaf
14	Anno_R1	0	3595	nomovement	0_1_21_p0_ELAN_tiers.eaf

For automatic annotations, we need to extract the timestamps and values from the .csv files. Before doing that, we need to handle two issues that stem from the the fact that the classifier can create flickering annotations, as the confidence values continuously vary throughout each trial.

Similarly to Pouw et al. (2021), we apply two rules to handle this flickering:

- Rule 1: If there is a nomovement event between two movement events that is shorter than 200 ms, this is considered as part of the movement event.
- Rule 2: If there is a movement event between two nomovement events that is shorter than 200 ms, this is considered as part of the nomovement event.

Afterwards, we take the first movement event and the very last movement event, and consider everything in between as a movement.

Custom functions

# Function to get chunks of annotations
def get_chunks(anno_df):
    anno_df['chunk'] = (anno_df['anno_values'] != anno_df['anno_values'].shift()).cumsum()
    anno_df['idx'] = anno_df.index

    # Calculate start and end of each chunk, grouped by anno_values, save also the first and last index
    chunks = anno_df.groupby(['anno_values', 'chunk']).agg(
        time_ms_min=('time_ms', 'first'),
        time_ms_max=('time_ms', 'last'),
        idx_min=('idx', 'first'),
        idx_max=('idx', 'last')
    ).reset_index()

    # Order the chunks
    chunks = chunks.sort_values('idx_min').reset_index(drop=True)

    return chunks

foi = folders80 # set which folder (threshold) you want to process
threshold = '80' # set the threshold

for folder in foi:
    # get tierID
    tier = folder.split('\\')[-2].split('_')[0]

    if tier == 'head':
        tier = 'head'
    elif tier == 'upperBody':
        tier = 'upper'
    elif tier == 'lowerBody':
        tier = 'lower'

    # This is the file we want to create
    txtfile = interfolder + 'AutoAnno_' + tier + '_' + threshold + '.txt'

    # List all files in the folder
    files = glob.glob(folder + '*.csv')

    for file in files:
        print('processing: ' + file)

        # Filename
        filename = file.split('\\')[-1].split('.')[0]
        filename = filename.split('_')[2:6]
        filename = '_'.join(filename)

        # Check if we have manual file matching to this file, otherwise skip
        manualfile = [x for x in manualfiles1 if filename in x]
        if len(manualfile) == 0:
            continue

        # Now we process the annotations made by the logreg model
        anno_df = pd.read_csv(file)

        # Chunk the df to see unique annotated chunks
        chunks = get_chunks(anno_df)

        # Check for fake pauses (i.e., nomovement annotation that last for less than 200ms)
        for i in range(1, len(chunks)-1):
            if chunks.loc[i, 'anno_values'] == 'no movement' and chunks.loc[i-1, 'anno_values'] == 'movement' and chunks.loc[i+1, 'anno_values'] == 'movement':
                if chunks.loc[i, 'time_ms_max'] - chunks.loc[i, 'time_ms_min'] < 200:
                    print('found a chunk of no movement between two movement chunks that is shorter than 200 ms')
                    # Change the chunk into movement
                    anno_df.loc[chunks.loc[i, 'idx_min']:chunks.loc[i, 'idx_max'], 'anno_values'] = 'movement'

        # Calculate new chunks
        chunks = get_chunks(anno_df)

        # Now check for fake movement (i.e., movement chunk that is shorter than 200ms)
        for i in range(1, len(chunks)-1):
            if chunks.loc[i, 'anno_values'] == 'movement' and chunks.loc[i-1, 'anno_values'] == 'no movement' and chunks.loc[i+1, 'anno_values'] == 'no movement':
                if chunks.loc[i, 'time_ms_max'] - chunks.loc[i, 'time_ms_min'] < 200:
                    print('found a chunk of movement between two no movement chunks that is shorter than 250 ms')
                    # change the chunk to no movement in the original df
                    anno_df.loc[chunks.loc[i, 'idx_min']:chunks.loc[i, 'idx_max'], 'anno_values'] = 'no movement'

        
        # Now, similarly to our human annotators, we consider movement anything from the very first movement to the very last movement
        if 'movement' in anno_df['anno_values'].unique():
            # Get the first and last index of movement
            first_idx = anno_df[anno_df['anno_values'] == 'movement'].index[0]
            last_idx = anno_df[anno_df['anno_values'] == 'movement'].index[-1]
            # Change all between to movement
            anno_df.loc[first_idx:last_idx, 'anno_values'] = 'movement'

        # Calculate new chunks
        chunks = get_chunks(anno_df)

        # Rewrite "no movement" in anno_values to "nomovement" (to match the manual annotations)
        chunks['anno_values'] = chunks['anno_values'].apply(
            lambda x: 'nomovement' if x == 'no movement' else x
        )

        # Add elanID to chunks (to match the manual annotations in EasyDIAG)
        chunks['elanID']  = str(filename + '_ELAN_tiers.eaf')

        # Write to the text file
        with open(txtfile, 'a') as f:
            for _, row in chunks.iterrows():
                f.write(
                    f"Anno_R1\t{row['time_ms_min']}\t{row['time_ms_max']}\t{row['anno_values']}\t{row['elanID']}\n"
                )

Creating txt files for EasyDIAG

EasyDIAG requires a txt file that contains all annotations of a tier from both annotators we wish to compare. We therefore need to merge the files we have created above into one file for each tier.

(Note that it is better to delete old files rather than let them overwrite because that can lead to some bugs in the files for which the agreement will be messy)

# eval: false

# These tiers we want to compare
toi = ['arms', 'head', 'upper', 'lower'] 

# We want to compare
## auto60 with R1  
## auto80 with R1 
## auto60 with R3  
## auto80 with R3 
# r1_2 with r3

# For us R1 is the manual annotator, R3 is second manual annotator, R2 is the automatic annotator
# But note that manual annotator is in the txt files always as R2, and automatic annotator is always R1 

comp1 = 'R3'    # change here who you want to compare
comp2 = 'R1'    # with whom

# Add adding if necessary
adding = 'manual'

for tier in toi:
    print('working on ' + tier)

    txtfile_auto60 = interfolder + 'AutoAnno_' + tier + '_60.txt'        # this is the automatic annotator with threshold 60
    txtfile_auto80 = interfolder + 'AutoAnno_' + tier + '_80.txt'        # this is the automatic annotator with threshold 80
    txtfile_manual_r1 = interfolder + 'R1_Manual_' + tier + '.txt'  # this is manual annotator (AC) as R2
    txtfile_manual_r3 = interfolder + 'R3_Manual_' + tier + '.txt'  # this is manual annotator (GR) as R2
    txtfile_manual_r1_2 = interfolder + 'R1_Manual_' + tier + '_2.txt'  # this is manual annotator (AC) as R1

    # Read in the files we want to compare
    r1_anno = pd.read_csv(txtfile_manual_r3, sep='\t', header=None)         # change here who you want to compare
    r2_anno = pd.read_csv(txtfile_manual_r1_2, sep='\t', header=None)    # with whom
 
    # Check that both files have the same number of files (EasyDIAG will ignore this mismatch and lower the agreement)
    files_to_check_r1 = r1_anno[4].unique()
    files_to_check_r2 = r2_anno[4].unique()
    files_to_check = list(set(files_to_check_r1) & set(files_to_check_r2))

    # Adapt both
    rows_auto = r1_anno[r1_anno[4].isin(files_to_check)]
    rows_manual = r2_anno[r2_anno[4].isin(files_to_check)]

    # And concatenate
    concat_rows = pd.concat([r1_anno, r2_anno])

    # Save as new file
    txtfile_IA = interfolder + 'IA_' + comp1 + '_' + comp2 + '_' + tier + '_' + adding + '.txt' # adapt the threshold based on what you work with

    with open(txtfile_IA, 'w') as f:
        for index, row in concat_rows.iterrows():
            f.write(f"{row[0]}\t{row[1]}\t{row[2]}\t{row[3]}\t{row[4]}\n")

Interrater agreement: results

Here we report the raw agreements together with kappa coefficients for interrater agreement between manual annotators (R1, R3) and automatic annotations (with threshold 60 and 80). Interrater agreement was computed using EasyDIAG (Holle and Rein 2015), and the results have been saved in txt files. We will now extract relevant information to report in the table. The overlap was kept at default value of 60% for all tiers.

def extract_ia(lines):
    # Extracting values
    linked_units = None
    raw_agreement = None
    kappa = None

    inside_section_2 = False  # Flag to track section 2

    for line in lines:
        if "Percentage of linked units:" in line:
            inside_section_2 = False  # Ensure we don't mistakenly extract from other parts
        elif "linked" in line and "=" in line:
            linked_units = float(line.split("=")[-1].strip().replace("%", ""))  # Extract linked %

        elif "2) Overall agreement indicies (including no match):" in line:
            inside_section_2 = True  # Activate flag when entering section 2

        elif inside_section_2:
            if "Raw agreement" in line and "=" in line:
                raw_agreement = float(line.split("=")[-1].strip())  # Extract correct raw agreement
            elif "kappa " in line and "=" in line:
                kappa = float(line.split("=")[-1].strip())  # Extract correct kappa
            elif "3)" in line:
                inside_section_2 = False  # Stop when reaching section 3

    return linked_units, raw_agreement, kappa

files = glob.glob(interfolder + '*EasyDIAG*')

IA_df = pd.DataFrame()

for txtfile in files:
    with open(txtfile, 'r') as f:
        lines = f.readlines()
    comparison = txtfile.split('\\')[-1].split('.')[0].split('_')[1]+'_'+txtfile.split('\\')[-1].split('_')[2]
    if 'Auto' in txtfile:
        tier = txtfile.split('\\')[-1].split('.')[0].split('_')[-2]+'_'+txtfile.split('\\')[-1].split('.')[0].split('_')[-1]
    else:
        tier = txtfile.split('\\')[-1].split('.')[0].split('_')[-1]


    linked_units, raw_agreement, kappa = extract_ia(lines)

    IA_row = pd.DataFrame({
        'comparison': comparison,
        'tier': tier,
        'linked_units': linked_units,
        'raw_agreement': raw_agreement,
        'kappa': kappa
    }, index=[0])

    IA_df = pd.concat([IA_df, IA_row])

comparison	tier	linked_units	raw_agreement	kappa
R1_Auto	arms_60	0.81	0.81	0.65
R1_Auto	arms_80	0.81	0.81	0.64
R1_Auto	head_60	0.60	0.56	0.32
R1_Auto	head_80	0.63	0.58	0.34
R1_Auto	lower_60	0.80	0.79	0.59
R1_Auto	lower_80	0.75	0.73	0.50
R1_Auto	upper_60	0.67	0.66	0.44
R1_Auto	upper_80	0.67	0.67	0.43
R3_Auto	arms_60	0.75	0.75	0.58
R3_Auto	arms_80	0.73	0.72	0.53
R3_Auto	head_60	0.62	0.60	0.39
R3_Auto	head_80	0.64	0.61	0.39
R3_Auto	lower_60	0.66	0.64	0.38
R3_Auto	lower_80	0.70	0.68	0.41
R3_Auto	upper_60	0.67	0.66	0.44
R3_Auto	upper_80	0.63	0.61	0.37
R1_R3	arms	0.85	0.84	0.70
R1_R3	head	0.67	0.62	0.38
R1_R3	lower	0.74	0.71	0.46
R1_R3	upper	0.70	0.68	0.44

From the table, we can observe few things:

- 60% and 80% thresholds for automatic annotations are yielding similar results both in terms of raw agreement and kappa coefficient.
- for arms, interrater agreement between automatic annotation and manual annotator R1 results in kappa coefficient 0.65, which is considered substantial agreement (Landis and Koch 1977).
- for upper body and lower body - the kappa signifies moderate agreement, but the same drop we can see in the interrater agreement between manual annotators.
- for head, we see only fair agreement both between manual annotators and between automatic annotation and manual annotators.

Generally, interrater agreement between manual annotator R1 and automatic annotation is comparable to the agreement between the two human annotators across all tiers. This suggests that the automatic annotation is a reliable tool for movement annotation, especially for arms. It seems like head is the most difficult tier to annotate, which is also reflected in the interrater agreement between manual annotators.

To improve its predictions for head, upper body, and lower body, and to avoid the risk of overfitting the model on a specific type of behaviour generated by individuals in dyad 0, we will extend the training data by annotating 10% of behaviour per participant per dyad before the final analysis. If kappa does not result in minimum substantial agreement (k = 0.61), we will annotate a larger portion of the data

In the next script, we will work with 60% threshold to annotate all the data.

References

Holle, Henning, and Robert Rein. 2015. “EasyDIAg: A Tool for Easy Determination of Interrater Agreement.” Behavior Research Methods 47 (3): 837–47. https://doi.org/10.3758/s13428-014-0506-7.

Landis, J. R., and G. G. Koch. 1977. “The Measurement of Observer Agreement for Categorical Data.” Biometrics 33 (1). https://pubmed.ncbi.nlm.nih.gov/843571/.

Pouw, Wim, Jan de Wit, Sara Bögels, Marlou Rasenberg, Branka Milivojevic, and Asli Ozyurek. 2021. “Semantically Related Gestures Move Alike: Towards a Distributional Semantics of Gesture Kinematics.” In Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Human Body, Motion and Behavior, edited by Vincent G. Duffy, 269–87. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-77817-0_20.