Computing concept similarity using ConceptNet word embeddings

Overview

Our second hypothesis tests the effect of degree of misunderstanding on the magnitude of effort.

We operationalize degree of misunderstanding as a conceptual similarity between target concept and answer offered by a guesser.

To have a reproducible measure of conceptual similarity, we use the ConceptNet (Speer, Chin, and Havasi 2018) to extract embeddings for concepts used in our study, and calculate cosine similarity between the target concept and guessed answer.

To verify the utility of the cosine similarity, we have collected data from 14 Dutch-native people who were asked to rate the similarity between each pair of words in online anonymous rating study. We then compare the ‘perceived similarity’ with cosine similarity computed from ConceptNet embeddings, to validate the use of ConceptNet embeddings as a measure of conceptual similarity.

Code to load packages and prepare environment

import numpy as np
import os
import glob
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pingouin
import openpyxl

curfolder = os.getcwd()
rawdata = curfolder + '\\..\\00_RAWDATA\\'
answerfiles = glob.glob(rawdata + '*\\*.csv', recursive=True)
datafolder = curfolder + '\\data\\'

# Load all files that have '_1_results' in the name 
answerfiles_1 = [f for f in answerfiles if '_1_results' in f]
# Loop over list and add it into one big df
df_all1 = pd.DataFrame()
for file in answerfiles_1:
    df = pd.read_csv(file)
    df_all1 = pd.concat([df_all1, df], ignore_index=True)

df_all1['exp'] = 1 

# Load all files that have '_2_results' in the name
answerfiles_2 = [f for f in answerfiles if '_2_results' in f]
# Loop over list and add it into one big df
df_all2 = pd.DataFrame()
for file in answerfiles_2:
    df = pd.read_csv(file)
    df_all2 = pd.concat([df_all2, df], ignore_index=True)

df_all2['exp'] = 2

# Merge
df_all = pd.concat([df_all1, df_all2], ignore_index=True)

# Keep only columns word and answer
df = df_all[['word', 'answer', 'exp']]

First we need to do some data-wrangling to get all in the right format for the embedding extraction and comparison

# concept list
df_concepts = pd.read_excel(rawdata + '/conceptlist_info.xlsx')

# in df_concepts, keep only English and Dutch
df_concepts = df_concepts[['English', 'Dutch']]

# rename Dutch to word
df_concepts = df_concepts.rename(columns={'Dutch': 'word'})

# merge df and df_concepts on word
df = pd.merge(df, df_concepts, on='word', how='left')

# show rows where English is NaN
df[df['English'].isnull()]

# add translations manually for each (these are practice trials)
df.loc[df['word'] == 'bloem', 'English'] = 'flower'
df.loc[df['word'] == 'dansen', 'English'] = 'to dance'
df.loc[df['word'] == 'auto', 'English'] = 'car'
df.loc[df['word'] == 'olifant', 'English'] = 'elephant'
df.loc[df['word'] == 'comfortabel', 'English'] = 'comfortable'
df.loc[df['word'] == 'bal', 'English'] = 'ball'
df.loc[df['word'] == 'haasten', 'English'] = 'to hurry'
df.loc[df['word'] == 'gek', 'English'] = 'crazy'
df.loc[df['word'] == 'snijden', 'English'] = 'to cut'
df.loc[df['word'] == 'koken', 'English'] = 'to cook'
df.loc[df['word'] == 'juichen', 'English'] = 'to cheer'
df.loc[df['word'] == 'zingen', 'English'] = 'to sing'
df.loc[df['word'] == 'glimlach', 'English'] = 'smile'
df.loc[df['word'] == 'klok', 'English'] = 'clock'
df.loc[df['word'] == 'fiets', 'English'] = 'bicycle'
df.loc[df['word'] == 'vliegtuig', 'English'] = 'airplane'
df.loc[df['word'] == 'geheim', 'English'] = 'secret'
df.loc[df['word'] == 'telefoon', 'English'] = 'telephone'
df.loc[df['word'] == 'zwaaien', 'English'] = 'to wave'
df.loc[df['word'] == 'sneeuw', 'English'] = 'snow'

# make a list of English answers
answers_en = ['party', 'to cheer', 'tasty', 'to shoot', 'to breathe', 'zombie', 'bee', 'sea', 'dirty', 'tasty', 'car', 'to eat', 'to eat', 'to blow', 'hose', 'hose', 'to annoy', 'to make noise', 'to make noise', 'to run away', 'elephant', 'to cry', 'cold', 'outfit', 'silence', 'to ski', 'wrong', 'to play basketball', 'to search', 'disturbed', 'to run', 'to lick', 'to lift', 'lightning', 'to think', 'to jump', 'to fall', 'to write', 'to dance', 'shoulder height', 'horn', 'dirty', 'boring', 'to drink', 'strong', 'elderly', 'to mix', 'fish', 'fish', 'dirty', 'wrong', 'smart', 'to box', 'to box', 'dog', 'to catch', 'to cheer', 'to sing', 'pregnant', 'hair', 'to shower', 'pain', 'burnt', 'hot', 'I', 'to chew', 'bird', 'airplane', 'to fly', 'to think', 'to choose', 'to doubt', 'graffiti', 'fireworks', 'bomb', 'to smile', 'to laugh', 'smile', 'clock', 'to wonder', 'height', 'big', 'height', 'space', 'to misjudge', 'to wait', 'satisfied', 'happy', 'fish', 'to smell', 'wind', 'pain', 'to burn', 'hot', 'to cycle', 'to fly', 'airplane', 'bird', 'to crawl', 'to drink', 'waterfall', 'water', 'fire', 'top', 'good', 'to hear', 'to point', 'distance', 'there', 'to whisper', 'quiet', 'to be silent', 'telephone', 'to blow', 'to distribute', 'to give', 'cat', 'to laugh', 'tasty', 'to eat', 'yummy', 'to sleep', 'mountain', 'dirty', 'to vomit', 'to be disgusted', 'to greet', 'hello', 'goodbye', 'to smell', 'nose', 'odor', 'to fly', 'fireworks', 'to blow', 'to cut', 'pain', 'hot', 'to slurp', 'to throw', 'to fall', 'to fall', 'whistle', 'heartbeat', 'mouse', 'to hit', 'to catch', 'to grab', 'to throw', 'to fall', 'to shoot', 'circus', 'trunk', 'to fall', 'to fight', 'pain', 'to push open', 'to growl', 'to cut', 'to eat', 'knife', 'to slurp', 'to drink', 'drink', 'to eat', 'delicious', 'tasty', 'to cough', 'sick', 'to cry', 'to cry']

# replace skien with skiën in the df
df['answer'] = df['answer'].str.replace('skien', 'skiën')

# get rid of English 'to beat'
df_final = df[df['English'] != 'to beat']
# and to weep
df_final = df[df['English'] != 'to weep']
# and noisy
df_final = df[df['English'] != 'noisy']

# add those to df as answers_en
df['answer_en'] = answers_en

# make a list of English targets
meanings_en = list(df['English'])
# Dutch targets
meanings_nl = list(df['word'])
# Dutch answers
answers_nl = list(df['answer'])

# Save it
df.to_csv(datafolder + 'concept_answer_withoutcossim.csv', index=False)

This is how the dataframe looks like

	word	answer	exp	English	answer_en
0	bloem	feest	1	flower	party
1	dansen	juichen	1	to dance	to cheer
2	bitter	lekker	1	bitter	tasty
3	vechten	schieten	1	to fight	to shoot
4	ademen	ademen	1	to breathe	to breathe
5	bijten	zombie	1	to bite	zombie
6	zoemen	bij	1	buzz	bee
7	fluisteren	zee	1	to whisper	sea
8	walgen	vies	1	disgusted	dirty
9	langzaam	lekker	1	slow	tasty
10	auto	auto	1	car	car
11	eten	eten	1	to eat	to eat
12	ei	eten	1	egg	to eat
13	zwemmen	waaien	1	to swim	to blow
14	snel	waterslang	1	fast	hose

Calculating cosine similarity

Now we will load in ConceptNet numberbatch (version 19.08) and compute cosine similarity for each pair

Custom functions

# Load embeddings from a file
def load_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Cosine similarity
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

We will use multilingual numberbatch to extract words in the original language of experiment - Dutch. While English has better representation in ConceptNet, the English numberbatch does not make distinction between nouns and verbs (so ‘a drink’ and ‘to drink’ have common representation - drink). Because this is important distinction for us, we opt for Dutch embeddings to avoid this problem

# load embeddings
embeddings = load_embeddings('numberbatch\\numberbatch.txt')

This is how a single concept is represented (here skiën, engl. skiing)

[ 3.410e-02 -4.640e-02  5.490e-02  1.544e-01  1.800e-02 -5.050e-02
 -6.660e-02 -2.300e-02  5.320e-02  1.104e-01  2.770e-02  5.040e-02
 -2.010e-02  5.900e-03 -1.133e-01 -9.370e-02 -7.890e-02  3.540e-02
  3.780e-02  8.400e-02 -3.880e-02  7.680e-02 -8.010e-02  6.540e-02
 -1.493e-01 -1.036e-01  8.490e-02  1.040e-02 -6.890e-02  6.890e-02
  1.226e-01 -1.850e-02  1.520e-02  2.810e-02 -5.660e-02 -2.670e-02
 -5.700e-02 -4.480e-02  1.924e-01  5.800e-02 -7.800e-02 -7.700e-03
  1.132e-01  6.350e-02 -4.310e-02  1.900e-03 -4.820e-02  1.047e-01
  6.900e-02  7.150e-02  1.660e-02  2.730e-02  4.340e-02  1.130e-02
 -1.427e-01 -9.200e-03 -8.000e-04  2.310e-02  1.234e-01 -1.452e-01
 -1.710e-02 -1.094e-01 -1.518e-01  4.820e-02  1.400e-02 -1.460e-02
  1.023e-01  5.220e-02  1.362e-01  3.190e-02 -2.590e-02  1.220e-01
  1.750e-02  8.810e-02 -9.200e-02 -1.226e-01 -5.560e-02 -6.600e-03
  3.180e-02 -1.113e-01  6.130e-02 -1.202e-01 -2.480e-02 -8.300e-03
 -1.710e-02  3.410e-02  1.550e-02 -8.000e-02 -6.390e-02  1.170e-01
 -1.578e-01  2.660e-02 -1.360e-02 -6.790e-02 -1.030e-02 -3.320e-02
  1.040e-02  1.600e-03 -1.300e-03 -4.420e-02  4.060e-02 -7.470e-02
 -2.800e-03  1.900e-02  1.098e-01 -6.180e-02 -5.690e-02  7.700e-02
 -2.000e-03 -1.050e-02 -9.610e-02  7.000e-03  5.310e-02  2.720e-02
  5.850e-02  5.000e-02  4.400e-03 -3.210e-02  3.000e-02 -6.020e-02
  1.190e-02  3.760e-02  3.650e-02  1.452e-01  1.900e-03  3.870e-02
  5.210e-02  4.470e-02  5.570e-02  1.460e-02 -8.330e-02 -3.660e-02
 -8.020e-02 -6.520e-02  3.660e-02  2.260e-02 -7.050e-02 -9.100e-03
  2.190e-02 -2.630e-02 -1.410e-02  6.200e-03 -2.130e-02 -3.750e-02
  1.960e-02 -1.250e-02  2.740e-02  8.450e-02 -8.750e-02 -2.700e-02
  8.500e-03  1.260e-02 -2.550e-02  1.640e-02  2.850e-02 -5.570e-02
 -1.110e-02  9.400e-02  3.280e-02 -4.830e-02  1.080e-02  1.170e-02
  8.600e-03  6.620e-02  1.151e-01  2.960e-02  1.670e-02  1.180e-02
 -1.101e-01 -5.720e-02 -2.910e-02 -2.450e-02 -3.700e-03 -4.650e-02
  7.500e-03 -4.350e-02 -3.150e-02 -4.980e-02 -3.750e-02  6.250e-02
  6.600e-03 -9.910e-02  3.180e-02 -4.350e-02 -2.830e-02 -9.260e-02
 -6.110e-02 -1.280e-02 -7.960e-02  3.280e-02  8.670e-02 -1.080e-02
 -1.430e-02  1.350e-02 -6.210e-02 -2.220e-02 -6.010e-02 -4.440e-02
  5.820e-02  9.090e-02  9.300e-03 -4.130e-02 -3.040e-02 -6.410e-02
  4.760e-02 -6.620e-02 -1.000e-04  5.300e-02 -3.650e-02  7.500e-03
  3.460e-02 -5.350e-02  8.900e-03 -9.140e-02 -2.830e-02 -4.290e-02
  9.000e-03 -7.400e-02  1.014e-01 -8.760e-02  3.200e-03 -4.420e-02
 -2.990e-02  1.000e-04  3.900e-02 -6.200e-03  7.740e-02  2.840e-02
  1.170e-02 -3.680e-02  2.790e-02  6.890e-02  1.800e-03  3.000e-03
 -3.070e-02 -5.000e-04  1.000e-03 -2.200e-03 -2.480e-02  1.117e-01
 -2.340e-02  3.780e-02  3.350e-02 -4.950e-02  9.500e-03  1.720e-02
  4.720e-02 -6.040e-02 -3.220e-02 -1.570e-02 -3.860e-02  1.140e-02
 -4.240e-02 -1.710e-02 -8.000e-04  3.900e-02  1.410e-02 -8.580e-02
 -3.250e-02 -1.650e-02 -2.520e-02  1.140e-02  7.050e-02  6.000e-04
  3.570e-02  4.750e-02 -7.050e-02  2.500e-02 -2.940e-02  1.830e-02
  9.600e-03  1.600e-02  3.910e-02  3.970e-02 -1.540e-02  8.070e-02
 -1.000e-02 -6.400e-03 -4.050e-02  3.030e-02 -6.060e-02  1.690e-02
  4.370e-02 -6.500e-03  2.920e-02 -2.900e-03 -1.017e-01  7.060e-02
 -1.053e-01 -1.500e-03  9.000e-03  9.500e-03  4.020e-02  2.420e-02
  1.800e-03  4.270e-02  3.140e-02 -2.540e-02  2.780e-02 -6.000e-03]

Now we take the list of target-answer pairs, transform them into embedding format and perform cosine similarity.

# get the embeddings for the words in the list meanings_en
word_embeddings_t = {}
for word in meanings_nl:
    word_embed = '/c/nl/' + word
    if word_embed in embeddings:
        word_embeddings_t[word] = embeddings[word_embed]

# get the embeddings for the words in the list answers_en
word_embeddings_ans = {}
for word in answers_nl:
    word_embed = '/c/nl/' + word
    if word_embed in embeddings:
        word_embeddings_ans[word] = embeddings[word_embed]

# calculate the similarity between the first word in the list meanings_en and first word in answers_en, second word in meanings_en and second word in answers_en, etc.
cosine_similarities = []

for i in range(len(meanings_nl)):
    word1 = meanings_nl[i]
    word2 = answers_nl[i]
    vec1 = word_embeddings_t.get(word1)
    vec2 = word_embeddings_ans.get(word2)
    if vec1 is not None and vec2 is not None:
        cosine_sim = cosine_similarity(vec1, vec2)
        cosine_similarities.append(cosine_sim)
    else:
        # print which concepts could not be found
        if vec1 is None:
            print(f"Concept not found: {word1}")
        if vec2 is None:
            print(f"Concept not found: {word2}")
        cosine_similarities.append(None)

df['cosine_similarity'] = cosine_similarities

# Save it
df.to_csv(datafolder + 'conceptnet_clean.csv', index=False)

Concept not found: lawaai maken
Concept not found: lawaai maken
Concept not found: schouderhoogte
Concept not found: openduwen

When running the code, we will see that some target or answered concepts are not represented in numberbatch (e.g., if the answer has more than one word).

Because we verified that cosine similarity and perceived similarity are highly correlated (see below), we will collect the missing data through new online rating study.

Comparing cosine similarity against perceived similarity

To validate the use of ConceptNet embeddings as a measure of conceptual similarity, we compare the cosine similarity computed from ConceptNet embeddings with the ‘perceived similarity’ ratings collected in the online anonymous rating study.

The rating study has been introduced to the participants in a way that closely relates to the experiment. The instructions go as follows:

Below is a list of 171 pairs of words. Your task is to go through them and rate on the scale from 0 to 10 how similar they are/feel for you. You can for example imagine that you are playing a game where you need to explain the first word from the pair (e.g., to dance), and someone answers the second word in the pair. In such a situation, how close is the guesser from the intended word? If they answer ‘to dance’, then the two words are completely identical. But if they answer ‘a car’ it is not similar at all. Rate it according to your intuition, there is no incorrect answer. Note that the survey is completely anonymous and we are not collecting any of your personal data, only the ratings.

This is how the survey results look like:

	bloem - feest	dansen - juichen	bitter - lekker	vechten - schieten	ademen - ademen	bijten - zombie	zoemen - bij	fluisteren - zee	walgen - vies	langzaam - lekker	...	zout - mes	zuigen - slurpen	zuigen - drinken	zuigen - drink	dik - eten	dik - heerlijk	dik - lekker	ziek - hoesten	ziek - ziek	huilen - huilen
0	1	3	1	7	10	5	8	1	8	0	...	0	8	6	5	6	5	5	8	10	10
1	3	6	1	6	10	4	8	0	8	1	...	2	8	8	8	7	4	5	7	10	10
2	5	8	5	8	10	7	7	2	8	5	...	2	7	7	7	7	5	6	8	10	10
3	4	6	4	8	10	7	9	7	7	0	...	2	7	7	8	6	2	2	8	10	10
4	6	0	5	6	10	7	8	0	6	0	...	0	7	8	7	4	0	0	8	10	10
5	0	4	2	3	10	6	5	2	8	0	...	0	6	6	4	0	0	0	3	10	10
6	0	1	1	6	10	4	8	0	9	0	...	0	5	6	8	2	0	0	7	10	10
7	1	5	1	2	10	3	6	0	4	0	...	1	6	5	4	6	3	4	6	10	10
8	2	3	3	4	10	0	7	0	8	1	...	0	7	0	2	6	4	2	8	10	10
9	0	2	0	2	10	5	0	0	8	0	...	0	0	3	2	0	0	2	5	10	10
10	3	4	4	7	10	4	6	3	8	3	...	4	7	4	6	6	5	6	7	10	10
11	0	2	0	5	10	8	7	3	9	0	...	0	5	2	2	5	2	1	8	10	10
12	0	0	4	0	10	0	2	0	2	0	...	0	1	1	3	3	5	0	2	10	10
13	0	1	2	1	10	0	0	0	0	1	...	0	2	1	0	0	0	0	0	10	10

14 rows × 166 columns

Now we have to calculate mean rating for each pair

# for each column, calculate the mean and save it to a df
df_survey_means = pd.DataFrame(df_survey.mean()).reset_index()

# separate the index, the first part is English, the second part is the answer_en
df_survey_means['word'] = df_survey_means['index'].str.split(' - ').str[0]
df_survey_means['answer'] = df_survey_means['index'].str.split(' - ').str[1]

# get rid of the index column
df_survey_means = df_survey_means.drop(columns='index')

# rename the column 0 to mean_similarity
df_survey_means = df_survey_means.rename(columns={0: 'mean_similarity'})

##### some corrections ####
# get rid of all invisible spaces in answer
df_survey_means['answer'] = df_survey_means['answer'].str.strip()
# where word is vangen and answer vagen, change answer to vangen, and add similarity to 10
df_survey_means.loc[(df_survey_means['word'] == 'vagen') & (df_survey_means['answer'] == 'vangen'), 'word'] = 'vangen'
df_survey_means.loc[(df_survey_means['word'] == 'vangen') & (df_survey_means['answer'] == 'vangen'), 'mean_similarity'] = 10
# where word is lopen and answer skien, change answer to skiën
df_survey_means.loc[(df_survey_means['word'] == 'lopen') & (df_survey_means['answer'] == 'skien'), 'answer'] = 'skiën'
# add one missing pair vallen-vallen with mean_similarity 10
missing_row = pd.DataFrame({'word': ['vallen'], 'answer': ['vallen'], 'mean_similarity': [10]})
df_survey_means = pd.concat([df_survey_means, missing_row], ignore_index=True)

# display
df_survey_means.head(15)

	mean_similarity	word	answer
0	1.785714	bloem	feest
1	3.214286	dansen	juichen
2	2.357143	bitter	lekker
3	4.642857	vechten	schieten
4	10.000000	ademen	ademen
5	4.285714	bijten	zombie
6	5.785714	zoemen	bij
7	1.285714	fluisteren	zee
8	6.642857	walgen	vies
9	0.785714	langzaam	lekker
10	10.000000	auto	auto
11	10.000000	eten	eten
12	5.285714	ei	eten
13	1.142857	zwemmen	waaien
14	0.500000	snel	waterslang

Now we can merge it with the cosine similarity dataframe

# load in similarity
df_similarity = pd.read_csv(datafolder + 'conceptnet_clean.csv')

# merge df_survey_means with df on English and answer_en
df_final = pd.merge(df_similarity, df_survey_means, on=['word', 'answer'], how='left')

# get rid of English 'to beat'
df_final = df_final[df_final['English'] != 'beat']
# and to weep
df_final = df_final[df_final['English'] != 'weep']

# save it
df_final.to_csv(datafolder + '/df_final_conceptnet.csv', index=False)

# Display
df_final.head(15)

	word	answer	exp	English	answer_en	cosine_similarity	mean_similarity
0	bloem	feest	1	flower	party	0.135571	1.785714
1	dansen	juichen	1	to dance	to cheer	0.177888	3.214286
2	bitter	lekker	1	bitter	tasty	0.257505	2.357143
3	vechten	schieten	1	to fight	to shoot	0.205791	4.642857
4	ademen	ademen	1	to breathe	to breathe	1.000000	10.000000
5	bijten	zombie	1	to bite	zombie	0.068596	4.285714
6	zoemen	bij	1	buzz	bee	0.164508	5.785714
7	fluisteren	zee	1	to whisper	sea	0.072605	1.285714
8	walgen	vies	1	disgusted	dirty	0.353700	6.642857
9	langzaam	lekker	1	slow	tasty	0.077073	0.785714
10	auto	auto	1	car	car	1.000000	10.000000
11	eten	eten	1	to eat	to eat	1.000000	10.000000
12	ei	eten	1	egg	to eat	0.233187	5.285714
13	zwemmen	waaien	1	to swim	to blow	0.065727	1.142857
14	snel	waterslang	1	fast	hose	0.081750	0.500000

Now we can finally run correlation

# get rid of all lines where mean_similarity is 10.0 - otherwise we will drag the correlation up
df_corr = df_final[df_final['mean_similarity'] != 10.0]

feature1 = "cosine_similarity"
feature2 = "mean_similarity"

# create a sub-dataframe with the selected features, dropping missing values
subdf = df_corr[[feature1, feature2]].dropna()

# compute the correlation coefficient, with Bayes factor
corr_with_bf = pingouin.pairwise_corr(subdf, columns=['cosine_similarity', 'mean_similarity'], method='pearson', alternative='two-sided')

# display
print(corr_with_bf)

                   X                Y   method alternative    n         r  \
0  cosine_similarity  mean_similarity  pearson   two-sided  122  0.730051   

         CI95%         p-unc       BF10  power  
0  [0.63, 0.8]  1.430306e-21  3.728e+18    1.0

And here we see the relationship visually

The strong correlation (r=0.73) validates the use of ConceptNet embeddings as a measure of conceptual similarity. In the next script, we will load it in together with our effort features.

References

Speer, Robyn, Joshua Chin, and Catherine Havasi. 2018. “ConceptNet 5.5: An Open Multilingual Graph of General Knowledge.” 2018. https://doi.org/10.48550/arXiv.1612.03975.