globalwordnet · chiarcos · May 2, 2022 · May 2, 2022
diff --git a/README.md b/README.md
@@ -27,6 +27,9 @@ The following files are in this repository:
 * `ili-map-pwn31.tab`, `ili-map-wn31.ttl`: The mappings from Princeton WordNet 3.1
     to the ILI
 * `ili-map-odwn13.ttl`: The mapping from Open Dutch WordNet 1.3 to the ILI.
-* `older-wn-mappings`: Automatically constructed mappings from previous versions
+* `older-wn-mappings/`: Automatically constructed mappings from previous versions
     of WordNet to the ILI
 
+Complementary data:
+
+* `sense-mappings/`: Mappings between WordNet synsets and sense IDs for selected WordNets. Provided to facilitate the linking between *sense*-annotated resources (such as the [SemCor corpus](https://web.eecs.umich.edu/~mihalcea/downloads.html#semcor)] or the [Princeton WordNet Gloss Corpus](https://wordnetcode.princeton.edu/glosstag.shtml)) with ILI-based WordNets. 
diff --git a/sense-mappings/Makefile b/sense-mappings/Makefile
@@ -0,0 +1,37 @@
+all:
+	for file in http://wordnetcode.princeton.edu/1.5/wn15.zip \
+		http://wordnetcode.princeton.edu/1.6/wn16.unix.tar.gz \
+		http://wordnetcode.princeton.edu/1.7/wn17.unix.tar.gz \
+		http://wordnetcode.princeton.edu/1.7.1/WordNet-1.7.1.tar.gz \
+		http://wordnetcode.princeton.edu/2.0/WordNet-2.0.tar.gz \
+		http://wordnetcode.princeton.edu/2.1/WordNet-2.1.tar.gz \
+		http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz \
+		http://wordnetcode.princeton.edu/wn3.1.dict.tar.gz; do \
+		version=pwn`basename $$file | sed s/'[^0-9]'//g`; \
+		if [ ! -e $$version ]; then mkdir $$version; fi; \
+		if [ ! -e $$version/`basename $$file` ]; then \
+			cd $$version;\
+			wget $$file;\
+			cd ..;\
+		fi; \
+	done;
+	for dir in */; do \
+		cd $$dir;\
+		for file in *zip; do \
+			unzip -u $$file;\
+		done;\
+		for file in *t*gz; do \
+			tar -xvf $$file; \
+		done; \
+		cd ..; \
+	done;
+	for dir in */; do \
+		for file in `find $$dir | grep -i dict | egrep -i '\.DAT|data\.'`; do \
+			python3 dict2map.py $$file; \
+		done > pwn`echo $$dir | sed s/'[^0-9]'//g`.tab;\
+	done;
+
+
+
+
+
diff --git a/sense-mappings/Readme.md b/sense-mappings/Readme.md
@@ -0,0 +1,22 @@
+# Sense mappings
+
+WordNets provide two types of IDs:
+- synset IDs (in Princeton WordNet, a number followed by `-n` for noun, `-v` for verbs, etc.)
+- sense IDs (in Princeton WordNet, a lemma followed by a number of `:`-separated numerical and lexical IDs)
+
+Synsets represent sets of synonyms (i.e., abstract concepts). Senses represent possible meanings that a lexeme can take. A word sense thus corresponds to a tuple of synset and lemma (a lexicalization for a specific synset, a meaning of a specific lexeme).
+
+ILI provides a mapping between synset IDs, only, but not to sense IDs. For convenience, we provide a mapping of synset IDs to sense IDs extracted from a number of WordNets. Note, that the synset IDs are not ILI ids, but in combination with other mapping files in this repository, they can be used to map ILIs to resource-specific sense IDs.
+
+The purpose of the provided mapping is that corpora with WordNet *sense* (but not *synset*) annotations, e.g., [SemCor](https://web.eecs.umich.edu/~mihalcea/downloads.html#semcor) can be directly processed along with other ILI-linked resources, without having to use the original software.
+
+These mappings are automatically extracted from WordNet dict files, to re-build them, run
+
+	$> make
+
+We do not provide an RDF version. However, note that the TAB files can be directly SPARQLed with [TARQL](https://github.com/tarql/tarql) and combined with other RDF data using [Fintan](https://github.com/Pret-a-LLOD/Fintan).
+
+# Known issues
+
+- currently limited to Princeton WordNet
+- support for `-s` synsets is incomplete, we generate substrings for these (compare with `startswith`).
diff --git a/sense-mappings/dict2map.py b/sense-mappings/dict2map.py
@@ -0,0 +1,52 @@
+import sys,os,re
+
+""" when called on a WordNet data.* file, extrapolate sense IDs and concept ID 
+	note that we do not produce the sense key, but a sense prefix, optionally, this can be followed, either by ":" (i.e., ending in "::") or ":"-separated addenda
+	(for which it is not clear where they come from)
+
+"""
+
+pos2nr_sfx={
+	"n": ("1","n"),
+	"v": ("2","v"),
+	"adj": ("3","a"),
+	"a": ("3","a"),
+	"adv": ("4","r"),
+	"r": ("4","r"),
+	"s": ("5","s")
+
+}
+
+for file in sys.argv[1:]:
+	with open(file,"rt", errors="ignore") as input:
+		for line in input:
+			if line[0] in "0123456789":
+				line=line.strip()
+				fields=line.split()
+				synset=fields[0]
+				field=fields[1]
+				pos=fields[2]
+				if not pos in pos2nr_sfx:
+					sys.stderr.write("unsupported pos \""+pos+"\" in \""+line+"\"\n")
+					sys.stderr.flush()
+				else:
+					# if pos=="s":
+					# 	print("# "+line)
+					posnr,pos=pos2nr_sfx[pos]
+					synset=synset+"-"+pos
+					# lemmas=int(fields[3]) #??? can be 0d in WN31
+					fields=fields[4:]
+					while not fields[0][0] in "0123456789":
+						lemma=fields[0]
+						nr=fields[1]
+						if len(nr)==1:
+							nr="0"+nr 
+						sense=lemma+"%"+posnr+":"+field+":"+nr+":"
+						if pos!="s":
+							sense=sense+":"
+							# s phrases have disambiguating suffixes
+						print(synset+"\t"+sense)
+						fields=fields[2:]
+
+
+