Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide complementary sense mapping #15

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@ The following files are in this repository:
* `ili-map-pwn31.tab`, `ili-map-wn31.ttl`: The mappings from Princeton WordNet 3.1
to the ILI
* `ili-map-odwn13.ttl`: The mapping from Open Dutch WordNet 1.3 to the ILI.
* `older-wn-mappings`: Automatically constructed mappings from previous versions
* `older-wn-mappings/`: Automatically constructed mappings from previous versions
of WordNet to the ILI

Complementary data:

* `sense-mappings/`: Mappings between WordNet synsets and sense IDs for selected WordNets. Provided to facilitate the linking between *sense*-annotated resources (such as the [SemCor corpus](https://web.eecs.umich.edu/~mihalcea/downloads.html#semcor)] or the [Princeton WordNet Gloss Corpus](https://wordnetcode.princeton.edu/glosstag.shtml)) with ILI-based WordNets.
37 changes: 37 additions & 0 deletions sense-mappings/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
all:
for file in http://wordnetcode.princeton.edu/1.5/wn15.zip \
http://wordnetcode.princeton.edu/1.6/wn16.unix.tar.gz \
http://wordnetcode.princeton.edu/1.7/wn17.unix.tar.gz \
http://wordnetcode.princeton.edu/1.7.1/WordNet-1.7.1.tar.gz \
http://wordnetcode.princeton.edu/2.0/WordNet-2.0.tar.gz \
http://wordnetcode.princeton.edu/2.1/WordNet-2.1.tar.gz \
http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz \
http://wordnetcode.princeton.edu/wn3.1.dict.tar.gz; do \
version=pwn`basename $$file | sed s/'[^0-9]'//g`; \
if [ ! -e $$version ]; then mkdir $$version; fi; \
if [ ! -e $$version/`basename $$file` ]; then \
cd $$version;\
wget $$file;\
cd ..;\
fi; \
done;
for dir in */; do \
cd $$dir;\
for file in *zip; do \
unzip -u $$file;\
done;\
for file in *t*gz; do \
tar -xvf $$file; \
done; \
cd ..; \
done;
for dir in */; do \
for file in `find $$dir | grep -i dict | egrep -i '\.DAT|data\.'`; do \
python3 dict2map.py $$file; \
done > pwn`echo $$dir | sed s/'[^0-9]'//g`.tab;\
done;





22 changes: 22 additions & 0 deletions sense-mappings/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Sense mappings

WordNets provide two types of IDs:
- synset IDs (in Princeton WordNet, a number followed by `-n` for noun, `-v` for verbs, etc.)
- sense IDs (in Princeton WordNet, a lemma followed by a number of `:`-separated numerical and lexical IDs)

Synsets represent sets of synonyms (i.e., abstract concepts). Senses represent possible meanings that a lexeme can take. A word sense thus corresponds to a tuple of synset and lemma (a lexicalization for a specific synset, a meaning of a specific lexeme).

ILI provides a mapping between synset IDs, only, but not to sense IDs. For convenience, we provide a mapping of synset IDs to sense IDs extracted from a number of WordNets. Note, that the synset IDs are not ILI ids, but in combination with other mapping files in this repository, they can be used to map ILIs to resource-specific sense IDs.

The purpose of the provided mapping is that corpora with WordNet *sense* (but not *synset*) annotations, e.g., [SemCor](https://web.eecs.umich.edu/~mihalcea/downloads.html#semcor) can be directly processed along with other ILI-linked resources, without having to use the original software.

These mappings are automatically extracted from WordNet dict files, to re-build them, run

$> make

We do not provide an RDF version. However, note that the TAB files can be directly SPARQLed with [TARQL](https://github.com/tarql/tarql) and combined with other RDF data using [Fintan](https://github.com/Pret-a-LLOD/Fintan).

# Known issues

- currently limited to Princeton WordNet
- support for `-s` synsets is incomplete, we generate substrings for these (compare with `startswith`).
52 changes: 52 additions & 0 deletions sense-mappings/dict2map.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
import sys,os,re

""" when called on a WordNet data.* file, extrapolate sense IDs and concept ID
note that we do not produce the sense key, but a sense prefix, optionally, this can be followed, either by ":" (i.e., ending in "::") or ":"-separated addenda
(for which it is not clear where they come from)

"""

pos2nr_sfx={
"n": ("1","n"),
"v": ("2","v"),
"adj": ("3","a"),
"a": ("3","a"),
"adv": ("4","r"),
"r": ("4","r"),
"s": ("5","s")

}

for file in sys.argv[1:]:
with open(file,"rt", errors="ignore") as input:
for line in input:
if line[0] in "0123456789":
line=line.strip()
fields=line.split()
synset=fields[0]
field=fields[1]
pos=fields[2]
if not pos in pos2nr_sfx:
sys.stderr.write("unsupported pos \""+pos+"\" in \""+line+"\"\n")
sys.stderr.flush()
else:
# if pos=="s":
# print("# "+line)
posnr,pos=pos2nr_sfx[pos]
synset=synset+"-"+pos
# lemmas=int(fields[3]) #??? can be 0d in WN31
fields=fields[4:]
while not fields[0][0] in "0123456789":
lemma=fields[0]
nr=fields[1]
if len(nr)==1:
nr="0"+nr
sense=lemma+"%"+posnr+":"+field+":"+nr+":"
if pos!="s":
sense=sense+":"
# s phrases have disambiguating suffixes
print(synset+"\t"+sense)
fields=fields[2:]



Loading