Skip to content

PurandareAndPedersen

minhlab edited this page Dec 8, 2012 · 3 revisions

Purandare and Pedersen's context-clustering semantic space

Introduction

The Purandare and Pedersen (P&P) model builds a semantic space that induces different [senses for a word] (http://en.wikipedia.org/wiki/Word_sense) based on its different usages in the corpus. This is a form of [word sense induction] (http://en.wikipedia.org/wiki/Word_sense_disambiguation#Unsupervised_methods), where the different meanings of a word are automatically extracted. For details on this approach, see

  • Amruta Purandare and Ted Pedersen. Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces. Proceedings of the Conference on Computational Natural Language Learning (CoNLL), pp. 41-48, May 6-7, 2004, Boston, MA. Available [here] (http://www.d.umn.edu/~tpederse/Pubs/conll04-purandarep.pdf)

Algorithm Overview

The P&P model operates in two stages. First the corpus is processed to identify features that are likely correlated with a word. In this model features are either co-occurring words, or co-occurring bigrams. For example, the presence of "lawmaker" might be considered a feature for "congress." This path computes the [contingency table] (http://en.wikipedia.org/wiki/Contingency_table) between a word and all the possible features that have co-occurred with it. Those features that are deemed statistically significant are kept around. This has the effect of not counting words such as "the" or "good" which may frequently co-occur but are not necessarily semantically related to a word.

The second pass of the algorithm then reconsiders all of the contexts in which a word w occurs. Each context is made of a large region around the occurrence and only those words that features of w are counted in the context. (Words that are not features are not counted; similarly, words that may be features of other words are also not counted). All of these contexts are then clustered. The resulting clusters reveal similarities in which a word appears. Each cluster is said to represent a distinct sense of the word, i.e. a cluster of contexts whose words indicate a specific meaning of w. In the final semantic space, a word is given up to n meanings depending on the number of discovered clusters, with each meaning receiving its own semantic vector.

Implementation

The P&P model may be run using the edu.ucla.sspace.mains.PurandareMain class or by using the purandare-pedersen.jar executable archive. See [Mains] (/fozziethebeat/S-Space/wiki/Mains) for further details on program options.