PurandareAndPedersen

Purandare and Pedersen's context-clustering semantic space

Introduction

The Purandare and Pedersen (P&P) model builds a semantic space that induces different [senses for a word] (http://en.wikipedia.org/wiki/Word_sense) based on its different usages in the corpus. This is a form of [word sense induction] (http://en.wikipedia.org/wiki/Word_sense_disambiguation#Unsupervised_methods), where the different meanings of a word are automatically extracted. For details on this approach, see

Amruta Purandare and Ted Pedersen. Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces. Proceedings of the Conference on Computational Natural Language Learning (CoNLL), pp. 41-48, May 6-7, 2004, Boston, MA. Available [here] (http://www.d.umn.edu/~tpederse/Pubs/conll04-purandarep.pdf)

Algorithm Overview

The P&P model operates in two stages. First the corpus is processed to identify features that are likely correlated with a word. In this model features are either co-occurring words, or co-occurring bigrams. For example, the presence of "lawmaker" might be considered a feature for "congress." This path computes the [contingency table] (http://en.wikipedia.org/wiki/Contingency_table) between a word and all the possible features that have co-occurred with it. Those features that are deemed statistically significant are kept around. This has the effect of not counting words such as "the" or "good" which may frequently co-occur but are not necessarily semantically related to a word.

The second pass of the algorithm then reconsiders all of the contexts in which a word w occurs. Each context is made of a large region around the occurrence and only those words that features of w are counted in the context. (Words that are not features are not counted; similarly, words that may be features of other words are also not counted). All of these contexts are then clustered. The resulting clusters reveal similarities in which a word appears. Each cluster is said to represent a distinct sense of the word, i.e. a cluster of contexts whose words indicate a specific meaning of w. In the final semantic space, a word is given up to n meanings depending on the number of discovered clusters, with each meaning receiving its own semantic vector.

Implementation

The P&P model may be run using the edu.ucla.sspace.mains.PurandareMain class or by using the purandare-pedersen.jar executable archive. See [Mains] (/fozziethebeat/S-Space/wiki/Mains) for further details on program options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PurandareAndPedersen

Purandare and Pedersen's context-clustering semantic space

Introduction

Algorithm Overview

Implementation

Clone this wiki locally