GSoCIdeas

Google Summer of Code project ideas

Introduction

This is the ideas page for Google Summer of Code. We have listed a handful of interesting project ideas that would benefit not only the S-Space package, but the many researchers that use the package for their projects worldwide. Our key goals are to make the package more useful, more flexible, and more reliable.
Our projects are designed to give you a taste for both high quality development and exposure to interesting research questions. We want to feed for your passion for development and research with interesting, challenging and meaningful projects!

This list is just an overview of some major projects that the team has been thinking of doing. If any interest you or you would like more information on any of them, please send us an email our development list [email protected]. Also, this list is not comprehensive, if you have any ideas beyond what we've listed, we're very interested to hear them, so please share any other ideas you have via the mailing list, and we can find a way to turn the idea into a full GSoC project.

New things you will learn as a part of working with us:

Industrial-quality Java development with a focus on scalability, memory efficiency and concurrency
All about the [Distributional Hypothesis] (http://en.wikipedia.org/wiki/Distributional_hypothesis) and [ distributional semantics] (http://en.wikipedia.org/wiki/Statistical_semantics)
New and interesting ideas in [Natural Language Processing] (http://en.wikipedia.org/wiki/Natural_language_processing) and [Computational Linguistics] (http://en.wikipedia.org/wiki/Computational_linguistics)

Tools you will learn in your projects (if you didn't know them already)

[Maven] (http://ant.apache.org/)
[Git] (http://git-scm.com/)
[Concurrent] (http://java.sun.com/developer/technicalArticles/J2SE/concurrency/) Java programming
Writing unit tests with [jUnit] (http://www.junit.org/)

Things you can expect from us:

Guidance to help you select your project and future directions. We want you have a clear vision of what you're getting into and hopefully a lot of excitement as well.
Full support via email, IM, and IRC for all your questions to ensure you are able to keep making progress. Getting stuck or not knowing what to do next is frustrating; we want your development experience to be both fun and challenging!
Constant encouragement. Your work really matters to researchers around the world and we want you to know it.
Respect for your passion and development skills.

How to get started:

Read up on [distributional semantics] (http://en.wikipedia.org/wiki/Statistical_semantics) to get an idea of what we do and what kinds of problems people are solving with the S-Space Package.
Download or check out the latest source code
Join our mailing list, [email protected]
Download a corpus to play around with some of the algorithms. A txt version of one of Project Gutenberg's [top 100 books] (http://www.gutenberg.org/browse/scores/top) makes a great start. (If you have some space available the [Open American Corpus] (http://americannationalcorpus.org/OANC/index.html#download) or a [Wikipedia snapshot] (http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2) make for interesting corpora.)
Run any of the algorithms over your corpus to build a semantic space. Load it into [SemanicSpaceExplorer] (http://code.google.com/p/airhead-research/wiki/SemanticSpaceExplorer) and see what you can discover.
- Do the neighbors of words seem to capture some related properties?
- Are neighbors synonyms or are they related in a different way?
- How does word frequency affect your results?
- How do the neighbors change between different algorithms?
If you're feeling ambitious, consider reading Turney and Pantel's (2010) [survey] (http://www.patrickpantel.com/cgi-bin/web/tools/getfile.pl?type=paper&id=2010/jair10.pdf) of the different distributional approaches. This will provide more context for what you might be working on, but don't feel obligated to read all of it. :)
Email us! We love to hear from our users.

GSoC Project List

Task	Difficulty	Description	Rationale	Ideal Deliverables	Skill Requirements
Implement an interactive GUI for the S-Space Package	Easy	Currently, the S-Space package is only accessed via the command line. For many people, this is an unfamiliar model. A S-Space Package GUI should give users the flexibility to run a variety of Semantic Space models over a variety of corpora. Ideally a user would be able to select parameters for each Semantic Space model, such as the number of dimensions, a matrix transform, the form of dimensionality reduction, word filters, and so on from a series of simple menus. The user could then select their corpus of choice, possibly have a chance to clean the corpus, and then decide where to save the final semantic space.	A GUI interface significantly lowers the bar for playing around with a variety of semantic space algorithms. With a GUI, users would be able to easily select one or more algorithms, run them, and then plug them into their application. For a good example of an easy to use interface for a complex set of algorithms, see Weka 's GUI.	A GUI that exposes the ability to load and filter input corpora (e.g. remove [stop words) using features provided by S-Space Package, run all of the semantic-space building algorithms, with algorithm-specific options for customizing them build, analyze the resulting semantic space and interact with it, much like how the SemanticSpaceExporer tool operates on the command line, and plug in new algorithms easily to let researchers rapidly prototype.	Java Familiarity with GUI Design Swing / AWT A strong desire to make software "Just Work"
Integrate S-Spaces with Graph Visualization and Community Detection	Medium to Hard	Once of the nice properties of a semantic space is that it is a space. You can think about the connections between words in terms of distances, angles, vectors, etc. This project seeks to extend this idea by visualizing a Semantic Space model as a graph. The project will start with the idea of visualizing the space by representing words as vertices and connecting nearby neighbors with edges. We then would like to add support for [community detection] (http://en.wikipedia.org/wiki/Community_structure) to help group related words into semantic categories. Our target graph visualization platform is [Gelphi] (http://gephi.org/)	The complete structure of Semantic Space models is currently hard to visualize using a command-line interface. By visualizing these spaces as graphical models, researchers will be more capable of distinguishing the features of each algorithm's semantic space. They can then visually determine which type of space best captures the semantics for solving their particular type of problem. Furthermore, community structure helps researchers assess global properties of the space, such as its conceptual organization.	Integration with Gephi to display a semantic space as a graph Different community detection algorithms implemented for the graph. (This can be as easy (e.g. [min-cut] (http://en.wikipedia.org/wiki/Minimum_cut)) or as difficult (e.g. [clique percolation] (http://en.wikipedia.org/wiki/Clique_percolation_method)) as you would like. You can also implement more than one if you choose.) Integration with Gephi to display community structure	* Java * Familiarity with Gephi or other graphing software * Familiarity with graph data structures and algorithms
Create a web service around Semantic Space models	Medium	Semantic spaces for millions of words can often grow into the tens of gigabytes and take significant resources to compute. While the model only needs to be created once, sharing its data is prohibitively intensive on network bandwidth. This project focuses on exposing semantic space data as a network application. Users can query the different semantic spaces with remote method calls to access information much like they would if the data was local. Our target platform for this is [Google app engine] (https://appengine.google.com/start) .	This project aims at increasing the accessibility of semantic space data. As new semantic space models are built, their contents can be rapidly disseminated via web service without having to download the entire data set. Furthermore, the web service allows for semantic-space using applications to access the data in a light-weight manner, which opens the possibility of using the data in other web-apps.	A simple web service that can expose the contents of a semantic space via method call A client-side API in both Java and Javascript for clients to access the data Additional functionality for requesting the semantic space data in pre-processed forms, e.g. select 100 neighbors and cluster them.	Java Familiarity with networking or remote API calls Javascript experience is a plus.
Implement new clustering algorithms	Medium-High	[Clustering] (http://en.wikipedia.org/wiki/Cluster_analysis) is fundamental to many data applications and is often essential in analyzing data. The S-Space package is currently integrating a variety of innovative and efficient clustering algorithms such as [hierarchical] (http://en.wikipedia.org/wiki/Hierarchical_clustering) clustering and [spectral] (http://en.wikipedia.org/wiki/Cluster_analysis#Spectral_clustering) clustering. Our ultimate goal is to provide a robust, diverse library of algorithms for researchers to use in analyzing data. Your task would be to select one or more clustering algorithms and implement them according to our clustering API. This ensures that other researchers can easily assimilate your work and use it in new applications. Ideally, we would like you to select an algorithm with two key aspects: efficient computation time with sparse data sets and an ability to infer the number of clusters via parameters. (We can help guide you on different algorithms)	A number of clustering algorithms have been proposed through different research fields, many times outside of machine learning literature. We would like the S-Space package to be a good resource of effective clustering algorithms for word spaces. These algorithms will let researchers discover more relations both within the word spaces and in broader research areas.	One or more clustering algorithms implemented to match the Clustering interface. An analysis of each algorithm's performance and scalability on co-occurrence data A multi-threaded and/or Hadoop-based implementation of the same algorithm (stretch goal)	Java Computer Science Theory Familiarity with machine learning Linear algebra is a plus Concurrent programming is a big plus
Implement a S-Space algorithm in Hadoop	High	All but one of the S-Space algorithms are implemented with the default java threading framework. This form of parallelism has a number of limitations as the number of cores increases. The Hadoop framework is an effective method of utilizing a large number of parallel machines for highly parallel tasks. This task would require the student to find similar processing patterns in the S-Space algorithms, and create a simplified Hadoop processing system for as much of the parallel similarities as possible.	Hadoop's parallelism can scale to a massive number of nodes, which is becoming increasingly necessary as the amount of text data available increases. This project would let researchers already using hadoop leverage their setup fully utilize the S-Space package.	Extend the current Hadoop infrastucture in the S-Space Package to easily work with new algorithms. Implement HAL as an easy first start Implement BEAGLE as a second case Extend the Hadoop infrastructure to distribute other computationally intensive operations, such as finding the nearest-neighbor in a large semantic space or computing an affinity matrix.	Java Familiarity with concurrent and/or distributed programming (Hadoop is a plus) Access to a Hadoop server is a definite plus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoCIdeas

Google Summer of Code project ideas

Introduction

GSoC Project List

Clone this wiki locally