Skip to content

Database Options

Amir Naghibi edited this page Oct 13, 2018 · 5 revisions

Database Options

There are many possible database technologies and combinations that we could use for our project. This is a wiki page to flesh out thinking on the options available to us for our database. Especially because of our limited time frame, we should make our database design choice carefully because we likely won't have time to migrate data and make the required changes to our code later if we run into a roadblock issue.

Requirements

It is of paramount importance for us to think about what characteristics our database should have based on the requirements for our application. This should ideally also include what future requirements we may need to handle. Here are some of our requirements for the foreseeable future that the database will have an impact on (please add more if I have missed any):

  • Data store for internships, users, and related data
  • Fast search through lots of internship-related text (e.g. titles, descriptions)
  • Search that can rank results based on relevance, location, and other properties
  • Search that doesn't require exact matches for keywords using techniques such as stemming and fuzzy search

Since any database has the capability of just storing our data, the main technical issues for our project are related to search which is a core functionality. If users can't search the data and just have to look through a list of internships, the app will be of little use.

Other considerations

Beyond technical requirements, there are other things we should take into account before choosing a database. Below are some of the variables that we should examine before locking ourselves into a database:

  • Our level of experience with the technology
  • Completeness and quality of documentation
  • Monetary cost (Bruno has informed me there will be an announcement soon regarding what costs will be covered)
  • Compatibility with other technologies

Solutions

PostgreSQL

PostgreSQL is a very viable option for what we want to do. It has the speed of similar to a SQL database plus some extra features on top of it. It supports ranking (ordering results by relevance), stemming (search using the root of the word rather than exact matches) and fuzzy search (search that corrects for misspellings). All three of these features would be very important for a good search engine that gives you the results you are looking for, and doesn't feel clunky and restrictive.

These features are all supported natively in the query language! It has SQL functions such as to_tsvector that convert a block of text to a list of the word roots and their locations in the text. To rank, you can simply order your query by the ts_rank of the results. It even allows you to weight different pieces of the content as more important than others (e.g. the title over the content)!

I think that Postgres is a strong option because it gets the job done without any unnecessary complication. Most if not all of us have experience with SQL, and Postgres is just an extension of the language. Postgres has a lot of documentation, and is fairly widely used so it should be easy to find answers to questions that come up.

Linked here is an interesting tutorial outlining how all of the mentioned features work in Postgres.

Elasticsearch or Solr

Elasticsearch is another viable option for our project. It is a REST based NoSQL database where search is the focus. It specialises in sifting through massive amounts of data very quickly. It supports features such as ranking, stemming and fuzzy search along with many other search related features. It is also extremely scalable because it is natively distributed and runs as a cluster of nodes that are all interacted with in the same way. It shards data between nodes and keeps copies of shards in different nodes for backup. It can also automatically detect failures to make sure data is saved and available.

For our project, the search functionality of Elastic is attractive as we can create complex queries for search problems that come in the future (e.g. advanced search features such as results not including a keyword). The fact that Elastic is distributed and very efficient at searching large datasets means it will remain fast when our dataset grows more, mitigating possible future performance issues. The backup and resilience functionality would be very useful for handling failures in our live system.

Solr is another database that is based on search. Much like Elastic, it will exceed our requirements for search functionality. It is not natively distributed like Elastic which is good if we only want a single node because it will reduce complexity. If we want to have a distributed database, it is not a huge task to make it distributed with much of the same features as Elastic using ZooKeeper. Solr has some additional features such as supporting more data formats and 3rd party integration.

Both Elasticsearch and Solr are newer technologies than PostgreSQL. This means that it will likely be harder to find documentation and help on the topics. That said, the Solr documentation is quite complete and from what I've seen it is fairly easy to navigate. Elasticsearch's documentation seems to be less expansive, but still more than adequate for what we need.

Both Solr and Elasticsearch would be extremely performant for searching through our data and have many other features that would be useful in the future. Choosing these technologies could also have a downside. Both of these technologies are more complex than other databases and have many features that probably would not be relevant for us in the timeline that we have. The features that Elasticsearch and Solr bring, especially Elastic's distributed nature, come with added complexity. These are both unfamiliar technology to probably all of us and have unique ways of accessing data. However, this is also a positive in my view because we will be learning how to use a new technology in a very practical manner.

Article outlining some of the key search features in Elasticsearch

High-level overview of Solr's key features

Comparison between Elasticsearch and Solr (a lot of this is over my head, but useful if you want to find a specific feature)

MongoDB

MongoDB is a good option for our use that has text search. It is a NoSQL database which means our schema can be more dynamic which is important for a project that may be making changes rapidly. It supports text search over multiple fields using a text index. It allows ranking results by relevance, stemming and some simple features such as exact phrase matching and term exclusion natively.

Compared to the other options listed, it's search functionality is more limited. It doesn't have any fuzzy search capability natively. It also has a less customizable ranking system. These features are useful, but I don't think they are absolutely necessary for our project. Another concern with Mongo is that it probably won't be super fast for searching large amounts of text. Searching through documents is slower than SQL tables, and, as a general purpose database, Mongo is not optimized for search.

I think MongoDB is a viable option for our project because it has the functionality we need for a basic text search engine. For missing features such as fuzzy search, there are ways of performing this without Mongo having the feature natively. MongoDB is also well documented and widely used, and some of us will have had experience using it before.