Full-Text Search in MongoDB

por | 28 febrero, 2018



MongoDB, one of the leading NoSQL databases, is well known for its fast performance, flexible schema, scalability and great indexing capabilities. At the core of this fast performance lies MongoDB indexes, which support efficient execution of queries by avoiding full-collection scans and hence limiting the number of documents MongoDB searches.

Starting from version 2.4, MongoDB began with an experimental feature supporting Full-Text Search using Text Indexes. This feature has now become an integral part of the product (and is no longer an experimental feature). In this article we are going to explore the full-text search functionalities of MongoDB right from fundamentals.

If you are new to MongoDB, I recommend that you read the following articles on Envato Tuts+ that will help you understand the basic concepts of MongoDB:

Before we get into any details, let us look at some background. Full-text search refers to the technique of searching a full-text database against the search criteria specified by the user. It is something similar to how we search any content on Google (or in fact any other search application) by entering certain string keywords/phrases and getting back the relevant results sorted by their ranking.

Here are some more scenarios where we would see a full-text search happening:

  • Consider searching your favorite topic on Wiki. When you enter a search text on Wiki, the search engine brings up results of all the articles related to the keywords/phrase you searched for (even if those keywords were used deep inside the article). These search results are sorted by relevance based on their matched score.
  • As another example, consider a social networking site where the user can make a search to find all the posts which contain the keyword cats in them; or to be more complex, all the posts which have comments containing the word cats.

Before we move on, there are certain general terms related to full-text search which you should know. These terms are applicable to any full-text search implementation (and not MongoDB-specific).

Stop words are the irrelevant words that should be filtered out from a text. For example: a, an, the, is, at, which, etc.

Stemming is the process of reducing the words to their stem. For example: words like standing, stands, stood, etc. have a common base stand.

A relative ranking to measure which of the search results is most relevant.

Before MongoDB came up with the concept of text indexes, we would either model our data to support keyword searches or use regular expressions for implementing such search functionalities. However, using any of these approaches had its own limitations:

  • Firstly, none of these approaches supports functionalities like stemming, stop words, ranking, etc.
  • Using keyword searches would require the creation of multi-key indexes, which are not sufficient compared to full-text.
  • Using regular expressions is not efficient from the performance point of view, since these expressions do not effectively utilize indexes.
  • In addition to that, none of these techniques can be used to perform any phrase searches (like searching for ‘movies released in 2015’) or weighted searches.

Apart from these approaches, for more advanced and complex search-centric applications, there are alternative solutions like Elastic Search or SOLR. But using any of these solutions increases the architectural complexity of the application, since MongoDB now has to talk to an additional external database.

Note that MongoDB’s full-text search is not proposed as a complete replacement of search engine databases like Elastic, SOLR, etc. However, it can be effectively used for the majority of applications that are built with MongoDB today.

Using MongoDB full-text search, you can define a text index on any field in the document whose value is a string or an array of strings. When we create a text index on a field, MongoDB tokenizes and stems the indexed field’s text content, and sets up the indexes accordingly.

To understand things further, let us now dive into some practical things. I want you to follow the tutorial with me by trying out the examples in mongo shell. We will first create some sample data which we will be using throughout the article, and then we’ll move on to discuss key concepts.

For the purpose of this article, consider a collection messages which stores documents of the following structure:

Let us insert some sample documents using the insert command to create our test data:

A text index is created quite similar to how we create a regular index, except that it specifies the text keyword instead of specifying an ascending/descending order.

Create a text index on the subject field of our document using the following query:

To test this newly created text index on the subject field, we will search documents using the $text operator. We will be looking for all the documents that have the keyword dogs in their subject field.

Since we are running a text search, we are also interested in getting some statistics about how relevant the resultant documents are. For this purpose, we will use the { $meta:
"textScore" }
expression, which provides information on the processing of the $text operator. We will also sort the documents by their textScore using the sort command. A higher textScore indicates a more relevant match.

The above query returns the following documents containing the keyword dogs in their subject field.

As you can see, the first document has a score of 1 (since the keyword dog appears twice in its subject) as opposed to the second document with a score of 0.66. The query has also sorted the returned documents in descending order of their score.

One question that might arise in your mind is that if we are searching for the keyword dogs, why is the search engine is taking the keyword dog (without ‘s’) into consideration? Remember our discussion on stemming, where any search keywords are reduced to their base? This is the reason why the keyword dogs is reduced to dog.

More often than not, you will be using text search on multiple fields of a document. In our example, we will enable compound text indexing on the subject and content fields. Go ahead and execute the following command in mongo shell:

Did this work? No!! Creating a second text index will give you an error message saying that a full-text search index already exists. Why is it so? The answer is that text indexes come with a limitation of only one text index per collection. Hence if you would like to create another text index, you will have to drop the existing one and recreate the new one.

After executing the above index creation queries, try searching for all documents with keyword cat.

The above query would output the following documents:

You can see that the score of the first document, which contains the keyword cat in both subject and content fields, is higher.

In the last example, we put a combined index on the subject and content fields. But there can be scenarios where you want any text content in your documents to be searchable.

For example, consider storing emails in MongoDB documents. In the case of emails, all the fields, including Sender, Recipient, Subject and Body, need to be searchable. In such scenarios you can index all the string fields of your document using the $** wildcard specifier.

The query would go something like this (make sure you are deleting the existing index before creating a new one):

This query would automatically set up text indexes on any string fields in our documents. To test this out, insert a new document with a new field location in it:

Now if you try text searching with keyword chicago (query below), it will return the document which we just inserted.

A few things I would like to focus on here:

  • Observe that we did not explicitly define an index on the location field after we inserted a new document. This is because we already have defined a text index on the entire document using the $** operator.
  • Wildcard indexes can be slow at times, especially in scenarios where your data is very large. For this reason, plan your document indexes (aka wildcard indexes) wisely, as it can cause a performance hit.

You can search for phrases like “smart birds who love cooking” using text indexes. By default, the phrase search makes an OR search on all the specified keywords, i.e. it will look for documents which contains either the keywords smart, bird, love or cook.

This query would output the following documents:

In case you would like to perform an exact phrase search (logical AND), you can do so by specifying double quotes in the search text.

This query would result in the following document, which contains the phrase “cook food” together:

Prefixing a search keyword with (minus sign) excludes all the documents that contain the negated term. For example, try searching for any document which contains the keyword rat but does not contain birds using the following query:

One important functionality I did not disclose till now is how you look behind the scenes and see how your search keywords are being stemmed, stop wording applied, negated, etc. $explain to the rescue. You can run the explain query by passing true as its parameter, which will give you detailed stats on the query execution.

If you look at the queryPlanner object returned by the explain command, you will be able to see how MongoDB parsed the given search string. Observe that it neglected stop words like who, and stemmed dogs to dog.

You can also see the terms which we neglected from our search and the phrases we used in the parsedTextQuery section.

The explain query will be highly useful as we perform more complex search queries and want to analyze them.

When we have indexes on more than one field in our document, most of the times one field will be more important (i.e. more weight) than the other. For example, when you are searching across a blog, the title of the blog should be of highest weight, followed by the blog content.

The default weight for every indexed field is 1. To assign relative weights for the indexed fields, you can include the weights option while using the createIndex command.

Let’s understand this with an example. If you try searching for the cook keyword with our current indexes, it will result in two documents, both of which have the same score.

Now let us modify our indexes to include weights; with the subject field having a weight of 3 against the content field having a weight of 1.

Try searching for keyword cook now, and you will see that the document which contains this keyword in the subject field has a greater score (of 2) than the other (which has 0.66).

As the data stored in your application grows, the size of your text indexes keeps on growing too. With this increase in size of text indexes, MongoDB has to search against all the indexed entries whenever a text search is made.

As a technique to keep your text search efficient with growing indexes, you can limit the number of scanned index entries by using equality conditions with a regular $text search. A very common example of this would be searching all the posts made during a certain year/month, or searching all the posts with a certain category/tag.

If you observe the documents which we are working upon, we have a year field in them which we have not used yet. A common scenario would be to search messages by year, along with the full-text search that we have been learning about.

For this, we can create a compound index that specifies an ascending/descending index key on year followed by a text index on the subject field. By doing this, we are doing two important things:

  • We are logically partitioning the entire collection data into sets separated by year.
  • This would limit the text search to scan only those documents which fall under a specific year (or call it set).

Drop the indexes that you already have and create a new compound index on (year, subject):

Now execute the following query to search all the messages that were created in 2015 and contain the cats keyword:

The query would return only one matched document as expected. If you explain this query and look at the executionStats, you will find that totalDocsExamined for this query was 1, which confirms that our new index got utilized correctly and MongoDB had to only scan a single document while safely ignoring all other documents which did not fall under 2015.

We have come a long way in this article learning about text indexes. There are many other concepts that you can experiment with text indexes. But owing to the scope of this article, we will not be able to discuss them in detail today. Nevertheless, let’s have a brief look at what these functionalities are:

  • Text indexes provide multi-language support, allowing you to search in different languages using the $language operator. MongoDB currently supports around 15 languages, including French, German, Russian, etc.
  • Text indexes can be used in aggregation pipeline queries. The match stage in an aggregate search can specify the use of a full-text search query.
  • You can use your regular operators for projections, filters, limits, sorts, etc., while working with text indexes.

Keeping in mind the fact that MongoDB full-text search is not a complete replacement for traditional search engine databases used with MongoDB, using the native MongoDB functionality is recommended for the following reasons:

  • As per a recent talk at MongoDB, the current scope of text search works perfectly fine for a majority of applications (around 80%) that are built using MongoDB today.
  • Building the search capabilities of your application within the same application database reduces the architectural complexity of the application.
  • MongoDB text search works in real time, without any lags or batch updates. The moment you insert or update a document, the text index entries are updated.
  • Text search being integrated into the db kernel functionalities of MongoDB, it is totally consistent and works well even with sharding and replication.
  • It integrates perfectly with your existing Mongo features such as filters, aggregation, updates, etc.

Full-text search being a relatively new feature in MongoDB, there are certain functionalities which it currently lacks. I would divide them into three categories. Let’s have a look.

  • Text Indexes currently do not have the capability to support pluggable interfaces like pluggable stemmers, stop words, etc.
  • They do not currently support features like searching based on synonyms, similar words, etc.
  • They do not store term positions, i.e. the number of words by which the two keywords are separated.
  • You cannot specify the sort order for a sort expression from a text index.
  • A compound text index cannot include any other type of index, like multi-key indexes or geo-spatial indexes. Additionally, if your compound text index includes any index keys before the text index key, all the queries must specify the equality operators for the preceding keys.
  • There are some query-specific limitations. For example, a query can specify only a single $text expression, you can’t use $text with $nor, you can’t use the hint() command with $text, using $text with $or needs all the clauses in your $or expression to be indexed, etc.
  • Text indexes create an overhead while inserting new documents. This in turn hits the insertion throughput.
  • Some queries like phrase searches can be relatively slow.

Full-text search has always been one of the most demanded features of MongoDB. In this article, we started with an introduction to what full-text search is, before moving on to the basics of creating text indexes.

We then explored compound indexing, wildcard indexing, phrase searches and negation searches. Further, we explored some important concepts like analyzing text indexes, weighted search, and logically partitioning your indexes. We can expect some major updates to this functionality in the upcoming releases of MongoDB.

I recommend that you give text-search a try and share your thoughts. If you have already implemented it in your application, kindly share your experience here. Finally, feel free to post your questions, thoughts and suggestions on this article in the comment section.