We got an opportunity to work on an NLP project recently. We had to build a search engine that could fetch top-n results based on semantic similarity between a set of texts and an unknown text. We tried different methods like TF-IDF/ BM25, cosine/euclidean distance between various embeddings such as BERT, RoBERTA, XLNET, etc.
Given an unknown text, the goal was to compare it with each text from a set of 10M texts, calculate a similarity score for each comparison, and get the top 500 results based on the highest similarity scores.
It was a painstaking process but fortunately, we stumbled across an insightful GitHub discussion. Below, we have summarized our lessons from the ordeal and hope anyone else doing a similar project will find it helpful.
Steps to create the engine:
Creating a text-based search engine involved a lot of steps. However, if we had to summarize it, there were only two main milestones to keep in mind:
- To design a robust and accurate similarity algorithm.
- To find a way to fetch the best results from the search space quickly.
We achieved the above with a combination of Elasticsearch (BM25)*, Sentence transformers (word average embedding), and Facebook’s FAISS.
- Elasticsearch (BM25)*:
- High false-negative rate and low false-positive.
- It works very well with long texts with multiple sentences.
- Sentence transformer (word average embedding):
- High false-positive and low false-negative.
- Works very well with short sentences.
If the above terms confuse you, false-negative means that even when the texts are quite similar, the algorithm deems them dissimilar, while the case of false-positive arises when distinct texts are adjudged similar.
To put it simply, if Elasticsearch (BM25)* labels a couple of texts as similar then we can say with high confidence that they are similar but it might miss some similar texts. On the other hand, if two sentences are similar in reality, the sentence transformer model will not miss those, however, it may judge some dissimilar texts as similar.
To find the balance between the unique characteristics of the two approaches, we used both in tandem. First, we used Elasticsearch (BM25)* to find high confidence top-n similar texts and then used sentence transformer embeddings to re-assess the rank within top-n similar texts to get more accurate results compared to each algorithm working alone.
Fetching similar results from search space
Since we were working with a search space of 10M text, the next challenge was to optimise the search.
To solve this challenge, we created an optimised database index for all 10M embeddings using FAISS which drastically reduced the search time to less than 1 second.
Let’s help you understand how FAISS does this. First of all, FAISS is designed to work with embeddings. Secondly, FAISS gives an approximated score which might have a little, albeit very minor, deviation from the original distance. But, since we are interested in only comparing them relatively, this works like a charm.
We followed the workflow to come up with a super-fast semantic similarity based search engine. This has helped our clients build custom search engines for internal data and save a significant amount of time.
Note: We used Elasticsearch BM25 in our workflow however, TF-IDF can also be employed as an equally viable alternative with similar characteristics.
Please feel free to reach out to us at www.aidetic.in or email@example.com for more information. Lastly, thanks to Nils Reimers for the insightful discussion at GitHub.