Recently, we had the opportunity to work on an NLP project to fetch top-n results based on semantic similarity between a set of texts and an unknown text. We tried methods like TF-IDF/ BM25, cosine/euclidean distance between various embeddings BERT, RoBERTA, XLNET, etc.
Given an unknown text, the goal was to compare it with each text from a set of 10M texts, calculate a similarity score for each comparison, and get the top 500 results based on the highest similarity scores.
It was a painstaking process but fortunately, we stumbled across an insightful GitHub discussion. Below, we have summarized our lesson from the ordeal and hope anyone else doing a similar project will find it helpful.
Steps for creating the engine
Creating a text-based search engine involves a lot of steps. However, if we had to summarize it, there are only two main milestones and the rest is easy
- Design a robust and accurate similarity algorithm
- Find a fast way to fetch the best results from the search space
We achieved the above with a combination of Elasticsearch (BM25)*, Sentence transformers (word average embedding) and Facebook’s FAISS.
- Elasticsearch (BM25)*:
- High false negative rate and low false positive
- It works very well with long texts with multiple sentences
- Sentence transformer (word average embedding):
- High false positive and low false negative
- Works very well with short sentences
If the above terms confuse you, false negative means even when the texts were quite similar, the algorithm deems them dissimilar, while case of false positive arises when distinct texts are adjudged similar.
Putting it simply, if Elasticsearch (BM25)* labels a couple of texts as similar then we can say with high confidence that they are similar but it might miss some similar texts. On the other hand, if two sentences are similar in reality, the sentence transformer model will not miss those however, it may also judge some dissimilar texts as similar.
Let’s get back to our previous point. To find the balance between the unique characteristics of the two approaches, we used both in tandem. First, we used Elasticsearch (BM25)* to find high confidence top-n similar texts and then used sentence transformer embeddings to re-asses the rank within top-n similar texts to get more accurate results compared to each algorithm working alone.
Fetching similar results from search space
Since we were working with a search space of 10M text, the next challenge was to optimize the search.
To solve this challenge, we created an optimised database index for all 10M embeddings using FAISS which drastically reduced the search time to within 1 second.
Let’s help you understand how FAISS does this. First of all, FAISS is designed to work with embeddings. Secondly, FAISS gives an approximated score which might have a little, albeit very minor, deviation from the original distance. But since we are interested in only comparing them relatively, this works like a charm.
We followed this workflow to come up with a super fast semantic similarity based search engine. This has helped our clients build custom search engines for internal data and save a significant amount of time.
*FYI: We used Elasticsearch BM25 in our workflow however, TF-IDF can also be employed as an equally viable alternative with similar characteristics.