Embeddings with Elastic Search
Goal: I want to be able to search through some documents (videos) using Elastic Search’s knn feature with embeddings.
Problem: I’m getting subpar results. The results tend to be very similar no matter the search. This seems partly because embeddings of 0 dominate the results.
I can quickly mention the approach I take for the embeddings using mostly gensim.
- Preprocess your text
- Create a dictionary of all the unique words in the corpus
- Get the bag of words for each document
- Calculate the TFIDF
- Use LSI (PCA) to reduce the dimensionality to 500
Some questions I’ve asked to get to the bottom of this situation.
- Is this an issue with embeddings and search?
No. My normal approach is to take a piece of text, embed it, and then find the match with my corpus. This match is done using the Elastic Search knn feature. Instead, I did this process manually. I stored the embeddings for each document and manually searched for the best match between my input embedding and the corpus embeddings. When I did this, then I got good results.
- Might there be an issue with the embeddings stored in Elastic Search?
I checked and these embeddings match the ones I have.
- What about trying a different version of Elastic Search?
3a. How about a newer version on Amazon?
Right now I’m using an older version of the search with Amazon to be consistent with what the dev team uses. At one point I did run the elastic search on my own machine and it seemed to work fine. So maybe trying a more recent version would help?
With a newer version, I got different results but same issues. The results weren’t that relevant and different searches led to similar results.
3b. How about a local version of Elastic Search?
I tried to install Elastic Search on my machine but found out that the open-source version I installed via homebrew does not come with the knn package.
So I installed the OpenSearch docker version, which worked. However, this too gave me poor recall.
-
Do you get the exact match if you give a particular documents embedding?
-
Can you try some simpler dataset and see how the search works?