Embeddings with Elastic Search

2022-02-17

Goal: I want to be able to search through some documents (videos) using Elastic Search’s knn feature with embeddings.

Problem: I’m getting subpar results. The results tend to be very similar no matter the search. This seems partly because embeddings of 0 dominate the results.

I can quickly mention the approach I take for the embeddings using mostly gensim.

  1. Preprocess your text
  2. Create a dictionary of all the unique words in the corpus
  3. Get the bag of words for each document
  4. Calculate the TFIDF
  5. Use LSI (PCA) to reduce the dimensionality to 500

Some questions I’ve asked to get to the bottom of this situation.

  1. Is this an issue with embeddings and search?

No. My normal approach is to take a piece of text, embed it, and then find the match with my corpus. This match is done using the Elastic Search knn feature. Instead, I did this process manually. I stored the embeddings for each document and manually searched for the best match between my input embedding and the corpus embeddings. When I did this, then I got good results.

  1. Might there be an issue with the embeddings stored in Elastic Search?

I checked and these embeddings match the ones I have.

  1. What about trying a different version of Elastic Search?

3a. How about a newer version on Amazon?

Right now I’m using an older version of the search with Amazon to be consistent with what the dev team uses. At one point I did run the elastic search on my own machine and it seemed to work fine. So maybe trying a more recent version would help?

With a newer version, I got different results but same issues. The results weren’t that relevant and different searches led to similar results.

3b. How about a local version of Elastic Search?

I tried to install Elastic Search on my machine but found out that the open-source version I installed via homebrew does not come with the knn package.

So I installed the OpenSearch docker version, which worked. However, this too gave me poor recall.

  1. Do you get the exact match if you give a particular documents embedding?

  2. Can you try some simpler dataset and see how the search works?