Keyword Search with OpenSearch

2022-02-24

I’m making using of Amazon’s OpenSearch. I wanted to search for different keywords with different weights. And then I wanted to add some additional search customizations. The following code should help with that. For more details, you can see https://opensearch.org/docs/latest/opensearch/query-dsl/bool/.

Query

search_query = {
    "from" : 0, "size" : 10,
    'query': {
        "bool": {
            'should': [
              {
                'match_phrase': {
                  'keywords': {
                    'query': 'test', 'boost': 2
                  }
                }
              }, 
              {
                'match_phrase': {
                  'keywords': {
                    'query': 'test2', 'boost': 1
                  }
                }
              }
            ], 
            'must': [
              {
                'match': {'team_id': 'me'}
              }, 
              {
                'match': {'clipType': 'Video'}
              }
            ]
        }, 
    },
    '_source': ['id', 'clipType', 'name', 'keywords', "transcript", 'video_title', 'umap_embedding', 'team_name', 'team_id']
}

This query will do a boolean search, matching for two elements, a group of ‘should’s and a group of ‘must’s. - The ‘should’ part are a bunch of keywords that are matched as a phrase (e.g., ‘machine learning’). Each keyword is given a different boost. The word ‘test’ is boosted 2x compared to the word ‘test2’. If none of these keywords appear, then that is fine and we will still get results. - The ‘must’ part are terms that must appear in the result. The ‘_source’ indicates the elements that must appear in the result.

Python Code Version

I had a more complicated formulation when I wrote this in python code. Here it is.

keywords = ['business', 'machine learning', 'data']
weights = [1, 2, 4] # amount to boost each keywords

# Keyword 
boo_type = 'should'
boo_str = [ { 'match_phrase': { 'keywords': { 'query': kwrd, "boost": weights[i]  } } } for i,kwrd in enumerate(keywords) ]

# Compile the query
search_query = {
    "from" : 0, "size" : n,
    'query': {
        "bool": {
            boo_type: boo_str
        }
    },
    '_source': ['id', 'clipType', 'name', 'keywords', "transcript", 'video_title', 'umap_embedding', 'team_name', 'team_id']
}

# Adding additional search parameters
team_id = '00101012'
clip_type = 'Topic'
must_files = [ 
  {'match': {'team_id': team_id} }, 
  {'match': {'clipType': clip_type} }
]
if must_filters:
    search_query['query']['bool']['must'] = must_filters

# Process
search_query = json.dumps(search_query)

# Run
cmd = f"curl -X GET -u '{USERNAME}:{PASSWORD}' '{HOST_URL}/{INDEX_NAME}/_search' -H 'Content-Type: application/json' -d '{search_query}'"
ret = json.loads(os.popen(cmd).read())

search_results = [ x['_source'] for x in ret['hits']['hits'] ]