Keyword Search with OpenSearch
I’m making using of Amazon’s OpenSearch. I wanted to search for different keywords with different weights. And then I wanted to add some additional search customizations. The following code should help with that. For more details, you can see https://opensearch.org/docs/latest/opensearch/query-dsl/bool/.
Query
search_query = {
"from" : 0, "size" : 10,
'query': {
"bool": {
'should': [
{
'match_phrase': {
'keywords': {
'query': 'test', 'boost': 2
}
}
},
{
'match_phrase': {
'keywords': {
'query': 'test2', 'boost': 1
}
}
}
],
'must': [
{
'match': {'team_id': 'me'}
},
{
'match': {'clipType': 'Video'}
}
]
},
},
'_source': ['id', 'clipType', 'name', 'keywords', "transcript", 'video_title', 'umap_embedding', 'team_name', 'team_id']
}
This query will do a boolean search, matching for two elements, a group of ‘should’s and a group of ‘must’s. - The ‘should’ part are a bunch of keywords that are matched as a phrase (e.g., ‘machine learning’). Each keyword is given a different boost. The word ‘test’ is boosted 2x compared to the word ‘test2’. If none of these keywords appear, then that is fine and we will still get results. - The ‘must’ part are terms that must appear in the result. The ‘_source’ indicates the elements that must appear in the result.
Python Code Version
I had a more complicated formulation when I wrote this in python code. Here it is.
keywords = ['business', 'machine learning', 'data']
weights = [1, 2, 4] # amount to boost each keywords
# Keyword
boo_type = 'should'
boo_str = [ { 'match_phrase': { 'keywords': { 'query': kwrd, "boost": weights[i] } } } for i,kwrd in enumerate(keywords) ]
# Compile the query
search_query = {
"from" : 0, "size" : n,
'query': {
"bool": {
boo_type: boo_str
}
},
'_source': ['id', 'clipType', 'name', 'keywords', "transcript", 'video_title', 'umap_embedding', 'team_name', 'team_id']
}
# Adding additional search parameters
team_id = '00101012'
clip_type = 'Topic'
must_files = [
{'match': {'team_id': team_id} },
{'match': {'clipType': clip_type} }
]
if must_filters:
search_query['query']['bool']['must'] = must_filters
# Process
search_query = json.dumps(search_query)
# Run
cmd = f"curl -X GET -u '{USERNAME}:{PASSWORD}' '{HOST_URL}/{INDEX_NAME}/_search' -H 'Content-Type: application/json' -d '{search_query}'"
ret = json.loads(os.popen(cmd).read())
search_results = [ x['_source'] for x in ret['hits']['hits'] ]