Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query().apply_custom_filter method needs to be improved #272

Open
virunew opened this issue Jan 13, 2024 · 4 comments
Open

Query().apply_custom_filter method needs to be improved #272

virunew opened this issue Jan 13, 2024 · 4 comments

Comments

@virunew
Copy link
Contributor

virunew commented Jan 13, 2024

Query().apply_custom_filter does simple exact search for custom filters. This does not server the purpose most of the time ( at least for me ) . I need to filter using mongodb operators like $regex, $ne , $ge etc which is ignored by this method. This method is used by semantic_query method which , which results in filter filtering everything out. If we are filtering the text result from semantic query, we should be able to support the mongodb operators like $regex, $eq etc. We could use library like pymongoquery to support mongo operator like filtering from our results without making a trip to the mongo databse.

@shneeba
Copy link
Contributor

shneeba commented Jan 14, 2024

does simple exact search for custom filters

I've found this an issue as well. One way I was thinking of this is if there could be a way to do the semantic search against the file name as well as the text block (user defined, text block only, file name only, both). This would involve embedding the filename and storing this in milvus (or your db of choice) as well. We already store the blockID in there for the text. I need to dig a bit deeper into the embedding process as I don't think it's a small change (🤞 it could be).

This could keep it in line with doing things with natural language and prevent having to configure your regex strings as well.

@virunew
Copy link
Contributor Author

virunew commented Jan 15, 2024

I just wrote this piece of code, this one can match mongodb operators like $eq, $ne and $regex etc.. Its using mongoquery (https://pypi.org/project/mongoquery/) package to do the match. mongoquery has some limitation with regex, but that can be remedied by compiling the regex expression

    def apply_filter_including_mongodb_operators(self, data, filter_dict):

        # filter_dict is a dict with indefinite number of key:value pairs - each key will be interpreted
        #   as "$and" in the query, requiring a match against all of the key:values in the filter_dict
        # validate filter dict
        # Using mongoquery to match normal strings as well as mongodb operators
        import mongoquery
        query_list = []
        for field, expression in filter_dict.items():
            if isinstance(expression, dict):
                # expression is a mongo operator
                if "$regex" in expression:
                    # expression's key is $regex then compile and create query
                    regex_pattern = expression['$regex']
                    compiled_pattern = re.compile(regex_pattern, re.IGNORECASE)
                    query_list.append(mongoquery.Query({field: {"$regex": compiled_pattern}}))
                else:
                    # its other mongo operator except $regex , just use expression for query
                    query_list.append(mongoquery.Query({field: expression}))
            else:
                # non mongo operator expression, just use as it is
                query_list.append( mongoquery.Query({field: expression}))
        if all(query.match(data) for query in query_list):
            return True
        else:
            return False

    #modified
    def apply_custom_filter(self, results, custom_filter):
        filtered_results = []
        for result in results:
            if self.apply_filter_including_mongodb_operators(result, custom_filter):
                filtered_results.append(result)
        return filtered_results

tested apply_filter_including_mongodb_operators() method with following data:

filter_dict = {
"special_field1": {"$ne":"VIRENDRA"},
"special_field2": { "$regex": "^.*2024$"},
"content_type": "text"
}

data = {
"special_field1": "virendra",
"special_field2": "year 2024",
"content_type" : "text"
}

So this function can match:

  1. strings, e.g. 'text' in my example
  2. regex, e.g. second element of my filter_dict
  3. mongodb operators ( e.g. $ne as in the first element of filter_dict)

@virunew
Copy link
Contributor Author

virunew commented Jan 15, 2024

does simple exact search for custom filters

I've found this an issue as well. One way I was thinking of this is if there could be a way to do the semantic search against the file name as well as the text block (user defined, text block only, file name only, both). This would involve embedding the filename and storing this in milvus (or your db of choice) as well. We already store the blockID in there for the text. I need to dig a bit deeper into the embedding process as I don't think it's a small change (🤞 it could be).

This could keep it in line with doing things with natural language and prevent having to configure your regex strings as well.

I think this would be really useful, I would even go further to make it flexible so that we can search in any number or fields as per the user's wish. I can already think of use cases involving it. e.g. if you are searching in a time sensitive document ( say news) , we may want to insert a date/month/year of news in one of the custom fields and then also search it and sort it as per that custom field

@shneeba
Copy link
Contributor

shneeba commented Jan 15, 2024

apply_filter_including_mongodb_operators() method

This is really cool and really expands the custom filter use, thanks for sharing, I'll have a play around!

Yep fully agree. I think it's got some real legs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants