Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDF calculation issue #2132

Open
4 of 8 tasks
sanikolaev opened this issue May 4, 2024 · 0 comments
Open
4 of 8 tasks

IDF calculation issue #2132

sanikolaev opened this issue May 4, 2024 · 0 comments

Comments

@sanikolaev
Copy link
Collaborator

Proposal:

Moved from https://github.com/manticoresoftware/dev/issues/371

Currently, idf is miscalculated in case of indexes with multiple chunks, e.g.:

drop table if exists t;  
create table t(f text);  
insert into t values(0,'abc'),(0,'def');  
flush ramchunk t;  
insert into t values(0,'abc'),(0,'def');  
flush ramchunk t;  
select *,weight(),packedfactors() from t where match ('abc') option ranker=expr('bm25'); 

idf=0.43067655

The idf calculated here is equal to 0.43067655 while the correct idf, calculated manually, should be 0.12596481.
We can get this expected value by optimizing the index:

drop table if exists t;  
create table t(f text);  
insert into t values(0,'abc'),(0,'def');  
flush ramchunk t;  
insert into t values(0,'abc'),(0,'def');  
flush ramchunk t;  
optimize index t option sync=1, cutoff=1;  
select *,weight(),packedfactors() from t where match ('abc') option ranker=expr('bm25'); 

idf=0.12596481

The probable reason is that we retrieve the count of docs with a search term only per a single chunk.

Also, the global_idf option for CREATE TABLE doesn't appear to work with RT indexes. It's not displayed in index settings after the table's been created, and if we create a global idf file and set global_idf=1 when searching, all idf values get equal to 0.

Discussion

➤ Stan commented:

we already have local_df query option that should work for local index of distributed however it could also work and for disk chunks of RT index.

Could you check that?

➤ Nick Sergeev commented:

I've just checked it with the previous example:

select *,weight(),packedfactors() from t where match ('abc') option ranker=expr('bm25'), local_df='0';

select *,weight(),packedfactors() from t where match ('abc') option ranker=expr('bm25'), local_df='1';

but this haven't had effect on the idf, it's stayed the same.

➤ Sergey Nikolaev commented:

I think local_df=1 should be supported implicitly by RT indexes without even the need to specify it.

Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

  • Task estimated
  • Specification created, reviewed and approved
  • Implementation completed
  • Tests developed
  • Documentation updated
  • Documentation proofread
  • Changelog updated
  • OpenAPI YAML updated and issue created to rebuild clients
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant