Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is there any way to speed up deletion #544

Open
fenchu opened this issue Oct 18, 2023 · 1 comment
Open

is there any way to speed up deletion #544

fenchu opened this issue Oct 18, 2023 · 1 comment

Comments

@fenchu
Copy link

fenchu commented Oct 18, 2023

My tindydb json file grows with 5GB per week if I do not delete.

Currently we just load the tinydb data.json and delete all internalids below a given threshold.

But the major problem is that we need to close tinydb handler to do this, this do not work well in a multiprocessing asyncio fastapi app.

I like to keep max 1000 entries in the list and delete everything below the 1000 highest.

Any guidelines on how to do this while keeping the app running would be great.

A suggestion I got was adding a timestamps (epoch) and delete any timestamps below the 1000 highest, but it bloats up the table.
and add extra logic.
Thanks

@fenchu fenchu changed the title Feature request: get all the internalids and delete by internalid is there any way to speed up deletion Oct 18, 2023
@fenchu
Copy link
Author

fenchu commented Oct 18, 2023

This can be obtained using db.max(), but it is slow

def keep_newest(key:str='jobid', maxlen:int=1000) -> Optional[List]:
    """ keep the newest maxlen entries in database """
    global db
    if not db:
        db = TinyDB(db_path)
    currlen = len(db.all())
    if currlen<=maxlen:
        #log.warning(f"database size is:{currlen} which is less than {maxlen} - no deletion")
        return False
    ids = []
    for d in db.all()[:currlen - maxlen + 1]:
        id = db.remove(where(key)==d[key])
        if id:
            ids.append(id)
        #log.info(f"removed {d} with index {id}")
    return ids

number of entries in database: 10000
number of entries in database: 999
deleting 9001 took 450.83sec

The json direct version is way faster: 1875 times faster?

def keep_newest_json(fname:str,  maxlen:int=1000, table:str='_default') -> Optional[List]:
    """ keep the newest maxlen entries in database """
    dat = None
    with open(fname, 'r', encoding='utf8') as FR:
        dat = json.load(FR)
    if table not in dat:
        log.fatal(f"table:{table} not found in dat:{list(dat.keys())}")
        return None
    currlen = len(dat[table].keys())
    if currlen<=maxlen:
        log.info(f"table:{table} has {currlen} entries, less than maxlen:{maxlen}")
        return None
    ids = []
    for id in list(dat[table].keys())[:currlen - maxlen + 1]:
        del dat[table][id]
        ids.append(id)
        #log.info(f"removed index {id} from {table}")
    with open(fname, 'w', encoding='utf8') as FW:
        FW.write(json.dumps(dat, indent=2, sort_keys=True))
    return ids

number of entries in database: 10000
number of entries in database: 999
deleting 9001 took 0.24sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant