Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf bug: Inserting data is O(num versions) #2318

Open
wjones127 opened this issue May 9, 2024 · 1 comment
Open

perf bug: Inserting data is O(num versions) #2318

wjones127 opened this issue May 9, 2024 · 1 comment
Assignees

Comments

@wjones127
Copy link
Contributor

It appears the time to write data scales linearly with the number of versions. This is not great. On my local computer, it starts off at 10 ms and after a few thousand versions becomes 30 ms. For a higher-latency store, I bet this is more dramatic. One user reported latency of 1.5 sec after 8k versions.

My best guess is this is because to load the latest version we are listing all files in versions directory. We might have to implement the first part of #1362 to fix this.

Reproduce this

from datetime import timedelta
import time
import pyarrow as pa
import lance

data = pa.table({'a': pa.array([1])})

# Uncomment this part to reset and see once we delete versions, the latency
# goes back down.
# ds = lance.dataset("test_data")
# ds.cleanup_old_versions(older_than=timedelta(seconds=1), delete_unverified=True)

for i in range(10000):
    start = time.monotonic()
    # Use overwrite to eliminate possibility that it is O(num files)
    lance.write_dataset(data, 'test_data', mode='overwrite')
    print(time.monotonic() - start)

@wjones127 wjones127 self-assigned this May 15, 2024
wjones127 added a commit that referenced this issue May 29, 2024
Fixes #2338
Partially addresses #2318

**For a dataset on local SSD with 8,000 versions, we get 6x faster load
time and 3x faster append.**

* Added special code path for local filesystem for finding latest
manifest. This path skips the `metadata` call for paths that aren't
relevant, both fixing #2338 and improving performance on local
filesystems overall.
* Fixed code path where we were reading the manifest file twice
* Changed `CloudObjectReader` and `LocalFileReader` to both cache the
file size, so we aren't making multiple calls to get the size of the
same object/file. Also allowed passing the size when opening, in case we
already have it from a list operation.
* Deprecated some more methods for loading a dataset, in favor of using
`DatasetBuilder`. Also consolidated the implementations to use
`DatasetBuilder`, so we have fewer code paths to worry about and test.

## TODO

* [x] Cleanup
* [x] Add IO unit test for loading a dataset
* [x] Check repro from 2318

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
eddyxu pushed a commit that referenced this issue May 29, 2024
Fixes #2338
Partially addresses #2318

**For a dataset on local SSD with 8,000 versions, we get 6x faster load
time and 3x faster append.**

* Added special code path for local filesystem for finding latest
manifest. This path skips the `metadata` call for paths that aren't
relevant, both fixing #2338 and improving performance on local
filesystems overall.
* Fixed code path where we were reading the manifest file twice
* Changed `CloudObjectReader` and `LocalFileReader` to both cache the
file size, so we aren't making multiple calls to get the size of the
same object/file. Also allowed passing the size when opening, in case we
already have it from a list operation.
* Deprecated some more methods for loading a dataset, in favor of using
`DatasetBuilder`. Also consolidated the implementations to use
`DatasetBuilder`, so we have fewer code paths to worry about and test.

## TODO

* [x] Cleanup
* [x] Add IO unit test for loading a dataset
* [x] Check repro from 2318

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
@wjones127
Copy link
Contributor Author

This should be substantially mitigated by #2396. However, there is still some optimization that can be made in stores that support list start-after (GCS, S3). This optimization won't be useful for other ones like local file systems or Azure, so it's unclear whether it is worthwhile. It may be more worthwhile to invest in auto-cleanup so users don't accumulate so many versions in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant