perf bug: Inserting data is O(num versions) #2318

wjones127 · 2024-05-09T16:15:58Z

It appears the time to write data scales linearly with the number of versions. This is not great. On my local computer, it starts off at 10 ms and after a few thousand versions becomes 30 ms. For a higher-latency store, I bet this is more dramatic. One user reported latency of 1.5 sec after 8k versions.

My best guess is this is because to load the latest version we are listing all files in versions directory. We might have to implement the first part of #1362 to fix this.

Reproduce this

from datetime import timedelta
import time
import pyarrow as pa
import lance

data = pa.table({'a': pa.array([1])})

# Uncomment this part to reset and see once we delete versions, the latency
# goes back down.
# ds = lance.dataset("test_data")
# ds.cleanup_old_versions(older_than=timedelta(seconds=1), delete_unverified=True)

for i in range(10000):
    start = time.monotonic()
    # Use overwrite to eliminate possibility that it is O(num files)
    lance.write_dataset(data, 'test_data', mode='overwrite')
    print(time.monotonic() - start)

Fixes #2338 Partially addresses #2318 **For a dataset on local SSD with 8,000 versions, we get 6x faster load time and 3x faster append.** * Added special code path for local filesystem for finding latest manifest. This path skips the `metadata` call for paths that aren't relevant, both fixing #2338 and improving performance on local filesystems overall. * Fixed code path where we were reading the manifest file twice * Changed `CloudObjectReader` and `LocalFileReader` to both cache the file size, so we aren't making multiple calls to get the size of the same object/file. Also allowed passing the size when opening, in case we already have it from a list operation. * Deprecated some more methods for loading a dataset, in favor of using `DatasetBuilder`. Also consolidated the implementations to use `DatasetBuilder`, so we have fewer code paths to worry about and test. ## TODO * [x] Cleanup * [x] Add IO unit test for loading a dataset * [x] Check repro from 2318 --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

wjones127 · 2024-05-31T19:06:39Z

This should be substantially mitigated by #2396. However, there is still some optimization that can be made in stores that support list start-after (GCS, S3). This optimization won't be useful for other ones like local file systems or Azure, so it's unclear whether it is worthwhile. It may be more worthwhile to invest in auto-cleanup so users don't accumulate so many versions in the first place.

wjones127 added the performance label May 9, 2024

wjones127 mentioned this issue May 15, 2024

Dataset not found during frequent writes #2338

Closed

wjones127 self-assigned this May 15, 2024

wjones127 mentioned this issue May 26, 2024

perf: optimize IO path for reading manifest #2396

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf bug: Inserting data is O(num versions) #2318

perf bug: Inserting data is O(num versions) #2318

wjones127 commented May 9, 2024

wjones127 commented May 31, 2024

perf bug: Inserting data is O(num versions) #2318

perf bug: Inserting data is O(num versions) #2318

Comments

wjones127 commented May 9, 2024

wjones127 commented May 31, 2024