Dataset not found during frequent writes #2338

wjones127 · 2024-05-15T15:18:54Z

We got a user report where they got Dataset not found randomly. This is happening on AWS NFS. ¹ This had the error:

lance error: Dataset at path opt/netapp/wlmai/dbs/lance/chunks.lance was not found: LanceError(IO): Generic LocalFileSystem error: Unable to access metadata for opt/netapp/wlmai/dbs/lance/chunks.lance/_versions/.tmp_37.manifest_bb2eceae-ab02-43b1-8570-e4d1a6667932: IO error for operation on /opt/netapp/wlmai/dbs/lance/chunks.lance/_versions/.tmp_37.manifest_bb2eceae-ab02-43b1-8570-e4d1a6667932: No such file or directory (os error 2), /home/build_user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lance-table-0.10.16/src/io/commit.rs:89:26, /home/build_user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lance-0.10.16/src/dataset/builder.rs:229:31

It seems to be failing on getting metadata for a temporary manifest file.

I think where this is happening is we are using list_with_delimeter to get the manifest versions.

lance/rust/lance-table/src/io/commit.rs

Lines 89 to 91 in b61850b

    
           let manifest_files = object_store 
        
               .list_with_delimiter(Some(&base.child(VERSIONS_DIR))) 
        
               .await?;

The implementation of LocalFileSystem::list_with_delimeter calls WalkDir and then various metadata operations on those file references ². My suspicion is the original WalkDir call sometimes picks up the temporary manifests, but by the time they are calling the metadata operations they have been deleted.

There's two possible solutions to this:

We move the temporary manifests to a different directory, so they aren't part of the WalkDir.
We change the logic to use a pointer to the latest version, and use head to check if there are any newer versions. This is what I lean towards, since it will also help with perf bug: Inserting data is O(num versions) #2318. See also Replace _lastest.manifest and change manifest naming scheme #1362.

The text was updated successfully, but these errors were encountered:

Fixes #2338 Partially addresses #2318 **For a dataset on local SSD with 8,000 versions, we get 6x faster load time and 3x faster append.** * Added special code path for local filesystem for finding latest manifest. This path skips the `metadata` call for paths that aren't relevant, both fixing #2338 and improving performance on local filesystems overall. * Fixed code path where we were reading the manifest file twice * Changed `CloudObjectReader` and `LocalFileReader` to both cache the file size, so we aren't making multiple calls to get the size of the same object/file. Also allowed passing the size when opening, in case we already have it from a list operation. * Deprecated some more methods for loading a dataset, in favor of using `DatasetBuilder`. Also consolidated the implementations to use `DatasetBuilder`, so we have fewer code paths to worry about and test. ## TODO * [x] Cleanup * [x] Add IO unit test for loading a dataset * [x] Check repro from 2318 --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>

wjones127 added the bug Something isn't working label May 15, 2024

wjones127 self-assigned this May 15, 2024

wjones127 mentioned this issue May 26, 2024

perf: optimize IO path for reading manifest #2396

Merged

3 tasks

wjones127 closed this as completed in #2396 May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset not found during frequent writes #2338

Dataset not found during frequent writes #2338

wjones127 commented May 15, 2024

Dataset not found during frequent writes #2338

Dataset not found during frequent writes #2338

Comments

wjones127 commented May 15, 2024

Footnotes