Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

after using mmap, there's a small chance that the file read operation will get stuck for nearly 2 minutes #18584

Open
wwq2333 opened this issue Apr 19, 2024 · 4 comments
Labels
type-bug This issue is about a bug

Comments

@wwq2333
Copy link

wwq2333 commented Apr 19, 2024

Alluxio Version:
2.9.3

Describe the bug
it's not really a bug, just asking for advice;

we use alluxio to accelerate data set access in AI training, training pod read file via alluxio-fuse (mmap file and read at not fixed position); however, during the training, it was found that the GPU usage sometimes dropped to 0;

according to the log position, reading data was stuck (nearly 2 minutes); because mmap is used, strace cannot see the read syscall, but there are no other syscalls during the time it is stuck (other interference factors are eliminated);

meanwhile, we enabled the debug log for FUSE and found that during the period when it was stuck, it did not receive any requests. The time taken between the entry and exit of FUSE requests was very short (at the millisecond level) both before and after it got stuck;

furthermore, we used strace to trace the read and write behavior of the FUSE process on /dev/fuse.

strace -f -tt  -q  -T -P /dev/fuse  -x -y -p ${fuse-pid}

it appears that there were no requests with particularly high latency;

under such circumstances, could it be that the kernel is causing this, either by not dispatching requests to FUSE quickly enough, or by not returning the responses to the upper-level application promptly enough?

@wwq2333 wwq2333 added the type-bug This issue is about a bug label Apr 19, 2024
@wwq2333
Copy link
Author

wwq2333 commented Apr 19, 2024

13:37:22 - 13:39:20 (+8) , the number of requests received from /dev/fuse has significantly decreased.
fuse-strace.log

GPU_UTIL
image

the FUSE debug log was rolled over and overwritten, however, from observing the previous FUSE debug logs, the phenomena appear to be similar, there are only periodic statfs requests when training stucked.
image
image

@jja725
Copy link
Contributor

jja725 commented Apr 19, 2024

@LuQQiu @jiacheliu3 Do you have any idea on this issue?

@wwq2333
Copy link
Author

wwq2333 commented Apr 24, 2024

Any suggestions? @LuQQiu @jiacheliu3

@LuQQiu
Copy link
Contributor

LuQQiu commented Apr 24, 2024

With kernel -> JNI C++ -> Java
C++ attaching thread to Java, matching c++ object to java object, these matching need to done in very case sensitive way.
not quite sure whether mmap break some of the previous assumptions
e.g. the data transfer way, object management way or threading way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

No branches or pull requests

3 participants