New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] Default value for CompressedInputStream kChunkSize might be too small #41604
Comments
Maybe adding a buffer layer here is better: https://arrow.apache.org/docs/python/generated/pyarrow.input_stream.html#pyarrow.input_stream ? CompressInput just managers the decompress, rather than do-buffer for physical io |
@mapleFU Thank you for provide the alternative. The My point is,
|
Hmmm so the input buffer of s3fs is useless? It doesn't to underlying buffering
I agree fixed size buffering is weird, but I think the CompressedInputStream's buffer-size is just "decompress-input-buffer-size" rather than "s3-io-size" This can be add in C++ firstly and making a chunk-size as input argument as |
set Lines 744 to 746 in 657c4fa
Acturally, after thinking of it, I think the acturally problem is when CompressedInputStream is in use, it will convert a
But for a new pyarrow user, it is hard to know that the pa.BufferedInputStream is needed. For example, the following code is totally fine if the path_src is not a compressed file, it will issue one read to load the whole file
However, if the |
So we should enhance the "csv" module, and adjust the io-size before opening the CompressedInputStream. We should do a minor refactor here: arrow/cpp/src/arrow/dataset/file_csv.cc Lines 278 to 294 in 657c4fa
This also raised when it's a Json format. @pitrou @lidavidm should we change |
The bottom line is that I think we should add a new /// \brief Return the preferred chunk size for reading at least `nbytes`
///
/// Different file backends have different performance characteristics
/// (especially on the latency / bandwidth spectrum).
/// This method informs the caller on a well-performing read size
/// for the given logical read size.
///
/// Implementations of this method are free to ignore the input `nbytes`
/// when computing the return value. The return value might be smaller,
/// larger or equal to the input value.
///
/// This method should be deterministic: multiple calls on the same object
/// with the same input argument will return the same value. Therefore,
/// calling it once on a given file should be sufficient.
///
/// There are two ways for callers to use this method:
/// 1) callers which support readahead into an internal buffer will
/// use the return value as a hint for their internal buffer's size;
/// 2) callers which require exact read sizes will use the return value as
/// an advisory chunk size when reading.
///
/// \param[in] nbytes the logical number of bytes desired by the caller
/// \return an advisory physical chunk size, in bytes
virtual int64_t preferred_read_size(int64_t nbytes) const; Then the @felipecrv What do you think? |
@pitrou Note that:
|
Or we could even remove the additional buffering and have |
Could work well, but a more flexible alternative to this could be a [1] https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/io/zero_copy_stream.h |
That's another possibility, but it forces the caller to allocate Note that those are not either/or. We could start with one and add the other if needed/desired. |
Any way, a |
But the caller chooses The stream tries to fill buffer as much as possible if that is desirable like in the case of S3 calls.
Which one you think is simpler? |
Right, but they have no idea what value could be reasonable. The caller could think "ok, I know that some filesystems really like large chunk sizes, so I'm going to allocate 100 MiB", only to get 64 kiB as result. Or conversely, the caller could think "ok, let's be conservative so as not to waste memory unduly, so I'm going to allocate 512 kiB", and S3 will perform awfully. The point here is to let the file implementation inform the caller before making any impacting decision (such as allocating a large memory area).
Both are simple conceptually and should be simple implementation-wise. They have different implications though. |
Anyway, I think shouldn't this:
|
Describe the bug, including details regarding any error messages, version, and platform.
I'm using pyarrow.csv.open_csv to stream read a 15GB gz csv file over S3. The speed is unusable slow.
Here, even when I set the block_size=5_000_000, the reader are issuing 65K ranged read over S3.
This is bad for two reason:
After digging into the code, I find this
kChunkSize
is hard coded in CompressedInputStreamarrow/cpp/src/arrow/io/compressed.cc
Line 432 in f6127a6
Currently, my workaround is to use buffered stream
file_src = pyarrow_s3fs.open_input_stream(path_src, buffer_size=10_000_000)
But it is not obvious from the doc. Could we set this value higher? Or at least add some doc to clarify the usage.
Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: