ReadString function is inefficient #289

MS-Renan · 2022-05-11T23:57:22Z

Is your feature request related to a problem? Please describe.
When invoking any Helper::DiskIO ReadString function, the function may over estimate the buffer size, by doubling until it fits into the buffer. The function also scans each char for delim and breaks.

Describe the solution you'd like
Instead of having ReadString do resizing and delim parsing. The ReadString should only be responsible for reading into the buffer with the expected size. The expected size should come from the file size, as this is the exact size that must be read. The parsing shouldnt be done at all, instead, in WriteString always write terminating point (the delim always equals \n and then replaced by \0, so why not just let WriteString set the end point?).
Both ReadString / WriteString will basically be boiled down to ReadBinary / WriteBinary.

Additional context
This ReadString is used in reading config (ex. ini) and metadata files (ex. tsv).

By default the read buffer size always starts at 2^16 = 65,536 bytes.

The biggest inefficiency comes from metadata files as they are big. Ex. 100GB file, we know the read is divided up by threads (32), so each thread will eventually resize the buffer to ~4GB (exact size would be ~3GB, so one GB over), which then means we have over estimated the buffer size ~32GB (total over est).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadString function is inefficient #289

ReadString function is inefficient #289

MS-Renan commented May 11, 2022 •

edited

ReadString function is inefficient #289

ReadString function is inefficient #289

Comments

MS-Renan commented May 11, 2022 • edited

MS-Renan commented May 11, 2022 •

edited