You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When invoking any Helper::DiskIO ReadString function, the function may over estimate the buffer size, by doubling until it fits into the buffer. The function also scans each char for delim and breaks.
Describe the solution you'd like
Instead of having ReadString do resizing and delim parsing. The ReadString should only be responsible for reading into the buffer with the expected size. The expected size should come from the file size, as this is the exact size that must be read. The parsing shouldnt be done at all, instead, in WriteString always write terminating point (the delim always equals \n and then replaced by \0, so why not just let WriteString set the end point?).
Both ReadString / WriteString will basically be boiled down to ReadBinary / WriteBinary.
Additional context
This ReadString is used in reading config (ex. ini) and metadata files (ex. tsv).
By default the read buffer size always starts at 2^16 = 65,536 bytes.
The biggest inefficiency comes from metadata files as they are big. Ex. 100GB file, we know the read is divided up by threads (32), so each thread will eventually resize the buffer to ~4GB (exact size would be ~3GB, so one GB over), which then means we have over estimated the buffer size ~32GB (total over est).
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
When invoking any Helper::DiskIO ReadString function, the function may over estimate the buffer size, by doubling until it fits into the buffer. The function also scans each char for delim and breaks.
Describe the solution you'd like
Instead of having ReadString do resizing and delim parsing. The ReadString should only be responsible for reading into the buffer with the expected size. The expected size should come from the file size, as this is the exact size that must be read. The parsing shouldnt be done at all, instead, in WriteString always write terminating point (the delim always equals
\n
and then replaced by\0
, so why not just let WriteString set the end point?).Both ReadString / WriteString will basically be boiled down to ReadBinary / WriteBinary.
Additional context
This ReadString is used in reading config (ex. ini) and metadata files (ex. tsv).
By default the read buffer size always starts at 2^16 = 65,536 bytes.
The biggest inefficiency comes from metadata files as they are big. Ex. 100GB file, we know the read is divided up by threads (32), so each thread will eventually resize the buffer to ~4GB (exact size would be ~3GB, so one GB over), which then means we have over estimated the buffer size ~32GB (total over est).
The text was updated successfully, but these errors were encountered: