bgzip performance drop from v1.16 #1767

koriege · 2024-04-08T14:12:47Z

Hi,

I stumbled across an issue when testing multiple versions of htslib/bgzip. With release of a blasting fast decompression algorithm in 2023 https://github.com/mxmlnkn/rapidgzip, which achives a throughput of 1.5GB/s on my machine, I observed a massive performance difference in data compression for bgzip versions <= 1.16 and >= 1.17, not attributable to the change of default compression level (from v1.16) or cold cache effects.

v1.13 - v1.15

rapidgzip -P 8 -kdc --import-index fastq.gzri fastq.gz | bgzip -l 6 -kc -@ 128 | pv > /dev/null
4.40GiB 0:00:12 [ 355MiB/s]

v1.16 (with adapted compression level)

rapidgzip -P 8 -kdc --import-index fastq.gzri fastq.gz | bgzip -l 5 -kc -@ 128 | pv > /dev/null
4.40GiB 0:00:12 [ 351MiB/s]

v1.17 - v1.19 (with adapted compression level)

rapidgzip -P 8 -kdc --import-index fastq.gzri fastq.gz | bgzip -l 5 -kc -@ 128 | pv > /dev/null
4.40GiB 0:00:17 [ 260MiB/s]

The text was updated successfully, but these errors were encountered:

jkbonfield · 2024-04-08T15:38:16Z

Did you build all of them with libdeflate? samtools --version should report your libraries found ("libdeflate=yes"). (Assuming you have the entire ecosystem installed. If not strings the bgzip binary and grep would tell you)

I can't see much difference here:

$ v=1.19.1;for i in 1 2 3;do /software/common-apps/samtools-team-managed/htslib/$v/bin/bgzip -@64 -c -l5 < /tmp/_10g | pv > /dev/null;done
3.16GiB 0:00:03 [ 850MiB/s] [     <=>                                                                  ]
3.16GiB 0:00:04 [ 757MiB/s] [       <=>                                                                ]
3.16GiB 0:00:04 [ 747MiB/s] [       <=>                                                                ]

$ v=1.16;for i in 1 2 3;do /software/common-apps/samtools-team-managed/htslib/$v/bin/bgzip -@64 -c -l5 < /tmp/_10g | pv > /dev/null;done
3.16GiB 0:00:03 [ 847MiB/s] [     <=>                                                                  ]
3.16GiB 0:00:04 [ 714MiB/s] [       <=>                                                                ]
3.16GiB 0:00:04 [ 727MiB/s] [       <=>                                                                ]

That said, it's not gaining much beyond 32 threads (this is a 64-core EPYC).

koriege · 2024-04-09T13:14:22Z

Thank you for the reply. I can confirm the peak performance using 32 threads on our EPYC, too. In the meantime I was investigating this issue further. I compiled htslib with and without libdeflate, turned off NUMA and SMT, set cpu frequencies constant and tested from NVMe SSD and /dev/shm. Unfortunately under all test cases I made the the same observation as stated above when using rapidgzip to decompress and bgzip to compress a fastq data stream subsequently (or within a longer pipeline). When compressing an uncompressed file or by utilizing cat there is no observable difference in throughput. Now I am a little bit puzzled, what actually causes the behavior - bgzip or rapidgzip. Can you confirm a difference in performance on your system when using rapidgzip? The version I used was the latest installed via pip (v0.13.1).

jkbonfield · 2024-04-09T13:37:58Z

I haven't tried rapidgzip yet, but you could try adding mbuffer in there too (ie uncompress | mbuffer | compress) as that will provide a bigger buffer and help decouple the threading. Otherwise you're bound by the small pipe buffer size.

jkbonfield · 2024-04-09T14:34:12Z

I can confirm I see a difference, but only when reading from a pipe. Reading direct from a file it's the same speed. I'm not sure yet what would have affected that. Maybe some additional error checking somewhere. I'd need to do some bisecting to figure it out.

jkbonfield · 2024-04-09T15:14:16Z

So it's down to 4957188734d8d977f3b494983e28156af9a258e which added autodetection of text data and ensured bgzip blocks ended on newlines for text based formats. This improves random access capabilities for certain file types.

However even with --binary to enable the old behaviour, it still doesn't regain the speed again. This is perhaps in part due to switching from UNIX I/O to an hfile with additional layers of indirection. That said, hfile could be faster on some filesystems, particularly if we're using small block sizes (which it will be, defaulting to 0xff00 I think. It doesn't help that hfiles default block size is 32768, so less than a single BGZF block.

Adding an explicit hfile_set_blksize(f_src, 1e6) call in there regains the lost speed for binary mode. Oddly this is only exported in hfile_internal.h. I'm not sure why we didn't make it public. We can only change it otherwise with hts_set_opt which takes an htsFile rather than an hFILE, which is another level still of redirection. (That said, bgzip is internal to htslib so I don't care about using the internal function.)

I am a bit surprised to see such a performance difference with hfile though.

Commit 2ff03d3

@ ; for i in `seq 1 5`;do cat < ~/scratch/data/enwik9 | ~/work/samtools_master/htslib/bgzip -l5 -@64 |pv > /dev/null;done
 323MiB 0:00:00 [ 466MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 502MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 500MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 423MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 517MiB/s] [ <=>                                                                        ]

Commit e495718

@ ; for i in `seq 1 5`;do cat < ~/scratch/data/enwik9 | ~/work/samtools_master/htslib/bgzip -l5 -@64 |pv > /dev/null;done
 323MiB 0:00:00 [ 373MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 397MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 410MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 398MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 402MiB/s] [ <=>

Commit e495718 with hfile_set_blksize(f_src, 1e6)

@ ; for i in `seq 1 5`;do cat < ~/scratch/data/enwik9 | ~/work/samtools_master/htslib/bgzip -l5 -@64 |pv > /dev/null;done
 323MiB 0:00:00 [ 482MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 431MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 482MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 504MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 422MiB/s] [ <=>                                                                        ]

Obviously some variability, but basically rescues the impact. (Adding --binary didn't make much difference).

Commit e495718 changed bgzip from unix raw POSIX read() calls to hread(). Unfortunately hread gets its buffer size from stat of the input file descriptor, which can be 4kb for a pipe. We're reading 0xff00 bytes, so this ends up being split over two reads mostly, with one or both involving additional memcpys. This makes the buffered I/O worse performing than non-buffered. In the most extreme cases (cat data | bgzip -l0 > /dev/null) this is a two fold slow down. The easy solution is just to increase the buffer size to something sensible. It's a little messy as we have to use hfile_internal.h to get hfile_set_blksize, but it works. I'm not sure why we didn't elect to make that API more public. Probably simply out of caution. Fixes samtools#1767

koriege · 2024-04-10T07:30:32Z

Thank you for figuring this out! Does your pull request, which increases the blocksize to 1MB, affect the binary/legacy mode only?

jkbonfield · 2024-04-10T08:31:40Z

No, it modifies the hfile buffer regardless of binary / text mode. The speed difference between the two isn't that significant compared to the overheads of the buffer.

Further diagnosis shows the problem is due to pipes. Hfile does a stat call on the fd to determine the filesystem recommended buffer size. However for a pipe this comes back as 4kb. Using strace I see it's mixing big reads (likely a direct read into the destination buffer) plus small reads (<= 4kb), which are presumably refilling the hfile buffer. Compared to before bgzip always direct read() calls, but these were consistently around 64kb and so good enough performance. In short, the change to use hread lead to maybe 50% more underlying read system calls when reading from a pipe. This was an unintended consequence.

Htslib maybe should just have a minimum buffer size of something bigger so it doesn't reduce the size to something so tiny. 4Kb block sizes are a real anachronism and harking back to bygone eras of early unixen.

The other side of this is that even when the filesystem recommends a 4Mb buffer size (as lustre does), there is a hard cutoff of 32kb in htslib. This limit was added over concerns of high memory usage when doing many-way file merges with lots of file handles open. I have some sympathy for this, however having the maximum default detected block size being smaller than a typical single bgzf block seems like an inefficiency as we just end up doing multiple reads anyway so we can uncompress an entire bgzf block. Our inflight I/O based memory therefore is probably around 400kb for bam/vcf.gz, making the hfile layer largely an irrelevance. I'm tempted to say minimum 64Kb maximum 1Mb, with the option of being able to selectively increase as desired (eg bgzip).

Maybe @jmarshall has views on this as most of this code originated in his era, perhaps mostly from himself.

Edit: oops I'm wrong on my maths. 64k is the uncompressed size of a bgzf block, not compressed, so it's not as stark as it looks for current memory usage. Nevertheless, I think honouring the system stat return of 4kb for a pipe seems way too small.

Commit e495718 changed bgzip from unix raw POSIX read() calls to hread(). Unfortunately hread gets its buffer size from stat of the input file descriptor, which can be 4kb for a pipe. We're reading 0xff00 bytes, so this ends up being split over two reads mostly, with one or both involving additional memcpys. This makes the buffered I/O worse performing than non-buffered. In the most extreme cases (cat data | bgzip -l0 > /dev/null) this is a two fold slow down. The easy solution is just to increase the buffer size to something sensible. It's a little messy as we have to use hfile_internal.h to get hfile_set_blksize, but it works. I'm not sure why we didn't elect to make that API more public. Probably simply out of caution. Fixes samtools#1767

jkbonfield linked a pull request Apr 9, 2024 that will close this issue

Increase the input block size for bgzip. #1768

Open

daviesrob assigned jkbonfield Apr 11, 2024

koriege mentioned this issue May 27, 2024

retrieve data by byte offset and line position mxmlnkn/rapidgzip#36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bgzip performance drop from v1.16 #1767

bgzip performance drop from v1.16 #1767

koriege commented Apr 8, 2024

jkbonfield commented Apr 8, 2024 •

edited

koriege commented Apr 9, 2024

jkbonfield commented Apr 9, 2024

jkbonfield commented Apr 9, 2024

jkbonfield commented Apr 9, 2024

koriege commented Apr 10, 2024

jkbonfield commented Apr 10, 2024 •

edited

bgzip performance drop from v1.16 #1767

bgzip performance drop from v1.16 #1767

Comments

koriege commented Apr 8, 2024

jkbonfield commented Apr 8, 2024 • edited

koriege commented Apr 9, 2024

jkbonfield commented Apr 9, 2024

jkbonfield commented Apr 9, 2024

jkbonfield commented Apr 9, 2024

koriege commented Apr 10, 2024

jkbonfield commented Apr 10, 2024 • edited

jkbonfield commented Apr 8, 2024 •

edited

jkbonfield commented Apr 10, 2024 •

edited