Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgzip performance drop from v1.16 #1767

Open
koriege opened this issue Apr 8, 2024 · 7 comments · May be fixed by #1768
Open

bgzip performance drop from v1.16 #1767

koriege opened this issue Apr 8, 2024 · 7 comments · May be fixed by #1768
Assignees

Comments

@koriege
Copy link

koriege commented Apr 8, 2024

Hi,

I stumbled across an issue when testing multiple versions of htslib/bgzip. With release of a blasting fast decompression algorithm in 2023 https://github.com/mxmlnkn/rapidgzip, which achives a throughput of 1.5GB/s on my machine, I observed a massive performance difference in data compression for bgzip versions <= 1.16 and >= 1.17, not attributable to the change of default compression level (from v1.16) or cold cache effects.

v1.13 - v1.15

rapidgzip -P 8 -kdc --import-index fastq.gzri fastq.gz | bgzip -l 6 -kc -@ 128 | pv > /dev/null
4.40GiB 0:00:12 [ 355MiB/s]

v1.16 (with adapted compression level)

rapidgzip -P 8 -kdc --import-index fastq.gzri fastq.gz | bgzip -l 5 -kc -@ 128 | pv > /dev/null
4.40GiB 0:00:12 [ 351MiB/s]

v1.17 - v1.19 (with adapted compression level)

rapidgzip -P 8 -kdc --import-index fastq.gzri fastq.gz | bgzip -l 5 -kc -@ 128 | pv > /dev/null
4.40GiB 0:00:17 [ 260MiB/s]
@jkbonfield
Copy link
Contributor

jkbonfield commented Apr 8, 2024

Did you build all of them with libdeflate? samtools --version should report your libraries found ("libdeflate=yes"). (Assuming you have the entire ecosystem installed. If not strings the bgzip binary and grep would tell you)

I can't see much difference here:

$ v=1.19.1;for i in 1 2 3;do /software/common-apps/samtools-team-managed/htslib/$v/bin/bgzip -@64 -c -l5 < /tmp/_10g | pv > /dev/null;done
3.16GiB 0:00:03 [ 850MiB/s] [     <=>                                                                  ]
3.16GiB 0:00:04 [ 757MiB/s] [       <=>                                                                ]
3.16GiB 0:00:04 [ 747MiB/s] [       <=>                                                                ]

$ v=1.16;for i in 1 2 3;do /software/common-apps/samtools-team-managed/htslib/$v/bin/bgzip -@64 -c -l5 < /tmp/_10g | pv > /dev/null;done
3.16GiB 0:00:03 [ 847MiB/s] [     <=>                                                                  ]
3.16GiB 0:00:04 [ 714MiB/s] [       <=>                                                                ]
3.16GiB 0:00:04 [ 727MiB/s] [       <=>                                                                ]

That said, it's not gaining much beyond 32 threads (this is a 64-core EPYC).

@koriege
Copy link
Author

koriege commented Apr 9, 2024

Thank you for the reply. I can confirm the peak performance using 32 threads on our EPYC, too. In the meantime I was investigating this issue further. I compiled htslib with and without libdeflate, turned off NUMA and SMT, set cpu frequencies constant and tested from NVMe SSD and /dev/shm. Unfortunately under all test cases I made the the same observation as stated above when using rapidgzip to decompress and bgzip to compress a fastq data stream subsequently (or within a longer pipeline). When compressing an uncompressed file or by utilizing cat there is no observable difference in throughput. Now I am a little bit puzzled, what actually causes the behavior - bgzip or rapidgzip. Can you confirm a difference in performance on your system when using rapidgzip? The version I used was the latest installed via pip (v0.13.1).

@jkbonfield
Copy link
Contributor

I haven't tried rapidgzip yet, but you could try adding mbuffer in there too (ie uncompress | mbuffer | compress) as that will provide a bigger buffer and help decouple the threading. Otherwise you're bound by the small pipe buffer size.

@jkbonfield
Copy link
Contributor

I can confirm I see a difference, but only when reading from a pipe. Reading direct from a file it's the same speed. I'm not sure yet what would have affected that. Maybe some additional error checking somewhere. I'd need to do some bisecting to figure it out.

@jkbonfield
Copy link
Contributor

So it's down to 4957188734d8d977f3b494983e28156af9a258e which added autodetection of text data and ensured bgzip blocks ended on newlines for text based formats. This improves random access capabilities for certain file types.

However even with --binary to enable the old behaviour, it still doesn't regain the speed again. This is perhaps in part due to switching from UNIX I/O to an hfile with additional layers of indirection. That said, hfile could be faster on some filesystems, particularly if we're using small block sizes (which it will be, defaulting to 0xff00 I think. It doesn't help that hfiles default block size is 32768, so less than a single BGZF block.

Adding an explicit hfile_set_blksize(f_src, 1e6) call in there regains the lost speed for binary mode. Oddly this is only exported in hfile_internal.h. I'm not sure why we didn't make it public. We can only change it otherwise with hts_set_opt which takes an htsFile rather than an hFILE, which is another level still of redirection. (That said, bgzip is internal to htslib so I don't care about using the internal function.)

I am a bit surprised to see such a performance difference with hfile though.

Commit 2ff03d3

@ ; for i in `seq 1 5`;do cat < ~/scratch/data/enwik9 | ~/work/samtools_master/htslib/bgzip -l5 -@64 |pv > /dev/null;done
 323MiB 0:00:00 [ 466MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 502MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 500MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 423MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 517MiB/s] [ <=>                                                                        ]

Commit e495718

@ ; for i in `seq 1 5`;do cat < ~/scratch/data/enwik9 | ~/work/samtools_master/htslib/bgzip -l5 -@64 |pv > /dev/null;done
 323MiB 0:00:00 [ 373MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 397MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 410MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 398MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 402MiB/s] [ <=>    

Commit e495718 with hfile_set_blksize(f_src, 1e6)

@ ; for i in `seq 1 5`;do cat < ~/scratch/data/enwik9 | ~/work/samtools_master/htslib/bgzip -l5 -@64 |pv > /dev/null;done
 323MiB 0:00:00 [ 482MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 431MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 482MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 504MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 422MiB/s] [ <=>                                                                        ]

Obviously some variability, but basically rescues the impact. (Adding --binary didn't make much difference).

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Apr 9, 2024
Commit e495718 changed bgzip from unix raw POSIX read() calls to
hread().  Unfortunately hread gets its buffer size from stat of the
input file descriptor, which can be 4kb for a pipe.  We're reading
0xff00 bytes, so this ends up being split over two reads mostly, with
one or both involving additional memcpys.  This makes the buffered I/O
worse performing than non-buffered.  In the most extreme cases (cat
data | bgzip -l0 > /dev/null) this is a two fold slow down.

The easy solution is just to increase the buffer size to something
sensible.  It's a little messy as we have to use hfile_internal.h to
get hfile_set_blksize, but it works.  I'm not sure why we didn't elect
to make that API more public.  Probably simply out of caution.

Fixes samtools#1767
@jkbonfield jkbonfield linked a pull request Apr 9, 2024 that will close this issue
@koriege
Copy link
Author

koriege commented Apr 10, 2024

Thank you for figuring this out! Does your pull request, which increases the blocksize to 1MB, affect the binary/legacy mode only?

@jkbonfield
Copy link
Contributor

jkbonfield commented Apr 10, 2024

No, it modifies the hfile buffer regardless of binary / text mode. The speed difference between the two isn't that significant compared to the overheads of the buffer.

Further diagnosis shows the problem is due to pipes. Hfile does a stat call on the fd to determine the filesystem recommended buffer size. However for a pipe this comes back as 4kb. Using strace I see it's mixing big reads (likely a direct read into the destination buffer) plus small reads (<= 4kb), which are presumably refilling the hfile buffer. Compared to before bgzip always direct read() calls, but these were consistently around 64kb and so good enough performance. In short, the change to use hread lead to maybe 50% more underlying read system calls when reading from a pipe. This was an unintended consequence.

Htslib maybe should just have a minimum buffer size of something bigger so it doesn't reduce the size to something so tiny. 4Kb block sizes are a real anachronism and harking back to bygone eras of early unixen.

The other side of this is that even when the filesystem recommends a 4Mb buffer size (as lustre does), there is a hard cutoff of 32kb in htslib. This limit was added over concerns of high memory usage when doing many-way file merges with lots of file handles open. I have some sympathy for this, however having the maximum default detected block size being smaller than a typical single bgzf block seems like an inefficiency as we just end up doing multiple reads anyway so we can uncompress an entire bgzf block. Our inflight I/O based memory therefore is probably around 400kb for bam/vcf.gz, making the hfile layer largely an irrelevance. I'm tempted to say minimum 64Kb maximum 1Mb, with the option of being able to selectively increase as desired (eg bgzip).

Maybe @jmarshall has views on this as most of this code originated in his era, perhaps mostly from himself.

Edit: oops I'm wrong on my maths. 64k is the uncompressed size of a bgzf block, not compressed, so it's not as stark as it looks for current memory usage. Nevertheless, I think honouring the system stat return of 4kb for a pipe seems way too small.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Apr 10, 2024
Commit e495718 changed bgzip from unix raw POSIX read() calls to
hread().  Unfortunately hread gets its buffer size from stat of the
input file descriptor, which can be 4kb for a pipe.  We're reading
0xff00 bytes, so this ends up being split over two reads mostly, with
one or both involving additional memcpys.  This makes the buffered I/O
worse performing than non-buffered.  In the most extreme cases (cat
data | bgzip -l0 > /dev/null) this is a two fold slow down.

The easy solution is just to increase the buffer size to something
sensible.  It's a little messy as we have to use hfile_internal.h to
get hfile_set_blksize, but it works.  I'm not sure why we didn't elect
to make that API more public.  Probably simply out of caution.

Fixes samtools#1767
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants