Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int overflow for big variant lines in vcf #1539

Open
arnaldurgylfason opened this issue Dec 16, 2022 · 4 comments
Open

int overflow for big variant lines in vcf #1539

arnaldurgylfason opened this issue Dec 16, 2022 · 4 comments
Assignees

Comments

@arnaldurgylfason
Copy link

We have a variant with more than 50 alleles for almost 500000 samples. The line in question is over 3 gb in size.
bcftools and our own program that uses htslib can not read it.
After debugging I noticed, at least for .vcf instead of .vcf.gz, that hts_getline seemed to read it but it returned the length read which was over 3gb and since the returned value was captured in an int (4 bytes), (function _reader_fill_buffer in synced_bcf_reader.c)
it resulted in an overflow giving a negative number which was interpreted as failure.
Bigger datasets becoming more common in the future this problem will occur more often.
I´m sure this overflow problem can be found at more places.
I would consider it important to solve this and allow for bigger lines to handle the big data that we have today and even bigger in the near future. Changing to 8 byte integers would solve this. Note, size_t is used at many places, which handles this.

@jmarshall
Copy link
Member

See also samtools/bcftools#1614 and #1448.

@arnaldurgylfason
Copy link
Author

Thanks. We will start using FORMAT/LPL and FORMAT/LAA for variants with many alleles.
I still think htslib should be able to handle bigger lines though

@daviesrob
Copy link
Member

It does look like bgzf_getline() has the same overflow problem. It should be easy to fix in the same way as #1448 did for uncompressed reads.

Note that this may not buy you that much due to other limitations in the library, for example bcf_fmt_t::p_len and bcf_fmt_t::p_off. As these are part of the public interface we can't easily change them. I'm also not entirely sure we'd want to as by the time individual records start getting to multi-gigabyte sizes you have to consider if you've reached the limit of what can sensibly be done with flat files, and consider solutions that are more database oriented (for example, hail).

@daviesrob daviesrob self-assigned this Jan 3, 2023
@pd3
Copy link
Member

pd3 commented Jan 3, 2023

It would be nice to have htslib print a warning and skip such entries. We cannot support such large records fully as that breaks too many things and is not worth it. The feasible solution was proposed above, generate LPL and LAA instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants