Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VCF/BCF support for POS=0 coordinate #1573

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from
Open

Conversation

pd3
Copy link
Member

@pd3 pd3 commented Mar 1, 2023

Resolves #1571

See also samtools/bcftools#1875

pd3 added a commit to samtools/bcftools that referenced this pull request Mar 1, 2023
@daviesrob
Copy link
Member

This accidentally reverts the last htscodecs submodule update. Please could you update your local copy with git submodule update, and then fix up your bt-1871 branch? I'm not entirely sure, but I think you'd need to do something like:

( cd htscodecs && git pull && git checkout v1.4.0 )
git add htscodecs
git commit --amend

Also, if you rebase the branch on to the current develop the AppVeyor tests should now work again.

@jkbonfield
Copy link
Contributor

git pull on develop just did it for me. You only need to do this sort of thing by hand if you've been editing the submodule I think.

Review of `bcftools merge` code and possibly others is necessary because it is relying
on uninitialized POS==-1 in synced_bcf_reader

This should resolve samtools#1571
@pd3
Copy link
Member Author

pd3 commented Mar 1, 2023

This gives me headaches, always manage to mess it up. Did what you suggested and now I've got

On branch bt-1871
Your branch and 'origin/bt-1871' have diverged,
and have 1 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

Should I merge as suggested and push the result?

EDIT: push -f fixed it. Hopefully

@daviesrob
Copy link
Member

Hmm, this is a bit of a rabbit-hole. I don't think it's a good idea to change reg2bins(), as it would become non-compliant with the index specification. (BCF defers to SAM for this, and SAM specifically defines how the (-1, 0] interval should be handled.

Now for htslib it turns out that this is moot, because we don't index unmapped reads; we reject SAM positions < 0; and we've actually been pushing this interval into the (0, 1] bin in hts_idx_push() since version 1.4. However, it turns out the Picard puts this interval into bin 4860, as defined by the specification. I think that may be by accident though because htsjdk also prevents you from searching for it in the reg2bins() equivalent so presumably has the same problem you're trying to fix here...

My test file:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20150218
##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
##source=1000GenomesPhase3Pipeline
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,assembly=b37,length=51304566>
##SAMPLE=<ID=SAMPLE_1>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE_1
1	0	ID1	C	G	.	PASS	.	GT	1/0
1	1	ID2	G	T	.	PASS	.	GT	1/0

Picard index dump (obtained by compiling htslib with -DDEBUG_INDEX:

$ ./tabix /tmp/example_p.vcf.gz 1
format='tbi', min_shift=14, n_lvls=5, n_bins=37449, l_meta=30 n=1, m=1, n_no_coor=0
======== BIN Index - tid=0, n_buckets=4, size=2
	bin=4680, level=4, parent=584, n_chunks=1, loff=0, interval=[536739841 - 536870912]
		chunk=0, u=434, v=462
	bin=4681, level=5, parent=585, n_chunks=1, loff=434, interval=[1 - 16384]
		chunk=0, u=462, v=23527424
======== LINEAR Index - tid=0, n_values=1
		entry=0, offset=434, interval=[1 - 16384]
[W::tbx_parse1] Coordinate <= 0 detected. Did you forget to use the -0 option?
1	0	ID1	C	G	.	PASS	.	GT	1/0
1	1	ID2	G	T	.	PASS	.	GT	1/0

Note that the whole-chromosome search works because it asks for an interval wide enough to include bin 4680's official range, which means we also pick up the (zero-based) position -1 stuff in there. It wouldn't work if you asked for a smaller interval including that position though.

For maximum compatibility we probably want to make reg2bins() include bin 4680 (or its csi equivalent) when you ask for a range starting at -1. I don't think it needs to return the full list in the specification though because as far as I can make out intervals starting at -1 can only ever end up in bins 4680 or 0.

This should probably be done as a special case, so we can avoid issues with shifting negative numbers. We may also want to discuss if we should be indexing this position in a currently spec-compliant way, or updating the specification so it follows the practice we've been doing for a while (which would break htsjdk, but as I think it can't do the look-up, probably not too badly compared with the current state).

The restriction is already enforced by hts_idx_push() before
this function is called, so no need to do it again.
When searching for `max_off`, hts_itr_query() looks for a bin
to the right of the end of the region.  For whole chromosomes,
this would be HTS_POS_MAX, which is far beyond the maximum
bin position supported.  The `bin` calculation overflowed leading
to much implementation-defined behaviour as it explored various
negative bin numbers.

Fix by limiting `end` to the maximum value supported by the index.
Since 71db687 htslib has put ranges starting -1 into the
most appropriate bin starting from 0.  However, the SAM
specification actually says [-1, 0) should go in bin 4680 (for
bai / tbx indexes) and htsjdk does this.  To be able to
handle indexes made both ways, add bin 4680 (or the equivalent
number for csi) to the search list in `reg2bins()` as a
special case, and make hts_itr_query() check for it when finding
`min_off`, when ranges start at -1.

Also take the opportunity to make reg2bins() notice when its
memory allocations fail, and ensure the error gets propagated if
necessary.
Required to support VCF files with variants at position 0

Add region parse flag HTS_PARSE_REF_START_MINUS_1 to make
whole-chromosome ranges start at (zero-based position) -1 instead
of 0.

Rework hts_parse_region() so it can correctly interpret region
specifications that start at zero.  If the new
HTS_PARSE_REF_START_MINUS_1 flag is set, it will also return
region [-1, HTS_POS_MAX) for whole-chromosome requests.

Make hts_itr_querys() work out if it's making an iterator for
a vcf or bcf file and pass the HTS_PARSE_REF_START_MINUS_1 flag to
hts_parse_region() if it is.  This detection works by inspecting
the `readrec` parameter to see what sort of file is being read.
This should work as long as hts_itr_querys() is being called
via the bcf_itr_querys() or tbx_itr_querys() wrappers, however
it may be better to make a new interface that can accept an
explicit flag parameter.
These should be interpreted as start of chr. to the given position.
@daviesrob
Copy link
Member

I've pushed up my progress so far on this:

  • Reverted the check that was added in insert_to_l() as a similar one is already done before calling the function.
  • Fixed a signed overflow in hts_itr_query() which caused it to check for numerous negative bin numbers when passed an end value of HTS_POS_MAX.
  • Work around the discrepancy about where the range [-1, 0) is stored in the binning index by looking for both options.
  • Allows range specifications for regions in VCF/BCF to start at base 0, and ensures that ranges like chr:-0 work correctly, both in hts_parse_region() and in the synced reader.
  • Adds some tabix tests for VCF/BCF files with variants at position 0.

The region parsing part adds a new flag HTS_PARSE_REF_START_MINUS_1 to make hts_parse_region() return a start position of -1 instead of zero for the beginning of the chromosome. I don't really like that name, better suggestions are welcome. Actually setting the flag so the revised start only applies to VCF and BCF files is a bit of a kludge at the moment. hts_itr_querys() doesn't get many clues at the moment as to what sort of file it's making an iterator for. Currently I take the somewhat hacky approach of testing its readrec argument to try to find out what soft of file it's reading, and also the hdr if it looks like something for tabix. This does work, but it might be better to pass the information in a more explicit way - which would require adding a new interface to the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bcftools view with region filter drops variants at POS=0 (telomeres)
3 participants