Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: extend mpileup API #1453

Open
pd3 opened this issue Jun 14, 2022 · 3 comments
Open

Feature request: extend mpileup API #1453

pd3 opened this issue Jun 14, 2022 · 3 comments

Comments

@pd3
Copy link
Member

pd3 commented Jun 14, 2022

The current mpileup API (bam_mplp_auto()) gives access to an array of per-read structures but does not give the possibility to store per-position data.

One possible use case would be to store the number of overlapping soft-clipped reads at each position, this would allow to annotate candidate variants in difficult regions as suspicious. Currently the removed soft-clipped reads are invisible to the mpileup caller.

@jkbonfield
Copy link
Contributor

I'm trying to think of how practically it would be implemented to track how many soft-clipped reads overlap.

For the right hand soft-clip, we could record when we're discarding a BAM from the in-flight list that while it's ending at the current pos P, it has N bases of soft-clips so for the next N bases we add +1 to a soft-clip depth stat. That would need some sorted list or tree into which we can keep inserting, so we're sorting based on right-hand end+SClen. It's also possible to add the BAM object to a secondary "just gone by" list I guess, and only free it once the soft-clip margin has passed. That would then give us access to the data, but it's cumbersome and high memory.

However the right hand end is only half the problem. We can't really do this for the left hand end as we don't "see" the data until it's too late. The only solution would be to be doing read-ahead and have a maximum window of soft-clips. We may want that anyway to compensate for the extreme long reads that may have Kbps of soft-clipped data. We only need a small window for purposes of variant calling information. So a read-ahead may work too in that case.

It's sounding like a total rewrite from the ground up though, both in implementation and API. It's not necessarily a bad idea as I've needed to hack my own pileup alternative anyway for dealing with iterating over bases within an insertion. (Sometimes it's simply not sufficient to be given a bunch of insertion sequence and have to parse the CIGARs manually to look for P operators.) Plus read-ahead of at least 1bp is necessary to handle reads that post-realignment start in an insertion, as they too don't "appear" until the event has already happened.

That isn't so much "extend" as "write mpileup2". (See https://github.com/samtools/samtools/blob/develop/consensus_pileup.c for a start at that)

It's possible we could just have a buffered-pileup wrapper around the existing mpileup. It may not be the most efficient, but it could be implementable. Eg something that introduces a delay. So while pileup is on pos P, the delayed_pileup is at pos P-D. That allows you to cope with the reads which appear some time in the future with soft-clip at the start. It's doable, but I think it'd be inefficient as it's likely it'd have to be duplicating memory and copying data around to avoid it being freed up by mpileup. (Or we add reference counting to the various structures involved so they can be considered to be in use by more than one piece of code.)

@pd3
Copy link
Member Author

pd3 commented Jun 17, 2022

Thanks for the analysis! My thoughts about this are:

  • I agree replacing with mpileup2 is a better approach and also gives more flexibility
  • a secondary "just gone by" list should be okay, it would only require keeping reads with soft clips for a bit longer. Not sure what is typically encountered in long reads though.
  • a read-ahead buffer could be used also for other applications (such as local realignment), so providing this level of API abstraction would be useful

@jkbonfield
Copy link
Contributor

Mixture of ideas from our samtools meeting (and some others).

  • With deep data, a considerable bottleneck is fetching the seq and/or qual and they can fall out of the cache on each pileup loop. I say "or" because potentially the user may not want it in simpler algorithms. (E.g see prefetch calls in samtools consensus.)
  • The caller could potentially fetch the next N bases and quals in a single step and cache them in a local buffer along with other reads, so we only periodically need to thrash the cache in this way and most of the time it's neat and sequential memory access.
  • Often the user simply wants to know the bases and qualities at a position, along with some other meta-data (is it an indel, etc). So potentially just a pair of arrays (seq, qual) is more useful to them than an array of structs containing seq and qual members.
  • If hidden behind an API, the internals can obviously be changed as we see fit, eg array of structs vs struct of arrays. So asking for the base for record N of M doesn't matter if that's pileup[N]->base or pileup->base[N] behind the scenes, but it may provide great potential for cache use.
  • Separate the layers. These don't have to all be implemented with the 1st iteration, given we want an opaque data structure and a clean API. Just knowing the desired architecture is enough.
    1. Ref pos delay buffer: fetching pos N+1000, but serving up pos N. This allows us to report things we need such as knowing how many records have soft-clip "overlapping" (assuming a trivial 1:1 alignment) this coordinate for reads that start a few bases downstream.
    1. Depth reduction. The best algorithm here builds on top of the delay buffer, as it can give us a better depth smoothing.
    1. The user-driver pileup caller.
  • Drop needless memcpys. If we want to let the user iterate over the N sequences in the pileup then that's fine, but they don't need to be copied into a packed array first.
  • Separate per-record from per-position fields. This isn't done well currently and we have a field indicating if the column is an indel, which is replicated for every record in that column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants