Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htslib in pysam.fetch on S3 Bucket #1670

Open
StephanHolgerD opened this issue Sep 7, 2023 · 4 comments
Open

htslib in pysam.fetch on S3 Bucket #1670

StephanHolgerD opened this issue Sep 7, 2023 · 4 comments
Assignees

Comments

@StephanHolgerD
Copy link

Hi, I want to report a potentially problematic behaviour using pysam.fetch on AWS S3 bucket infrastructure. Using the following pseudo code on a Bam file in a S3 Bucket will create requests without a defined end range.

Code

with pysam.AlignmentFile(bamfile_S3,filepath_index=baifile_S3) as f:
      for r in f.fetch(chrom,start,end):

Request

image

This kind of 'open' request results in high egress costs because aws logs the whole file after the start byte as delivered, even if you stop reading the data at the end of your fetch coordinates.

Compared to the requests from IGV on S3 data (low egress costs, only the exact byte range is logged)

Request

image

Initially I reported this here:

pysam-developers/pysam#1215

@daviesrob
Copy link
Member

That looks unfortunate. We'll investigate and see if we can make these requests less open-ended. It may need a bit of rework to how our http requests work though so I can't be sure how long it will take.

@StephanHolgerD
Copy link
Author

Thx, initially I used pysam.fetch which created the problematic open end requests.
After some debugging I switched to pysam.view (more or less a wrapper around the samtools view cmd), this created clean range requests.

@StephanHolgerD
Copy link
Author

I checked the requests from Samtools view, they are also open and will create inflated egress costs.

@apena23
Copy link

apena23 commented May 21, 2024

Hi @StephanHolgerD , did pysam.view end up producing clean range requests? I wonder if you would be able to share some code for how you implemented it as I am looking for a similar functionality and haven't found an easy solution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants