Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting. #1338

Open
ghuls opened this issue Oct 4, 2021 · 12 comments
Milestone

Comments

@ghuls
Copy link

ghuls commented Oct 4, 2021

Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting.

# bgzipped file*.bed.gz files with index.

# "Concatenate" indexed bgzipped files so they result in a sorted bgzip output file.
tabix --cat -o output.bed.gz file1.bed.gz file2.bed.gz ...
@jkbonfield
Copy link
Contributor

jkbonfield commented Oct 4, 2021

Just to be clear, are you calling this "cat" as file1.bed.gz, file2.bed.gz, file3.bed.gz etc are internally sorted with the first item in file2 coming after the last in file1, and so on? (Eg chr1.bed, chr2.bed, etc)? I could understand that being called "cat". (Edit: but that doesn't feel like interleaving to me.)

Or are you saying that file1.bed, file2.bed,... are sorted and may overlap, and the function just takes the next record from each, whichever should be sorted first? This is the impression I get when you say "interleave". If so that's "merge" and not "cat". Both choices have valid use cases IMO.

@ghuls
Copy link
Author

ghuls commented Oct 6, 2021

Merge is indeed the word I was looking for. file1.bed and file2.bed are sorted, combine them so the output is still sorted (by only sorting the needed blocks and not the whole files).

So this can be avoided (mainly the full sort step):

zcat file1.bed.gz file2.bed.gz | sort -k 1,1 -k 2,2n -k 3,3n | bgzip ... 

So basically samtools merge but for non BAM/CRAM files.

@jmarshall
Copy link
Member

jmarshall commented Oct 6, 2021

I initially had the impression you meant a way to make a single index file that would enable queries over the set of input files (in place), doing some kind of gathered interleaving of records read directly from the individual input files. While just about plausible, this would be rather impractical to implement!

A form of samtools merge for the sort of generic genome-position-sorted files that tabix indexes is a more plausible beast. While it could be implemented as a separate tool, it would also fit into/alongside tabix without much difficulty.

Because tbx_itr_querys(tbx. ".") (assuming "." works for tbx indexes!) returns curr_tid/curr_beg/curr_end, it would be fairly easy to do the comparison between records to choose which record to output into the merged output first, at least for a.curr_beg <=> b.curr_beg.

The major wrinkle is in comparing tid values. Because at least in theory, file1.bed.gz, file2.bed.gz, … fileN.bed.gz may each contain different subsets of the complete set of chromosomes, the integer tid values are not necessarily meaningfully comparable between different input files. So you would need to do some kind of clever building of and lookup into a merged chromname→tid dictionary, or (easier) also require an option providing a file (e.g. an fai file or bedtools genome file) giving the desired ordering of chromosomes.

@jkbonfield
Copy link
Contributor

jkbonfield commented Oct 6, 2021

It's an interesting one. For SAM and VCF of course there are already dedicated merge tools. We could do another for BED, but tabix is totally generic. It has configurations stating which fields to index on.

However it strikes me there is one thing missing here in that configuration - collation order. I think tabix can index something working on the assumption that the data is sorted and therefore it's just recording the change from one chromosome to another, but I don't think it would care if that order was chr 1,2,3,...9,10,11 or chr 1,10,11,12...,19,2,20,21,... (ie ascii sort order rather than alphanumeric). That works because the entire index is loaded into memory and a genome chromosome query just fetches that part of index. Maybe I'm wrong here, but I'm pretty sure our indices have no notion of collation.

However, you do need to know the collation order when you're merging. For SAM and VCF that's done by the order of the SQ or contig lines in the headers, but no such thing exists in BED (for example), or any other arbitrary tab delimited file. Therefore this would require some substantial tabix additions I think, either a general purpose header parser capability of extracting file specific collation order, or a general purpose mechanism to describe collation orders.

Edit: I guess if the files you're merging have themselves already been indexed, then you have a file specific collation order already defined in each index. I guess that's what John is referring to by the tid comments. As he explains, you'd need to map back to some unified numbering so they can be sorted properly.

@ghuls
Copy link
Author

ghuls commented Oct 6, 2021

@jkbonfield would that still be true when usage of this feature would require index files to be there (and get the ordering from there)?
In the worst case, the user provides an external ordered list (similar to the chromosome file needed for bedtools).

@jkbonfield
Copy link
Contributor

If we require it to be indexed first, then it becomes an easier proposal and no additional collation order information is needed. (See my dawning realisation in the "Edit:" comment above). Thinking further, I'm assuming now you wanted this to work on regions too (hence indexed files). As you didn't mention this I was thinking of whole files without, but as you comment you can just do that with unix sort anyway so why would we even need a new tool; other than time complexity as merging is faster than sort, but not necessarily hugely so.

A new tool however really helps if you want to do this on parts of the data, which implies indices. That's a (simpler) on-the-fly alternative to the multi-file indexing problem that John mentioned above. I can see the benefit of having such a tool.

It's not a totally trivial bit of work though, so I don't see this feature arriving in the immediate future at least.

@ghuls
Copy link
Author

ghuls commented Oct 6, 2021

As you didn't mention this I was thinking of whole files without, but as you comment you can just do that with unix sort anyway so why would we even need a new tool; other than time complexity as merging is faster than sort, but not necessarily hugely so.

In my case the bgzipped BED files are several GB each. I assumed a merging approach should be much faster than a naive full sort of the data.

@jkbonfield
Copy link
Contributor

Maybe :-) It feels more like an improved version of Unix comm.

Anyway in the short term you should probably use bedtools merge:

https://bedtools.readthedocs.io/en/latest/content/tools/merge.html

@ghuls
Copy link
Author

ghuls commented Oct 6, 2021

bedtools merge does something completely different. First you need one file as input, secondly it will merge overlapping intervals to a new bed entry, not "sorting" input intervals and preserving them.

@daviesrob daviesrob added this to the wishlist milestone Oct 19, 2021
@ghuls
Copy link
Author

ghuls commented Dec 8, 2023

If we require it to be indexed first, then it becomes an easier proposal and no additional collation order information is needed. (See my dawning realisation in the "Edit:" comment above). Thinking further, I'm assuming now you wanted this to work on regions too (hence indexed files). As you didn't mention this I was thinking of whole files without, but as you comment you can just do that with unix sort anyway so why would we even need a new tool; other than time complexity as merging is faster than sort, but not necessarily hugely so.

I was looking for using it on whole files.

An example for which it would be useful, would be to merge fragments.tsv.gz (BED with count in column 5) files from scATAC data from different sequencing runs, which are already sorted in the same chromosome order as those files are created directly from mapped BAM files.

It's not a totally trivial bit of work though, so I don't see this feature arriving in the immediate future at least.

Any chance it would be considered now?

@jkbonfield
Copy link
Contributor

Unfortunately we've got 2 new projects on the go in addition to samtools (and the team hasn't been given any more resources) so our resources are stretched. So we can't commit to any specific time line, if ever sadly.

It is a reasonable idea though.

@ghuls
Copy link
Author

ghuls commented Dec 8, 2023

It is a pity your are not getting more resources. Hopefully the situation improves in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants