Feature request: Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting. #1338

ghuls · 2021-10-04T09:39:45Z

Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting.

# bgzipped file*.bed.gz files with index.

# "Concatenate" indexed bgzipped files so they result in a sorted bgzip output file.
tabix --cat -o output.bed.gz file1.bed.gz file2.bed.gz ...

The text was updated successfully, but these errors were encountered:

jkbonfield · 2021-10-04T10:47:17Z

Just to be clear, are you calling this "cat" as file1.bed.gz, file2.bed.gz, file3.bed.gz etc are internally sorted with the first item in file2 coming after the last in file1, and so on? (Eg chr1.bed, chr2.bed, etc)? I could understand that being called "cat". (Edit: but that doesn't feel like interleaving to me.)

Or are you saying that file1.bed, file2.bed,... are sorted and may overlap, and the function just takes the next record from each, whichever should be sorted first? This is the impression I get when you say "interleave". If so that's "merge" and not "cat". Both choices have valid use cases IMO.

ghuls · 2021-10-06T09:59:24Z

Merge is indeed the word I was looking for. file1.bed and file2.bed are sorted, combine them so the output is still sorted (by only sorting the needed blocks and not the whole files).

So this can be avoided (mainly the full sort step):

zcat file1.bed.gz file2.bed.gz | sort -k 1,1 -k 2,2n -k 3,3n | bgzip ...

So basically samtools merge but for non BAM/CRAM files.

jmarshall · 2021-10-06T10:25:40Z

I initially had the impression you meant a way to make a single index file that would enable queries over the set of input files (in place), doing some kind of gathered interleaving of records read directly from the individual input files. While just about plausible, this would be rather impractical to implement!

A form of samtools merge for the sort of generic genome-position-sorted files that tabix indexes is a more plausible beast. While it could be implemented as a separate tool, it would also fit into/alongside tabix without much difficulty.

Because tbx_itr_querys(tbx. ".") (assuming "." works for tbx indexes!) returns curr_tid/curr_beg/curr_end, it would be fairly easy to do the comparison between records to choose which record to output into the merged output first, at least for a.curr_beg <=> b.curr_beg.

The major wrinkle is in comparing tid values. Because at least in theory, file1.bed.gz, file2.bed.gz, … fileN.bed.gz may each contain different subsets of the complete set of chromosomes, the integer tid values are not necessarily meaningfully comparable between different input files. So you would need to do some kind of clever building of and lookup into a merged chromname→tid dictionary, or (easier) also require an option providing a file (e.g. an fai file or bedtools genome file) giving the desired ordering of chromosomes.

jkbonfield · 2021-10-06T10:39:27Z

It's an interesting one. For SAM and VCF of course there are already dedicated merge tools. We could do another for BED, but tabix is totally generic. It has configurations stating which fields to index on.

However it strikes me there is one thing missing here in that configuration - collation order. I think tabix can index something working on the assumption that the data is sorted and therefore it's just recording the change from one chromosome to another, but I don't think it would care if that order was chr 1,2,3,...9,10,11 or chr 1,10,11,12...,19,2,20,21,... (ie ascii sort order rather than alphanumeric). That works because the entire index is loaded into memory and a genome chromosome query just fetches that part of index. Maybe I'm wrong here, but I'm pretty sure our indices have no notion of collation.

However, you do need to know the collation order when you're merging. For SAM and VCF that's done by the order of the SQ or contig lines in the headers, but no such thing exists in BED (for example), or any other arbitrary tab delimited file. Therefore this would require some substantial tabix additions I think, either a general purpose header parser capability of extracting file specific collation order, or a general purpose mechanism to describe collation orders.

Edit: I guess if the files you're merging have themselves already been indexed, then you have a file specific collation order already defined in each index. I guess that's what John is referring to by the tid comments. As he explains, you'd need to map back to some unified numbering so they can be sorted properly.

ghuls · 2021-10-06T11:10:45Z

@jkbonfield would that still be true when usage of this feature would require index files to be there (and get the ordering from there)?
In the worst case, the user provides an external ordered list (similar to the chromosome file needed for bedtools).

jkbonfield · 2021-10-06T11:23:53Z

If we require it to be indexed first, then it becomes an easier proposal and no additional collation order information is needed. (See my dawning realisation in the "Edit:" comment above). Thinking further, I'm assuming now you wanted this to work on regions too (hence indexed files). As you didn't mention this I was thinking of whole files without, but as you comment you can just do that with unix sort anyway so why would we even need a new tool; other than time complexity as merging is faster than sort, but not necessarily hugely so.

A new tool however really helps if you want to do this on parts of the data, which implies indices. That's a (simpler) on-the-fly alternative to the multi-file indexing problem that John mentioned above. I can see the benefit of having such a tool.

It's not a totally trivial bit of work though, so I don't see this feature arriving in the immediate future at least.

ghuls · 2021-10-06T11:36:32Z

As you didn't mention this I was thinking of whole files without, but as you comment you can just do that with unix sort anyway so why would we even need a new tool; other than time complexity as merging is faster than sort, but not necessarily hugely so.

In my case the bgzipped BED files are several GB each. I assumed a merging approach should be much faster than a naive full sort of the data.

jkbonfield · 2021-10-06T11:46:59Z

Maybe :-) It feels more like an improved version of Unix comm.

Anyway in the short term you should probably use bedtools merge:

https://bedtools.readthedocs.io/en/latest/content/tools/merge.html

ghuls · 2021-10-06T12:14:41Z

bedtools merge does something completely different. First you need one file as input, secondly it will merge overlapping intervals to a new bed entry, not "sorting" input intervals and preserving them.

ghuls · 2023-12-08T09:27:23Z

If we require it to be indexed first, then it becomes an easier proposal and no additional collation order information is needed. (See my dawning realisation in the "Edit:" comment above). Thinking further, I'm assuming now you wanted this to work on regions too (hence indexed files). As you didn't mention this I was thinking of whole files without, but as you comment you can just do that with unix sort anyway so why would we even need a new tool; other than time complexity as merging is faster than sort, but not necessarily hugely so.

I was looking for using it on whole files.

An example for which it would be useful, would be to merge fragments.tsv.gz (BED with count in column 5) files from scATAC data from different sequencing runs, which are already sorted in the same chromosome order as those files are created directly from mapped BAM files.

It's not a totally trivial bit of work though, so I don't see this feature arriving in the immediate future at least.

Any chance it would be considered now?

jkbonfield · 2023-12-08T10:00:31Z

Unfortunately we've got 2 new projects on the go in addition to samtools (and the team hasn't been given any more resources) so our resources are stretched. So we can't commit to any specific time line, if ever sadly.

It is a reasonable idea though.

ghuls · 2023-12-08T13:07:18Z

It is a pity your are not getting more resources. Hopefully the situation improves in the future.

daviesrob added this to the wishlist milestone Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting. #1338

Feature request: Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting. #1338

ghuls commented Oct 4, 2021

jkbonfield commented Oct 4, 2021 •

edited

ghuls commented Oct 6, 2021 •

edited

jmarshall commented Oct 6, 2021 •

edited

jkbonfield commented Oct 6, 2021 •

edited

ghuls commented Oct 6, 2021 •

edited

jkbonfield commented Oct 6, 2021

ghuls commented Oct 6, 2021

jkbonfield commented Oct 6, 2021

ghuls commented Oct 6, 2021

ghuls commented Dec 8, 2023 •

edited

jkbonfield commented Dec 8, 2023

ghuls commented Dec 8, 2023

Feature request: Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting. #1338

Feature request: Add option to tabix to interleave multiple bgzipped indexed files, without the need for external resorting. #1338

Comments

ghuls commented Oct 4, 2021

jkbonfield commented Oct 4, 2021 • edited

ghuls commented Oct 6, 2021 • edited

jmarshall commented Oct 6, 2021 • edited

jkbonfield commented Oct 6, 2021 • edited

ghuls commented Oct 6, 2021 • edited

jkbonfield commented Oct 6, 2021

ghuls commented Oct 6, 2021

jkbonfield commented Oct 6, 2021

ghuls commented Oct 6, 2021

ghuls commented Dec 8, 2023 • edited

jkbonfield commented Dec 8, 2023

ghuls commented Dec 8, 2023

jkbonfield commented Oct 4, 2021 •

edited

ghuls commented Oct 6, 2021 •

edited

jmarshall commented Oct 6, 2021 •

edited

jkbonfield commented Oct 6, 2021 •

edited

ghuls commented Oct 6, 2021 •

edited

ghuls commented Dec 8, 2023 •

edited