Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-text characters appearing in SEQ field #402

Open
csoeder opened this issue Sep 6, 2023 · 0 comments
Open

Non-text characters appearing in SEQ field #402

csoeder opened this issue Sep 6, 2023 · 0 comments

Comments

@csoeder
Copy link

csoeder commented Sep 6, 2023

I had an odd error crop up. I aligned my reads as usual:
bwa samse droSim1.fa fragSimulated_BCM10NE.clean.R0.fastq.droSim1.sai fragSimulated_BCM10NE.clean.R0.fastq -r '@RG\tID:foo\tSM:bar' > test.sam
(I have tried this with and without the readgroup flag, with the same outcome)

but when I tried to manipulate the alignment with samtools I got this:
$ samtools view test.sam > /dev/null
[W::sam_read1_sam] Parse error at line 28
samtools view: error reading file "test.sam"

There was nothing obviously wrong with line 28:

$ head -n 28 test.sam | tail -n 2
READ_7 0 chr2h_random 1166746 37 100M * 0 0 GCAAACCTATTTGAGCCTGCTTCAGACACGACGGTGAGGTATGCACTGTTTCGATGTAAAGAGAGTCGGCGCTCGTCTTGCTCATTTTGCCGCTGAGCGC BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
READ_8 4 * 0 0 * * 0 0 TCGGTGCACAGAAAGAAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGGGGGTTGAGGCTTAGAAGGGGGCGTGGCCGGGCGGAT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo

I experimented a bit; deleting the offending line just finds another one at line 353, and so on. Some are unmapped; some are mapped. Oddly, although the file appeared to be an uncompressed SAM, it sometimes behaved as though it were a binary, such as not grepping properly:
grep READ test.sam
Binary file test.sam matches
file test.sam
test.sam: data

Then, I noticed that the problematic lines have a length mismatch between the SEQ and QUAL fields, which seemed like it could be an issue, but have never encountered this happening before.

Finally, while I was examining the file in less, I noticed this:

READ_8 4 * 0 0 * * 0 0 TCGGTGCACAGAAAGAAAANNNNNNNNNNNNNNNNNNNNNNNNN^@NNNNNNNNNNNNNNNNGGGGGGTTGAGGCTTAGAAGGGGGCGTGGCCGGGCGGAT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB RG:Z:foo

There seems to be a mystery character, "^@", which has found its way into the SEQ field instead of a proper character. I don't know how samtools interprets it, but when sent to standard output it just disappears, giving a string with one fewer character than it's supposed to have. Within less, it appears with the reverse coloration that you see when you open a binary file, which might explain some previous observations.

The only thing I'm doing that's slightly unusual is that I'm aligning pseudoreads synthetically generated by fragmenting a reference genome, but I don't think that's responsible (I've recently used this pipeline without issue).

Any tips on how to make this not happen?

Thanks,
Charlie

@csoeder csoeder changed the title Non-text characters appearing in Non-text characters appearing in SEQ field Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant