Talk:FASTQ format

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computational Biology (Rated B-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computational Biology, a collaborative effort to improve the coverage of Computational Biology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B-Class article B  This article has been rated as B-Class on the quality scale.
 Mid  This article has been rated as Mid-importance on the importance scale.
 

Technically fastq format is multi-lined, but the use of it in short-read sequencing obviously disguises this issue.

Hence sequences may be line-wrapped, and quality values too. Given that @ is a legal quality value and it may occur just after a newline in a line-wrapped quality string, care must be taken when parsing it. The ideal solution here is simply to count the number of bases in the sequence lines and then parse with the expectation of the same number of bases in the quality lines. (If after this there isn't a new sequence header immediately starting after the quality then the format is in error.)

Unfortunately many people have implemented broken parsers and so you'll sometimes see ghastly messes where the first quality value on each line has been changed to zero (ascii '!'). This is just a bug!

193.62.203.214 (talk) 15:36, 16 April 2009 (UTC) jkb

The Celera Assembler implements yet another quality format based on this theme...[edit]

The input for the Celera Assembler is a 'frg' file [1]

Apparently they take the (presumably Phred style) quality score and add 48 before converting to ascii for storage in the frg file. i.e. "chr(ord(0)+$qual)".

--Dan|(talk) 15:27, 30 July 2009 (UTC)

The AMOS .afg format uses the same encoding[edit]

IonTorrent quality range[edit]

I've seen some IonTorrent quality values and they seem have different range from sanger or illumina. However I don't have access to such machine or output so can't be sure. Can anyone with the machine confirm and put the range up? — Preceding unsigned comment added by Hena wp (talkcontribs) 18:25, 30 April 2013 (UTC)

Would adding color to the FASTQ versions test make it clearer?[edit]

  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126
  0........................26...31.......40                                
                           -5....0........9.............................40 
                                 0........9.............................40 
                                    3.....9.............................40 
  0........................26...31........41                               

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
    (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

Colors picked at random, and I don't absolutely guarantee that the alignment is correct. And there appears to be a problem with the J alignment in the original figure.

Tnabtaf (talk) 02:17, 22 October 2012 (UTC)

Got no comments; posting to page.

Tnabtaf (talk) 05:59, 22 January 2013 (UTC)

Sequence letter definitions?[edit]

I'm writing a fastq parser for Illumina exome data, and I found this article very useful! Thanks for writing it. The only data I see missing from this article that would aid me in completing the parser is sequence letter definitions. I see ACTG throughout the Illumina data, which makes sense, but I don't know what 'N' stands for. I'll figure it out, but it would be cool if sequence letters were documented here.WaywardGeek (talk) 12:00, 5 August 2013 (UTC)