N50, L50, and related statistics: Difference between revisions

Content deleted Content added

Inline

Revision as of 17:54, 18 February 2014

In computational biology, the N50 statistic is a statistic of a set of contig or scaffold lengths. The N50 is similar to a mean or median, but has greater weight given to the longer contigs. It is used widely in genome assembly, especially in reference to contig lengths within a draft assembly. Given a set of contigs, each with its own length, the N50 length is defined as the length for which the collection of all contigs of that length or longer contains at least half of the total of the lengths of the contigs, and for which the collection of all contigs of that length or shorter contains at least half of the total of the lengths of the contigs. (When more than one value of length meets both these criteria then the N50 is the average of the longest and shortest lengths that meet these criteria.) This can be thought of as the point of half of the mass of the distribution; the number of bases from all contigs shorter than the N50 will be close to equal to the number of bases from all contigs longer than the N50. The N90 statistic is smaller than or equal to the N50 statistic; it is the length for which the collection of all contigs of that length or longer contains at least 90% of the total of the lengths of the contigs, and for which the collection of all contigs of that length or shorter contains at least 10% of the total of the lengths of the contigs.

NG50

Note that N50 is calculated in the context of the assembly size rather than the genome size. Therefore, comparisons of N50 values derived from different organisms, that may have different genome sizes, are usually not informative. To address this, the authors of the Assemblathon competition derived a new measure called NG50. The NG50 statistic is the same as N50 except that the known or estimated genome size is used rather than the assembly size. This allows for meaningful comparisons between different assemblies.

Examples

Consider two fictional, and highly simplified, genome assemblies (A & B) that are derived from two different species. Assembly A contains six contigs of lengths 80 Kbp, 70 Kbp, 50 kbp, 40 Kbp, 30 Kbp, and 20 Kbp. The sum size of assembly A would be 290 Kbp and so the N50 contig length would be 70 Kbp (because 80 + 70 is greater than 50% of 290 Kbp). Now lets assume that the contig lengths of assembly B are the same as assembly A except for the presence of two additional contigs of 10 Kbp and 5 Kbp. Thus the size of assembly B is 305 Kbp, and now the N50 contig length drops to 50 Kbp (80 + 70 + 50 is greater than 50% of 320 Kbp). This example illustrates that one can sometimes increase the N50 length simply by removing some of the shortest contigs or scaffolds from an assembly.

If the estimated or known size of the genome from the fictional species A was 500 Kbp, then the NG50 contig length would be 30 Kbp (80 + 70 + 50 + 40 + 30 is greater than 50% of 500 Kbp. In contrast, if the estimated or known size of the genome from species B was 350 Kbp, then it would have an NG50 contig length of 50 Kbp (80 + 70 is greater than 50% of 350 Kbp).

Alternate computation

N50 can be found mathematically for a list L of positive integers as follows:

Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself.
The median of L' is the N50 of L. (The 10% quantile of L' is the N90 statistic.)

For example: If L = (2, 2, 2, 3, 3, 4, 8, 8), then L' consists of six 2's, six 3's, four 4's, and sixteen 8's. That is, L' has twice as many 2s as L; it has three times as many 3s as L; it has four times as many 4s; etc. The median of the 32-element set L' is the average of the 16th smallest element, 4, and 17th smallest element, 8, so the N50 is 6. We can see that the sum of all values in the list L that are smaller than or equal to the N50 of 6 is 16 = 2+2+2+3+3+4 and the sum of all values in the list L that are larger than or equal to 6 is also 16 = 8+8. For comparison with the N50 of 6, note that the mean of the list L is 4 while the median is 3.

Contradictory definitions

There have been identified some contradictions in the definition(s) of the N50 value, as discussed in a thread on the SEQ Answers forum.

@@ Line 1: / Line 1: @@
-In [[computational biology]], the '''N50 statistic''' is a statistic of a set of [[contig]] lengths.  The ''N50'' is similar to a [[mean]] or [[median]], but has greater weight given to the longer contigs. It is used widely in [[genome assembly]], especially in reference to contig lengths within a draft assembly. Given a set of contigs, each with its own length, the ''N50'' length is defined as the length for which the collection of all contigs of that length or longer contains at least half of the total of the lengths of the contigs, and for which the collection of all contigs of that length or shorter contains at least half of the total of the lengths of the contigs. (When more than one value of length meets both these criteria then the ''N50'' is the average of the longest and shortest lengths that meet these criteria.)  This can be thought of as the point of half of the mass of the distribution; the number of [[nucleotide|base]]s from all contigs shorter than the ''N50'' will be close to equal to the number of bases from all contigs longer than the ''N50''. The '''N90 statistic''' is smaller than or equal to the ''N50'' statistic; it is the length for which the collection of all contigs of that length or longer contains at least 90% of the total of the lengths of the contigs, and for which the collection of all contigs of that length or shorter contains at least 10% of the total of the lengths of the contigs.
+In [[computational biology]], the '''N50 statistic''' is a statistic of a set of [[contig]] or [[scaffold]] lengths.  The ''N50'' is similar to a [[mean]] or [[median]], but has greater weight given to the longer contigs. It is used widely in [[genome assembly]], especially in reference to contig lengths within a draft assembly. Given a set of contigs, each with its own length, the ''N50'' length is defined as the length for which the collection of all contigs of that length or longer contains at least half of the total of the lengths of the contigs, and for which the collection of all contigs of that length or shorter contains at least half of the total of the lengths of the contigs. (When more than one value of length meets both these criteria then the ''N50'' is the average of the longest and shortest lengths that meet these criteria.)  This can be thought of as the point of half of the mass of the distribution; the number of [[nucleotide|base]]s from all contigs shorter than the ''N50'' will be close to equal to the number of bases from all contigs longer than the ''N50''. The '''N90 statistic''' is smaller than or equal to the ''N50'' statistic; it is the length for which the collection of all contigs of that length or longer contains at least 90% of the total of the lengths of the contigs, and for which the collection of all contigs of that length or shorter contains at least 10% of the total of the lengths of the contigs.
+==NG50==
-For example, for a genome of 600[[megabase|Mb]], when the assembled contigs add up to 500Mb, the ''N50'' can be calculated by sorting the contigs from longest to shortest and finding the length of the contig where the cumulative size reaches 250Mb. Note that ''N50'' is calculated in the context of the assembly size rather than the genome size. The '''NG50 statistic''' is the same as the ''N50'' except that the genome size is used rather than the assembly size.
+Note that ''N50'' is calculated in the context of the assembly size rather than the genome size. Therefore, comparisons of N50 values derived from different organisms, that may have different genome sizes,  are usually not informative. To address this, the authors of the [[Assemblathon]] competition derived a new measure called NG50.  The '''NG50 statistic''' is the same as ''N50'' except that the known or estimated genome size is used rather than the assembly size. This allows for meaningful comparisons between different assemblies.
+==Examples==
+Consider two fictional, and highly simplified, genome assemblies (A & B) that are derived from two different species. Assembly A contains six contigs of lengths 80 [[kilobase|Kbp]], 70 Kbp, 50 kbp, 40 Kbp, 30 Kbp, and 20 Kbp. The sum size of assembly A would be 290 Kbp and so the N50 contig length would be 70 Kbp (because 80 + 70 is greater than 50% of 290 Kbp). Now lets assume that the contig lengths of assembly B are the same as assembly A except for the presence of two additional contigs of 10 Kbp and 5 Kbp. Thus the size of assembly B is 305 Kbp, and now the N50 contig length drops to 50 Kbp (80 + 70 + 50 is greater than 50% of 320 Kbp). This example illustrates that one can sometimes increase the N50 length simply by removing some of the shortest contigs or scaffolds from an assembly.
+If the estimated or known size of the genome from the fictional species A was 500 Kbp, then the ''NG50'' contig length would be 30 Kbp (80 + 70 + 50 + 40 + 30 is greater than 50% of 500 Kbp. In contrast, if the estimated or known size of the genome from species B was 350 Kbp, then it would have an NG50 contig length of 50 Kbp (80 + 70 is greater than 50% of 350 Kbp).
 ==Alternate computation==
@@ Line 17: / Line 23: @@
 * [http://www.broad.harvard.edu/crd/wiki/index.php/N50 Arachne wiki] at [[Broad Institute]]
 * [http://www.ncbi.nlm.nih.gov/pubmed/20211242 "Assembly algorithms for next-generation sequencing data", Miller JR, Koren S, Sutton G]
+* [http://genome.cshlp.org/content/21/12/2224.ful "Assemblathon 1: A competitive assessment of de novo short read assembly methods"]
 ==See also==
 * [[Herfindahl–Hirschman Index]]
 [[Category:Bioinformatics]]
 [[Category:Genomics]]

Revision as of 17:54, 18 February 2014

NG50

Examples

Alternate computation

Contradictory definitions

References

See also