Biological data

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Biological data are data or measurements collected from biological sources, which are often stored or exchanged in a digital form. Biological data are commonly stored in files or databases. Examples of biological data are DNA base-pair sequences, and population data used in ecology.

Data File Formats[edit]

Each file format has been designed for specific needs and outputs in mind.

  • GFF
  • BAM
  • SAM
  • VCF
  • AB1 – In DNA sequencing, chromatogram files used by instruments from Applied Biosystems
  • ACE – A sequence assembly format
  • BAM – Binary compressed SAM format
  • BED – The browser extensible display format is used for describing genes and other features of DNA sequences
  • CAF – Common Assembly Format for sequence assembly
  • EMBL – The flatfile format used by the EMBL to represent database records for nucleotide and peptide sequences from EMBL databases
  • FASTA – The FASTA file format, for sequence data. Sometimes also given as FNA or FAA (Fasta Nucleic Acid or Fasta Amino Acid).
  • FASTQ – The FASTQ file format, for sequence data with quality. Sometimes also given as QUAL.
  • GenBank – The flatfile format used by the NCBI to represent database records for nucleotide and peptide sequences from the GenBank and *

RefSeq databases

  • GFF – The General feature format is used for describing genes and other features of DNA, RNA and protein sequences
  • GTF – The Gene transfer format is used to hold information about gene structure.
  • NEXUS – The Nexus file encodes mixed information about genetic sequence data in a block structured format.
  • NWK – The Newick tree format is a way of representing graph-theoretical trees with edge lengths using parentheses and commas and usefil to hold phylogenetic trees.
  • PDB – structures of biomolecules deposited in Protein Data Bank. Also used for exchanging protein/nucleic acid structures.
  • PHD – Phred output, from the basecalling software Phred
  • SAM – Sequence Alignment/Map format, in which the results of the 1000 Genomes Project will be released.
  • SCF – Staden chromatogram files used to store data from DNA sequencing
  • SBML – The Systems Biology Markup Language is used to store biochemical network computational models
  • SFF - Standard Flowgram Format
  • Stockholm – The Stockholm format for representing multiple sequence alignments
  • Swiss-Prot – The flatfile format used to represent database records for protein sequences from the Swiss-Prot database
  • VCF – Variant Call Format, a standard created by the 1000 Genomes Project that lists and annotates the entire collection of human variants (with the exception of approximately 1.6 million variants).

Biological Data Sharing[edit]


See also[edit]