InterPro

InterPro
Content
Description	protein families, domains and functional sites
Contact
Research center	EMBL
Laboratory	European Bioinformatics Institute
Primary citation	Finn, et al. (2016)
Release date	1999
Access
Website	www.ebi.ac.uk/interpro/
Download URL	ftp
Miscellaneous
Data release; frequency	8-weekly
Version	56.0 (18 February 2016; 8 years ago)

InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences^[2] in order to functionally characterise them.^[3]^[4]

The contents of InterPro consist of diagnostic signatures and the proteins that they significantly match. The signatures consist of models (simple types, such as regular expressions or more complex ones, such as Hidden Markov models) which describe protein families, domains or sites. Models are built from the amino acid sequences of known families or domains and they are subsequently used to search unknown sequences (such as those arising from novel genome sequencing) in order to classify them. Each of the member databases of InterPro contribute towards a different niche, from very high-level, structure-based classifications (SUPERFAMILY and CATH-Gene3D) through to quite specific sub-family classifications (PRINTS and PANTHER).

InterPro's intention is to provide a one-stop-shop for protein classification, where all the signatures produced by the different member databases are placed into entries within the InterPro database. Signatures which represent equivalent domains, sites or families are put into the same entry and entries can also be related to one another. Additional information such as a description, consistent names and Gene Ontology (GO) terms are associated with each entry, where possible.

Data contained in InterPro

InterPro contains three main entities: proteins, signatures (also referred to as "methods" or "models") and entries. The proteins in UniProtKB are also the central protein entities in InterPro. Information regarding which signatures significantly match these proteins are calculated as the sequences are released by UniProtKB and these results are made available to the public (see below). The matches of signatures to proteins are what determine how signatures are integrated together into InterPro entries: comparative overlap of matched protein sets and the location of the signatures' matches on the sequences are used as indicators of relatedness. Only signatures deemed to be of sufficient quality are integrated into InterPro.

InterPro also includes data for splice variants and the proteins contained in the UniParc and UniMES databases.

InterPro member databases

The signatures from InterPro come from 11 "member databases", which are listed below.

CATH-Gene3D: describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. Functional annotation is provided to proteins from multiple resources. Functional prediction and analysis of domain architectures is available from the Gene3D website.
HAMAP: stands for High-quality Automated and Manual Annotation of microbial Proteomes. HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded (i.e. chloroplasts, cyanelles, apicoplasts, non-photosynthetic plastids) proteins families or subfamilies.
PANTHER: is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise. These subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function (human-curated molecular function and biological process classifications and pathway diagrams), as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences.
Pfam: is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families.
PIRSF: protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. The primary PIRSF classification unit is the homeomorphic family, whose members are both homologous (evolved from a common ancestor) and homeomorphic (sharing full-length sequence similarity and a common domain architecture).
PRINTS: is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of UniProt. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, their full diagnostic potency deriving from the mutual context afforded by motif neighbours.
ProDom: domain database consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches.
PROSITE: is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.
SMART: allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 800 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues.
SUPERFAMILY: is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY has been used to carry out structural assignments to all completely sequenced genomes.
TIGRFAMs: is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. Those entries which are "equivalogs" group homologous proteins which are conserved with respect to function.

Access

The database is available for text- and sequence-based searches via a webserver, and for download via anonymous FTP. Like other EBI databases, it is in the public domain, since its content can be used "by any individual and for any purpose".^[5]

Users can also use the signature scanning software, InterProScan, if they have novel sequences that require characterisation.^[6] InterProScan is frequently used in genome projects in order to obtain a "first-pass" characterisation of the genome of interest.^[7]^[8] As of February 2013^[update], the public version of InterProScan (v4.x) is Perl-based, however, a new Java-based architecture is under development which will form the core of InterProScan v5.^[9]

To cite a particular InterPro article in Wikipedia, use the template of the form {{InterPro|IPRxxxxxx}}, where IPRxxxxxx is an InterPro accession number, for instance InterPro: IPR000001.

InterPro aims to release data to the public every 8 weeks, typically within a day of the UniProtKB release of the same proteins.

References

^ Finn, RD; Attwood, TK; Babbitt, PC; Bateman, A; Bork, P; Bridge, AJ; Chang, HY; Dosztányi, Z; El-Gebali, S; Fraser, M; Gough, J; Haft, D; Holliday, GL; Huang, H; Huang, X; Letunic, I; Lopez, R; Lu, S; Marchler-Bauer, A; Mi, H; Mistry, J; Natale, DA; Necci, M; Nuka, G; Orengo, CA; Park, Y; Pesseat, S; Piovesan, D; Potter, SC; Rawlings, ND; Redaschi, N; Richardson, L; Rivoire, C; Sangrador-Vegas, A; Sigrist, C; Sillitoe, I; Smithers, B; Squizzato, S; Sutton, G; Thanki, N; Thomas, PD; Tosatto, SC; Wu, CH; Xenarios, I; Yeh, LS; Young, SY; Mitchell, AL (29 November 2016). "InterPro in 2017-beyond protein family and domain annotations". Nucleic acids research. PMID 27899635.
^ Hunter, S.; Jones, P.; Mitchell, A.; Apweiler, R.; Attwood, T. K.; Bateman, A.; Bernard, T.; Binns, D.; Bork, P.; Burge, S.; De Castro, E.; Coggill, P.; Corbett, M.; Das, U.; Daugherty, L.; Duquenne, L.; Finn, R. D.; Fraser, M.; Gough, J.; Haft, D.; Hulo, N.; Kahn, D.; Kelly, E.; Letunic, I.; Lonsdale, D.; Lopez, R.; Madera, M.; Maslen, J.; McAnulla, C.; McDowall, J. (2011). "InterPro in 2011: New developments in the family and domain prediction database". Nucleic Acids Research. 40 (Database issue): D306–D312. doi:10.1093/nar/gkr948. PMC 3245097. PMID 22096229.
^ Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, A.; Birney, E.; Biswas, M.; Bucher, P.; Cerutti, L.; Corpet, F.; Croning, M. D.; Durbin, R.; Falquet, L.; Fleischmann, W.; Gouzy, J.; Hermjakob, H.; Hulo, N.; Jonassen, I.; Kahn, D.; Kanapin, A.; Karavidopoulou, Y.; Lopez, R.; Marx, B.; Mulder, N. J.; Oinn, T. M.; Pagni, M.; Servant, F.; Sigrist, C. J.; Zdobnov, E. M. (2001). "The InterPro database, an integrated documentation resource for protein families, domains and functional sites". Nucleic Acids Research. 29 (1): 37–40. doi:10.1093/nar/29.1.37. PMC 29841. PMID 11125043.
^ Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, A.; Birney, E.; Biswas, M.; Bucher, P.; Cerutti, L.; Corpet, F.; Croning, M. D. R.; Durbin, R.; Falquet, L.; Fleischmann, W.; Gouzy, J.; Hermjakob, H.; Hulo, N.; Jonassen, I.; Kahn, D.; Kanapin, A.; Karavidopoulou, Y.; Lopez, R.; Marx, B.; Mulder, N. J.; Oinn, T. M.; Pagni, M.; Servant, F.; Sigrist, C. J. A.; Zdobnov, E. M.; Interpro, C. (2000). "InterPro--an integrated documentation resource for protein families, domains and functional sites". Bioinformatics. 16 (12): 1145–1150. doi:10.1093/bioinformatics/16.12.1145. PMID 11159333.
^ http://www.ebi.ac.uk/Information/termsofuse.html
^ Quevillon, E. .; Silventoinen, V. .; Pillai, S. .; Harte, N. .; Mulder, N. .; Apweiler, R. .; Lopez, R. . (Jul 2005). "InterProScan: protein domains identifier" (Free full text). Nucleic Acids Research. 33 (Web Server issue): W116–W120. doi:10.1093/nar/gki442. ISSN 0305-1048. PMC 1160203. PMID 15980438.
^ Lander, E. S.; Linton, M.; Birren, B.; Nusbaum, C.; Zody, C.; Baldwin, J.; Devon, K.; Dewar, K.; Doyle, M.; Fitzhugh, W.; Funke, R.; Gage, D.; Harris, K.; Heaford, A.; Howland, J.; Kann, L.; Lehoczky, J.; Levine, R.; McEwan, P.; McKernan, K.; Meldrim, J.; Mesirov, J. P.; Miranda, C.; Morris, W.; Naylor, J.; Raymond, C.; Rosetti, M.; Santos, R.; Sheridan, A.; Sougnez, C. (Feb 2001). "Initial sequencing and analysis of the human genome". Nature. 409 (6822): 860–921. doi:10.1038/35057062. ISSN 0028-0836. PMID 11237011. {{cite journal}}: Unknown parameter |displayauthors= ignored (|display-authors= suggested) (help)
^ Holt, A.; Subramanian, M.; Halpern, A.; Sutton, G.; Charlab, R.; Nusskern, R.; Wincker, P.; Clark, G.; Ribeiro, M.; Wides, R.; Salzberg, S. L.; Loftus, B.; Yandell, M.; Majoros, W. H.; Rusch, D. B.; Lai, Z.; Kraft, C. L.; Abril, J. F.; Anthouard, V.; Arensburger, P.; Atkinson, P. W.; Baden, H.; De Berardinis, V.; Baldwin, D.; Benes, V.; Biedler, J.; Blass, C.; Bolanos, R.; Boscus, D.; et al. (Oct 2002). "The genome sequence of the malaria mosquito Anopheles gambiae". Science. 298 (5591): 129–149. Bibcode:2002Sci...298..129H. doi:10.1126/science.1076181. ISSN 0036-8075. PMID 12364791.
^ https://code.google.com/p/interproscan/

External links

Official website — webserver
databases — FTP download

[1] Finn, RD; Attwood, TK; Babbitt, PC; Bateman, A; Bork, P; Bridge, AJ; Chang, HY; Dosztányi, Z; El-Gebali, S; Fraser, M; Gough, J; Haft, D; Holliday, GL; Huang, H; Huang, X; Letunic, I; Lopez, R; Lu, S; Marchler-Bauer, A; Mi, H; Mistry, J; Natale, DA; Necci, M; Nuka, G; Orengo, CA; Park, Y; Pesseat, S; Piovesan, D; Potter, SC; Rawlings, ND; Redaschi, N; Richardson, L; Rivoire, C; Sangrador-Vegas, A; Sigrist, C; Sillitoe, I; Smithers, B; Squizzato, S; Sutton, G; Thanki, N; Thomas, PD; Tosatto, SC; Wu, CH; Xenarios, I; Yeh, LS; Young, SY; Mitchell, AL (29 November 2016). "InterPro in 2017-beyond protein family and domain annotations". Nucleic acids research. PMID 27899635.

[pmid22096229-2] Hunter, S.; Jones, P.; Mitchell, A.; Apweiler, R.; Attwood, T. K.; Bateman, A.; Bernard, T.; Binns, D.; Bork, P.; Burge, S.; De Castro, E.; Coggill, P.; Corbett, M.; Das, U.; Daugherty, L.; Duquenne, L.; Finn, R. D.; Fraser, M.; Gough, J.; Haft, D.; Hulo, N.; Kahn, D.; Kelly, E.; Letunic, I.; Lonsdale, D.; Lopez, R.; Madera, M.; Maslen, J.; McAnulla, C.; McDowall, J. (2011). "InterPro in 2011: New developments in the family and domain prediction database". Nucleic Acids Research. 40 (Database issue): D306–D312. doi:10.1093/nar/gkr948. PMC 3245097. PMID 22096229.

[3] Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, A.; Birney, E.; Biswas, M.; Bucher, P.; Cerutti, L.; Corpet, F.; Croning, M. D.; Durbin, R.; Falquet, L.; Fleischmann, W.; Gouzy, J.; Hermjakob, H.; Hulo, N.; Jonassen, I.; Kahn, D.; Kanapin, A.; Karavidopoulou, Y.; Lopez, R.; Marx, B.; Mulder, N. J.; Oinn, T. M.; Pagni, M.; Servant, F.; Sigrist, C. J.; Zdobnov, E. M. (2001). "The InterPro database, an integrated documentation resource for protein families, domains and functional sites". Nucleic Acids Research. 29 (1): 37–40. doi:10.1093/nar/29.1.37. PMC 29841. PMID 11125043.

[4] Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, A.; Birney, E.; Biswas, M.; Bucher, P.; Cerutti, L.; Corpet, F.; Croning, M. D. R.; Durbin, R.; Falquet, L.; Fleischmann, W.; Gouzy, J.; Hermjakob, H.; Hulo, N.; Jonassen, I.; Kahn, D.; Kanapin, A.; Karavidopoulou, Y.; Lopez, R.; Marx, B.; Mulder, N. J.; Oinn, T. M.; Pagni, M.; Servant, F.; Sigrist, C. J. A.; Zdobnov, E. M.; Interpro, C. (2000). "InterPro--an integrated documentation resource for protein families, domains and functional sites". Bioinformatics. 16 (12): 1145–1150. doi:10.1093/bioinformatics/16.12.1145. PMID 11159333.

[termsofuse-5] ttp://www.ebi.ac.uk/Information/termsofuse.html

[pmid15980438-6] Quevillon, E. .; Silventoinen, V. .; Pillai, S. .; Harte, N. .; Mulder, N. .; Apweiler, R. .; Lopez, R. . (Jul 2005). "InterProScan: protein domains identifier" (Free full text). Nucleic Acids Research. 33 (Web Server issue): W116–W120. doi:10.1093/nar/gki442. ISSN 0305-1048. PMC 1160203. PMID 15980438.

[pmid11237011-7] Lander, E. S.; Linton, M.; Birren, B.; Nusbaum, C.; Zody, C.; Baldwin, J.; Devon, K.; Dewar, K.; Doyle, M.; Fitzhugh, W.; Funke, R.; Gage, D.; Harris, K.; Heaford, A.; Howland, J.; Kann, L.; Lehoczky, J.; Levine, R.; McEwan, P.; McKernan, K.; Meldrim, J.; Mesirov, J. P.; Miranda, C.; Morris, W.; Naylor, J.; Raymond, C.; Rosetti, M.; Santos, R.; Sheridan, A.; Sougnez, C. (Feb 2001). "Initial sequencing and analysis of the human genome". Nature. 409 (6822): 860–921. doi:10.1038/35057062. ISSN 0028-0836. PMID 11237011. {{cite journal}}: Unknown parameter |displayauthors= ignored (|display-authors= suggested) (help)

[pmid12364791-8] Holt, A.; Subramanian, M.; Halpern, A.; Sutton, G.; Charlab, R.; Nusskern, R.; Wincker, P.; Clark, G.; Ribeiro, M.; Wides, R.; Salzberg, S. L.; Loftus, B.; Yandell, M.; Majoros, W. H.; Rusch, D. B.; Lai, Z.; Kraft, C. L.; Abril, J. F.; Anthouard, V.; Arensburger, P.; Atkinson, P. W.; Baden, H.; De Berardinis, V.; Baldwin, D.; Benes, V.; Biedler, J.; Blass, C.; Bolanos, R.; Boscus, D.; et al. (Oct 2002). "The genome sequence of the malaria mosquito Anopheles gambiae". Science. 298 (5591): 129–149. Bibcode:2002Sci...298..129H. doi:10.1126/science.1076181. ISSN 0036-8075. PMID 12364791.

[i5codebase-9] ttps://code.google.com/p/interproscan/

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

v t e Bioinformatics
Databases	Sequence databases: GenBank, European Nucleotide Archive, DNA Data Bank of Japan and China National GeneBank Secondary databases: UniProt, database of protein sequences grouping together Swiss-Prot, TrEMBL and Protein Information Resource Other databases: BioNumbers, Protein Data Bank, Ensembl, InterPro, KEGG, and Gene Ontology Specialised genomic databases: BOLD, Saccharomyces Genome Database, FlyBase, VectorBase, WormBase, Rat Genome Database, PHI-base, Arabidopsis Information Resource, GISAID and Zebrafish Information Network
Software	BLAST Bowtie Clustal EMBOSS HMMER MUSCLE PANGOLIN SAMtools SOAP suite TopHat
Other	Server: ExPASy Rosalind (education platform)
Institutions	Broad Institute Computational Biology Department (CBD) Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI) Database Center for Life Science (DBCLS) DNA Data Bank of Japan (DDBJ) European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory (EMBL) Flatiron Institute J. Craig Venter Institute (JCVI) Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) US National Center for Biotechnology Information (NCBI) Japanese Institute of Genetics Netherlands Bioinformatics Centre (NBIC) Philippine Genome Center (PGC) Scripps Research Swiss Institute of Bioinformatics (SIB) Wellcome Sanger Institute Whitehead Institute
Organizations	African Society for Bioinformatics and Computational Biology (ASBCB) Australia Bioinformatics Resource (EMBL-AR) European Molecular Biology network (EMBnet) International Nucleotide Sequence Database Collaboration (INSDC) International Society for Biocuration (ISB) International Society for Computational Biology (ISCB) Student Council (ISCB-SC) Institute of Genomics and Integrative Biology (CSIR-IGIB) Japanese Society for Bioinformatics (JSBi)
Meetings	Basel Computational Biology Conference‎ ([BC²]) European Conference on Computational Biology (ECCB) Intelligent Systems for Molecular Biology (ISMB) International Conference on Bioinformatics (InCoB) International Conference on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB) ISCB Africa ASBCB Conference on Bioinformatics Pacific Symposium on Biocomputing (PSB) Research in Computational Molecular Biology (RECOMB)
File formats	CRAM format FASTA format FASTQ format NeXML format Nexus format Pileup format SAM format Stockholm format VCF format GFF format GTF format
Related topics	Computational biology List of biobanks List of biological databases Molecular phylogenetics Sequencing Sequence database Sequence alignment
Category Commons

Data contained in InterPro

InterPro member databases

Access

Other useful links

References

External links