Jump to content

Talk:HMMER

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Plan 7

[edit]

A section outlining the Plan 7 architecture and why this was important for HMMER would constitute a huge improvement to the quality of the article. Maonao (talk) —Preceding undated comment added 14:07, 23 July 2017 (UTC)[reply]

I have addressed this by adding a section defining profile HMMs, and referring to the Plan 7 architecture in this context. A separate article on profile HMMs may need to be created in the future, but in the meantime no such article exists, so I have outlined what they are here. Maonao (talk) 16:52, 23 July 2017 (UTC)[reply]

Short how to by Stefanjanssen (talk) 17:50, 27 July 2010 (UTC)

[edit]

Hi there. This is my first edit to a Wiki entry, so please indulge me if I violate common rules.

For the last three weeks I delved into the sources of HMMer to understand how it builds it's models and score a sequence against a model. Maybe I misunderstood some concepts, but I think it's a good point to start with.

In principle all the probabilities are counted from the multiple sequence alignment, that you provide. But there are several concepts to improve these pure frequencies:

  • mark sequence fragments
  • annotate match and insert columns * --hand
  • resolve delete->insert and insert->delete situations
  • pseudocounts
  • mixture dirichlets * --laplace
  • sequence weighting * --wnone
  • maximum entropy weighting * --enone

(concepts marked by asterisk can be turned off by the mentioned special switches of hmmbuild)

mark sequence fragments

[edit]

Your multiple sequence alignment A is a set of homologue sequences. But it may also contain some homologue sequence fragments, e.g. single exons of a protein. Aligning these fragments typically result in long leading and trailing gaps. But these gaps should not give negative scores, because they don't reflect evolution but missing data. In a first step hmmbuild marks these leading and trailing gaps as missing data points, if the pure sequence (without all gaps) is smaller than * alignmentlength. Default for fragthesh is 0.5.

annotate match and insert columns

[edit]

Given the multiple sequence alignment A, hmmbuild has to know or to decide which of the alignment columns should be modeled as conserved positions or as insertions relative to the consensus. Only the conserved positions (matches) generate states in the profile HMM. The annotation can either be given by an extra RF line the the Stockholm file or guessed by a simple heuristic. The heuristic considers an alignment column as conserved, if it contains less than * (number of sequences in the alignment) gaps. Default for symfrac is 0.5. Missing data points don't count, neither as gaps nor as number of sequences in the alignment.

resolve delete->insert and insert->delete situations

[edit]

First delete a subsequence and then insert another subsequence is not different from first insert and then delete. Thus most sequence analysis programs forbid one of these two adjacent operations. The same is true for hmmbuild. If a delete (a gap in the sequence, where a match is annotated for the alignment column) is followed by an insert (a subsequence, where the alignment columns are not marked as matches) hmmbuild shifts the insert subsequence one position to the left and inserts a gap on the rightmost position. For insert -> delete situations it's the other way around. On ambiguous situations, e.g. Delete->Insert->Delete, hmmbuild always shifts to the left, even if this decision might not be optimal.

train the HMM

[edit]

Figure 1 shows the general architecture of the core profile HMM. I will use some piece of pseudocode to explain how to count emission- and transition- frequencies. Assume your alignment a consists of n sequences with m columns and let j_1 and j_2 be two succeeding, but not necessarily adjacent, alignment columns that are marked as matches: s_i(x) is the character at position x in sequence i. {} denote some kind of hash.

for 0 <= i < n {
	if (isChar(s_i(j_1))) {
		emissionMatch{s_i(j_1)} ++
	}
	for 0 <= l <= j_2 - j_1 + 1 {
		if (isChar(s_i(l))) {
			emissionInsert{s_i(l)} ++
		}
	}

	if (isChar(s_i(j_1)) && isNoMissingDatapoint(s_i(j_2))) {
		if (isGap(s_i(j_2))) {
			transitionMatch_Delete ++
		} else {
			if (j_2 - j_1 + 1 < 1) {
				transitionMatch_Match ++
			} else {
				transitionMatch_Insert++
			}
		}
	}
	if (isNoMissingDatapoint(s_i(j_2-1))) {
		 for 0 <= l <= j_2 - j_1 + 1 {
			if (isChar(s_i(l))) {
				transitionInsert_Insert ++
			}
		}
		transitionInsert_Insert -- if (j_2 - j_1 + 1 > 0)
		transitionInsert_Match ++ 
	}
	if (isGap(s_i(j_1))) {
		if (isChar(s_i(j_2)) {
			transitionDelete_Match ++
		} else {
			transitionDelete_Delete ++
		}
	}
}

We now got the counts for the alignment. To get frequencies we have to divide them by their sums, e.g. the frequency of emitting character 'A' is the count for 'A' divided by the summed counts of all characters. The same holds for the transitions. There are exceptions for the first and the last match position, because the delete state is missing: so the transition frequency for delete->match is set to 1, delete->delete is set to 0 and - only for the last position - match->delete is set to 0. If you inspect the HMM file, you will find some deviations from the above description, because there they use a virtual match 0 position. So there actually are transitions from delete 1 to delete 2 and match 2, but hmmsearch won't use them.

pseudocounts

[edit]

With Bayesian statistic we have prior and posterior probabilities. Priors (or pseudocounts) are prior knowledge about our model, maybe from biologist expertise. Posteriors are the observed counts from the training data, i.e. the sequence alignment. If your training data are large enough all priors are cancelled out, but usually it's impossible to collect thus large training data. The advantage of pseuodcount is, that no transition- or emission probabilities are set to zero, so even if we haven't seen an 'A' in the training we can align a sequence containing an 'A'. If you use the switch --laplace or use RNA instead of Aminoacid input all counter-variables are initialized with 1 instead of 0.

@Stefanjanssen You added your tutorial to the talk page not the main page! Your text is better for a blog as it doesn't fit many rules of wikipedia could you consider rewriting it like an article and use put it to the main page 195.248.49.162 (talk) 13:21, 1 June 2014 (UTC)[reply]

Profile HMM goes to general HMM page

[edit]

Maybe there isn't an article written on protein PROFILE HMM on wiki, the link to profile HMM goes to a general HMM page. Xtalline (talk) 05:29, 13 August 2021 (UTC)[reply]