Markovian discrimination
This article's tone or style may not reflect the encyclopedic tone used on Wikipedia. (May 2024) |
Within the Markov model in probability theory, Markovian discrimination is a spam filtering method used in CRM114 and other spam filters to model the statistical behaviors of spam more accurately than can be done by using simple Bayesian methods.[1] It is based on the theory of Markov chains by Andrey Markov. While a simple bag-of-words model contains only the dictionary of legal words and their relative probabilities, a Markovian model adds to this the relative transition probabilities (the probabilities of the next word given the current word). In other words, a Bayesian filter works on single words alone, while a Markovian filter works on phrases or entire sentences.
Comparison with Bayesian Model and Spam Filtering Application
In contrast to a simple Bayesian model, which represents a message simply as a bag-of-words and examines a dictionary of legitimate words along with their relative probabilities, a Markovian model incorporates the transition probabilities between words. That allows the model to use properties of a sentence or phrase to help determine if something is spam. Furthermore, Markovian filters used in spam filtering are not limited to the word level; they can also process letters or partial word tokens. Weights can then be assigned to these tokens based on their probability of appearing in spam or legitimate content, further enhancing the accuracy of the filter.[2]
Markov Models
There are two variants of Markov models: the visible Markov model and the hidden Markov model (HMM). These differ from the concept of the current word; with a visible Markov model, the current word is considered to contain full information about previous states of the language model, whereas a hidden Markov model obscures the relationship, with the current word being only probabilistically linked to the previously read words.[3]
Application of Hidden Markov Models
To illustrate, in a visible Markov model, the word "the" should predict the subsequent word with a high degree of accuracy. In contrast, a hidden Markov model implies that the entire preceding text indicates the actual state and foresees the subsequent words, but it does not provide a guarantee of that state or prediction. As spam filtering typically encounters the latter scenario, hidden Markov models are predominantly employed. Moreover, due to storage constraints, a specific type of hidden Markov model, known as a Markov random field, proves particularly suitable, typically with a 'sliding window' or clique size ranging between four and six tokens.[4]
See also
References
- ^ Chhabra, S., Yerazunis, W. S., and Siefkes, C. 2004. Spam Filtering using a Markov Random Field Model with Variable Weighting Schemas. In Proceedings of the Fourth IEEE international Conference on Data Mining (November 1–04, 2004). ICDM. IEEE Computer Society, Washington, DC, Mazharul
- ^ Duarte, J. P., Leite, V. C., & Frutuoso e Melo, P. F. F. (January 2013). A comparison between Markovian models and Bayesian networks for treating some dependent events in reliability evaluations. 2013 International Nuclear Atlantic Conference, Recife, Brazil. [1]
- ^ Jurafsky, Daniel & Martin, James H. Speech and Language Processing. 2023. Stanford University. [2]
- ^ Yerazunis, W. S. The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It. Mitsubishi Electric Research Laboratories, TR2004-091, January 2004. [3]