The Hammersley–Clifford theorem is a result in probability theory , mathematical statistics and statistical mechanics that gives necessary and sufficient conditions under which a strictly positive probability distribution (of events in a probability space ) [clarification needed ] can be represented as events generated by a Markov network (also known as a Markov random field ). It is the fundamental theorem of random fields .[ 1] It states that a probability distribution that has a strictly positive mass or density satisfies one of the Markov properties with respect to an undirected graph G if and only if it is a Gibbs random field , that is, its density can be factorized over the cliques (or complete subgraphs ) of the graph.
The relationship between Markov and Gibbs random fields was initiated by Roland Dobrushin [ 2] and Frank Spitzer [ 3] in the context of statistical mechanics . The theorem is named after John Hammersley and Peter Clifford , who proved the equivalence in an unpublished paper in 1971.[ 4] [ 5] Simpler proofs using the inclusion–exclusion principle were given independently by Geoffrey Grimmett ,[ 6] Preston[ 7] and Sherman[ 8] in 1973, with a further proof by Julian Besag in 1974.[ 9]
A simple Markov network for demonstrating that any Gibbs random field satisfies every Markov property.
It is a trivial matter to show that a Gibbs random field satisfies every Markov property . As an example of this fact, see the following:
In the image to the right, a Gibbs random field over the provided graph has the form
Pr
(
A
,
B
,
C
,
D
,
E
,
F
)
∝
f
1
(
A
,
B
,
D
)
f
2
(
A
,
C
,
D
)
f
3
(
C
,
D
,
F
)
f
4
(
C
,
E
,
F
)
{\displaystyle \Pr(A,B,C,D,E,F)\propto f_{1}(A,B,D)f_{2}(A,C,D)f_{3}(C,D,F)f_{4}(C,E,F)}
. If variables
C
{\displaystyle C}
and
D
{\displaystyle D}
are fixed, then the global Markov property requires that:
A
,
B
⊥
E
,
F
|
C
,
D
{\displaystyle A,B\perp E,F|C,D}
(see conditional independence ), since
C
,
D
{\displaystyle C,D}
forms a barrier between
A
,
B
{\displaystyle A,B}
and
E
,
F
{\displaystyle E,F}
.
With
C
{\displaystyle C}
and
D
{\displaystyle D}
constant,
Pr
(
A
,
B
,
E
,
F
|
C
=
c
,
D
=
d
)
∝
[
f
1
(
A
,
B
,
d
)
f
2
(
A
,
c
,
d
)
]
⋅
[
f
3
(
c
,
d
,
F
)
f
4
(
c
,
E
,
F
)
]
=
g
1
(
A
,
B
)
g
2
(
E
,
F
)
{\displaystyle \Pr(A,B,E,F|C=c,D=d)\propto [f_{1}(A,B,d)f_{2}(A,c,d)]\cdot [f_{3}(c,d,F)f_{4}(c,E,F)]=g_{1}(A,B)g_{2}(E,F)}
where
g
1
(
A
,
B
)
=
f
1
(
A
,
B
,
d
)
f
2
(
A
,
c
,
d
)
{\displaystyle g_{1}(A,B)=f_{1}(A,B,d)f_{2}(A,c,d)}
and
g
2
(
E
,
F
)
=
f
3
(
c
,
d
,
F
)
f
4
(
c
,
E
,
F
)
{\displaystyle g_{2}(E,F)=f_{3}(c,d,F)f_{4}(c,E,F)}
. This implies that
A
,
B
⊥
E
,
F
|
C
,
D
{\displaystyle A,B\perp E,F|C,D}
.
To establish that every positive probability distribution that satisfies the local Markov property is also a Gibbs random field, the following lemma, which provides a means for combining different factorizations, needs to be proved:
Lemma 1 provides a means for combining factorizations as shown in this diagram. Note that in this image, the overlap between sets is ignored.
Lemma 1
Let
U
{\displaystyle U}
denote the set of all random variables under consideration, and let
Θ
,
Φ
1
,
Φ
2
,
…
,
Φ
n
⊆
U
{\displaystyle \Theta ,\Phi _{1},\Phi _{2},\dots ,\Phi _{n}\subseteq U}
and
Ψ
1
,
Ψ
2
,
…
,
Ψ
m
⊆
U
{\displaystyle \Psi _{1},\Psi _{2},\dots ,\Psi _{m}\subseteq U}
denote arbitrary sets of variables. (Here, given an arbitrary set of variables
X
{\displaystyle X}
,
X
{\displaystyle X}
will also denote an arbitrary assignment to the variables from
X
{\displaystyle X}
.)
If
Pr
(
U
)
=
f
(
Θ
)
∏
i
=
1
n
g
i
(
Φ
i
)
=
∏
j
=
1
m
h
j
(
Ψ
j
)
{\displaystyle \Pr(U)=f(\Theta )\prod _{i=1}^{n}g_{i}(\Phi _{i})=\prod _{j=1}^{m}h_{j}(\Psi _{j})}
for functions
f
,
g
1
,
g
2
,
…
g
n
{\displaystyle f,g_{1},g_{2},\dots g_{n}}
and
h
1
,
h
2
,
…
,
h
m
{\displaystyle h_{1},h_{2},\dots ,h_{m}}
, then there exist functions
h
1
′
,
h
2
′
,
…
,
h
m
′
{\displaystyle h'_{1},h'_{2},\dots ,h'_{m}}
and
g
1
′
,
g
2
′
,
…
,
g
n
′
{\displaystyle g'_{1},g'_{2},\dots ,g'_{n}}
such that
Pr
(
U
)
=
(
∏
j
=
1
m
h
j
′
(
Θ
∩
Ψ
j
)
)
(
∏
i
=
1
n
g
i
′
(
Φ
i
)
)
{\displaystyle \Pr(U)={\bigg (}\prod _{j=1}^{m}h'_{j}(\Theta \cap \Psi _{j}){\bigg )}{\bigg (}\prod _{i=1}^{n}g'_{i}(\Phi _{i}){\bigg )}}
In other words,
∏
j
=
1
m
h
j
(
Ψ
j
)
{\displaystyle \prod _{j=1}^{m}h_{j}(\Psi _{j})}
provides a template for further factorization of
f
(
Θ
)
{\displaystyle f(\Theta )}
.
Proof of Lemma 1
In order to use
∏
j
=
1
m
h
j
(
Ψ
j
)
{\displaystyle \prod _{j=1}^{m}h_{j}(\Psi _{j})}
as a template to further factorize
f
(
Θ
)
{\displaystyle f(\Theta )}
, all variables outside of
Θ
{\displaystyle \Theta }
need to be fixed. To this end, let
θ
¯
{\displaystyle {\bar {\theta }}}
be an arbitrary fixed assignment to the variables from
U
∖
Θ
{\displaystyle U\setminus \Theta }
(the variables not in
Θ
{\displaystyle \Theta }
). For an arbitrary set of variables
X
{\displaystyle X}
, let
θ
¯
[
X
]
{\displaystyle {\bar {\theta }}[X]}
denote the assignment
θ
¯
{\displaystyle {\bar {\theta }}}
restricted to the variables from
X
∖
Θ
{\displaystyle X\setminus \Theta }
(the variables from
X
{\displaystyle X}
, excluding the variables from
Θ
{\displaystyle \Theta }
).
Moreover, to factorize only
f
(
Θ
)
{\displaystyle f(\Theta )}
, the other factors
g
1
(
Φ
1
)
,
g
2
(
Φ
2
)
,
.
.
.
,
g
n
(
Φ
n
)
{\displaystyle g_{1}(\Phi _{1}),g_{2}(\Phi _{2}),...,g_{n}(\Phi _{n})}
need to be rendered moot for the variables from
Θ
{\displaystyle \Theta }
. To do this, the factorization
Pr
(
U
)
=
f
(
Θ
)
∏
i
=
1
n
g
i
(
Φ
i
)
{\displaystyle \Pr(U)=f(\Theta )\prod _{i=1}^{n}g_{i}(\Phi _{i})}
will be re-expressed as
Pr
(
U
)
=
(
f
(
Θ
)
∏
i
=
1
n
g
i
(
Φ
i
∩
Θ
,
θ
¯
[
Φ
i
]
)
)
(
∏
i
=
1
n
g
i
(
Φ
i
)
g
i
(
Φ
i
∩
Θ
,
θ
¯
[
Φ
i
]
)
)
{\displaystyle \Pr(U)={\bigg (}f(\Theta )\prod _{i=1}^{n}g_{i}(\Phi _{i}\cap \Theta ,{\bar {\theta }}[\Phi _{i}]){\bigg )}{\bigg (}\prod _{i=1}^{n}{\frac {g_{i}(\Phi _{i})}{g_{i}(\Phi _{i}\cap \Theta ,{\bar {\theta }}[\Phi _{i}])}}{\bigg )}}
For each
i
=
1
,
2
,
.
.
.
,
n
{\displaystyle i=1,2,...,n}
:
g
i
(
Φ
i
∩
Θ
,
θ
¯
[
Φ
i
]
)
{\displaystyle g_{i}(\Phi _{i}\cap \Theta ,{\bar {\theta }}[\Phi _{i}])}
is
g
i
(
Φ
i
)
{\displaystyle g_{i}(\Phi _{i})}
where all variables outside of
Θ
{\displaystyle \Theta }
have been fixed to the values prescribed by
θ
¯
{\displaystyle {\bar {\theta }}}
.
Let
f
′
(
Θ
)
=
f
(
Θ
)
∏
i
=
1
n
g
i
(
Φ
i
∩
Θ
,
θ
¯
[
Φ
i
]
)
{\displaystyle f'(\Theta )=f(\Theta )\prod _{i=1}^{n}g_{i}(\Phi _{i}\cap \Theta ,{\bar {\theta }}[\Phi _{i}])}
and
g
i
′
(
Φ
i
)
=
g
i
(
Φ
i
)
g
i
(
Φ
i
∩
Θ
,
θ
¯
[
Φ
i
]
)
{\displaystyle g'_{i}(\Phi _{i})={\frac {g_{i}(\Phi _{i})}{g_{i}(\Phi _{i}\cap \Theta ,{\bar {\theta }}[\Phi _{i}])}}}
for each
i
=
1
,
2
,
…
,
n
{\displaystyle i=1,2,\dots ,n}
so
Pr
(
U
)
=
f
′
(
Θ
)
∏
i
=
1
n
g
i
′
(
Φ
i
)
=
∏
j
=
1
m
h
j
(
Ψ
j
)
{\displaystyle \Pr(U)=f'(\Theta )\prod _{i=1}^{n}g'_{i}(\Phi _{i})=\prod _{j=1}^{m}h_{j}(\Psi _{j})}
What is most important is that
g
i
′
(
Φ
i
)
=
g
i
(
Φ
i
)
g
i
(
Φ
i
∩
Θ
,
θ
¯
[
Φ
i
]
)
=
1
{\displaystyle g'_{i}(\Phi _{i})={\frac {g_{i}(\Phi _{i})}{g_{i}(\Phi _{i}\cap \Theta ,{\bar {\theta }}[\Phi _{i}])}}=1}
when the values assigned to
Φ
i
{\displaystyle \Phi _{i}}
do not conflict with the values prescribed by
θ
¯
{\displaystyle {\bar {\theta }}}
, making
g
i
′
(
Φ
i
)
{\displaystyle g'_{i}(\Phi _{i})}
"disappear" when all variables not in
Θ
{\displaystyle \Theta }
are fixed to the values from
θ
¯
{\displaystyle {\bar {\theta }}}
.
Fixing all variables not in
Θ
{\displaystyle \Theta }
to the values from
θ
¯
{\displaystyle {\bar {\theta }}}
gives
Pr
(
Θ
,
θ
¯
)
=
f
′
(
Θ
)
∏
i
=
1
n
g
i
′
(
Φ
i
∩
Θ
,
θ
¯
[
Φ
i
]
)
=
∏
j
=
1
m
h
j
(
Ψ
j
∩
Θ
,
θ
¯
[
Ψ
j
]
)
{\displaystyle \Pr(\Theta ,{\bar {\theta }})=f'(\Theta )\prod _{i=1}^{n}g'_{i}(\Phi _{i}\cap \Theta ,{\bar {\theta }}[\Phi _{i}])=\prod _{j=1}^{m}h_{j}(\Psi _{j}\cap \Theta ,{\bar {\theta }}[\Psi _{j}])}
Since
g
i
′
(
Φ
i
∩
Θ
,
θ
¯
[
Φ
i
]
)
=
1
{\displaystyle g'_{i}(\Phi _{i}\cap \Theta ,{\bar {\theta }}[\Phi _{i}])=1}
,
f
′
(
Θ
)
=
∏
j
=
1
m
h
j
(
Ψ
j
∩
Θ
,
θ
¯
[
Ψ
j
]
)
{\displaystyle f'(\Theta )=\prod _{j=1}^{m}h_{j}(\Psi _{j}\cap \Theta ,{\bar {\theta }}[\Psi _{j}])}
Letting
h
j
′
(
Θ
∩
Ψ
j
)
=
h
j
(
Ψ
j
∩
Θ
,
θ
¯
[
Ψ
j
]
)
{\displaystyle h'_{j}(\Theta \cap \Psi _{j})=h_{j}(\Psi _{j}\cap \Theta ,{\bar {\theta }}[\Psi _{j}])}
gives:
f
′
(
Θ
)
=
∏
j
=
1
m
h
j
′
(
Θ
∩
Ψ
j
)
{\displaystyle f'(\Theta )=\prod _{j=1}^{m}h'_{j}(\Theta \cap \Psi _{j})}
which finally gives:
Pr
(
U
)
=
(
∏
j
=
1
m
h
j
′
(
Θ
∩
Ψ
j
)
)
(
∏
i
=
1
n
g
i
′
(
Φ
i
)
)
{\displaystyle \Pr(U)={\bigg (}\prod _{j=1}^{m}h'_{j}(\Theta \cap \Psi _{j}){\bigg )}{\bigg (}\prod _{i=1}^{n}g'_{i}(\Phi _{i}){\bigg )}}
The clique formed by vertices
x
1
{\displaystyle x_{1}}
,
x
2
{\displaystyle x_{2}}
, and
x
3
{\displaystyle x_{3}}
, is the intersection of
{
x
1
}
∪
∂
x
1
{\displaystyle \{x_{1}\}\cup \partial x_{1}}
,
{
x
2
}
∪
∂
x
2
{\displaystyle \{x_{2}\}\cup \partial x_{2}}
, and
{
x
3
}
∪
∂
x
3
{\displaystyle \{x_{3}\}\cup \partial x_{3}}
.
Lemma 1 provides a means of combining two different factorizations of
Pr
(
U
)
{\displaystyle \Pr(U)}
. The local Markov property implies that for any random variable
x
∈
U
{\displaystyle x\in U}
, that there exists factors
f
x
{\displaystyle f_{x}}
and
f
−
x
{\displaystyle f_{-x}}
such that:
Pr
(
U
)
=
f
x
(
x
,
∂
x
)
f
−
x
(
U
∖
{
x
}
)
{\displaystyle \Pr(U)=f_{x}(x,\partial x)f_{-x}(U\setminus \{x\})}
where
∂
x
{\displaystyle \partial x}
are the neighbors of node
x
{\displaystyle x}
. Applying Lemma 1 repeatedly eventually factors
Pr
(
U
)
{\displaystyle \Pr(U)}
into a product of clique potentials (see the image on the right).
End of Proof
^ Lafferty, John D.; Mccallum, Andrew (2001). "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data" . Proc. of the 18th Intl. Conf. on Machine Learning (ICML-2001) . Morgan Kaufmann. ISBN 9781558607781 . Retrieved 14 December 2014 . by the fundamental theorem of random fields (Hammersley & Clifford 1971 )
^ Dobrushin, P. L. (1968), "The Description of a Random Field by Means of Conditional Probabilities and Conditions of Its Regularity" , Theory of Probability and Its Applications , 13 (2): 197–224, doi :10.1137/1113026
^ Spitzer, Frank (1971), "Markov Random Fields and Gibbs Ensembles", The American Mathematical Monthly , 78 (2): 142–154, doi :10.2307/2317621 , JSTOR 2317621
^ Hammersley, J. M.; Clifford, P. (1971), Markov fields on finite graphs and lattices (PDF)
^ Clifford, P. (1990), "Markov random fields in statistics" , in Grimmett, G. R.; Welsh, D. J. A. (eds.), Disorder in Physical Systems: A Volume in Honour of John M. Hammersley , Oxford University Press, pp. 19–32, ISBN 978-0-19-853215-6 , MR 1064553 , retrieved 2009-05-04
^ Grimmett, G. R. (1973), "A theorem about random fields", Bulletin of the London Mathematical Society , 5 (1): 81–84, CiteSeerX 10.1.1.318.3375 , doi :10.1112/blms/5.1.81 , MR 0329039
^ Preston, C. J. (1973), "Generalized Gibbs states and Markov random fields", Advances in Applied Probability , 5 (2): 242–261, doi :10.2307/1426035 , JSTOR 1426035 , MR 0405645
^ Sherman, S. (1973), "Markov random fields and Gibbs random fields", Israel Journal of Mathematics , 14 (1): 92–103, doi :10.1007/BF02761538 , MR 0321185
^ Besag, J. (1974), "Spatial interaction and the statistical analysis of lattice systems", Journal of the Royal Statistical Society, Series B , 36 (2): 192–236, JSTOR 2984812 , MR 0373208