Notes on Stochastic Processes (Joe Chang)

Ch. 1 – Markov Chains

1.1 Specifying and Simulating a Markov Chain

to specify a Markov chain, we need to know its

1.2 The Markov Property

a process X0,X1,X_0, X_1, … satisfies the Markov property if

P{Xn+1=in+1Xn=in,Xn1=in1,...,X0=i0}=P{Xn+1=in+1Xn=in}\mathbb{P}\{X_{n+1} = i_{n+1} | X_n=i_n, X_{n-1} = i_{n-1}, ..., X_0 = i_0 \} \newline = \mathbb{P}\{X_{n+1} = i_{n+1} | X_n = i_n\}

for all nn and all i0,,in+1Si_0, …, i_{n+1} \in S.

notes

notes

(i.e., dependent on the last r states)

1.3 Matrices

recall π0\pi_0 from 1.1. let πn\pi_n denote the distribution of the chain at time nn analogously, πn(i)=P{Xn=i}\pi_n(i) = \mathbb{P}\{X_n = i\}. consider both as row vectors.

suppose that the state space is finite: S={1,,N}S = \{1, …, N\}. by LOTP,

πn+1(j)=P{Xn+1=j}=i=1NP{Xn=i}P{Xn+1=jXn=1}i=1Nπn(i)P(i,j)=πn+1=πnP.\pi_{n+1}(j) = \mathbb{P}\{X_{n+1} = j\} \newline = \sum_{i=1}^{N}\mathbb{P}\{X_n=i\}\mathbb{P}\{X_{n+1} = j | X_n=1\} \newline \sum_{i=1}^{N} \pi_n(i)P(i,j) = \pi_{n+1} = \pi_nP.

notes

1.4 Basic Limit Theorem of Markov Chains

notes

1.5 Stationary Distribution

stationary distribution amounts to saying π=πP\pi = \pi P is satisfied, i.e.,

π(j)=iSπ(i)P(i,j)\pi(j) = \sum_{i \in S} \pi(i) P(i,j)

for all jSj \in S.

a Markov chain might have no stationary distribution, one stationary distribution, or infinitely many stationary distributions.

for subsets A,BA, B of the state space, define the probability flux from set AA into BB as

flux(A,B)=iAjBπ(i)P(i,j)\text{flux}(A, B) = \sum_{i \in A} \sum_{j\in B} \pi(i) P(i,j)

1.6 Irreducibility, Periodicity, Recurrence

Use Pi(A)\mathbb{P}_i(A) as shorthand for P{AX0=i}\mathbb{P}\{A | X_0 = i\}, and same for Ei\E_i.

Accessibility: for two states i,ji, j, we say that jj is accessible from ii if it is possible for the chain ever to visit state jj if the chain starts in state ii:

Pi{n=0{Xn=j}}>0.\P_i \{\cup_{n=0}^{\infty} \{X_n = j\} \} > 0.

Equivalently,

n=0Pn(i,j)=n=0{Xn=j}>0.\sum_{n=0}^{\infty}P^n (i, j) = \sum_{n=0}^{\infty}\{X_n = j \} > 0.

Communication: we say ii communicates with jj if ii is accessible from jj and jj is accessible from ii.

Irreducibility: the Markov chain is irreducible if all pairs of states communicate.

The relation “communicates with” is an equivalence relation; hence, the states space SS can be partitioned into “communicating classes” or “classes.”

The Basic Limit Theorem requires irreducibility and aperiodicity (see 1.5). Trivial examples why:

Period: Given a Markov chain {X0,X1,...}\{ X_0, X_1, ... \}, define the period of a state ii to be the greatest common divisor

di=gcd{n:Pn(i,i)>0}.d_i = \text{gcd} \{n : P^n (i, i) > 0 \}.

THEOREM: if the states ii and jj communicate, then di=djd_i = d_j.

The period of a state is a “class property.” In particular, all states in an irreducible Markov chain have the same period. Thus, we can speak of the period of a Markov chain if the Markov chain is irreducible.

An irreducible Markov chain is aperiodic if its period is 11, and periodic otherwise. (A sufficient but not necessary condition for an irreducible chain to be aperiodic is that there exists a state ii such that P(i,i)>0P(i,i) > 0.)

One more concept for the Basic Limit Theorem: recurrence. We will begin with showing recurrence, then showing that it is a class property. In particular, in an irreducible Markov chain, etiher all states are recurrent OR all states are transient.

The idea of recurrence: a state ii is recurrent if, starting from the state ii at time 0, the chain is sure to return to ii eventually. More precisely, define the first hitting time TiT_i of the state i by

Ti=inf{n>0:Xn=i}.T_i = \inf \{n>0 : X_n = i \}.

Recurrence: the state ii is recurrent if Pi{Ti<}=1\P_i \{ T_i < \infty \} = 1. If ii is not recurrent, it is called transient.

(note that accessibility could be defined: for distinct states iji \neq j, jj is accessible from ii iff Pi{Tj<}>0\P_i \{ T_j < \infty \} > 0.)

THEOREM: Let ii be a recurrent state, and suppose that jj is accessible from ii. Then all of the following hold:

We use the notation NiN_i for the total number of visits of the Markov chain to the state ii:

Ni=n=0I{Xn=i}.N_i = \sum_{n=0}^{\infty} I \{X_n = i \}.

THEOREM: The state ii is recurrent iff Ei(Ni)=.\E_i(N_i) = \infty.

COROLLARY: If jj is transient, then limnPn(i,j)=0\lim_{n \rightarrow \infty} P^n(i,j) = 0 for all states ii.

Introducing stationary distributions.

PROP: Suppose a Markov chain has a stationary distribution π\pi. If the state jj is transient, then π(j)=0\pi(j) = 0.

COROLLARY: If an irreducible Markov chain has a stationary distribution, then the chain is recurrent.

Note that the converse for the above is not true. There are irreducible, recurrent Markov chains that do not have stationary distributions. For example, the simple symmetric random walk on the integers in one dimension is irreducible and recurrent but does not have a stationary distribution. By recurrence we have P0{T0<}=1\P_0 \{T_0 < \infty \} = 1, but also E0{T0}=\E_0 \{T_0 \} = \infty. The name for this kind of recurrence is null recurrence, i.e., a state ii is null recurrent if it is recurrent and Ei(Ti)=\E_i(T_i) = \infty. Otherwise, a recurrent state is positive recurrent, where Ei(Ti)<\E_i (T_i) < \infty.

Positive recurrence is also a class property: if a chain is irreducible, the chain is either transient, null recurrent, or positive recurrent. In fact, an irreducible chain has a stationary distribution iff it is positive recurrent.

1.7 Coupling

Example of coupling technique: consider a random graph on a given finite set of nodes, in which each pair of nodes is joined by an edge independently with probability pp. We could simulate a random graph as follows: for each pair of nodes i,ji, j generate a random number UijU[0,1]U_{ij} \sim U[0,1], and join nodes ii and jj with an edge if UijpU_{ij} \leq p.

How do we show that the probability of the resulting graph being connected is nondecreasing in pp, i.e., show that for p1<p2p_1 < p_2,

Pp1{graph connected}Pp2{graph connected}.\P_{p1} \{ \text{graph connected}\} \leq \P_{p2} \{ \text{graph connected} \}.

We could try to find an explicit function for the probability in terms of pp, which seems inefficient. How to formalize the intuition that this seems obvious?

An idea: show that corresponding events are ordered, i.e., if ABA \subset B then PAPB\P A \leq \P B.

Let’s make 2 events by making 2 random graphs, G1,G2G_1, G_2 on the same set of nodes. G1G_1 is constructed by having each possible edge appear with prob p1p_1, and for G2G_2, each edge present with prob p2.p_2. We can do this by using two sets of U[0,1]U[0,1] random variables: {Uij},{Vij}\{U_{ij}\}, \{V_{ij}\} for the first and second graph, respectively. Is it true that

{G1 connected}{G2 connected}?\{G_1 \text{ connected}\} \subset \{G_2 \text{ connected} \}?

No, since the two sets of r.v.s are independently generated.

A change: use the same random numbers for each graph. Then

{G1 connected}{G2 connected}\{G_1 \text{ connected}\} \subset \{G_2 \text{ connected} \}

becomes true. This establishes monotonicity of the probability being connected.

Conclusion: what characterizes a coupling argument? Generally, we show that the same set of random variables can be used to construct two different objects about which we want to make a probabilistic statement.

1.8 Proof of Basic Limit Theorem

The Basic Limit Theorem says that if an irreducible, aperiodic Markov chain has a stationary distribution π\pi, then for each initial distirbution π0\pi_0, as nn \rightarrow \infty we have πn(i)π(i)\pi_n (i) \rightarrow \pi(i) for all states ii.

(Note the wording “a stationary distribution”: assuming BLT true implies that an irreducible and aperiodic Markov chain cannot have two different stationary distributions)

Equivalently, let’s define a distance between probability distributions, called “total variation distance:”

DEFINITION: Let λ\lambda and μ\mu be two probability distributions on the set SS. Then the total variation distance λμ|| \lambda - \mu || is defined by

λμ=supAS[λ(A)μ(A)].|| \lambda - \mu || = \sup_{A \subset S} [ \lambda(A) - \mu(A) ].

PROP: The total variation distance may also be expressed in the alternative forms

λμ=supAS[λ(A)μ(A)]=12iSλ(i)μ(i)=1iSmin{λ(i),μ(u)}.|| \lambda - \mu = \sup_{A \subset S} [ \lambda(A) - \mu(A) ] = \frac{1}{2} \sum_{i \in S} | \lambda(i) - \mu(i) | = 1 - \sum_{i \in S} \min \{ \lambda(i), \mu(u) \}.

We now introduce the coupling method. Let Y0,Y1,...Y_0, Y_1, ... be a Markov chain with the same probability transition matrix as X0,X1,...X_0, X_1, ..., but let Y0Y_0 have the initial distribution π\pi and X0X_0 have the initial distribution π0\pi_0. Note that {Yn}\{ Y_n \} is a stationary Markov chain with distribution π\pi for all nn. Let the YY chain be independent of the XX chain.

We want to show that, for large nn, the probabilistic behavior of XnX_n is close to that of YnY_n.

Define the coupling time TT to be the first time at which Xn=YnX_n = Y_n:

T=inf{n:Xn=Yn}.T = \inf \{ n : X_n = Y_n \}.

LEMMA: For all n we have

πnπP{T>n}.|| \pi_n - \pi || \leq \P \{ T > n \}.

Hence we need only show that P{T>n}0\P \{T > n \} \rightarrow 0, or equivalently, that P{T<}=1.\P \{ T < \infty \} = 1.

Consider the bivariate chain {Zn=(Xn,Yn):n0}.\{ Z_n = (X_n, Y_n) : n \geq 0 \}. Z0,Z1,...Z_0, Z_1, ... is clearly a Markov chain on the state space S×SS \times S. Since the XX and YY chains are independent, the probability transition matrix PZP_Z of chain ZZ can be written

PZ(ixiy,jxjy)=P(ix,jx)P(iy,jy).P_Z (i_x i_y, j_x j_y) = P(i_x , j_x) P(i_y, j_y).

ZZ has stationary distribution

πZ(ixyiy)=π(ix)π(iy).\pi_Z (i_xy i_y) = \pi(i_x) \pi(i_y).

We want to show P{T}.\P \{ T \leq \infty \}. So, in terms of the ZZ chain, we want to show that with probability one, the ZZ chain hits the “diagonal” {(j,j):jS}\{ (j,j) : j \in S\} in S×SS \times S in finite time. To do this, it is sufficient to show that the ZZ chain is irreducible and recurrent.

This is where we use aperiodicity.

LEMMA: Sps AA is a set of positive integers closed under addition, with gcd one. Then there exists an integer NN s.t. nAn \in A for all nNn \geq N.

Let iSi \in S and recall the assumption that the chain is aperiodic. Since the set {n:Pn(i,i)>0} \{ n : P^n (i,i) > 0 \} is closed under addition, and, from aperiodicity, has greatest common divisor 11, we can use the previous lemma. So Pn(i,i)>0P^n (i,i) > 0 for all sufficiently large nn. From this, for any i,jSi,j \in S, since irreducibility implies Pm(i,j)>0P^m (i,j) > 0 for some m, it follows that Pn(i,j)>0P^n(i,j) > 0 for all sufficiently large nn.

Now we show irreducibility. Let ix,iy,jx,jySi_x, i_y, j_x, j_y \in S. It is sufficient to show that PZn(ixiy,jxjy)>0P^n_Z (i_x i_y, j_x j_y) > 0 for some nn. By the assumed independence of {Xn}\{X_n\} and {Yn}\{Y_n\}, we have

PZn(ixiy,jxjy)=Pn(ix,jx)Pn(iy,jy),P^n_Z (i_x i_y, j_x j_y) = P^n(i_x, j_x)P^n(i_y, j_y),

which (by the previous argument) is positive for all sufficiently large nn, so we are done.

1.9 SLLN for Markov Chains

The usual Strong Law of Large Numbers for iid random variables says that if X1,X2,...X_1, X_2, ... are iid with mean μ\mu, then

P{(1/n)t=1nXtμ}=1 as n.\P \{ (1/n) \sum^n_{t=1} X_t \rightarrow \mu \} = 1 \text{ as } n \rightarrow \infty.

We will do a generalization of this result for Markov chains: the fraction of times a Markov chain occupies state ii converges to a limit.

Although the successive states of a Markov chain are not independent, certain features of a Markov chain are independent of each other. Here we will use the idea that the path of the chain consists of a succession of independent “cycles,” the segments of the path between successive visits to a recurrent state. This independence allows us to use the LLN that we already know.

THEOREM: Let X0,X1,...X_0, X_1, ... be a Markov chain starting in the state X0=iX_0 = i, and suppose that the state ii communictes with another state jj. The limiting fraction of time that the chain spends in state jj is 1/EjTj1 / \E_j T_j. That is,

Pi{limn1nt=1nI{Xt=j}=1EjTj}=1.\P_i \left\{ \lim_{n \rightarrow \infty} \frac{1}{n} \sum_{t =1}^{n} I \{X_t = j\} = \frac{1}{\E_j T_j} \right\} = 1.

In the proof of the previous theorem, we define Vn(j)V_n(j) as the number of visits made to state jj made by X1,...,XnX_1, ..., X_n, i.e.,

Vn(j)=t=1n{Xt=j}.V_n(j) = \sum^{n}_{t=1} \{X_t = j\}.

Using the Bounded Convergence Theorem,

we have the following:

COROLLARY: For an irreducible Markov chain, we have

limn1nt=1nPt(i,j)=1EjTj\lim_{n\rightarrow \infty} \frac{1}{n} \sum^{n}_{t=1} P^t (i,j) = \frac{1}{\E_j T_j}

for all states ii and jj.

Consider an irreducible, aperiodic Markov chain haing a stationary distribution π\pi. From the Basic Limit Theorem, we know that Pn(i,j)π(j)P^n (i,j) \rightarrow \pi(j) as nn \rightarrow \infty. Notice also that if a sequence of numbers converges to a limit, then the sequence of Cesaro averages converges to the same limit, i.e., if ataa_t \rightarrow a as tt \rightarrow \infty, then (1/n)t=1nata(1/n) \sum^{n}_{t=1}a_t \rightarrow a as nn \rightarrow \infty. However, the previous Corollary shows the Cesaro averages of Pt(i,j)P^t (i,j)'s converge to 1/EjTj1 / \E_j T_j. So, we must have

π(j)=1EjTj.\pi(j) = \frac{1}{E_j T_j}.

In fact, aperiodicity is not needed for this conclusion:

THEOREM: An irreducible, positive recurrent Markov chain has a unique stationary distribution π\pi given by

π(j)=1EjTj.\pi(j) = \frac{1}{\E_j T_j}.

1.10 Exercises

Ch. 2 – Markov Chains: Examples & Applications

2.1 Branching Processes

Motivating example: the branching process model was formulated by Galton, who was interested in the survival and extinction of family names.

Suppose children inherit their fathers’ names, so we need only to keep track of fathers and sons. Consider a male who is the only member of his generation to have a given family name, so that the responsibility of keeping the family name alive falls upon him–if his line of male descendants terminates, so does the family name.

Suppose for simplicity that each male has probability f(0)f(0) of producing no sons, f(1)f(1) of producing 1 son, and so on.

What is the probability that the family name eventually becomes extinct?

To formalize this, let:

We are interested in the extinction probability ρ=P1{Gt=0 for some t}\rho = \P_1 \{ G_t = 0 \text{ for some t} \}.

{Gt:t0}\{ G_t : t \geq 0 \} is a Markov chain. State 0 is absorbing. So, for each i>0i > 0 , since Pi{G1=0}=(f(0))i>0\P_i \{ G_1 = 0 \} = (f(0))^i >0, the state ii must be transient.

Consequently, with probability 1, each state i>0i > 0 is visited only a finite number of times. So, with probability 1, the chain must get absorbed at 0 or approach \infty.

We can obtain an equation for ρ\rho by conditioning on what happens at the first step of the chain:

ρ=k=0P{G1=kG0=1}P{eventual extinctionG1=k}.\rho = \sum^{\infty}_{k=0} \P \{ G_1 = k | G_0 = 1 \} \P \{\text{eventual extinction} | G_1 = k \}.

Since the males all have sons independently,

P{eventual extinctionG1=k}=ρk.\P \{ \text{eventual extinction} | G_1 = k \} = \rho^k.

Thus, we have

ρ=k=0f(k)ρk=ψ(ρ).\rho = \sum^{\infty}_{k=0} f(k) \rho^k = \psi (\rho).

For each distribution ff there is a corresponding function of ρ\rho denoted ψ\psi. So the extinction probability ρ\rho is a fixed point of ψ\psi.

ψ\psi is the probability generating function of the probability mass funcion ff. The first two derivatives are

ψ(z)=k=1kf(k)zk1,ψ(z)=k=2k(k1)f(k)zk2.\psi' (z) = \sum^{\infty}_{k=1} kf(k) z^{k-1}, \psi '' (z) = \sum^{\infty}_{k = 2} k (k-1) f(k) z^{k-2}.

for z(0,1). z \in (0,1). Since these are positive, the function ψ\psi is strictly increasing and convex on z(0,1).z \in (0,1). Also note that ψ(0)=f(0),ψ(1)=1\psi(0) = f(0), \psi(1) = 1. Finally, ψ(1)=kf(k)=μ\psi'(1) = \sum k f(k) = \mu, where μ=E(X)\mu = E(X), the expected number of sons for each male.

notes

Since ψ(1)=1\psi(1) = 1, there is always a trivial solution at ρ=1\rho = 1. When μ1\mu \leq 1, this trivial solution is the only solution, so ρ=1\rho = 1.

When μ>1\mu > 1, we define rr to be the smaller solution of ψ(r)=r\psi(r) = r. Since ψ(ρ)=ρ\psi(\rho) = \rho, we know that ρ\rho must be either rr or 1. We want to show that ρ=r\rho = r.

Defining pt=P1{Gt=0}p_t = \P_1 \{ G_t = 0 \}, observe that as tt \rightarrow \infty,

ptP1[n=1{Gn=0}]=ρ.p_t \uparrow \P_1 \left[ \bigcup_{n=1}^{\infty} \{ G_n = 0 \} \right] = \rho.

Therefore, to rule out ρ=1\rho = 1, it is sufficient to prove that

ptr for all t.p_t \leq r \text{ for all }t.

By induction, observe p0=0p_0 = 0. Next,

pt+1=P1{Gt+1=0}=i=0P1{G1=i}P1{Gt+1=0G1=i}=i=0f(i)(pt)i.p_{t+1} = \P_1 \{ G_{t+1} = 0 \} = \sum_{i=0}^{\infty}\P_1 \{ G_1 = i \} \P_1 \{G_{t+1} = 0 | G_1 = i \} = \sum^{\infty}_{i=0} f(i) (p_t)^i.

i.e., pt+1=ψ(pt).p_{t+1} = \psi(p_t). Since ψ\psi is increasing and by the induction hypothesis ptrp_t \leq r, we have that

pt+1=ψ(pt)ψ(r)=r.p_{t+1} = \psi(p_t) \leq \psi(r) = r.

So p0,pt+1p_0, p_{t+1} is bounded by rr for all non negative tt, which proves ρ=r\rho = r. Hence there is a non negative probability that the family name goes on forever.

2.2 Time Reversibility

2.3 More on Time Reversibility: A Tandem Queue Model

2.4 The Metropolis Method

2.5 Simulated Annealing

2.6 Ergodicity Concepts

In this section we focus on a time-inhomogeneous Markov chain {Xn}\{X_n\} on a countably infinite state space SS.

Let PnP_n denote the probability transition matrix governing the transition from XnX_n to Xn+1X_{n+1}, i.e.,

Pn(i,j)=P{Xn+1=jXn=i}.P_n(i,j) = \P \{X_{n+1} = j | X_n = i \}.

For m<nm < n, define P(m,n)=k=mn1Pk,P^{(m,n)} = \prod^{n-1}_{k=m} P_k, s.t.

P(m,n)(i,j)=P{Xn=jXm=i}.P^{(m,n)}(i,j) = \P \{ X_n = j | X_m = i \}.

DEFINITION: {Xn}\{X_n\} is strongly ergodic if there exists a probability distribution π\pi^* on SS such that

limnsupiSP(m,n)(i,)π=0,m.\lim_{n \rightarrow \infty} \sup_{i\in S} || P^{(m,n)} (i, \cdot) - \pi^* || = 0, \forall m.

DEFINITION: {Xn}\{X_n\} is weakly ergodic if there exists a probability distribution π\pi^* on SS such that

limnsupi,jSP(m,n)(i,)P(m,n)(j,)=0,m.\lim_{n \rightarrow \infty} \sup_{i,j \in S} || P^{(m,n)} (i, \cdot) - P^{(m,n)} (j, \cdot) || = 0, \forall m.

We can understand weak ergodicity somewhat as a “loss of memory concept.” It says that at a large enough time nn, the chain has nearly forgotten its state at time mm, in the sense that the distribution at time nn would be nearly the same no matter waht the state was at time mm. However, there is no requirement that the distribution be converging to anything as nn \rightarrow \infty. The concept that incorporates convergence in addition to loss of memory is strong ergodicity.

2.6.1 The Ergodic Coefficient

For a probability transition matrix P=P((i,j)),P = P((i,j)), the ergodic coefficient δ(P)\delta(P) of PP is defined to be the maximum total variation distance between pairs of rows of PP, that is,

DEFINITION: The ergodic coefficient δ(P)\delta(P) of a probability transition matrix PP is

δ(P)=supi,jSP(i,)P(j,)=12supi,jSkSP(i,k)P(j,k)=supi,jSkS(P(i,k)P(j,k))+.\begin{align*} \delta(P) &= \sup_{i,j \in S} || P(i, \cdot) - P(j, \cdot) || \\ &= \frac{1}{2} \sup_{i,j \in S} \sum_{k \in S} | P(i,k) - P(j,k) | \\ &= \sup_{i,j \in S} \sum_{k \in S} (P(i,k) - P(j,k))^+ . \end{align*}

The basic idea is that δ(P)\delta(P) being small is “good” for ergodicity. For example, in the extreme case of δ(P)=0\delta(P) = 0, all the rows of PP are identical, so PP would cause a Markov chain to lose its memory completely in just one step: v1=v0Pv_1 = v_0 P does not depend on v0v_0.

LEMMA: δ(PQ)δ(P)δ(Q)\delta (PQ) \leq \delta(P) \delta(Q) for probability transition matrices P,QP, Q.

2.6.2 Sufficient Conditions for Weak and Strong Ergodicity

Sufficient conditions are given in the next two propositions:

PROP: If there exist n0<n1<n2<... n_0 < n_1 < n_2 < ... such that k[1δ(P(nk,nk+1))]=\sum_k [ 1- \delta (P^{(n_k, n_{k+1})}) ] = \infty, then {Xn}\{X_n\} is weakly ergodic.

PROP: If {Xn}\{ X_n \} is weakly ergodic and if there exist π0,π1,...\pi_0, \pi_1, ... such that πn\pi_n is a stationary distribution for PnP_n for all nn and nπnπn+1<\sum_n || \pi_n - \pi_{n+1} || < \infty, then {Xn}\{ X_n \} is strongly ergodic. In that case, the distribution π\pi^* in the definition is given by π=limnπn\pi^* = \lim_{n \rightarrow \infty} \pi_n.

2.7 Proof of Main Theorem of Simulated Annealing

2.8 Card Shuffling

We have seen that for an irreducible, aperiodic Markov chain {Xn}\{ X_n \} having stationary distribution π\pi, the distribution πn\pi_n of {Xn}\{X_n\} converges to π\pi in the total variation distance. An example of using this would be generating a nearly uniformly distributed 4×44 \times 4 table with given row and column sums by simulating a certain Markov chain for a long enough time. The question is: how long is long enough?

In certain simple Markov chain examples, it is easy to figure out the rate of convergence of πn\pi_n to π\pi. In this section we will concentrate on a simple shuffling example considered by Aldous and Diaconis in their article “Shuffling cards and stopping times.” The basic question is: How close is the deck to being “random” (uniformly distributed over the 52!52! possible permutation) after nn shuffles? For the riffle shuffle model, the answer is “about 7.”

2.8.1 “Top in at random” Shuffle

The “top in at random” method consists of taking the top card off the deck and then inserting is back into the deck at a random position (this includes back on top). So, altogether, there are 52 equally likely positions. Repeated performance of this shuffle on a deck of cards produces a sequence of “states” of the deck. This sequence of states forms a Markov chain with state space S52S_{52}, the group of permutations of the cards. This Markov chain is irreducible, aperiodic, and has stationary distribution π= Uniform \pi = \text{ Uniform } on S52S_{52} (i.e., probability 1/52!1/52! for each permutation). Therefore, by the Basic Limit Theorem, we may conclude that πnπ0|| \pi_n - \pi || \rightarrow 0 as nn \rightarrow \infty.

2.8.2 Threshold Phenomenon

Suppose we are working with a fresh deck of dd cards in the original order (card 1 on top, card 2 under, etc.). Then π0π=1(1/d!)|| \pi_0 - \pi || = 1 - (1/d!). We also know that πnπ0|| \pi_n - \pi || \rightarrow 0 as nn \rightarrow \infty from the Basic Limit Theorem. It is natural to assume that the distance from stationarity decreases to 0 in a smooth manner; however, it actually experiences what we call the “threshold phenomenon.” An abrupt change happens in a relatively small neighborhood of the value n=dlogdn = d \log d. That is, for large dd the graph of πnπ|| \pi_n - \pi || versus nn looks like the following picture.

notes

The larger the value of the deck size dd, the sharper (relative to dlogdd\log d) the drop is near n=dlogdn = d \log d.

2.8.3 A random time to exact stationarity

Let’s give a name to each card in the deck (i.e., say the 2 of hearts is card 1, the 3 of hearts is card 2, etc.). Suppose we start with the deck in pristine order (card 1 on top, then card 2, etc.). Though πn\pi_n will never become exactly random, it is possible to find a random time TT at which the deck becomes exactly uniformly distributed, that is, XTUnif(S52)X_T \sim \text{Unif}(S_{52}).

Here is an example of such a random time. Suppose that “card i” always refers to the same card (like, say, card 52, the ace of spades), whereas “top card,” “card in position 2,” etc. are just about cards at the time of consideration. Also note that we may describe a sequence of shuffles simply by a sequence of iid random vacriables U1,U2,...U_1, U_2, ... uniformly distributed on {1,2,...,52}\{1, 2, ..., 52 \}: just say that the ii-th shuffle moves the top card to position UiU_i. Define the following random times:

T1=inf{n:Un=52}= 1st time a top card goes below card 52 T2=inf{n>T1:Un51}= 2nd time a top card goes below card 52 T3=inf{n>T2:Un50}= 3rd time a top card goes below card 52 T51=inf{n>T50:Un2}= 51st time a top card goes card below 52 \begin{align*} T_1 &= \inf \{n : U_n = 52 \} = \text{ 1st time a top card goes below card 52 } \\ T_2 &= \inf \{n > T_1 : U_n \geq 51 \} = \text{ 2nd time a top card goes below card 52 } \\ T_3 &= \inf \{n > T_2 : U_n \geq 50 \} = \text{ 3rd time a top card goes below card 52 } \\ & \vdots \\ T_{51} &= \inf \{n > T_{50} : U_n \geq 2 \} = \text{ 51st time a top card goes card below 52 } \end{align*}

and

T=T52=T51+1.T = T_{52} = T_{51} + 1.

It is not hard to see that TT has the desired property and that XTX_T is uniformly distributed. To understand this, start with T1T_1. At time T1T_1, we know that some card is below card 52; we don’t know which card, but that will not matter. After time T1T_1 we continue to shuffle until T2T_2, at which time another card goes below card 52. At time T2T_2, there are 2 cards below card 52. Again, we do not know which cards they are, but conditional on which 2 cards are below card 52, each of the two possible orderings of those 2 cards is equally likely. Similarly, we continue to shuffle until time T3T_3, at which time there are some 3 cards below card 52, and, whatever those 3 cards are, each of their (3!) possible relative positions in the deck is equally likely. And so on. At time T51T_{51}, card 52 has risen all the way up to become the top card, and the other 51 cards are below card 52 (now we do know which cards they are), and those 51 cards are in random positions (i.e. uniform over 51! possibilities). Now all we have to do is shuffle one more time to get card 52 in random position, so that at time T=T52=T51+1T = T_{52} = T_{51} + 1, the whole deck is random.

Let us find ETET. By the definitions above, T1Geom(1/52),(T2T1)Geom(2/52),...,(T51T50)Geom(51/52),(T52T51)Geom(52/52).T_1 \sim \text{Geom} (1/52), (T_2 - T_1) \sim \text{Geom} (2/52), ..., (T_{51} - T_{50}) \sim \text{Geom}(51/52), (T_{52} - T_{51}) \sim \text{Geom}(52/52). So

ET=E(T1)+E(T2T1)+...+E(T52T51)52log52.ET = E(T_1) + E(T_2 - T_1) + ... + E(T_{52} - T_{51}) \approx 52 \log 52.

Analogously, if the deck had dd cards rather than 52, we would have obtained ETdlogdET \sim d\log d (for large dd), where TT is now a random time at which the whole deck of dd cards becomes uniformly distributed on SdS_d.

2.8.4 Strong Stationary Times

As we have observed, the random variable TT that we just constructed has the property that XTπX_T \sim \pi. TT also has two other important properites. First, XTX_T is independent of TT. Second, TT is a stopping time, that is, for each nn, one can determine whether or not T=nT=n just by looking at the values of X0,...,XnX_0, ..., X_n. In particular, to determine whether or not T=nT = n it is not necessary to know any future values Xn+1,Xn+2,...X_{n+1}, X_{n+2}, ....

DEFINITION: A random variable TT satisfying

is called a strong stationary time.

What’s so good about strong stationary times?

LEMMA: If TT is a strong stationary time for the Markov chain {Xn}\{ X_n \}, then πnπP{T>n}|| \pi_n - \pi || \leq \P \{T>n\} for all nn.

This tells us that strong stationary times satisfy the same inequality derived for coupling times.

2.8.5 Proof of threshold phenomenon in shuffling

Let Δ(n)\Delta (n) denote πnπ|| \pi_n - \pi ||. The proof that the threshold phenomenon occurs in the top-in-at-random shuffle consists of two parts. Roughly speaking, the first part shows that Δ(n)\Delta (n) is close to 0 for nn slightly larger than dlogdd \log d, and the second part shows that Δ(n)\Delta (n) is close to 1 for nn slightly smaller than dlogdd \log d, where in both cases the meaning of “slightly” is “small relative to dlogdd \log d.”

Part 1 is addressed in the following result:

THEOREM: For TT defined as in the random time discussion above, we have

Δ(dlogd+cd)P{T>dlogd+cd}ec\Delta (d \log d + cd) \leq \P \{ T > d \log d + cd \} \leq e^{-c}

for all c0c \geq 0.

Note that for each fixed cc, cdcd is small relative to dlogdd \log d if dd is large enough. FINISH THIS SECTION

Ch. 3 – MRFs and HMMs

This section looks at aspects of Markov random fields (MRF’s), hidden Markov models (HMM’s), and their applications.

3.1 MRFs on Graphs and HMMs

A stochastic process is a collection of random variables {Xt:tT}\{ X_t : t \in T\} indexed by some subset TT of the real line R\R. The elements of TT are often interpreted as times, in which case XtX_t represents the state at time tt of the random process under consideration. The term random field refers to a generalization of the notion of a stochastic process: a random field {Xs:sG}\{ X_s : s \in G\} is still a collection of random variables, but now the index set GG need not be a subset of R\R. For example, GG could be a subset of the plane R2\R^2. In this section, we’ll consider GG as the set of nodes of a graph (the set being at most countable). Important aspects of the dependence among the random variables will be determined by the edges of the graph through a generalization of the Markov property.

NOTATION: Given a graph GG, we say two nodes s,ts, t are neighbors, denoted sts \sim t, if they are joined by an edge of the graph. We do not consider a node to be a neighbor of itself. N(t)N(t) the set of neighbors of tt.

DEFINITION: Suppose we are given a graph GG with nodes {1,...,n}\{ 1, ..., n\} and a neighborhood structure N(t)N(t). The collection of random variables (X1,...,Xn)(X_1, ..., X_n) is a Markov random field on GG if

P{Xt=xtXs=xs for st}=P{Xt=xtXs=xs for sN(t)}\P \{ X_t = x_t | X_s = x_s \text{ for } s \neq t \} = \P \{ X_t = x_t | X_s = x_s \text{ for } s \in N(t)\}

for all nodes t{1,...,n}.t \in \{1, ..., n\}.

More compact notation: for a subset of nodes AGA \subset G, let xAx_A be the vector (xs:sA).(x_s : s \in A). We will also write p(xtxN(t))p(x_t | x_{N(t)}) for P{Xt=xtXs=xs for sN(t)}\P \{X_t = x_t | X_s = x_s \text { for } s \in N(t)\}. HMM’s are Markov random fields in which some of the random variables are observable and others are not. We adopt XX for hidden random variables and YY for observed random variables.

3.2 Bayesian Framework

What do we get out of these models & how can we use them? One approach is Bayesian: HMM’s fit nicely in the Bayesian framework. XX is the object of interest; it is unknown. For example, in modeling a noisy image, XX could be the true image. We consider the unknown XX to be random, and we assume it has a certain prior distribution. This distribution, our probabilistic model for XX, is assumed to be a Markov random field. We also postulate a certain probabilistic model for YY conditional on XX. This conditional distribution of YY given XX reflects our ideas about the noise of blurring or whatever transforms the hidden true image XX into the image YY that we observe.

Given our assumed prior distribution of XX and condiitonal distribution of (YX)(Y|X), Bayes’ formula gives the posterior distribution of (XY)(X|Y). Thus, given an observed value Y=yY=y, in principle we get a posterior distribution P{X=Y=y}P \{ X = \cdot | Y = y \} over all possible true images, so that (again, in principle) we could make a variety of reasonable choices of our estimator of XX. For example, we could choose the xx that maximizes P{X=xY=y}P \{ X = x | Y=y\}. This is called MAP estimation, where MAP stands for “maximum a posteriori”.

3.3 Hammersley - Clifford Theorem

How do we specify a Markov random field? Compared to the case of Markov chains, we might want to specify in terms of a conditional distribution. The following example suggests why this approach goes wrong.

EXAMPLE: Suppose we are designing a Markov random field for images on the 3x3 lattice:

notes

For each pixel, let us specify the conditional distribution of its color given the color of its neighbors. Suppose there are two colors, 0 and 1. But it is possible to specify conditional distributions for each pixel that lead don’t work, i.e., there might be no joint distribution having the given conditional distributions.

In general, we can’t expect to specify a full set of conditional distributions as above. Fortunately, the Hammersley-Clifford Theorem says that a random field’s having the Markov property is equivalent to its having a Gibbs distribution, which is a friendly sort of distribution. Thus, instead of worrying baout specifying our MRF’s in terms of consistent conditional distributions, we can just consider Gibbs distributions, which are simple to write down and work with.

Some definitions needed to state HC:

DEF: A set of nodes CC is complete if all distinct nodes in CC are neighbors of each others. A clique is a maximal complete set of nodes.

DEF: Let GG be a finite graph. A Gibbs distribution with respect fo GG is a probability mass function that can be expressed in the form

p(x)=C completeVC(x),p(x) = \prod_{C \text{ complete}} V_C (x),

where each VCV_C is a function that depends only on the values xc=(xs:sC)x_c = (x_s : s \in C) of xx at the nodes in the clique CC. That is, the function VCV_C satisfies VC(x)=VC(y)V_C(x) = V_C(y) if xC=yCx_C= y_C.

By combining functions VCV_C for sets CC that are subsets of the same clique, we see that we can further reduce the product in the definiton of Gibbs distribution to

p(x)=C a cliqueVC(x).p(x) = \prod_{C \text{ a clique}} V_C(x).

THM (HAMMERSLEY-CLIFFORD): Suppose that X=(X1,...,Xn)X = (X_1, ..., X_n) has a positive joint probability mass function. XX is a Markov random field on GG iff XX has a Gibbs distribution with respect to GG.

EXAMPLE: A Markov chain X0,...,XnX_0, ..., X_n has joint distribution of the form p(x0,x1,...,xn)=π0(x0)P1(x0,x1)P2(x1,x2)...Pn(xn1,xn)p(x_0, x_1, ..., x_n) = \pi_0 (x_0) P_1 (x_0, x_1) P_2 (x_1, x_2) ... P_n (x_{n-1}, x_n).

By defining V{0,1}(x0,x1)=π0(x0)P1(x0,x1)V_{\{0,1\}} (x_0, x_1) = \pi_0 (x_0) P_1 (x_0, x_1) and V{k1,k}(xk1,xk)=Pk(xk1,xk)V_{\{k-1, k\}} (x_{k-1}, x_k) = P_k (x_{k-1}, x_k) for k>0k>0, we see that this product is a Gibbs distribution on the graph

(x1)(x1)(x2)...(xn)(x_1) -- (x_1) -- (x_2) -- ... -- (x_n)

PROP: Suppose (X,Y)(X,Y) is a Markov random field on the graph GG with the neighborhood structure NN. Write G=GXGYG= G_X \cup G_Y, where GXG_X and GYG_Y are the sets of nodes in GG corresponding to the XX and YY random variables, respectively. Then the marginal distribution of YY is a Markov random field on GYG_Y, where two nodes y1,y2GYy_1, y_2 \in G_Y are neighbors if either

The conditional distribution of XX given YY is a Markov random field on the graph GXG_X, where nodes x1x_1 and x2x_2 are neighbors if x1x2x_1 \sim x_2, that is, if x1,x2x_1, x_2 were neighbors in the original graph.

3.4 Long range dependence in the Ising model

For this section, we will work in Zd\Z^d. For each tZdt \in \Z^d there is a binary random variable XtX_t taking values in {1,1}\{-1,1\}. The Ising model gives a joint probability distribution for these random variables.

We consider a special case of the Ising model as follows.

For xx a configuration of +1’s and -1’s at the ndoes of a finite subset of Zd\Z^d, let b(x)b(x) denote the number of “odd bonds” in xx, that is, the number of edges {t,u}\{t,u\} such that xtxux_t \neq x_u. Then, under the Ising model, a configuration xx has a probability proportional to αb(x)\alpha^{b(x)} where α\alpha is a positive parameter of the distribution (typically <1). The chocie α=1\alpha =1 corresponds to the uniform distribution, giving equal probability to all congiruations. Distributions with small α\alpha strongly dicourage odd bonds, placing large probability on configurations with few odd bonds.

For the case d=1d=1, the model corresponds to a stationary Markov chain with probability transition matrix

Pα=(1/(1+α)α/(1+α)α/(1+α)1/(1+α)).P_\alpha = \begin{pmatrix} 1/(1+\alpha) & \alpha / (1+\alpha) \\ \alpha / (1+\alpha) & 1 / (1+\alpha) \end{pmatrix}.

Basic Limit Theorem tells us

Pαn(1/21/21/21/2) as n.P_\alpha^n \rightarrow \begin{pmatrix} 1/2 & 1/2 \\ 1/2 & 1/2 \end{pmatrix} \text{ as } n \rightarrow \infty.

Thus, the state X0X_0 is asymptotically independent of information at nodes far away from 0.

We will show that the situation in d2d \geq 2 is qualitatively different, in that the effect of states at remote nodes does not disappear in 2 dimensions.

To state the result, some notation:

Imagine a “cube” KnK_n of side length 2n2n centered at 0 in Zd\Z^d, consisting of lattice points whose dd coordinates all lie between n-n and nn:

Kn={tZd:tin,i=1,...,d}.K_n = \{ t\in \Z^d : |t_i| \leq n, \forall i = 1,...,d\}.

Let BnB_n denote the boundary points of the cube KnK_n, that is, the points having at least one coordinate equal to nn.

Let P+(n){X=x}P^{(n)}_+ \{X=x\} denote the Ising probability of X=xX = x, conditional on Xt=+1X_t = +1 for all tBnt \in B_n. Similarly, let P(n){X=x}P^{(n)}_- \{X=x\} denote probabilities conditional on 1-1’s on the boundary.

THEOREM: For the Ising model on Z2\Z^2, the effect of the boundary does not disappear. In particular, there exists α\alpha such that P+(n){X0=1}P^{(n)}_+ \{X_0 = -1 \} remains below 0.40.4 for all nn, no matter how large.

3.5 Hidden Markov Chains

3.5.1 Model overview

The hidden Markov chain model is a MRF in which some of the random variables are observed and others are not (hence hidden). In the graph structure for the hidden Markov chain, the hidden chain is X0,X1,...,XnX_0, X_1, ..., X_n and the observed process if Y0,Y1,...,YnY_0, Y_1, ..., Y_n.

notes

Edges join XtX_t to both YtY_t and Xt+1X_{t+1}. The model is parametrized by a marginal distribution ζ\zeta of X0X_0 and, if we assume time-homogeneity of the transition matrices, by two transition matrices AA and BB, where A(i,j)=P{Xt+1=jXt=i}A(i,j) = \P \{X_{t+1} = j | X_t = i \} and B(i,j)=P{Yt=jXt=i}.B(i,j) = \P \{Y_t = j | X_t = i\}.

Let us write θ=(ζ,A,B)\theta = (\zeta, A, B) for the vector of all parameters. If there are uu states possible for each of the hidden XX random varirables and vv oucomes possible for the observed yy random variables, then ζ\zeta is a vector of uu probabilities, AA is a u×uu\times u probability transition matrix, and BB is a u×vu \times v matrix, each of whose rows is a proabbility mass function.

The hidden Markov chain is pretty general. Examples of cases include

3.5.2 How to calculate likelihoods

The likelihood function L=L(θ)L = L(\theta) is the probability of the observed data, as a function of the parameters of the model. The tricky aspect is we observe only the YY’s, so that

L(θ)=pθ(y0,...,yn)=x0x1...xnpθ(x0,...,xn,y0,...,yn)=xpθ(x,y).L(\theta) = p_{\theta}(y_0, ..., y_n) = \sum_{x_0} \sum_{x_1} ... \sum_{x_n} p_{\theta}(x_0, ..., x_n, y_0, ..., y_n) = \sum_x p_{\theta}(x,y).

This is a large sum. If the size of the state space of the hidden variables is just u=2u=2, the sum still has 2n+12^{n+1} terms. Without a way around this computational issue, the hdiden Markov chain model would be of little pratical use. However, it turns out that we can do these calculations in time linear in nn.

We will denote the state space for the hidden variables XtX_t by X\X.

We are thinking of the observed YY values as fixed here–we know them, and will denote them by y0,...,yny_0, ..., y_n. For each t=0,1,..,nt=0,1,..,n and for each xtXx_t \in \X, define

αt(xt)=pθ(y0,..,yt,xt).\alpha_t (x_t) = p_\theta (y_0, .., y_t, x_t).

We can calculate the function α0\alpha_0 immediately:

α0(x0)=pθ(y0,x0)=ζ(x0)B(x0,y0).\alpha_0 (x_0) = p_\theta (y_0, x_0) = \zeta (x_0) B(x_0, y_0).

Note the simple recursion that expressed αt+1\alpha_{t+1} in terms of αt\alpha_t:

αt+1(xt+1)=pθ(y0,...,yt+1,xt+1)=xtXpθ(y0,...,yt,xt,xt+1,yt+1)=xtXpθ(y0,...,yt)pθ(xt+1xt)pθ(yt+1xt+1)=xtXαt(xt)A(xt,xt+1)B(xt+1,yt+1).\begin{align} \alpha_{t+1}(x_{t+1}) &= p_\theta(y_0, ..., y_{t+1}, x_{t+1}) \\ &= \sum_{x_t \in \X} p_\theta (y_0, ..., y_t, x_t, x_{t+1}, y_{t+1}) \\ &= \sum_{x_t \in \X} p_{\theta}(y_0, ..., y_t) p_\theta (x_{t+1}|x_t) p_\theta(y_{t+1} | x_{t+1}) \\ &= \sum_{x_t \in \X} \alpha_t(x_t) A(x_t, x_{t+1}) B(x_{t+1}, y_{t+1}). \end{align}

The sum in the recursion

xtXαt(xt)A(xt,xt+1)B(xt+1,yt+1)\sum_{x_t \in \X} \alpha_t(x_t) A(x_t, x_{t+1}) B(x_{t+1}, y_{t+1})

is fairly modest in comparison. Using the recursion to calculate the function αt+1\alpha_{t+1} from αt\alpha_t involves a fixed amount of work, and the task gets no harder as tt increases. Thus, the amount of work to calculate all of the probabilities αt(xt)\alpha_t (x_t) for t=0,...,nt=0, ...,n and xtXx_t \in \X is linear in nn.

Having completed the recrusion to calculate the function αn\alpha_n, the likelihood is simply

L(θ)=pθ(y0,...,yn)=xnpθ(y0,...,yn,xn)=xnαn(xn).L(\theta) = p_\theta (y_0, ..., y_n) = \sum_{x_n} p_\theta (y_0, ..., y_n, x_n) = \sum_{x_n} \alpha_n (x_n).

The above probabilities are called “forward” probabilities. In a similar manner, we can calculate the “backward” probabilities

βt(xt)=pθ(yt+1,...,ynxt)=Pθ{Yt+1=yt+1,...,Yn=ynXt=xt}.\beta_t (x_t) = p_\theta (y_{t+1},...,y_n | x_t) = \P_\theta \{Y_{t+1} = y_{t+1}, ..., Y_n = y_n | X_t = x_t\}.

by using the recursion

βt1(xt1)=xtA(xt1,xt)B(xt,yt)βt(xt).\beta_{t-1} (x_{t-1}) = \sum_{x_t} A(x_{t-1, x_t}) B(x_t, y_t) \beta_t(x_t).

3.5.3 Maximum Likelihood and the EM algorithm

Without an organized method for changing θ=(ζ,A,B)\theta = (\zeta, A, B), it is hard to find the θ\theta that maximizes the likelihood. The EM algorithm is a method for finding maximum likelihood estimates that is applicable to many statistical problems, including hidden Markov chains.

PROP: For p=(p1,...,pk)p = (p_1, ..., p_k) and q=(q1,...,qk)q = (q_1, ..., q_k) two probability mass functions on {1,...,k},\{1,..., k\}, we have

ipilogpiipilogqi.\sum_i p_i \log p_i \geq \sum_i p_i \log q_i.

A brief description of EM algorithm. We’ll assume X,YX,Y are discrete random variables or vectors, so probs are given by sums rathers than integrals.

We want to find the θ\theta maximizing logL(θ)\log L(\theta) where L(θ)=pθ(y)=xpθ(x,y)L (\theta) = p_\theta (y) = \sum_x p_\theta(x,y). The EM algorithm repeats the following update that is guaranteed to increase the likelihood at each iteration.

Let θ0\theta_0 denote the current value of θ\theta. Replace θ0\theta_0 by θ1\theta_1, the value of θ\theta that maximizes

Eθ0[logpθ(X,y)Y=y].\E_{\theta_0} \left[ \log p_\theta (X, y) | Y=y \right].

Why does it work? For a given θ0\theta_0, define g(θ)=Eθ0[logpθ(X,y)Y=y].g(\theta) = \E_{\theta_0} [ \log p_\theta (X, y) | Y =y]. We will see that, in order to have pθ1(y)>pθ0(y),p_{\theta_1}(y) > p_{\theta_0}(y), we do not need to find θ1\theta_1 maximizing gg, but rather it is enough to find a θ1\theta_1 such that g(θ1)>g(θ0).g(\theta_1) > g(\theta_0).

PROP: If Eθ0[logpθ1(X,y)Y=y]>Eθ0[logpθ0(X,y)Y=y],\E_{\theta_0} [ \log p_{\theta_1} (X, y) | Y =y] > \E_{\theta_0} [ \log p_{\theta_0} (X, y) | Y =y], then pθ1(y)>pθ0(y).p_{\theta_1} (y) > p_{\theta_0}(y).

3.5.4 Applying the EM algorithm to a hidden Markov chain

Consider a hidden Markov chain for (X,Y)=(X0,...,Xn,Y0,...,Yn)(X,Y) = (X_0, ..., X_n, Y_0, ..., Y_n), where the r.v.s YtY_t are observed and the r.v.s XtX_t are hidden. The model is parametrized by θ=(ζ,A,B)\theta = (\zeta, A, B) for ζ\zeta the distribution of X0X_0, AA the prob transition matrix for XX chain and BB the prob transition matrix from XtX_t to YtY_t.

To describe one iteration of the EM method, imagine our current guess for θ\theta is θ0=(ζ0,A0,B0).\theta_0 = (\zeta_0, A_0, B_0). We want a new guess θ1=(ζ1,A1,B1)\theta_1 = (\zeta_1, A_1, B_1) that has higher likelihood.

FINISH THIS SECTION PG 107…

In summary, to do EM on hidden Markov chain model:

ζ1(i)=jγ0(i,j)A1(i,j)=t=0n1γy(i,j)lt=0n1γt(i,l)B1(i,j)=t=0n1lγt(i,l)I(yt=j)+mγn1(m,i)I(yn=j)t=0n1lγt(i,l)+mγn1(m,i).\begin{align*} \zeta_1(i) &= \sum_j \gamma_0 (i,j) \\ A_1 (i,j) &= \frac{\sum_{t=0}^{n-1}\gamma_y(i,j)}{\sum_l \sum_{t=0}^{n-1}\gamma_t(i,l)}\\ B_1(i,j) &= \frac{ \sum_{t=0}^{n-1} \sum_l \gamma_t(i, l) I(y_t = j) + \sum_m \gamma_{n-1}(m, i) I(y_n = j)}{\sum_{t=0}^{n-1} \sum_l \gamma_t(i, l) + \sum_m \gamma_{n-1}(m, i)}. \end{align*}

3.6 Gibbs Sampler

Terms like “Markov chain Monte Carlo” and “Markov sampling” refer to methods for generating random samples from given distributions by running Markov chains. Although such methods have quite a long history, they have become the subject of renewed interest in the last decade, particularly with the introduction of the “Gibbs sampler” by Geman and Geman (1984), who used the method in a Bayesian approach to image reconstruction. The Gibbs sampler itself has enjoyed a recent surge of intense interest within statistics community, spurred by Gelfand and Smith (1990), who applied the Gibbs sampler to a wide variety of inference problems.

Recall that a distribution π\pi being “stationary” for a Markov chain X0,X1,...X_0, X_1, ... means that, if X0π,X_0 \sim \pi, then X0πX_0 \sim \pi for all nn. The basic phenomenon underlying all Markov sampling methods is the convergence in distribution of a Markov chain to its stationary distribution: If a Markov chain X0,X1,...X_0, X_1, ... has stationary distribution π\pi, then under the conditions of the Basic Limit Theorem, the distribution of XnX_n for large nn is close to π\pi. Thus, in order to generate an observation from a desired distribution π\pi, we find a Markov chain X0,X1,...X_0, X_1, ... that has π\pi as its stationary distribution. The Basic Limit Theorem then suggests that running or simulating the chain until a large time nn will produce a random variable XnX_n whose distribution is close to the desired π\pi. By taking nn large enough, in principle we obtain a value that may for practical purposes be considered a random draw from the distribution π\pi.

The Gibbs sampler is a way of constructing a Markov chain having a desired stationary distribution. A simple setting that illustrates the idea involves a probability mass function π\pi of the form π(x,y)\pi(x, y). Suppose we want to generate a random vector (X,Y)π(X, Y) \sim \pi. Denote the conditional probability distributions by π(X=)\pi(\cdot \mid X = \cdot) and π(Y=)\pi(\cdot \mid Y = \cdot). To perform a Gibbs sampler, start with any initial point (X0,Y0)(X_0, Y_0). Then generate X1X_1 from the conditional distribution π(Y=Y0)\pi(\cdot \mid Y = Y_0), and generate Y1Y_1 from the conditional distribution π(X=X1)\pi(\cdot \mid X = X_1). Continue on in this way, generating X2X_2 from the conditional distribution π(Y=Y1)\pi(\cdot \mid Y = Y_1) and Y2Y_2 from the conditional distribution π(X=X2)\pi(\cdot \mid X = X_2), and so on. Then the distribution π\pi is stationary for the Markov chain {(Xn,Yn):n=0,1,}\{(X_n, Y_n) : n = 0, 1, \dots \}.

To see this, suppose (X0,Y0)π(X_0, Y_0) \sim \pi. In particular, Y0Y_0 is distributed according to the YY-marginal of π\pi, so that, since X1X_1 is drawn from the conditional distribution of XX given Y=Y0Y = Y_0, we have (X1,Y0)π(X_1, Y_0) \sim \pi. Now we use the same reasoning again: Y1Y_1 is distributed according to the YY-marginal of π\pi, so that (X1,Y1)π(X_1, Y_1) \sim \pi. Thus, the Gibbs sampler Markov chain {(Xn,Yn):n0}\{(X_n, Y_n) : n \geq 0\} has the property that if (X0,Y0)π(X_0, Y_0) \sim \pi then (X1,Y1)π(X_1, Y_1) \sim \pi—that is, the distribution π\pi is stationary.

Simulating a Markov chain is technically and conceptually simple. We just generate the random variables in the chain, in order, and we are done. However, the index set of a Markov random field has no natural ordering in general. This is what causes iterative methods such as the Gibbs sampler to be necessary.

The Gibbs sampler is well suited to Markov random fields, since it works by repeatedly sampling from the conditional distribution at one node given the values at the remaining nodes, and the Markov property is precisely the statement that these conditional distributions are simple, depending only on the neighbors of the node.

Ch. 4 – Martingales

4.2 Definitions

Notation: Given a process W={Wk}W = \{ W_k \}, let Wm,nW_{m,n} denote the portion Wm,Wm+1,...,WnW_m, W_{m+1}, ..., W_n of the process from time mm up to time nn.

DEF: A process M0,M1,...,M_0, M_1, ..., is a martingale if

E[Mn+1M0,n]=Mn for each n0.\E [M_{n+1} | M_{0,n} ] = M_n \text{ for each } n \geq 0.

Alternatively, A process M0,M1,...,M_0, M_1, ..., is a martingale w.r.t another process W0,W1,...W_0, W_1, ... if

E[Mn+1W0,n]=Mn for each n0.\E [M_{n+1} | W_{0,n} ] = M_n \text{ for each } n \geq 0.

The crux of the definition E[Mn+1W0,n]=Mn\E [M_{n+1} | W_{0,n} ] = M_n is a “fair game” sort of requirement. If we are playing a fair game, we expect neither to win nor to lose money on the average. Given the history of our fortunes up to time nn, our expected fortune Mn+1M_{n+1} at the future time n+1n+1 should just be the fortune MnM_n that we have at time nn.

A minor technical condition: we also require EMn<\E |M_n| < \infty for all nn s.t. the coniditional expectations in the definition are guaranteed to be well-defined.

How about submartingales and supermartingales? These are processes that are “better than fair” and “worse than fair,” respectively.

DEF: A process X0,X1,...X_0, X_1, ... is a submartingle w.r.t. a process W0,W1,...,W_0, W_1, ..., if E[Xn+1W0,n]Xn\E [X_{n+1} | W_{0,n} ] \geq X_n for each n0n \geq 0.

{X_n} is a supermartingale w.r.t {Wn}\{W_n\} if E[Xn+1W0,n]Xn\E [X_{n+1} | W_{0,n} ] \leq X_n for each n0n \geq 0.

4.3 Examples

See book

4.4 Optional Sampling

Optional sampling property is a “conservation of fairness” type property.

By the “fair game” property, E{Mn+1}=E{E[Mn+1W0,n]}=E{Mn}\E \{M_{n+1}\} = \E \{ \E [M_{n+1} | W_{0,n}] \} = \E \{M_n\} for all nn. This implies that

EMn=EM0 for all times n0.E M_n = EM_0 \text{ for all times } n \geq 0.

That is, one can stop at any predetermined time tt, like t=8t =8, and their winnings will be fair: EM8=EM0EM_8 = EM_0.

Fairness is also conserved in many cases, but not in all cases (i.e., if one stops at a time that is not predetermined, but random (depending on the observed sample path of the game)). The issue of optional sampling is this:

If TT is a random time, that is, TT is a nonnegative random variable, does the equality EMT=EM0EM_T = EM_0 still hold?

There are two sorts of things that shouldn’t be allowed if we want fairness to be conserved:

In fact, ruling out these two sorts of behaviors leaves a class of random times TT at which the optional sampling statement EMT=EM0EM_T = EM_0 holds. Disallowing arbitrarily long times is done by assuming TT bounded. Random times that disallow the gambler from peeking ahead into the future are called stopping times.

DEF: A random variable TT taking values in the set {0,1,2,...,}\{ 0, 1, 2, ..., \infty \} is a stopping time w.r.t. the process W0,W1,...W_0, W_1, ... if for each integer kk, the indicator random variable I{T=k}I\{T=k\} is a function of W0,kW_{0,k}.

The main optional sampling result:

THM: Let M0,M1,...M_0, M_1, ... be a martingale w.r.t. W0,W1,...,W_0, W_1, ..., and let TT be a bounded stopping time. Then EMT=EM0EM_T = EM_0.

Ch. 5 – Brownian Motion

Ch. 6 – Diffusions and Stochastic Calculus

Ch. 7 – Likelihood Ratios

Ch. 8 – Extremes and Poisson Clumping