Notes on Stochastic Processes (Joe Chang)

Ch. 1 – Markov Chains

1.1 Specifying and Simulating a Markov Chain

to specify a Markov chain, we need to know its

state space $S$ , a finite or countable set of states—values the random variables $X_i$ may take on
initial distribution $\pi_0$ $π_{0}$ , the probability distribution of the Markov chain at time $0$ $0$ .
- for each state $i$ , $\pi_0(i) := \mathbb{P}\{X_0=i\}$ , the probability the Markov chain starts in state $i$ we can also think of $\pi_0$ are a vector whose $i$ -th entry is $\pi_0(i)$
probability transition matrix $P = (P_{ij})$ $P = (P_{ij})$
- if $S$ has $N$ states, then $P$ is $N \times N$
- $P_{ij}$ or $P(i,j)$ is the probability that the chain will transition to $j$ from $i$ : $P_{ij} = \mathbb{P}\{X_{n+1} = j | X_n = i \}$ .
- rows sum to 1 (think: from state $i$ , we are in row $i$ , and there must be probability 1 of going to any next state $j$ )

1.2 The Markov Property

a process $X_0, X_1, …$ satisfies the Markov property if

\mathbb{P}\{X_{n+1} = i_{n+1} | X_n=i_n, X_{n-1} = i_{n-1}, ..., X_0 = i_0 \} \newline = \mathbb{P}\{X_{n+1} = i_{n+1} | X_n = i_n\}

for all $n$ and all $i_0, …, i_{n+1} \in S$ .

notes

(i.e., dependent on the last r states)

1.3 Matrices

recall $\pi_0$ from 1.1. let $\pi_n$ denote the distribution of the chain at time $n$ analogously, $\pi_n(i) = \mathbb{P}\{X_n = i\}$ . consider both as row vectors.

suppose that the state space is finite: $S = \{1, …, N\}$ . by LOTP,

\pi_{n+1}(j) = \mathbb{P}\{X_{n+1} = j\} \newline = \sum_{i=1}^{N}\mathbb{P}\{X_n=i\}\mathbb{P}\{X_{n+1} = j | X_n=1\} \newline \sum_{i=1}^{N} \pi_n(i)P(i,j) = \pi_{n+1} = \pi_nP.

notes

1.4 Basic Limit Theorem of Markov Chains

notes

1.5 Stationary Distribution

stationary distribution amounts to saying $\pi = \pi P$ is satisfied, i.e.,

\pi(j) = \sum_{i \in S} \pi(i) P(i,j)

for all $j \in S$ .

a Markov chain might have no stationary distribution, one stationary distribution, or infinitely many stationary distributions.

for subsets $A, B$ of the state space, define the probability flux from set $A$ into $B$ as

\text{flux}(A, B) = \sum_{i \in A} \sum_{j\in B} \pi(i) P(i,j)

1.6 Irreducibility, Periodicity, Recurrence

Use $\mathbb{P}_i(A)$ as shorthand for $\mathbb{P}\{A | X_0 = i\}$ , and same for $\E_i$ .

Accessibility: for two states $i, j$ , we say that $j$ is accessible from $i$ if it is possible for the chain ever to visit state $j$ if the chain starts in state $i$ :

\P_i \{\cup_{n=0}^{\infty} \{X_n = j\} \} > 0.

Equivalently,

\sum_{n=0}^{\infty}P^n (i, j) = \sum_{n=0}^{\infty}\{X_n = j \} > 0.

Communication: we say $i$ communicates with $j$ if $i$ is accessible from $j$ and $j$ is accessible from $i$ .

Irreducibility: the Markov chain is irreducible if all pairs of states communicate.

The relation “communicates with” is an equivalence relation; hence, the states space $S$ can be partitioned into “communicating classes” or “classes.”

–

The Basic Limit Theorem requires irreducibility and aperiodicity (see 1.5). Trivial examples why:

Irreducibility: takes $S = \{0,1\}, P = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$ . Then $\pi_n = \pi_0$ holds for all $n$ , i.e., $\pi_n$ does not approach a limit independent of $\pi_0$ .
Aperiodicity: same $S$ , take $P = \begin{pmatrix} 0&1 \\ 1&0 \end{pmatrix}$ . If, for example, $\pi_0 = (1,0),$ $\pi_n$ alternates with even and odd $n$ , and does not converge to anything.

–

Period: Given a Markov chain $\{ X_0, X_1, ... \}$ , define the period of a state $i$ to be the greatest common divisor

d_i = \text{gcd} \{n : P^n (i, i) > 0 \}.

THEOREM: if the states $i$ and $j$ communicate, then $d_i = d_j$ .

The period of a state is a “class property.” In particular, all states in an irreducible Markov chain have the same period. Thus, we can speak of the period of a Markov chain if the Markov chain is irreducible.

An irreducible Markov chain is aperiodic if its period is $1$ , and periodic otherwise. (A sufficient but not necessary condition for an irreducible chain to be aperiodic is that there exists a state $i$ such that $P(i,i) > 0$ .)

–

One more concept for the Basic Limit Theorem: recurrence. We will begin with showing recurrence, then showing that it is a class property. In particular, in an irreducible Markov chain, etiher all states are recurrent OR all states are transient.

The idea of recurrence: a state $i$ is recurrent if, starting from the state $i$ at time 0, the chain is sure to return to $i$ eventually. More precisely, define the first hitting time $T_i$ of the state i by

T_i = \inf \{n>0 : X_n = i \}.

Recurrence: the state $i$ is recurrent if $\P_i \{ T_i < \infty \} = 1$ . If $i$ is not recurrent, it is called transient.

(note that accessibility could be defined: for distinct states $i \neq j$ , $j$ is accessible from $i$ iff $\P_i \{ T_j < \infty \} > 0$ .)

THEOREM: Let $i$ be a recurrent state, and suppose that $j$ is accessible from $i$ . Then all of the following hold:

$\P_i \{T_j < \infty \} = 1$
$\P_j \{T_i < \infty \} = 1$
The state $j$ is recurrent

Pf pg. 20. roughly, let i and j be distinct states. iii follows from i and ii.

to prove i, imagine starting in state i. since i is recurrent, it will return to i in finite time. actually, continuing after the first hitting time, it will return again. so with probability 1, the chain returns to i infinitely many times. we can call each of these paths a cycle. set a characteristic function I_n to 1 if the chain visits j some time during the n-th cycle, and 0 otherwise. then I_1, I_2, … is an iid sequence of bernoulli trials. the probability that there will be a cycle where j is visited converges to 1.

to prove ii, suppose to the contrary that the probability of reaching i from j in infinite time is > 0. since j is accessible from i, it is possible with positive prob for the chain to go from i to j in finite time and then never return to i. but this contradicts that starting from i, the chain must return to i infinitely many times with probability 1, so ii holds.

–

We use the notation $N_i$ for the total number of visits of the Markov chain to the state $i$ :

N_i = \sum_{n=0}^{\infty} I \{X_n = i \}.

THEOREM: The state $i$ is recurrent iff $\E_i(N_i) = \infty.$

COROLLARY: If $j$ is transient, then $\lim_{n \rightarrow \infty} P^n(i,j) = 0$ for all states $i$ .

–

Introducing stationary distributions.

PROP: Suppose a Markov chain has a stationary distribution $\pi$ . If the state $j$ is transient, then $\pi(j) = 0$ .

COROLLARY: If an irreducible Markov chain has a stationary distribution, then the chain is recurrent.

Note that the converse for the above is not true. There are irreducible, recurrent Markov chains that do not have stationary distributions. For example, the simple symmetric random walk on the integers in one dimension is irreducible and recurrent but does not have a stationary distribution. By recurrence we have $\P_0 \{T_0 < \infty \} = 1$ , but also $\E_0 \{T_0 \} = \infty$ . The name for this kind of recurrence is null recurrence, i.e., a state $i$ is null recurrent if it is recurrent and $\E_i(T_i) = \infty$ . Otherwise, a recurrent state is positive recurrent, where $\E_i (T_i) < \infty$ .

Positive recurrence is also a class property: if a chain is irreducible, the chain is either transient, null recurrent, or positive recurrent. In fact, an irreducible chain has a stationary distribution iff it is positive recurrent.

1.7 Coupling

Example of coupling technique: consider a random graph on a given finite set of nodes, in which each pair of nodes is joined by an edge independently with probability $p$ . We could simulate a random graph as follows: for each pair of nodes $i, j$ generate a random number $U_{ij} \sim U[0,1]$ , and join nodes $i$ and $j$ with an edge if $U_{ij} \leq p$ .

How do we show that the probability of the resulting graph being connected is nondecreasing in $p$ , i.e., show that for $p_1 < p_2$ ,

\P_{p1} \{ \text{graph connected}\} \leq \P_{p2} \{ \text{graph connected} \}.

We could try to find an explicit function for the probability in terms of $p$ , which seems inefficient. How to formalize the intuition that this seems obvious?

An idea: show that corresponding events are ordered, i.e., if $A \subset B$ then $\P A \leq \P B$ .

Let’s make 2 events by making 2 random graphs, $G_1, G_2$ on the same set of nodes. $G_1$ is constructed by having each possible edge appear with prob $p_1$ , and for $G_2$ , each edge present with prob $p_2.$ We can do this by using two sets of $U[0,1]$ random variables: $\{U_{ij}\}, \{V_{ij}\}$ for the first and second graph, respectively. Is it true that

\{G_1 \text{ connected}\} \subset \{G_2 \text{ connected} \}?

No, since the two sets of r.v.s are independently generated.

A change: use the same random numbers for each graph. Then

\{G_1 \text{ connected}\} \subset \{G_2 \text{ connected} \}

becomes true. This establishes monotonicity of the probability being connected.

Conclusion: what characterizes a coupling argument? Generally, we show that the same set of random variables can be used to construct two different objects about which we want to make a probabilistic statement.

1.8 Proof of Basic Limit Theorem

The Basic Limit Theorem says that if an irreducible, aperiodic Markov chain has a stationary distribution $\pi$ , then for each initial distirbution $\pi_0$ , as $n \rightarrow \infty$ we have $\pi_n (i) \rightarrow \pi(i)$ for all states $i$ .

(Note the wording “a stationary distribution”: assuming BLT true implies that an irreducible and aperiodic Markov chain cannot have two different stationary distributions)

Equivalently, let’s define a distance between probability distributions, called “total variation distance:”

DEFINITION: Let $\lambda$ and $\mu$ be two probability distributions on the set $S$ . Then the total variation distance $|| \lambda - \mu ||$ is defined by

|| \lambda - \mu || = \sup_{A \subset S} [ \lambda(A) - \mu(A) ].

PROP: The total variation distance may also be expressed in the alternative forms

|| \lambda - \mu = \sup_{A \subset S} [ \lambda(A) - \mu(A) ] = \frac{1}{2} \sum_{i \in S} | \lambda(i) - \mu(i) | = 1 - \sum_{i \in S} \min \{ \lambda(i), \mu(u) \}.

We now introduce the coupling method. Let $Y_0, Y_1, ...$ be a Markov chain with the same probability transition matrix as $X_0, X_1, ...$ , but let $Y_0$ have the initial distribution $\pi$ and $X_0$ have the initial distribution $\pi_0$ . Note that $\{ Y_n \}$ is a stationary Markov chain with distribution $\pi$ for all $n$ . Let the $Y$ chain be independent of the $X$ chain.

We want to show that, for large $n$ , the probabilistic behavior of $X_n$ is close to that of $Y_n$ .

Define the coupling time $T$ to be the first time at which $X_n = Y_n$ :

T = \inf \{ n : X_n = Y_n \}.

LEMMA: For all n we have

|| \pi_n - \pi || \leq \P \{ T > n \}.

Hence we need only show that $\P \{T > n \} \rightarrow 0$ , or equivalently, that $\P \{ T < \infty \} = 1.$

Consider the bivariate chain $\{ Z_n = (X_n, Y_n) : n \geq 0 \}.$ $Z_0, Z_1, ...$ is clearly a Markov chain on the state space $S \times S$ . Since the $X$ and $Y$ chains are independent, the probability transition matrix $P_Z$ of chain $Z$ can be written

P_Z (i_x i_y, j_x j_y) = P(i_x , j_x) P(i_y, j_y).

$Z$ has stationary distribution

\pi_Z (i_xy i_y) = \pi(i_x) \pi(i_y).

We want to show $\P \{ T \leq \infty \}.$ So, in terms of the $Z$ chain, we want to show that with probability one, the $Z$ chain hits the “diagonal” $\{ (j,j) : j \in S\}$ in $S \times S$ in finite time. To do this, it is sufficient to show that the $Z$ chain is irreducible and recurrent.

This is where we use aperiodicity.

LEMMA: Sps $A$ is a set of positive integers closed under addition, with gcd one. Then there exists an integer $N$ s.t. $n \in A$ for all $n \geq N$ .

Let $i \in S$ and recall the assumption that the chain is aperiodic. Since the set $\{ n : P^n (i,i) > 0 \}$ is closed under addition, and, from aperiodicity, has greatest common divisor $1$ , we can use the previous lemma. So $P^n (i,i) > 0$ for all sufficiently large $n$ . From this, for any $i,j \in S$ , since irreducibility implies $P^m (i,j) > 0$ for some m, it follows that $P^n(i,j) > 0$ for all sufficiently large $n$ .

Now we show irreducibility. Let $i_x, i_y, j_x, j_y \in S$ . It is sufficient to show that $P^n_Z (i_x i_y, j_x j_y) > 0$ for some $n$ . By the assumed independence of $\{X_n\}$ and $\{Y_n\}$ , we have

P^n_Z (i_x i_y, j_x j_y) = P^n(i_x, j_x)P^n(i_y, j_y),

which (by the previous argument) is positive for all sufficiently large $n$ , so we are done.

1.9 SLLN for Markov Chains

The usual Strong Law of Large Numbers for iid random variables says that if $X_1, X_2, ...$ are iid with mean $\mu$ , then

\P \{ (1/n) \sum^n_{t=1} X_t \rightarrow \mu \} = 1 \text{ as } n \rightarrow \infty.

We will do a generalization of this result for Markov chains: the fraction of times a Markov chain occupies state $i$ converges to a limit.

Although the successive states of a Markov chain are not independent, certain features of a Markov chain are independent of each other. Here we will use the idea that the path of the chain consists of a succession of independent “cycles,” the segments of the path between successive visits to a recurrent state. This independence allows us to use the LLN that we already know.

THEOREM: Let $X_0, X_1, ...$ be a Markov chain starting in the state $X_0 = i$ , and suppose that the state $i$ communictes with another state $j$ . The limiting fraction of time that the chain spends in state $j$ is $1 / \E_j T_j$ . That is,

\P_i \left\{ \lim_{n \rightarrow \infty} \frac{1}{n} \sum_{t =1}^{n} I \{X_t = j\} = \frac{1}{\E_j T_j} \right\} = 1.

In the proof of the previous theorem, we define $V_n(j)$ as the number of visits made to state $j$ made by $X_1, ..., X_n$ , i.e.,

V_n(j) = \sum^{n}_{t=1} \{X_t = j\}.

Using the Bounded Convergence Theorem,

we have the following:

COROLLARY: For an irreducible Markov chain, we have

\lim_{n\rightarrow \infty} \frac{1}{n} \sum^{n}_{t=1} P^t (i,j) = \frac{1}{\E_j T_j}

for all states $i$ and $j$ .

Consider an irreducible, aperiodic Markov chain haing a stationary distribution $\pi$ . From the Basic Limit Theorem, we know that $P^n (i,j) \rightarrow \pi(j)$ as $n \rightarrow \infty$ . Notice also that if a sequence of numbers converges to a limit, then the sequence of Cesaro averages converges to the same limit, i.e., if $a_t \rightarrow a$ as $t \rightarrow \infty$ , then $(1/n) \sum^{n}_{t=1}a_t \rightarrow a$ as $n \rightarrow \infty$ . However, the previous Corollary shows the Cesaro averages of $P^t (i,j)$ 's converge to $1 / \E_j T_j$ . So, we must have

\pi(j) = \frac{1}{E_j T_j}.

In fact, aperiodicity is not needed for this conclusion:

THEOREM: An irreducible, positive recurrent Markov chain has a unique stationary distribution $\pi$ given by

\pi(j) = \frac{1}{\E_j T_j}.

1.10 Exercises

Ch. 2 – Markov Chains: Examples & Applications

2.1 Branching Processes

Motivating example: the branching process model was formulated by Galton, who was interested in the survival and extinction of family names.

Suppose children inherit their fathers’ names, so we need only to keep track of fathers and sons. Consider a male who is the only member of his generation to have a given family name, so that the responsibility of keeping the family name alive falls upon him–if his line of male descendants terminates, so does the family name.

Suppose for simplicity that each male has probability $f(0)$ of producing no sons, $f(1)$ of producing 1 son, and so on.

What is the probability that the family name eventually becomes extinct?

To formalize this, let:

$G_t$ $G_{t}$ the number of males in generation $t$ $t$
- Start with $G_0 = 1$
- If $G_t = i$ , write $G_{t+1} = X_{t1} + X_{t2} + ... + X_{ti}$ , where $X_{tj}$ is the number of sons fathered by the $j$ -th man in generation $t$
Assume the r.v.s $\{ X_{tj} : t \geq 0, j \geq 1 \}$ ${X_{t j} : t \geq 0, j \geq 1}$ are iid with probability mass function $f$ $f$
- Hence $\P \{ X_{tj} = k \} = f(k)$ for $k = 0, 1, ...$ .
To avoid trivial cases, assume $f(0) > 0$ and $f(0) + f(1) < 1$ .

We are interested in the extinction probability $\rho = \P_1 \{ G_t = 0 \text{ for some t} \}$ .

$\{ G_t : t \geq 0 \}$ is a Markov chain. State 0 is absorbing. So, for each $i > 0$ , since $\P_i \{ G_1 = 0 \} = (f(0))^i >0$ , the state $i$ must be transient.

Consequently, with probability 1, each state $i > 0$ is visited only a finite number of times. So, with probability 1, the chain must get absorbed at 0 or approach $\infty$ .

We can obtain an equation for $\rho$ by conditioning on what happens at the first step of the chain:

\rho = \sum^{\infty}_{k=0} \P \{ G_1 = k | G_0 = 1 \} \P \{\text{eventual extinction} | G_1 = k \}.

Since the males all have sons independently,

\P \{ \text{eventual extinction} | G_1 = k \} = \rho^k.

Thus, we have

\rho = \sum^{\infty}_{k=0} f(k) \rho^k = \psi (\rho).

For each distribution $f$ there is a corresponding function of $\rho$ denoted $\psi$ . So the extinction probability $\rho$ is a fixed point of $\psi$ .

$\psi$ is the probability generating function of the probability mass funcion $f$ . The first two derivatives are

\psi' (z) = \sum^{\infty}_{k=1} kf(k) z^{k-1}, \psi '' (z) = \sum^{\infty}_{k = 2} k (k-1) f(k) z^{k-2}.

for $z \in (0,1).$ Since these are positive, the function $\psi$ is strictly increasing and convex on $z \in (0,1).$ Also note that $\psi(0) = f(0), \psi(1) = 1$ . Finally, $\psi'(1) = \sum k f(k) = \mu$ , where $\mu = E(X)$ , the expected number of sons for each male.

notes

Since $\psi(1) = 1$ , there is always a trivial solution at $\rho = 1$ . When $\mu \leq 1$ , this trivial solution is the only solution, so $\rho = 1$ .

When $\mu > 1$ , we define $r$ to be the smaller solution of $\psi(r) = r$ . Since $\psi(\rho) = \rho$ , we know that $\rho$ must be either $r$ or 1. We want to show that $\rho = r$ .

Defining $p_t = \P_1 \{ G_t = 0 \}$ , observe that as $t \rightarrow \infty$ ,

p_t \uparrow \P_1 \left[ \bigcup_{n=1}^{\infty} \{ G_n = 0 \} \right] = \rho.

Therefore, to rule out $\rho = 1$ , it is sufficient to prove that

p_t \leq r \text{ for all }t.

By induction, observe $p_0 = 0$ . Next,

p_{t+1} = \P_1 \{ G_{t+1} = 0 \} = \sum_{i=0}^{\infty}\P_1 \{ G_1 = i \} \P_1 \{G_{t+1} = 0 | G_1 = i \} = \sum^{\infty}_{i=0} f(i) (p_t)^i.

i.e., $p_{t+1} = \psi(p_t).$ Since $\psi$ is increasing and by the induction hypothesis $p_t \leq r$ , we have that

p_{t+1} = \psi(p_t) \leq \psi(r) = r.

So $p_0, p_{t+1}$ is bounded by $r$ for all non negative $t$ , which proves $\rho = r$ . Hence there is a non negative probability that the family name goes on forever.

2.2 Time Reversibility

2.3 More on Time Reversibility: A Tandem Queue Model

2.4 The Metropolis Method

2.5 Simulated Annealing

2.6 Ergodicity Concepts

In this section we focus on a time-inhomogeneous Markov chain $\{X_n\}$ on a countably infinite state space $S$ .

Let $P_n$ denote the probability transition matrix governing the transition from $X_n$ to $X_{n+1}$ , i.e.,

P_n(i,j) = \P \{X_{n+1} = j | X_n = i \}.

For $m < n$ , define $P^{(m,n)} = \prod^{n-1}_{k=m} P_k,$ s.t.

P^{(m,n)}(i,j) = \P \{ X_n = j | X_m = i \}.

DEFINITION: $\{X_n\}$ is strongly ergodic if there exists a probability distribution $\pi^*$ on $S$ such that

\lim_{n \rightarrow \infty} \sup_{i\in S} || P^{(m,n)} (i, \cdot) - \pi^* || = 0, \forall m.

DEFINITION: $\{X_n\}$ is weakly ergodic if there exists a probability distribution $\pi^*$ on $S$ such that

\lim_{n \rightarrow \infty} \sup_{i,j \in S} || P^{(m,n)} (i, \cdot) - P^{(m,n)} (j, \cdot) || = 0, \forall m.

We can understand weak ergodicity somewhat as a “loss of memory concept.” It says that at a large enough time $n$ , the chain has nearly forgotten its state at time $m$ , in the sense that the distribution at time $n$ would be nearly the same no matter waht the state was at time $m$ . However, there is no requirement that the distribution be converging to anything as $n \rightarrow \infty$ . The concept that incorporates convergence in addition to loss of memory is strong ergodicity.

2.6.1 The Ergodic Coefficient

For a probability transition matrix $P = P((i,j)),$ the ergodic coefficient $\delta(P)$ of $P$ is defined to be the maximum total variation distance between pairs of rows of $P$ , that is,

DEFINITION: The ergodic coefficient $\delta(P)$ of a probability transition matrix $P$ is

\begin{align*} \delta(P) &= \sup_{i,j \in S} || P(i, \cdot) - P(j, \cdot) || \\ &= \frac{1}{2} \sup_{i,j \in S} \sum_{k \in S} | P(i,k) - P(j,k) | \\ &= \sup_{i,j \in S} \sum_{k \in S} (P(i,k) - P(j,k))^+ . \end{align*}

The basic idea is that $\delta(P)$ being small is “good” for ergodicity. For example, in the extreme case of $\delta(P) = 0$ , all the rows of $P$ are identical, so $P$ would cause a Markov chain to lose its memory completely in just one step: $v_1 = v_0 P$ does not depend on $v_0$ .

LEMMA: $\delta (PQ) \leq \delta(P) \delta(Q)$ for probability transition matrices $P, Q$ .

2.6.2 Sufficient Conditions for Weak and Strong Ergodicity

Sufficient conditions are given in the next two propositions:

PROP: If there exist $n_0 < n_1 < n_2 < ...$ such that $\sum_k [ 1- \delta (P^{(n_k, n_{k+1})}) ] = \infty$ , then $\{X_n\}$ is weakly ergodic.

PROP: If $\{ X_n \}$ is weakly ergodic and if there exist $\pi_0, \pi_1, ...$ such that $\pi_n$ is a stationary distribution for $P_n$ for all $n$ and $\sum_n || \pi_n - \pi_{n+1} || < \infty$ , then $\{ X_n \}$ is strongly ergodic. In that case, the distribution $\pi^*$ in the definition is given by $\pi^* = \lim_{n \rightarrow \infty} \pi_n$ .

2.7 Proof of Main Theorem of Simulated Annealing

2.8 Card Shuffling

We have seen that for an irreducible, aperiodic Markov chain $\{ X_n \}$ having stationary distribution $\pi$ , the distribution $\pi_n$ of $\{X_n\}$ converges to $\pi$ in the total variation distance. An example of using this would be generating a nearly uniformly distributed $4 \times 4$ table with given row and column sums by simulating a certain Markov chain for a long enough time. The question is: how long is long enough?

In certain simple Markov chain examples, it is easy to figure out the rate of convergence of $\pi_n$ to $\pi$ . In this section we will concentrate on a simple shuffling example considered by Aldous and Diaconis in their article “Shuffling cards and stopping times.” The basic question is: How close is the deck to being “random” (uniformly distributed over the $52!$ possible permutation) after $n$ shuffles? For the riffle shuffle model, the answer is “about 7.”

2.8.1 “Top in at random” Shuffle

The “top in at random” method consists of taking the top card off the deck and then inserting is back into the deck at a random position (this includes back on top). So, altogether, there are 52 equally likely positions. Repeated performance of this shuffle on a deck of cards produces a sequence of “states” of the deck. This sequence of states forms a Markov chain with state space $S_{52}$ , the group of permutations of the cards. This Markov chain is irreducible, aperiodic, and has stationary distribution $\pi = \text{ Uniform }$ on $S_{52}$ (i.e., probability $1/52!$ for each permutation). Therefore, by the Basic Limit Theorem, we may conclude that $|| \pi_n - \pi || \rightarrow 0$ as $n \rightarrow \infty$ .

2.8.2 Threshold Phenomenon

Suppose we are working with a fresh deck of $d$ cards in the original order (card 1 on top, card 2 under, etc.). Then $|| \pi_0 - \pi || = 1 - (1/d!)$ . We also know that $|| \pi_n - \pi || \rightarrow 0$ as $n \rightarrow \infty$ from the Basic Limit Theorem. It is natural to assume that the distance from stationarity decreases to 0 in a smooth manner; however, it actually experiences what we call the “threshold phenomenon.” An abrupt change happens in a relatively small neighborhood of the value $n = d \log d$ . That is, for large $d$ the graph of $|| \pi_n - \pi ||$ versus $n$ looks like the following picture.

notes

The larger the value of the deck size $d$ , the sharper (relative to $d\log d$ ) the drop is near $n = d \log d$ .

2.8.3 A random time to exact stationarity

Let’s give a name to each card in the deck (i.e., say the 2 of hearts is card 1, the 3 of hearts is card 2, etc.). Suppose we start with the deck in pristine order (card 1 on top, then card 2, etc.). Though $\pi_n$ will never become exactly random, it is possible to find a random time $T$ at which the deck becomes exactly uniformly distributed, that is, $X_T \sim \text{Unif}(S_{52})$ .

Here is an example of such a random time. Suppose that “card i” always refers to the same card (like, say, card 52, the ace of spades), whereas “top card,” “card in position 2,” etc. are just about cards at the time of consideration. Also note that we may describe a sequence of shuffles simply by a sequence of iid random vacriables $U_1, U_2, ...$ uniformly distributed on $\{1, 2, ..., 52 \}$ : just say that the $i$ -th shuffle moves the top card to position $U_i$ . Define the following random times:

\begin{align*} T_1 &= \inf \{n : U_n = 52 \} = \text{ 1st time a top card goes below card 52 } \\ T_2 &= \inf \{n > T_1 : U_n \geq 51 \} = \text{ 2nd time a top card goes below card 52 } \\ T_3 &= \inf \{n > T_2 : U_n \geq 50 \} = \text{ 3rd time a top card goes below card 52 } \\ & \vdots \\ T_{51} &= \inf \{n > T_{50} : U_n \geq 2 \} = \text{ 51st time a top card goes card below 52 } \end{align*}

and

T = T_{52} = T_{51} + 1.

It is not hard to see that $T$ has the desired property and that $X_T$ is uniformly distributed. To understand this, start with $T_1$ . At time $T_1$ , we know that some card is below card 52; we don’t know which card, but that will not matter. After time $T_1$ we continue to shuffle until $T_2$ , at which time another card goes below card 52. At time $T_2$ , there are 2 cards below card 52. Again, we do not know which cards they are, but conditional on which 2 cards are below card 52, each of the two possible orderings of those 2 cards is equally likely. Similarly, we continue to shuffle until time $T_3$ , at which time there are some 3 cards below card 52, and, whatever those 3 cards are, each of their (3!) possible relative positions in the deck is equally likely. And so on. At time $T_{51}$ , card 52 has risen all the way up to become the top card, and the other 51 cards are below card 52 (now we do know which cards they are), and those 51 cards are in random positions (i.e. uniform over 51! possibilities). Now all we have to do is shuffle one more time to get card 52 in random position, so that at time $T = T_{52} = T_{51} + 1$ , the whole deck is random.

Let us find $ET$ . By the definitions above, $T_1 \sim \text{Geom} (1/52), (T_2 - T_1) \sim \text{Geom} (2/52), ..., (T_{51} - T_{50}) \sim \text{Geom}(51/52), (T_{52} - T_{51}) \sim \text{Geom}(52/52).$ So

ET = E(T_1) + E(T_2 - T_1) + ... + E(T_{52} - T_{51}) \approx 52 \log 52.

Analogously, if the deck had $d$ cards rather than 52, we would have obtained $ET \sim d\log d$ (for large $d$ ), where $T$ is now a random time at which the whole deck of $d$ cards becomes uniformly distributed on $S_d$ .

2.8.4 Strong Stationary Times

As we have observed, the random variable $T$ that we just constructed has the property that $X_T \sim \pi$ . $T$ also has two other important properites. First, $X_T$ is independent of $T$ . Second, $T$ is a stopping time, that is, for each $n$ , one can determine whether or not $T=n$ just by looking at the values of $X_0, ..., X_n$ . In particular, to determine whether or not $T = n$ it is not necessary to know any future values $X_{n+1}, X_{n+2}, ...$ .

DEFINITION: A random variable $T$ satisfying

$T$ is a stopping time,
$X_T$ is distributed as $\pi$ , and
$X_T$ is independent of $T$

is called a strong stationary time.

What’s so good about strong stationary times?

LEMMA: If $T$ is a strong stationary time for the Markov chain $\{ X_n \}$ , then $|| \pi_n - \pi || \leq \P \{T>n\}$ for all $n$ .

This tells us that strong stationary times satisfy the same inequality derived for coupling times.

2.8.5 Proof of threshold phenomenon in shuffling

Let $\Delta (n)$ denote $|| \pi_n - \pi ||$ . The proof that the threshold phenomenon occurs in the top-in-at-random shuffle consists of two parts. Roughly speaking, the first part shows that $\Delta (n)$ is close to 0 for $n$ slightly larger than $d \log d$ , and the second part shows that $\Delta (n)$ is close to 1 for $n$ slightly smaller than $d \log d$ , where in both cases the meaning of “slightly” is “small relative to $d \log d$ .”

Part 1 is addressed in the following result:

THEOREM: For $T$ defined as in the random time discussion above, we have

\Delta (d \log d + cd) \leq \P \{ T > d \log d + cd \} \leq e^{-c}

for all $c \geq 0$ .

Note that for each fixed $c$ , $cd$ is small relative to $d \log d$ if $d$ is large enough. FINISH THIS SECTION

Ch. 3 – MRFs and HMMs

This section looks at aspects of Markov random fields (MRF’s), hidden Markov models (HMM’s), and their applications.

3.1 MRFs on Graphs and HMMs

A stochastic process is a collection of random variables $\{ X_t : t \in T\}$ indexed by some subset $T$ of the real line $\R$ . The elements of $T$ are often interpreted as times, in which case $X_t$ represents the state at time $t$ of the random process under consideration. The term random field refers to a generalization of the notion of a stochastic process: a random field $\{ X_s : s \in G\}$ is still a collection of random variables, but now the index set $G$ need not be a subset of $\R$ . For example, $G$ could be a subset of the plane $\R^2$ . In this section, we’ll consider $G$ as the set of nodes of a graph (the set being at most countable). Important aspects of the dependence among the random variables will be determined by the edges of the graph through a generalization of the Markov property.

NOTATION: Given a graph $G$ , we say two nodes $s, t$ are neighbors, denoted $s \sim t$ , if they are joined by an edge of the graph. We do not consider a node to be a neighbor of itself. $N(t)$ the set of neighbors of $t$ .

DEFINITION: Suppose we are given a graph $G$ with nodes $\{ 1, ..., n\}$ and a neighborhood structure $N(t)$ . The collection of random variables $(X_1, ..., X_n)$ is a Markov random field on $G$ if

\P \{ X_t = x_t | X_s = x_s \text{ for } s \neq t \} = \P \{ X_t = x_t | X_s = x_s \text{ for } s \in N(t)\}

for all nodes $t \in \{1, ..., n\}.$

More compact notation: for a subset of nodes $A \subset G$ , let $x_A$ be the vector $(x_s : s \in A).$ We will also write $p(x_t | x_{N(t)})$ for $\P \{X_t = x_t | X_s = x_s \text { for } s \in N(t)\}$ . HMM’s are Markov random fields in which some of the random variables are observable and others are not. We adopt $X$ for hidden random variables and $Y$ for observed random variables.

3.2 Bayesian Framework

What do we get out of these models & how can we use them? One approach is Bayesian: HMM’s fit nicely in the Bayesian framework. $X$ is the object of interest; it is unknown. For example, in modeling a noisy image, $X$ could be the true image. We consider the unknown $X$ to be random, and we assume it has a certain prior distribution. This distribution, our probabilistic model for $X$ , is assumed to be a Markov random field. We also postulate a certain probabilistic model for $Y$ conditional on $X$ . This conditional distribution of $Y$ given $X$ reflects our ideas about the noise of blurring or whatever transforms the hidden true image $X$ into the image $Y$ that we observe.

Given our assumed prior distribution of $X$ and condiitonal distribution of $(Y|X)$ , Bayes’ formula gives the posterior distribution of $(X|Y)$ . Thus, given an observed value $Y=y$ , in principle we get a posterior distribution $P \{ X = \cdot | Y = y \}$ over all possible true images, so that (again, in principle) we could make a variety of reasonable choices of our estimator of $X$ . For example, we could choose the $x$ that maximizes $P \{ X = x | Y=y\}$ . This is called MAP estimation, where MAP stands for “maximum a posteriori”.

3.3 Hammersley - Clifford Theorem

How do we specify a Markov random field? Compared to the case of Markov chains, we might want to specify in terms of a conditional distribution. The following example suggests why this approach goes wrong.

EXAMPLE: Suppose we are designing a Markov random field for images on the 3x3 lattice:

notes

For each pixel, let us specify the conditional distribution of its color given the color of its neighbors. Suppose there are two colors, 0 and 1. But it is possible to specify conditional distributions for each pixel that lead don’t work, i.e., there might be no joint distribution having the given conditional distributions.

In general, we can’t expect to specify a full set of conditional distributions as above. Fortunately, the Hammersley-Clifford Theorem says that a random field’s having the Markov property is equivalent to its having a Gibbs distribution, which is a friendly sort of distribution. Thus, instead of worrying baout specifying our MRF’s in terms of consistent conditional distributions, we can just consider Gibbs distributions, which are simple to write down and work with.

Some definitions needed to state HC:

DEF: A set of nodes $C$ is complete if all distinct nodes in $C$ are neighbors of each others. A clique is a maximal complete set of nodes.

DEF: Let $G$ be a finite graph. A Gibbs distribution with respect fo $G$ is a probability mass function that can be expressed in the form

p(x) = \prod_{C \text{ complete}} V_C (x),

where each $V_C$ is a function that depends only on the values $x_c = (x_s : s \in C)$ of $x$ at the nodes in the clique $C$ . That is, the function $V_C$ satisfies $V_C(x) = V_C(y)$ if $x_C= y_C$ .

By combining functions $V_C$ for sets $C$ that are subsets of the same clique, we see that we can further reduce the product in the definiton of Gibbs distribution to

p(x) = \prod_{C \text{ a clique}} V_C(x).

THM (HAMMERSLEY-CLIFFORD): Suppose that $X = (X_1, ..., X_n)$ has a positive joint probability mass function. $X$ is a Markov random field on $G$ iff $X$ has a Gibbs distribution with respect to $G$ .

EXAMPLE: A Markov chain $X_0, ..., X_n$ has joint distribution of the form $p(x_0, x_1, ..., x_n) = \pi_0 (x_0) P_1 (x_0, x_1) P_2 (x_1, x_2) ... P_n (x_{n-1}, x_n)$ .

By defining $V_{\{0,1\}} (x_0, x_1) = \pi_0 (x_0) P_1 (x_0, x_1)$ and $V_{\{k-1, k\}} (x_{k-1}, x_k) = P_k (x_{k-1}, x_k)$ for $k>0$ , we see that this product is a Gibbs distribution on the graph

(x_1) -- (x_1) -- (x_2) -- ... -- (x_n)

PROP: Suppose $(X,Y)$ is a Markov random field on the graph $G$ with the neighborhood structure $N$ . Write $G= G_X \cup G_Y$ , where $G_X$ and $G_Y$ are the sets of nodes in $G$ corresponding to the $X$ and $Y$ random variables, respectively. Then the marginal distribution of $Y$ is a Markov random field on $G_Y$ , where two nodes $y_1, y_2 \in G_Y$ are neighbors if either

They were neighbors in the original graph, or
There are nodes $x_1, ..., x_k \in G_X$ such that $y_1 \sim x_1 \sim x_2 \sim ... \sim x_k \sim y_2$ .

The conditional distribution of $X$ given $Y$ is a Markov random field on the graph $G_X$ , where nodes $x_1$ and $x_2$ are neighbors if $x_1 \sim x_2$ , that is, if $x_1, x_2$ were neighbors in the original graph.

3.4 Long range dependence in the Ising model

For this section, we will work in $\Z^d$ . For each $t \in \Z^d$ there is a binary random variable $X_t$ taking values in $\{-1,1\}$ . The Ising model gives a joint probability distribution for these random variables.

We consider a special case of the Ising model as follows.

For $x$ a configuration of +1’s and -1’s at the ndoes of a finite subset of $\Z^d$ , let $b(x)$ denote the number of “odd bonds” in $x$ , that is, the number of edges $\{t,u\}$ such that $x_t \neq x_u$ . Then, under the Ising model, a configuration $x$ has a probability proportional to $\alpha^{b(x)}$ where $\alpha$ is a positive parameter of the distribution (typically <1). The chocie $\alpha =1$ corresponds to the uniform distribution, giving equal probability to all congiruations. Distributions with small $\alpha$ strongly dicourage odd bonds, placing large probability on configurations with few odd bonds.

For the case $d=1$ , the model corresponds to a stationary Markov chain with probability transition matrix

P_\alpha = \begin{pmatrix} 1/(1+\alpha) & \alpha / (1+\alpha) \\ \alpha / (1+\alpha) & 1 / (1+\alpha) \end{pmatrix}.

Basic Limit Theorem tells us

P_\alpha^n \rightarrow \begin{pmatrix} 1/2 & 1/2 \\ 1/2 & 1/2 \end{pmatrix} \text{ as } n \rightarrow \infty.

Thus, the state $X_0$ is asymptotically independent of information at nodes far away from 0.

We will show that the situation in $d \geq 2$ is qualitatively different, in that the effect of states at remote nodes does not disappear in 2 dimensions.

To state the result, some notation:

Imagine a “cube” $K_n$ of side length $2n$ centered at 0 in $\Z^d$ , consisting of lattice points whose $d$ coordinates all lie between $-n$ and $n$ :

K_n = \{ t\in \Z^d : |t_i| \leq n, \forall i = 1,...,d\}.

Let $B_n$ denote the boundary points of the cube $K_n$ , that is, the points having at least one coordinate equal to $n$ .

Let $P^{(n)}_+ \{X=x\}$ denote the Ising probability of $X = x$ , conditional on $X_t = +1$ for all $t \in B_n$ . Similarly, let $P^{(n)}_- \{X=x\}$ denote probabilities conditional on $-1$ ’s on the boundary.

THEOREM: For the Ising model on $\Z^2$ , the effect of the boundary does not disappear. In particular, there exists $\alpha$ such that $P^{(n)}_+ \{X_0 = -1 \}$ remains below $0.4$ for all $n$ , no matter how large.

pf. for any configuration, imagine draawing line segments of length 1 cutting through each of the odd bonds in $X$ (ie, wall off zones of negative from positive). note that $P^{(n)}_+ \{X\} \propto \alpha^{\text{total length of all segments drawn}}.$ if $X_0 = -1$ , the line segments we’ve drawn will form a circuit around the origin, denoted by $\gamma_0 = \gamma_0(X).$

so $P^{(n)}_+ \{X_0 = -1 \} = \sum_{\text{circuits }\gamma \text{ about } 0} P^{(n)}_+ \{ x : \gamma_0(x) = \gamma\}.$

for any given $x$ with $\gamma_0(x) = \gamma,$ let $\phi_0(x)$ denote the configuration obtained by flipping the $-1$ ’s inside the circuit to positive. let $\ell(y)$ denote the length of $\gamma$ . noting that $x$ has $\ell(\gamma_0 (x))$ mroe segments drawn than the version $\phi_0(x)$ , we see that

P^{(n)}_+ (x) = \alpha^{ \ell(\gamma_0(x))} P^{(n)}_+ (\phi_0 (x))

for each $x$ with $x_0 = -1$ .

\begin{align*} P^{(n)}_+ \{X_0 = -1 \} &= \sum_{\gamma} P^{(n)}_+ \{ x : \gamma_0 (x) = \gamma \} \\ &= \sum_{\gamma} \alpha^{\ell (y)} \sum_{x:\gamma_0 (x) = y} P^{(n)}_+ (\phi_0 (x)) \\ &\leq \sum_{\gamma} \alpha^{\ell (y)} \end{align*}

since $\sum_{x:\gamma_0(x) =\gamma} P^{(n)}_+(\phi_0(x)) \leq 1.$

let $v(\ell)$ be the number of circuits about 0 of length $\ell$ . then

\begin{align*} P^{(n)}_+ \{ X_0 = -1 \} &\leq \sum_{\gamma} \alpha^{\ell(y)} \\ &= \sum_{\ell=4}^{\infty} v(\ell) \alpha^{\ell} \\ &\leq \sum_{\ell=4}^{\infty} \ell (3 \alpha)^{\ell} \end{align*}

which can be made $\leq 0.4$ by $\alpha$ sufficiently small.

3.5 Hidden Markov Chains

3.5.1 Model overview

The hidden Markov chain model is a MRF in which some of the random variables are observed and others are not (hence hidden). In the graph structure for the hidden Markov chain, the hidden chain is $X_0, X_1, ..., X_n$ and the observed process if $Y_0, Y_1, ..., Y_n$ .

notes

Edges join $X_t$ to both $Y_t$ and $X_{t+1}$ . The model is parametrized by a marginal distribution $\zeta$ of $X_0$ and, if we assume time-homogeneity of the transition matrices, by two transition matrices $A$ and $B$ , where $A(i,j) = \P \{X_{t+1} = j | X_t = i \}$ and $B(i,j) = \P \{Y_t = j | X_t = i\}.$

Let us write $\theta = (\zeta, A, B)$ for the vector of all parameters. If there are $u$ states possible for each of the hidden $X$ random varirables and $v$ oucomes possible for the observed $y$ random variables, then $\zeta$ is a vector of $u$ probabilities, $A$ is a $u\times u$ probability transition matrix, and $B$ is a $u \times v$ matrix, each of whose rows is a proabbility mass function.

The hidden Markov chain is pretty general. Examples of cases include

One hidden state possible for each $X$ . Then $A$ is just the $1 \times 1$ transition matrix, $B$ is $1 \times v$ , and the $Y$ process is simply an iid sequence, where each $Y_t$ has probability mass function given by $B$ . So iid sequences are a special case of the hidden Markov chain model
If $u=v$ and $B$ is the identity matrix, then $Y_t = X_t$ for all $t$ , so Markov chains are a special case of the hidden Markov chain

3.5.2 How to calculate likelihoods

The likelihood function $L = L(\theta)$ is the probability of the observed data, as a function of the parameters of the model. The tricky aspect is we observe only the $Y$ ’s, so that

L(\theta) = p_{\theta}(y_0, ..., y_n) = \sum_{x_0} \sum_{x_1} ... \sum_{x_n} p_{\theta}(x_0, ..., x_n, y_0, ..., y_n) = \sum_x p_{\theta}(x,y).

This is a large sum. If the size of the state space of the hidden variables is just $u=2$ , the sum still has $2^{n+1}$ terms. Without a way around this computational issue, the hdiden Markov chain model would be of little pratical use. However, it turns out that we can do these calculations in time linear in $n$ .

We will denote the state space for the hidden variables $X_t$ by $\X$ .

We are thinking of the observed $Y$ values as fixed here–we know them, and will denote them by $y_0, ..., y_n$ . For each $t=0,1,..,n$ and for each $x_t \in \X$ , define

\alpha_t (x_t) = p_\theta (y_0, .., y_t, x_t).

We can calculate the function $\alpha_0$ immediately:

\alpha_0 (x_0) = p_\theta (y_0, x_0) = \zeta (x_0) B(x_0, y_0).

Note the simple recursion that expressed $\alpha_{t+1}$ in terms of $\alpha_t$ :

\begin{align} \alpha_{t+1}(x_{t+1}) &= p_\theta(y_0, ..., y_{t+1}, x_{t+1}) \\ &= \sum_{x_t \in \X} p_\theta (y_0, ..., y_t, x_t, x_{t+1}, y_{t+1}) \\ &= \sum_{x_t \in \X} p_{\theta}(y_0, ..., y_t) p_\theta (x_{t+1}|x_t) p_\theta(y_{t+1} | x_{t+1}) \\ &= \sum_{x_t \in \X} \alpha_t(x_t) A(x_t, x_{t+1}) B(x_{t+1}, y_{t+1}). \end{align}

The sum in the recursion

\sum_{x_t \in \X} \alpha_t(x_t) A(x_t, x_{t+1}) B(x_{t+1}, y_{t+1})

is fairly modest in comparison. Using the recursion to calculate the function $\alpha_{t+1}$ from $\alpha_t$ involves a fixed amount of work, and the task gets no harder as $t$ increases. Thus, the amount of work to calculate all of the probabilities $\alpha_t (x_t)$ for $t=0, ...,n$ and $x_t \in \X$ is linear in $n$ .

Having completed the recrusion to calculate the function $\alpha_n$ , the likelihood is simply

L(\theta) = p_\theta (y_0, ..., y_n) = \sum_{x_n} p_\theta (y_0, ..., y_n, x_n) = \sum_{x_n} \alpha_n (x_n).

The above probabilities are called “forward” probabilities. In a similar manner, we can calculate the “backward” probabilities

\beta_t (x_t) = p_\theta (y_{t+1},...,y_n | x_t) = \P_\theta \{Y_{t+1} = y_{t+1}, ..., Y_n = y_n | X_t = x_t\}.

by using the recursion

\beta_{t-1} (x_{t-1}) = \sum_{x_t} A(x_{t-1, x_t}) B(x_t, y_t) \beta_t(x_t).

3.5.3 Maximum Likelihood and the EM algorithm

Without an organized method for changing $\theta = (\zeta, A, B)$ , it is hard to find the $\theta$ that maximizes the likelihood. The EM algorithm is a method for finding maximum likelihood estimates that is applicable to many statistical problems, including hidden Markov chains.

PROP: For $p = (p_1, ..., p_k)$ and $q = (q_1, ..., q_k)$ two probability mass functions on $\{1,..., k\},$ we have

\sum_i p_i \log p_i \geq \sum_i p_i \log q_i.

A brief description of EM algorithm. We’ll assume $X,Y$ are discrete random variables or vectors, so probs are given by sums rathers than integrals.

We want to find the $\theta$ maximizing $\log L(\theta)$ where $L (\theta) = p_\theta (y) = \sum_x p_\theta(x,y)$ . The EM algorithm repeats the following update that is guaranteed to increase the likelihood at each iteration.

Let $\theta_0$ denote the current value of $\theta$ . Replace $\theta_0$ by $\theta_1$ , the value of $\theta$ that maximizes

\E_{\theta_0} \left[ \log p_\theta (X, y) | Y=y \right].

Why does it work? For a given $\theta_0$ , define $g(\theta) = \E_{\theta_0} [ \log p_\theta (X, y) | Y =y].$ We will see that, in order to have $p_{\theta_1}(y) > p_{\theta_0}(y),$ we do not need to find $\theta_1$ maximizing $g$ , but rather it is enough to find a $\theta_1$ such that $g(\theta_1) > g(\theta_0).$

PROP: If $\E_{\theta_0} [ \log p_{\theta_1} (X, y) | Y =y] > \E_{\theta_0} [ \log p_{\theta_0} (X, y) | Y =y],$ then $p_{\theta_1} (y) > p_{\theta_0}(y).$

pf: we have

\begin{align*} 0 &\underset{(a)}< {E_{\theta_0} \left[ \log \frac{p_{\theta_1}(X, y)}{p_{\theta_0}(X, y)} \,\middle|\, Y = y \right]} \\ &= \sum_x p_{\theta_0}(x \mid y) \log \frac{p_{\theta_1}(x, y)}{p_{\theta_0}(x, y)} \\ &= \sum_x p_{\theta_0}(x \mid y) \log \frac{p_{\theta_1}(x)}{p_{\theta_0}(y)} - \sum_x p_{\theta_0}(x \mid y) \log \frac{p_{\theta_0}(x \mid y)}{p_{\theta_1}(x \mid y)} \\ &\underset{(b)}{=} \log \frac{p_{\theta_1}(y)}{p_{\theta_0}(y)} - \sum_x p_{\theta_0}(x \mid y) \log \frac{p_{\theta_0}(x \mid y)}{p_{\theta_1}(x \mid y)} \\ &\underset{(c)}{\leq} \log \frac{p_{\theta_1}(y)}{p_{\theta_0}(y)}, \end{align*}

where $a$ holds by assumption, $b$ holds because $\sum_x p_{\theta_0}(x|y) = 1$ , and $c$ follows from an earlier proposition

3.5.4 Applying the EM algorithm to a hidden Markov chain

Consider a hidden Markov chain for $(X,Y) = (X_0, ..., X_n, Y_0, ..., Y_n)$ , where the r.v.s $Y_t$ are observed and the r.v.s $X_t$ are hidden. The model is parametrized by $\theta = (\zeta, A, B)$ for $\zeta$ the distribution of $X_0$ , $A$ the prob transition matrix for $X$ chain and $B$ the prob transition matrix from $X_t$ to $Y_t$ .

To describe one iteration of the EM method, imagine our current guess for $\theta$ is $\theta_0 = (\zeta_0, A_0, B_0).$ We want a new guess $\theta_1 = (\zeta_1, A_1, B_1)$ that has higher likelihood.

FINISH THIS SECTION PG 107…

In summary, to do EM on hidden Markov chain model:

Start with some choice of parameter values $\theta_0 = (\zeta_0, A_0, B_0).$
Calculate forward and backward probabilities $\alpha_t(i), \beta_t(i)$ for $t=0,1,...n$ and $i \in \X$ (all with $\theta$ taken as $\theta_0$ ).
Calculate the quantities $\gamma_t(i,j) = \P_{\theta_0}\{X_t = i, X_{t+1}=j | y\}$ for $t \in \{0,...,n-1\}, i \in \X, j \in \X$ . These could all be stored in a $u \times u \times n$ array.
Define:

\begin{align*} \zeta_1(i) &= \sum_j \gamma_0 (i,j) \\ A_1 (i,j) &= \frac{\sum_{t=0}^{n-1}\gamma_y(i,j)}{\sum_l \sum_{t=0}^{n-1}\gamma_t(i,l)}\\ B_1(i,j) &= \frac{ \sum_{t=0}^{n-1} \sum_l \gamma_t(i, l) I(y_t = j) + \sum_m \gamma_{n-1}(m, i) I(y_n = j)}{\sum_{t=0}^{n-1} \sum_l \gamma_t(i, l) + \sum_m \gamma_{n-1}(m, i)}. \end{align*}

Replace $\theta_0$ by $\theta_1 = (\zeta_1, A_1, B_1)$ and repeat.

3.6 Gibbs Sampler

Terms like “Markov chain Monte Carlo” and “Markov sampling” refer to methods for generating random samples from given distributions by running Markov chains. Although such methods have quite a long history, they have become the subject of renewed interest in the last decade, particularly with the introduction of the “Gibbs sampler” by Geman and Geman (1984), who used the method in a Bayesian approach to image reconstruction. The Gibbs sampler itself has enjoyed a recent surge of intense interest within statistics community, spurred by Gelfand and Smith (1990), who applied the Gibbs sampler to a wide variety of inference problems.

Recall that a distribution $\pi$ being “stationary” for a Markov chain $X_0, X_1, ...$ means that, if $X_0 \sim \pi,$ then $X_0 \sim \pi$ for all $n$ . The basic phenomenon underlying all Markov sampling methods is the convergence in distribution of a Markov chain to its stationary distribution: If a Markov chain $X_0, X_1, ...$ has stationary distribution $\pi$ , then under the conditions of the Basic Limit Theorem, the distribution of $X_n$ for large $n$ is close to $\pi$ . Thus, in order to generate an observation from a desired distribution $\pi$ , we find a Markov chain $X_0, X_1, ...$ that has $\pi$ as its stationary distribution. The Basic Limit Theorem then suggests that running or simulating the chain until a large time $n$ will produce a random variable $X_n$ whose distribution is close to the desired $\pi$ . By taking $n$ large enough, in principle we obtain a value that may for practical purposes be considered a random draw from the distribution $\pi$ .

The Gibbs sampler is a way of constructing a Markov chain having a desired stationary distribution. A simple setting that illustrates the idea involves a probability mass function $\pi$ of the form $\pi(x, y)$ . Suppose we want to generate a random vector $(X, Y) \sim \pi$ . Denote the conditional probability distributions by $\pi(\cdot \mid X = \cdot)$ and $\pi(\cdot \mid Y = \cdot)$ . To perform a Gibbs sampler, start with any initial point $(X_0, Y_0)$ . Then generate $X_1$ from the conditional distribution $\pi(\cdot \mid Y = Y_0)$ , and generate $Y_1$ from the conditional distribution $\pi(\cdot \mid X = X_1)$ . Continue on in this way, generating $X_2$ from the conditional distribution $\pi(\cdot \mid Y = Y_1)$ and $Y_2$ from the conditional distribution $\pi(\cdot \mid X = X_2)$ , and so on. Then the distribution $\pi$ is stationary for the Markov chain $\{(X_n, Y_n) : n = 0, 1, \dots \}$ .

To see this, suppose $(X_0, Y_0) \sim \pi$ . In particular, $Y_0$ is distributed according to the $Y$ -marginal of $\pi$ , so that, since $X_1$ is drawn from the conditional distribution of $X$ given $Y = Y_0$ , we have $(X_1, Y_0) \sim \pi$ . Now we use the same reasoning again: $Y_1$ is distributed according to the $Y$ -marginal of $\pi$ , so that $(X_1, Y_1) \sim \pi$ . Thus, the Gibbs sampler Markov chain $\{(X_n, Y_n) : n \geq 0\}$ has the property that if $(X_0, Y_0) \sim \pi$ then $(X_1, Y_1) \sim \pi$ —that is, the distribution $\pi$ is stationary.

Simulating a Markov chain is technically and conceptually simple. We just generate the random variables in the chain, in order, and we are done. However, the index set of a Markov random field has no natural ordering in general. This is what causes iterative methods such as the Gibbs sampler to be necessary.

The Gibbs sampler is well suited to Markov random fields, since it works by repeatedly sampling from the conditional distribution at one node given the values at the remaining nodes, and the Markov property is precisely the statement that these conditional distributions are simple, depending only on the neighbors of the node.

Ch. 4 – Martingales

4.2 Definitions

Notation: Given a process $W = \{ W_k \}$ , let $W_{m,n}$ denote the portion $W_m, W_{m+1}, ..., W_n$ of the process from time $m$ up to time $n$ .

DEF: A process $M_0, M_1, ...,$ is a martingale if

\E [M_{n+1} | M_{0,n} ] = M_n \text{ for each } n \geq 0.

Alternatively, A process $M_0, M_1, ...,$ is a martingale w.r.t another process $W_0, W_1, ...$ if

\E [M_{n+1} | W_{0,n} ] = M_n \text{ for each } n \geq 0.

The crux of the definition $\E [M_{n+1} | W_{0,n} ] = M_n$ is a “fair game” sort of requirement. If we are playing a fair game, we expect neither to win nor to lose money on the average. Given the history of our fortunes up to time $n$ , our expected fortune $M_{n+1}$ at the future time $n+1$ should just be the fortune $M_n$ that we have at time $n$ .

A minor technical condition: we also require $\E |M_n| < \infty$ for all $n$ s.t. the coniditional expectations in the definition are guaranteed to be well-defined.

How about submartingales and supermartingales? These are processes that are “better than fair” and “worse than fair,” respectively.

DEF: A process $X_0, X_1, ...$ is a submartingle w.r.t. a process $W_0, W_1, ...,$ if $\E [X_{n+1} | W_{0,n} ] \geq X_n$ for each $n \geq 0$ .

{X_n} is a supermartingale w.r.t $\{W_n\}$ if $\E [X_{n+1} | W_{0,n} ] \leq X_n$ for each $n \geq 0$ .

4.3 Examples

See book

4.4 Optional Sampling

Optional sampling property is a “conservation of fairness” type property.

By the “fair game” property, $\E \{M_{n+1}\} = \E \{ \E [M_{n+1} | W_{0,n}] \} = \E \{M_n\}$ for all $n$ . This implies that

E M_n = EM_0 \text{ for all times } n \geq 0.

That is, one can stop at any predetermined time $t$ , like $t =8$ , and their winnings will be fair: $EM_8 = EM_0$ .

Fairness is also conserved in many cases, but not in all cases (i.e., if one stops at a time that is not predetermined, but random (depending on the observed sample path of the game)). The issue of optional sampling is this:

If $T$ is a random time, that is, $T$ is a nonnegative random variable, does the equality $EM_T = EM_0$ still hold?

There are two sorts of things that shouldn’t be allowed if we want fairness to be conserved:

shouldn’t be able to take an indefinitely long time to stop. for example, a simple symmetric random walk with equal probabilities of going + or - 1. to stop at time $T_1 = \inf\{n : \text{ random walk is at 2 }\},$ then clearly this is not fair.
shouldn’t be able to “take back” moves. i.e., shouldn’t be able to have information up to time $t$ , then go back to some time $s<t$ and claim to stop at $s$ .

In fact, ruling out these two sorts of behaviors leaves a class of random times $T$ at which the optional sampling statement $EM_T = EM_0$ holds. Disallowing arbitrarily long times is done by assuming $T$ bounded. Random times that disallow the gambler from peeking ahead into the future are called stopping times.

DEF: A random variable $T$ taking values in the set $\{ 0, 1, 2, ..., \infty \}$ is a stopping time w.r.t. the process $W_0, W_1, ...$ if for each integer $k$ , the indicator random variable $I\{T=k\}$ is a function of $W_{0,k}$ .

The main optional sampling result:

THM: Let $M_0, M_1, ...$ be a martingale w.r.t. $W_0, W_1, ...,$ and let $T$ be a bounded stopping time. Then $EM_T = EM_0$ .

Notes on Stochastic Processes (Joe Chang)

Ch. 1 – Markov Chains

1.1 Specifying and Simulating a Markov Chain

1.2 The Markov Property

1.3 Matrices

1.4 Basic Limit Theorem of Markov Chains

1.5 Stationary Distribution

1.6 Irreducibility, Periodicity, Recurrence

1.7 Coupling

1.8 Proof of Basic Limit Theorem

1.9 SLLN for Markov Chains

1.10 Exercises

Ch. 2 – Markov Chains: Examples & Applications

2.1 Branching Processes

2.2 Time Reversibility

2.3 More on Time Reversibility: A Tandem Queue Model

2.4 The Metropolis Method

2.5 Simulated Annealing

2.6 Ergodicity Concepts

2.6.1 The Ergodic Coefficient

2.6.2 Sufficient Conditions for Weak and Strong Ergodicity

2.7 Proof of Main Theorem of Simulated Annealing

2.8 Card Shuffling

2.8.1 “Top in at random” Shuffle

2.8.2 Threshold Phenomenon

2.8.3 A random time to exact stationarity

2.8.4 Strong Stationary Times

2.8.5 Proof of threshold phenomenon in shuffling

Ch. 3 – MRFs and HMMs

3.1 MRFs on Graphs and HMMs

3.2 Bayesian Framework

3.3 Hammersley - Clifford Theorem

3.4 Long range dependence in the Ising model

3.5 Hidden Markov Chains

3.5.1 Model overview

3.5.2 How to calculate likelihoods

3.5.3 Maximum Likelihood and the EM algorithm

3.5.4 Applying the EM algorithm to a hidden Markov chain

3.6 Gibbs Sampler

Ch. 4 – Martingales

4.2 Definitions

4.3 Examples

4.4 Optional Sampling

Ch. 5 – Brownian Motion

Ch. 6 – Diffusions and Stochastic Calculus

Ch. 7 – Likelihood Ratios

Ch. 8 – Extremes and Poisson Clumping