A Stochastic Processes Primer

Scattered notes from Prof. Joe Chang’s stochastic processes book, intended for a first undergraduate course. Brownian motion and stochastic integration largely omitted; some sections left incomplete and proofs largely reference the book. See other posts for more details on stochastic calculus.

1 – Markov Chains

1.1 Specifying and Simulating a Markov Chain

to specify a Markov chain, we need to know its

state space $S$ , a finite or countable set of states—values the random variables $X_i$ may take on
initial distribution $\pi_0$ $π_{0}$ , the probability distribution of the Markov chain at time $0$ $0$ .
- for each state $i$ , $\pi_0(i) := \mathbb{P}\{X_0=i\}$ , the probability the Markov chain starts in state $i$ we can also think of $\pi_0$ are a vector whose $i$ -th entry is $\pi_0(i)$
probability transition matrix $P = (P_{ij})$ $P = (P_{ij})$
- if $S$ has $N$ states, then $P$ is $N \times N$
- $P_{ij}$ or $P(i,j)$ is the probability that the chain will transition to $j$ from $i$ : $P_{ij} = \mathbb{P}\{X_{n+1} = j | X_n = i \}$ .
- rows sum to 1 (think: from state $i$ , we are in row $i$ , and there must be probability 1 of going to any next state $j$ )

1.2 The Markov Property

a process $X_0, X_1, …$ satisfies the Markov property if

\mathbb{P}\{X_{n+1} = i_{n+1} | X_n=i_n, X_{n-1} = i_{n-1}, ..., X_0 = i_0 \} \newline = \mathbb{P}\{X_{n+1} = i_{n+1} | X_n = i_n\}

for all $n$ and all $i_0, …, i_{n+1} \in S$ .

The Markov property implies a simple expression for the probability of our Markov chain taking any specified path, as follows:

\begin{align*} &\P\{X_0 = i_0, X_1 = i_1, ..., X_n = i_n\} \\ &= \P\{X_o = i_0\}\P\{X_1 = i_1 | X_0 = i_0\}\P\{X_2=i_2 | X_1 = i_1, X_0 = i_0\}\times \\ & \text{ }... \times \P\{X_n = i_n | X_{n-1}=i_{n-1}, ..., X_0 = i_0\} \\ &= \pi_0(i_0)P(i_0, i_1)P(i_1, i_2)...)(i_{n-1}, i_n). \end{align*}

We say that a process $X_0, X_1, ...$ is $r$ -th order Markov if

\begin{align*} \P\{X_{n+1} = i_{n+1}| X_n = i_n, X_{n-1} = i_{n-1}, ..., X_0 = i_0 \}\\ = \P\{X_{n+1} = i_{n+1} | X_n = i_n, ..., X_{n-r+1} = i_{n-r+1}\} \\ \forall n \geq r \text{ and all } i_0, ..., i_{n+1} \in S. \end{align*}

1.3 Matrices

recall $\pi_0$ from 1.1. let $\pi_n$ denote the distribution of the chain at time $n$ analogously, $\pi_n(i) = \mathbb{P}\{X_n = i\}$ . consider both as row vectors.

suppose that the state space is finite: $S = \{1, …, N\}$ . by LOTP,

\pi_{n+1}(j) = \mathbb{P}\{X_{n+1} = j\} \newline = \sum_{i=1}^{N}\mathbb{P}\{X_n=i\}\mathbb{P}\{X_{n+1} = j | X_n=1\} \newline \sum_{i=1}^{N} \pi_n(i)P(i,j) = \pi_{n+1} = \pi_nP.

Then we have

\begin{align*} &\pi_1 = \pi_0 P \\ &\pi_2 = \pi_1 P = \pi_0 P^2 \\ &\pi_3 = \pi_2P = \pi_0 P^3 \end{align*}

and so on, so that by induction $\pi_n = \pi_0 P^n.$

1.4 Basic Limit Theorem of Markov Chains

THM (BASIC LIMIT): Let $X_0, X_1, ...$ be an irreducible, aperiodic Markov chain having a stationary distribution $\pi(\cdot).$ Let $X_0$ have the distribution $\pi_0$ , an arbitrary initial distribution.

Then $\lim_{n\rightarrow \infty} \pi_n(i) = \pi(i)$ for all states $i$ .

Let’s define “irreducible,” “aperiodic,” and “stationary distribution” so this makes sense.

1.5 Stationary Distribution

stationary distribution amounts to saying $\pi = \pi P$ is satisfied, i.e.,

\pi(j) = \sum_{i \in S} \pi(i) P(i,j)

for all $j \in S$ .

a Markov chain might have no stationary distribution, one stationary distribution, or infinitely many stationary distributions.

for subsets $A, B$ of the state space, define the probability flux from set $A$ into $B$ as

\text{flux}(A, B) = \sum_{i \in A} \sum_{j\in B} \pi(i) P(i,j)

1.6 Irreducibility, Periodicity, Recurrence

Use $\mathbb{P}_i(A)$ as shorthand for $\mathbb{P}\{A | X_0 = i\}$ , and same for $\E_i$ .

Accessibility: for two states $i, j$ , we say that $j$ is accessible from $i$ if it is possible for the chain ever to visit state $j$ if the chain starts in state $i$ :

\P_i \{\cup_{n=0}^{\infty} \{X_n = j\} \} > 0.

Equivalently,

\sum_{n=0}^{\infty}P^n (i, j) = \sum_{n=0}^{\infty}\{X_n = j \} > 0.

Communication: we say $i$ communicates with $j$ if $i$ is accessible from $j$ and $j$ is accessible from $i$ .

Irreducibility: the Markov chain is irreducible if all pairs of states communicate.

The relation “communicates with” is an equivalence relation; hence, the states space $S$ can be partitioned into “communicating classes” or “classes.”

–

The Basic Limit Theorem requires irreducibility and aperiodicity (see 1.5). Trivial examples why:

Irreducibility: takes $S = \{0,1\}, P = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$ . Then $\pi_n = \pi_0$ holds for all $n$ , i.e., $\pi_n$ does not approach a limit independent of $\pi_0$ .
Aperiodicity: same $S$ , take $P = \begin{pmatrix} 0&1 \\ 1&0 \end{pmatrix}$ . If, for example, $\pi_0 = (1,0),$ $\pi_n$ alternates with even and odd $n$ , and does not converge to anything.

–

Period: Given a Markov chain $\{ X_0, X_1, ... \}$ , define the period of a state $i$ to be the greatest common divisor

d_i = \text{gcd} \{n : P^n (i, i) > 0 \}.

THM: if the states $i$ and $j$ communicate, then $d_i = d_j$ .

The period of a state is a “class property.” In particular, all states in an irreducible Markov chain have the same period. Thus, we can speak of the period of a Markov chain if the Markov chain is irreducible.

An irreducible Markov chain is aperiodic if its period is $1$ , and periodic otherwise. (A sufficient but not necessary condition for an irreducible chain to be aperiodic is that there exists a state $i$ such that $P(i,i) > 0$ .)

–

One more concept for the Basic Limit Theorem: recurrence. We will begin with showing recurrence, then showing that it is a class property. In particular, in an irreducible Markov chain, etiher all states are recurrent OR all states are transient.

The idea of recurrence: a state $i$ is recurrent if, starting from the state $i$ at time 0, the chain is sure to return to $i$ eventually. More precisely, define the first hitting time $T_i$ of the state i by

T_i = \inf \{n>0 : X_n = i \}.

Recurrence: the state $i$ is recurrent if $\P_i \{ T_i < \infty \} = 1$ . If $i$ is not recurrent, it is called transient.

(note that accessibility could be defined: for distinct states $i \neq j$ , $j$ is accessible from $i$ iff $\P_i \{ T_j < \infty \} > 0$ .)

THM: Let $i$ be a recurrent state, and suppose that $j$ is accessible from $i$ . Then all of the following hold:

$\P_i \{T_j < \infty \} = 1$
$\P_j \{T_i < \infty \} = 1$
The state $j$ is recurrent

Pf pg. 20. roughly, let i and j be distinct states. iii follows from i and ii.

to prove i, imagine starting in state i. since i is recurrent, it will return to i in finite time. actually, continuing after the first hitting time, it will return again. so with probability 1, the chain returns to i infinitely many times. we can call each of these paths a cycle. set a characteristic function I_n to 1 if the chain visits j some time during the n-th cycle, and 0 otherwise. then I_1, I_2, … is an iid sequence of bernoulli trials. the probability that there will be a cycle where j is visited converges to 1.

to prove ii, suppose to the contrary that the probability of reaching i from j in infinite time is > 0. since j is accessible from i, it is possible with positive prob for the chain to go from i to j in finite time and then never return to i. but this contradicts that starting from i, the chain must return to i infinitely many times with probability 1, so ii holds.

–

We use the notation $N_i$ for the total number of visits of the Markov chain to the state $i$ :

N_i = \sum_{n=0}^{\infty} I \{X_n = i \}.

THM: The state $i$ is recurrent iff $\E_i(N_i) = \infty.$

COROLLARY: If $j$ is transient, then $\lim_{n \rightarrow \infty} P^n(i,j) = 0$ for all states $i$ .

–

Introducing stationary distributions.

PROP: Suppose a Markov chain has a stationary distribution $\pi$ . If the state $j$ is transient, then $\pi(j) = 0$ .

COROLLARY: If an irreducible Markov chain has a stationary distribution, then the chain is recurrent.

Note that the converse for the above is not true. There are irreducible, recurrent Markov chains that do not have stationary distributions. For example, the simple symmetric random walk on the integers in one dimension is irreducible and recurrent but does not have a stationary distribution. By recurrence we have $\P_0 \{T_0 < \infty \} = 1$ , but also $\E_0 \{T_0 \} = \infty$ . The name for this kind of recurrence is null recurrence, i.e., a state $i$ is null recurrent if it is recurrent and $\E_i(T_i) = \infty$ . Otherwise, a recurrent state is positive recurrent, where $\E_i (T_i) < \infty$ .

Positive recurrence is also a class property: if a chain is irreducible, the chain is either transient, null recurrent, or positive recurrent. In fact, an irreducible chain has a stationary distribution iff it is positive recurrent.

1.7 Coupling

Example of coupling technique: consider a random graph on a given finite set of nodes, in which each pair of nodes is joined by an edge independently with probability $p$ . We could simulate a random graph as follows: for each pair of nodes $i, j$ generate a random number $U_{ij} \sim U[0,1]$ , and join nodes $i$ and $j$ with an edge if $U_{ij} \leq p$ .

How do we show that the probability of the resulting graph being connected is nondecreasing in $p$ , i.e., show that for $p_1 < p_2$ ,

\P_{p1} \{ \text{graph connected}\} \leq \P_{p2} \{ \text{graph connected} \}.

We could try to find an explicit function for the probability in terms of $p$ , which seems inefficient. How to formalize the intuition that this seems obvious?

An idea: show that corresponding events are ordered, i.e., if $A \subset B$ then $\P A \leq \P B$ .

Let’s make 2 events by making 2 random graphs, $G_1, G_2$ on the same set of nodes. $G_1$ is constructed by having each possible edge appear with prob $p_1$ , and for $G_2$ , each edge present with prob $p_2.$ We can do this by using two sets of $U[0,1]$ random variables: $\{U_{ij}\}, \{V_{ij}\}$ for the first and second graph, respectively. Is it true that

\{G_1 \text{ connected}\} \subset \{G_2 \text{ connected} \}?

No, since the two sets of r.v.s are independently generated.

A change: use the same random numbers for each graph. Then

\{G_1 \text{ connected}\} \subset \{G_2 \text{ connected} \}

becomes true. This establishes monotonicity of the probability being connected.

Conclusion: what characterizes a coupling argument? Generally, we show that the same set of random variables can be used to construct two different objects about which we want to make a probabilistic statement.

1.8 Proof of Basic Limit Theorem

The Basic Limit Theorem says that if an irreducible, aperiodic Markov chain has a stationary distribution $\pi$ , then for each initial distirbution $\pi_0$ , as $n \rightarrow \infty$ we have $\pi_n (i) \rightarrow \pi(i)$ for all states $i$ .

(Note the wording “a stationary distribution”: assuming BLT true implies that an irreducible and aperiodic Markov chain cannot have two different stationary distributions)

Equivalently, let’s define a distance between probability distributions, called “total variation distance:”

DEF: Let $\lambda$ and $\mu$ be two probability distributions on the set $S$ . Then the total variation distance $|| \lambda - \mu ||$ is defined by

|| \lambda - \mu || = \sup_{A \subset S} [ \lambda(A) - \mu(A) ].

PROP: The total variation distance may also be expressed in the alternative forms

|| \lambda - \mu = \sup_{A \subset S} [ \lambda(A) - \mu(A) ] \\ = \frac{1}{2} \sum_{i \in S} | \lambda(i) - \mu(i) | \\ = 1 - \sum_{i \in S} \min \{ \lambda(i), \mu(u) \}.

We now introduce the coupling method. Let $Y_0, Y_1, ...$ be a Markov chain with the same probability transition matrix as $X_0, X_1, ...$ , but let $Y_0$ have the initial distribution $\pi$ and $X_0$ have the initial distribution $\pi_0$ . Note that $\{ Y_n \}$ is a stationary Markov chain with distribution $\pi$ for all $n$ . Let the $Y$ chain be independent of the $X$ chain.

We want to show that, for large $n$ , the probabilistic behavior of $X_n$ is close to that of $Y_n$ .

Define the coupling time $T$ to be the first time at which $X_n = Y_n$ :

T = \inf \{ n : X_n = Y_n \}.

LEMMA: For all n we have

|| \pi_n - \pi || \leq \P \{ T > n \}.

Hence we need only show that $\P \{T > n \} \rightarrow 0$ , or equivalently, that $\P \{ T < \infty \} = 1.$

Consider the bivariate chain $\{ Z_n = (X_n, Y_n) : n \geq 0 \}.$ $Z_0, Z_1, ...$ is clearly a Markov chain on the state space $S \times S$ . Since the $X$ and $Y$ chains are independent, the probability transition matrix $P_Z$ of chain $Z$ can be written

P_Z (i_x i_y, j_x j_y) = P(i_x , j_x) P(i_y, j_y).

$Z$ has stationary distribution

\pi_Z (i_xy i_y) = \pi(i_x) \pi(i_y).

We want to show $\P \{ T \leq \infty \}.$ So, in terms of the $Z$ chain, we want to show that with probability one, the $Z$ chain hits the “diagonal” $\{ (j,j) : j \in S\}$ in $S \times S$ in finite time. To do this, it is sufficient to show that the $Z$ chain is irreducible and recurrent.

This is where we use aperiodicity.

LEMMA: Sps $A$ is a set of positive integers closed under addition, with gcd one. Then there exists an integer $N$ s.t. $n \in A$ for all $n \geq N$ .

Let $i \in S$ and recall the assumption that the chain is aperiodic. Since the set $\{ n : P^n (i,i) > 0 \}$ is closed under addition, and, from aperiodicity, has greatest common divisor $1$ , we can use the previous lemma. So $P^n (i,i) > 0$ for all sufficiently large $n$ . From this, for any $i,j \in S$ , since irreducibility implies $P^m (i,j) > 0$ for some m, it follows that $P^n(i,j) > 0$ for all sufficiently large $n$ .

Now we show irreducibility. Let $i_x, i_y, j_x, j_y \in S$ . It is sufficient to show that $P^n_Z (i_x i_y, j_x j_y) > 0$ for some $n$ . By the assumed independence of $\{X_n\}$ and $\{Y_n\}$ , we have

P^n_Z (i_x i_y, j_x j_y) = P^n(i_x, j_x)P^n(i_y, j_y),

which (by the previous argument) is positive for all sufficiently large $n$ , so we are done.

1.9 SLLN for Markov Chains

The usual Strong Law of Large Numbers for iid random variables says that if $X_1, X_2, ...$ are iid with mean $\mu$ , then

\P \{ (1/n) \sum^n_{t=1} X_t \rightarrow \mu \} = 1 \text{ as } n \rightarrow \infty.

We will do a generalization of this result for Markov chains: the fraction of times a Markov chain occupies state $i$ converges to a limit.

Although the successive states of a Markov chain are not independent, certain features of a Markov chain are independent of each other. Here we will use the idea that the path of the chain consists of a succession of independent “cycles,” the segments of the path between successive visits to a recurrent state. This independence allows us to use the LLN that we already know.

THM: Let $X_0, X_1, ...$ be a Markov chain starting in the state $X_0 = i$ , and suppose that the state $i$ communictes with another state $j$ . The limiting fraction of time that the chain spends in state $j$ is $1 / \E_j T_j$ . That is,

\P_i \left\{ \lim_{n \rightarrow \infty} \frac{1}{n} \sum_{t =1}^{n} I \{X_t = j\} = \frac{1}{\E_j T_j} \right\} = 1.

In the proof of the previous theorem, we define $V_n(j)$ as the number of visits made to state $j$ made by $X_1, ..., X_n$ , i.e.,

V_n(j) = \sum^{n}_{t=1} \{X_t = j\}.

Using the Bounded Convergence Theorem,

we have the following:

COROLLARY: For an irreducible Markov chain, we have

\lim_{n\rightarrow \infty} \frac{1}{n} \sum^{n}_{t=1} P^t (i,j) = \frac{1}{\E_j T_j}

for all states $i$ and $j$ .

Consider an irreducible, aperiodic Markov chain haing a stationary distribution $\pi$ . From the Basic Limit Theorem, we know that $P^n (i,j) \rightarrow \pi(j)$ as $n \rightarrow \infty$ . Notice also that if a sequence of numbers converges to a limit, then the sequence of Cesaro averages converges to the same limit, i.e., if $a_t \rightarrow a$ as $t \rightarrow \infty$ , then $(1/n) \sum^{n}_{t=1}a_t \rightarrow a$ as $n \rightarrow \infty$ . However, the previous Corollary shows the Cesaro averages of $P^t (i,j)$ 's converge to $1 / \E_j T_j$ . So, we must have

\pi(j) = \frac{1}{E_j T_j}.

In fact, aperiodicity is not needed for this conclusion:

THM: An irreducible, positive recurrent Markov chain has a unique stationary distribution $\pi$ given by

\pi(j) = \frac{1}{\E_j T_j}.

2 – Markov Chains: Examples & Applications

2.1 Branching Processes

Motivating example: the branching process model was formulated by Galton, who was interested in the survival and extinction of family names.

Suppose children inherit their fathers’ names, so we need only to keep track of fathers and sons. Consider a male who is the only member of his generation to have a given family name, so that the responsibility of keeping the family name alive falls upon him–if his line of male descendants terminates, so does the family name.

Suppose for simplicity that each male has probability $f(0)$ of producing no sons, $f(1)$ of producing 1 son, and so on.

What is the probability that the family name eventually becomes extinct?

To formalize this, let:

$G_t$ $G_{t}$ the number of males in generation $t$ $t$
- Start with $G_0 = 1$
- If $G_t = i$ , write $G_{t+1} = X_{t1} + X_{t2} + ... + X_{ti}$ , where $X_{tj}$ is the number of sons fathered by the $j$ -th man in generation $t$
Assume the r.v.s $\{ X_{tj} : t \geq 0, j \geq 1 \}$ ${X_{t j} : t \geq 0, j \geq 1}$ are iid with probability mass function $f$ $f$
- Hence $\P \{ X_{tj} = k \} = f(k)$ for $k = 0, 1, ...$ .
To avoid trivial cases, assume $f(0) > 0$ and $f(0) + f(1) < 1$ .

We are interested in the extinction probability $\rho = \P_1 \{ G_t = 0 \text{ for some t} \}$ .

$\{ G_t : t \geq 0 \}$ is a Markov chain. State 0 is absorbing. So, for each $i > 0$ , since $\P_i \{ G_1 = 0 \} = (f(0))^i >0$ , the state $i$ must be transient.

Consequently, with probability 1, each state $i > 0$ is visited only a finite number of times. So, with probability 1, the chain must get absorbed at 0 or approach $\infty$ .

We can obtain an equation for $\rho$ by conditioning on what happens at the first step of the chain:

\rho = \sum^{\infty}_{k=0} \P \{ G_1 = k | G_0 = 1 \} \P \{\text{eventual extinction} | G_1 = k \}.

Since the males all have sons independently,

\P \{ \text{eventual extinction} | G_1 = k \} = \rho^k.

Thus, we have

\rho = \sum^{\infty}_{k=0} f(k) \rho^k = \psi (\rho).

For each distribution $f$ there is a corresponding function of $\rho$ denoted $\psi$ . So the extinction probability $\rho$ is a fixed point of $\psi$ .

$\psi$ is the probability generating function of the probability mass funcion $f$ . The first two derivatives are

\psi' (z) = \sum^{\infty}_{k=1} kf(k) z^{k-1}, \psi '' (z) = \sum^{\infty}_{k = 2} k (k-1) f(k) z^{k-2}.

for $z \in (0,1).$ Since these are positive, the function $\psi$ is strictly increasing and convex on $z \in (0,1).$ Also note that $\psi(0) = f(0), \psi(1) = 1$ . Finally, $\psi'(1) = \sum k f(k) = \mu$ , where $\mu = E(X)$ , the expected number of sons for each male.

notes

Since $\psi(1) = 1$ , there is always a trivial solution at $\rho = 1$ . When $\mu \leq 1$ , this trivial solution is the only solution, so $\rho = 1$ .

When $\mu > 1$ , we define $r$ to be the smaller solution of $\psi(r) = r$ . Since $\psi(\rho) = \rho$ , we know that $\rho$ must be either $r$ or 1. We want to show that $\rho = r$ .

Defining $p_t = \P_1 \{ G_t = 0 \}$ , observe that as $t \rightarrow \infty$ ,

p_t \uparrow \P_1 \left[ \bigcup_{n=1}^{\infty} \{ G_n = 0 \} \right] = \rho.

Therefore, to rule out $\rho = 1$ , it is sufficient to prove that

p_t \leq r \text{ for all }t.

By induction, observe $p_0 = 0$ . Next,

p_{t+1} = \P_1 \{ G_{t+1} = 0 \} = \sum_{i=0}^{\infty}\P_1 \{ G_1 = i \} \P_1 \{G_{t+1} = 0 | G_1 = i \} = \sum^{\infty}_{i=0} f(i) (p_t)^i.

i.e., $p_{t+1} = \psi(p_t).$ Since $\psi$ is increasing and by the induction hypothesis $p_t \leq r$ , we have that

p_{t+1} = \psi(p_t) \leq \psi(r) = r.

So $p_0, p_{t+1}$ is bounded by $r$ for all non negative $t$ , which proves $\rho = r$ . Hence there is a non negative probability that the family name goes on forever.

2.2 Time Reversibility

2.3 More on Time Reversibility: A Tandem Queue Model

2.4 The Metropolis Method

2.5 Simulated Annealing

2.6 Ergodicity Concepts

In this section we focus on a time-inhomogeneous Markov chain $\{X_n\}$ on a countably infinite state space $S$ .

Let $P_n$ denote the probability transition matrix governing the transition from $X_n$ to $X_{n+1}$ , i.e.,

P_n(i,j) = \P \{X_{n+1} = j | X_n = i \}.

For $m < n$ , define $P^{(m,n)} = \prod^{n-1}_{k=m} P_k,$ s.t.

P^{(m,n)}(i,j) = \P \{ X_n = j | X_m = i \}.

DEF: $\{X_n\}$ is strongly ergodic if there exists a probability distribution $\pi^*$ on $S$ such that

\lim_{n \rightarrow \infty} \sup_{i\in S} || P^{(m,n)} (i, \cdot) - \pi^* || = 0, \forall m.

DEF: $\{X_n\}$ is weakly ergodic if there exists a probability distribution $\pi^*$ on $S$ such that

\lim_{n \rightarrow \infty} \sup_{i,j \in S} || P^{(m,n)} (i, \cdot) - P^{(m,n)} (j, \cdot) || = 0, \forall m.

We can understand weak ergodicity somewhat as a “loss of memory concept.” It says that at a large enough time $n$ , the chain has nearly forgotten its state at time $m$ , in the sense that the distribution at time $n$ would be nearly the same no matter waht the state was at time $m$ . However, there is no requirement that the distribution be converging to anything as $n \rightarrow \infty$ . The concept that incorporates convergence in addition to loss of memory is strong ergodicity.

2.6.1 The Ergodic Coefficient

For a probability transition matrix $P = P((i,j)),$ the ergodic coefficient $\delta(P)$ of $P$ is defined to be the maximum total variation distance between pairs of rows of $P$ , that is,

DEF: The ergodic coefficient $\delta(P)$ of a probability transition matrix $P$ is

\begin{align*} \delta(P) &= \sup_{i,j \in S} || P(i, \cdot) - P(j, \cdot) || \\ &= \frac{1}{2} \sup_{i,j \in S} \sum_{k \in S} | P(i,k) - P(j,k) | \\ &= \sup_{i,j \in S} \sum_{k \in S} (P(i,k) - P(j,k))^+ . \end{align*}

The basic idea is that $\delta(P)$ being small is “good” for ergodicity. For example, in the extreme case of $\delta(P) = 0$ , all the rows of $P$ are identical, so $P$ would cause a Markov chain to lose its memory completely in just one step: $v_1 = v_0 P$ does not depend on $v_0$ .

LEMMA: $\delta (PQ) \leq \delta(P) \delta(Q)$ for probability transition matrices $P, Q$ .

2.6.2 Sufficient Conditions for Weak and Strong Ergodicity

Sufficient conditions are given in the next two propositions:

PROP: If there exist $n_0 < n_1 < n_2 < ...$ such that $\sum_k [ 1- \delta (P^{(n_k, n_{k+1})}) ] = \infty$ , then $\{X_n\}$ is weakly ergodic.

PROP: If $\{ X_n \}$ is weakly ergodic and if there exist $\pi_0, \pi_1, ...$ such that $\pi_n$ is a stationary distribution for $P_n$ for all $n$ and $\sum_n || \pi_n - \pi_{n+1} || < \infty$ , then $\{ X_n \}$ is strongly ergodic. In that case, the distribution $\pi^*$ in the definition is given by $\pi^* = \lim_{n \rightarrow \infty} \pi_n$ .

2.7 Proof of Main Theorem of Simulated Annealing

See book…

2.8 Card Shuffling

We have seen that for an irreducible, aperiodic Markov chain $\{ X_n \}$ having stationary distribution $\pi$ , the distribution $\pi_n$ of $\{X_n\}$ converges to $\pi$ in the total variation distance. An example of using this would be generating a nearly uniformly distributed $4 \times 4$ table with given row and column sums by simulating a certain Markov chain for a long enough time. The question is: how long is long enough?

In certain simple Markov chain examples, it is easy to figure out the rate of convergence of $\pi_n$ to $\pi$ . In this section we will concentrate on a simple shuffling example considered by Aldous and Diaconis in their article “Shuffling cards and stopping times.” The basic question is: How close is the deck to being “random” (uniformly distributed over the $52!$ possible permutation) after $n$ shuffles? For the riffle shuffle model, the answer is “about 7.”

2.8.1 “Top in at random” Shuffle

The “top in at random” method consists of taking the top card off the deck and then inserting is back into the deck at a random position (this includes back on top). So, altogether, there are 52 equally likely positions. Repeated performance of this shuffle on a deck of cards produces a sequence of “states” of the deck. This sequence of states forms a Markov chain with state space $S_{52}$ , the group of permutations of the cards. This Markov chain is irreducible, aperiodic, and has stationary distribution $\pi = \text{ Uniform }$ on $S_{52}$ (i.e., probability $1/52!$ for each permutation). Therefore, by the Basic Limit Theorem, we may conclude that $|| \pi_n - \pi || \rightarrow 0$ as $n \rightarrow \infty$ .

2.8.2 Threshold Phenomenon

Suppose we are working with a fresh deck of $d$ cards in the original order (card 1 on top, card 2 under, etc.). Then $|| \pi_0 - \pi || = 1 - (1/d!)$ . We also know that $|| \pi_n - \pi || \rightarrow 0$ as $n \rightarrow \infty$ from the Basic Limit Theorem. It is natural to assume that the distance from stationarity decreases to 0 in a smooth manner; however, it actually experiences what we call the “threshold phenomenon.” An abrupt change happens in a relatively small neighborhood of the value $n = d \log d$ . That is, for large $d$ the graph of $|| \pi_n - \pi ||$ versus $n$ looks like the following picture.

notes

The larger the value of the deck size $d$ , the sharper (relative to $d\log d$ ) the drop is near $n = d \log d$ .

2.8.3 A random time to exact stationarity

Let’s give a name to each card in the deck (i.e., say the 2 of hearts is card 1, the 3 of hearts is card 2, etc.). Suppose we start with the deck in pristine order (card 1 on top, then card 2, etc.). Though $\pi_n$ will never become exactly random, it is possible to find a random time $T$ at which the deck becomes exactly uniformly distributed, that is, $X_T \sim \text{Unif}(S_{52})$ .

Here is an example of such a random time. Suppose that “card i” always refers to the same card (like, say, card 52, the ace of spades), whereas “top card,” “card in position 2,” etc. are just about cards at the time of consideration. Also note that we may describe a sequence of shuffles simply by a sequence of iid random vacriables $U_1, U_2, ...$ uniformly distributed on $\{1, 2, ..., 52 \}$ : just say that the $i$ -th shuffle moves the top card to position $U_i$ . Define the following random times:

\begin{align*} T_1 &= \inf \{n : U_n = 52 \} = \text{ 1st time a top card goes below card 52 } \\ T_2 &= \inf \{n > T_1 : U_n \geq 51 \} = \text{ 2nd time a top card goes below card 52 } \\ T_3 &= \inf \{n > T_2 : U_n \geq 50 \} = \text{ 3rd time a top card goes below card 52 } \\ & \vdots \\ T_{51} &= \inf \{n > T_{50} : U_n \geq 2 \} = \text{ 51st time a top card goes card below 52 } \end{align*}

and

T = T_{52} = T_{51} + 1.

It is not hard to see that $T$ has the desired property and that $X_T$ is uniformly distributed. To understand this, start with $T_1$ . At time $T_1$ , we know that some card is below card 52; we don’t know which card, but that will not matter. After time $T_1$ we continue to shuffle until $T_2$ , at which time another card goes below card 52. At time $T_2$ , there are 2 cards below card 52. Again, we do not know which cards they are, but conditional on which 2 cards are below card 52, each of the two possible orderings of those 2 cards is equally likely. Similarly, we continue to shuffle until time $T_3$ , at which time there are some 3 cards below card 52, and, whatever those 3 cards are, each of their (3!) possible relative positions in the deck is equally likely. And so on. At time $T_{51}$ , card 52 has risen all the way up to become the top card, and the other 51 cards are below card 52 (now we do know which cards they are), and those 51 cards are in random positions (i.e. uniform over 51! possibilities). Now all we have to do is shuffle one more time to get card 52 in random position, so that at time $T = T_{52} = T_{51} + 1$ , the whole deck is random.

Let us find $ET$ . By the definitions above, $T_1 \sim \text{Geom} (1/52), (T_2 - T_1) \sim \text{Geom} (2/52), ..., (T_{51} - T_{50}) \sim \text{Geom}(51/52), (T_{52} - T_{51}) \sim \text{Geom}(52/52).$ So

ET = E(T_1) + E(T_2 - T_1) + ... + E(T_{52} - T_{51}) \approx 52 \log 52.

Analogously, if the deck had $d$ cards rather than 52, we would have obtained $ET \sim d\log d$ (for large $d$ ), where $T$ is now a random time at which the whole deck of $d$ cards becomes uniformly distributed on $S_d$ .

2.8.4 Strong Stationary Times

As we have observed, the random variable $T$ that we just constructed has the property that $X_T \sim \pi$ . $T$ also has two other important properites. First, $X_T$ is independent of $T$ . Second, $T$ is a stopping time, that is, for each $n$ , one can determine whether or not $T=n$ just by looking at the values of $X_0, ..., X_n$ . In particular, to determine whether or not $T = n$ it is not necessary to know any future values $X_{n+1}, X_{n+2}, ...$ .

DEF: A random variable $T$ satisfying

$T$ is a stopping time,
$X_T$ is distributed as $\pi$ , and
$X_T$ is independent of $T$

is called a strong stationary time.

What’s so good about strong stationary times?

LEMMA: If $T$ is a strong stationary time for the Markov chain $\{ X_n \}$ , then $|| \pi_n - \pi || \leq \P \{T>n\}$ for all $n$ .

This tells us that strong stationary times satisfy the same inequality derived for coupling times.

To see proof of threshold phenomenon in shuffling, refer to book.

3 – MRFs and HMMs

This section looks at aspects of Markov random fields (MRF’s), hidden Markov models (HMM’s), and their applications.

3.1 MRFs on Graphs and HMMs

A stochastic process is a collection of random variables $\{ X_t : t \in T\}$ indexed by some subset $T$ of the real line $\R$ . The elements of $T$ are often interpreted as times, in which case $X_t$ represents the state at time $t$ of the random process under consideration. The term random field refers to a generalization of the notion of a stochastic process: a random field $\{ X_s : s \in G\}$ is still a collection of random variables, but now the index set $G$ need not be a subset of $\R$ . For example, $G$ could be a subset of the plane $\R^2$ . In this section, we’ll consider $G$ as the set of nodes of a graph (the set being at most countable). Important aspects of the dependence among the random variables will be determined by the edges of the graph through a generalization of the Markov property.

NOTATION: Given a graph $G$ , we say two nodes $s, t$ are neighbors, denoted $s \sim t$ , if they are joined by an edge of the graph. We do not consider a node to be a neighbor of itself. $N(t)$ the set of neighbors of $t$ .

DEF: Suppose we are given a graph $G$ with nodes $\{ 1, ..., n\}$ and a neighborhood structure $N(t)$ . The collection of random variables $(X_1, ..., X_n)$ is a Markov random field on $G$ if

\P \{ X_t = x_t | X_s = x_s \text{ for } s \neq t \} = \P \{ X_t = x_t | X_s = x_s \text{ for } s \in N(t)\}

for all nodes $t \in \{1, ..., n\}.$

More compact notation: for a subset of nodes $A \subset G$ , let $x_A$ be the vector $(x_s : s \in A).$ We will also write $p(x_t | x_{N(t)})$ for $\P \{X_t = x_t | X_s = x_s \text { for } s \in N(t)\}$ . HMM’s are Markov random fields in which some of the random variables are observable and others are not. We adopt $X$ for hidden random variables and $Y$ for observed random variables.

3.2 Bayesian Framework

What do we get out of these models & how can we use them? One approach is Bayesian: HMM’s fit nicely in the Bayesian framework. $X$ is the object of interest; it is unknown. For example, in modeling a noisy image, $X$ could be the true image. We consider the unknown $X$ to be random, and we assume it has a certain prior distribution. This distribution, our probabilistic model for $X$ , is assumed to be a Markov random field. We also postulate a certain probabilistic model for $Y$ conditional on $X$ . This conditional distribution of $Y$ given $X$ reflects our ideas about the noise of blurring or whatever transforms the hidden true image $X$ into the image $Y$ that we observe.

Given our assumed prior distribution of $X$ and condiitonal distribution of $(Y|X)$ , Bayes’ formula gives the posterior distribution of $(X|Y)$ . Thus, given an observed value $Y=y$ , in principle we get a posterior distribution $P \{ X = \cdot | Y = y \}$ over all possible true images, so that (again, in principle) we could make a variety of reasonable choices of our estimator of $X$ . For example, we could choose the $x$ that maximizes $P \{ X = x | Y=y\}$ . This is called MAP estimation, where MAP stands for “maximum a posteriori”.

3.3 Hammersley - Clifford Theorem

How do we specify a Markov random field? Compared to the case of Markov chains, we might want to specify in terms of a conditional distribution. The following example suggests why this approach goes wrong.

EXAMPLE: Suppose we are designing a Markov random field for images on the 3x3 lattice:

notes

For each pixel, let us specify the conditional distribution of its color given the color of its neighbors. Suppose there are two colors, 0 and 1. But it is possible to specify conditional distributions for each pixel that lead don’t work, i.e., there might be no joint distribution having the given conditional distributions.

In general, we can’t expect to specify a full set of conditional distributions as above. Fortunately, the Hammersley-Clifford Theorem says that a random field’s having the Markov property is equivalent to its having a Gibbs distribution, which is a friendly sort of distribution. Thus, instead of worrying baout specifying our MRF’s in terms of consistent conditional distributions, we can just consider Gibbs distributions, which are simple to write down and work with.

Some definitions needed to state HC:

DEF: A set of nodes $C$ is complete if all distinct nodes in $C$ are neighbors of each others. A clique is a maximal complete set of nodes.

DEF: Let $G$ be a finite graph. A Gibbs distribution with respect fo $G$ is a probability mass function that can be expressed in the form

p(x) = \prod_{C \text{ complete}} V_C (x),

where each $V_C$ is a function that depends only on the values $x_c = (x_s : s \in C)$ of $x$ at the nodes in the clique $C$ . That is, the function $V_C$ satisfies $V_C(x) = V_C(y)$ if $x_C= y_C$ .

By combining functions $V_C$ for sets $C$ that are subsets of the same clique, we see that we can further reduce the product in the definiton of Gibbs distribution to

p(x) = \prod_{C \text{ a clique}} V_C(x).

THM (HAMMERSLEY-CLIFFORD): Suppose that $X = (X_1, ..., X_n)$ has a positive joint probability mass function. $X$ is a Markov random field on $G$ iff $X$ has a Gibbs distribution with respect to $G$ .

EXAMPLE: A Markov chain $X_0, ..., X_n$ has joint distribution of the form $p(x_0, x_1, ..., x_n) = \pi_0 (x_0) P_1 (x_0, x_1) P_2 (x_1, x_2) ... P_n (x_{n-1}, x_n)$ .

By defining $V_{\{0,1\}} (x_0, x_1) = \pi_0 (x_0) P_1 (x_0, x_1)$ and $V_{\{k-1, k\}} (x_{k-1}, x_k) = P_k (x_{k-1}, x_k)$ for $k>0$ , we see that this product is a Gibbs distribution on the graph

(x_1) -- (x_1) -- (x_2) -- ... -- (x_n)

PROP: Suppose $(X,Y)$ is a Markov random field on the graph $G$ with the neighborhood structure $N$ . Write $G= G_X \cup G_Y$ , where $G_X$ and $G_Y$ are the sets of nodes in $G$ corresponding to the $X$ and $Y$ random variables, respectively. Then the marginal distribution of $Y$ is a Markov random field on $G_Y$ , where two nodes $y_1, y_2 \in G_Y$ are neighbors if either

They were neighbors in the original graph, or
There are nodes $x_1, ..., x_k \in G_X$ such that $y_1 \sim x_1 \sim x_2 \sim ... \sim x_k \sim y_2$ .

The conditional distribution of $X$ given $Y$ is a Markov random field on the graph $G_X$ , where nodes $x_1$ and $x_2$ are neighbors if $x_1 \sim x_2$ , that is, if $x_1, x_2$ were neighbors in the original graph.

3.4 Long range dependence in the Ising model

For this section, we will work in $\Z^d$ . For each $t \in \Z^d$ there is a binary random variable $X_t$ taking values in $\{-1,1\}$ . The Ising model gives a joint probability distribution for these random variables.

We consider a special case of the Ising model as follows.

For $x$ a configuration of +1’s and -1’s at the ndoes of a finite subset of $\Z^d$ , let $b(x)$ denote the number of “odd bonds” in $x$ , that is, the number of edges $\{t,u\}$ such that $x_t \neq x_u$ . Then, under the Ising model, a configuration $x$ has a probability proportional to $\alpha^{b(x)}$ where $\alpha$ is a positive parameter of the distribution (typically <1). The chocie $\alpha =1$ corresponds to the uniform distribution, giving equal probability to all congiruations. Distributions with small $\alpha$ strongly dicourage odd bonds, placing large probability on configurations with few odd bonds.

For the case $d=1$ , the model corresponds to a stationary Markov chain with probability transition matrix

P_\alpha = \begin{pmatrix} 1/(1+\alpha) & \alpha / (1+\alpha) \\ \alpha / (1+\alpha) & 1 / (1+\alpha) \end{pmatrix}.

Basic Limit Theorem tells us

P_\alpha^n \rightarrow \begin{pmatrix} 1/2 & 1/2 \\ 1/2 & 1/2 \end{pmatrix} \text{ as } n \rightarrow \infty.

Thus, the state $X_0$ is asymptotically independent of information at nodes far away from 0.

We will show that the situation in $d \geq 2$ is qualitatively different, in that the effect of states at remote nodes does not disappear in 2 dimensions.

To state the result, some notation:

Imagine a “cube” $K_n$ of side length $2n$ centered at 0 in $\Z^d$ , consisting of lattice points whose $d$ coordinates all lie between $-n$ and $n$ :

K_n = \{ t\in \Z^d : |t_i| \leq n, \forall i = 1,...,d\}.

Let $B_n$ denote the boundary points of the cube $K_n$ , that is, the points having at least one coordinate equal to $n$ .

Let $P^{(n)}_+ \{X=x\}$ denote the Ising probability of $X = x$ , conditional on $X_t = +1$ for all $t \in B_n$ . Similarly, let $P^{(n)}_- \{X=x\}$ denote probabilities conditional on $-1$ ’s on the boundary.

THM: For the Ising model on $\Z^2$ , the effect of the boundary does not disappear. In particular, there exists $\alpha$ such that $P^{(n)}_+ \{X_0 = -1 \}$ remains below $0.4$ for all $n$ , no matter how large.

pf. for any configuration, imagine draawing line segments of length 1 cutting through each of the odd bonds in $X$ (ie, wall off zones of negative from positive). note that $P^{(n)}_+ \{X\} \propto \alpha^{\text{total length of all segments drawn}}.$ if $X_0 = -1$ , the line segments we’ve drawn will form a circuit around the origin, denoted by $\gamma_0 = \gamma_0(X).$

so $P^{(n)}_+ \{X_0 = -1 \} = \sum_{\text{circuits }\gamma \text{ about } 0} P^{(n)}_+ \{ x : \gamma_0(x) = \gamma\}.$

for any given $x$ with $\gamma_0(x) = \gamma,$ let $\phi_0(x)$ denote the configuration obtained by flipping the $-1$ ’s inside the circuit to positive. let $\ell(y)$ denote the length of $\gamma$ . noting that $x$ has $\ell(\gamma_0 (x))$ mroe segments drawn than the version $\phi_0(x)$ , we see that

P^{(n)}_+ (x) = \alpha^{ \ell(\gamma_0(x))} P^{(n)}_+ (\phi_0 (x))

for each $x$ with $x_0 = -1$ .

\begin{align*} P^{(n)}_+ \{X_0 = -1 \} &= \sum_{\gamma} P^{(n)}_+ \{ x : \gamma_0 (x) = \gamma \} \\ &= \sum_{\gamma} \alpha^{\ell (y)} \sum_{x:\gamma_0 (x) = y} P^{(n)}_+ (\phi_0 (x)) \\ &\leq \sum_{\gamma} \alpha^{\ell (y)} \end{align*}

since $\sum_{x:\gamma_0(x) =\gamma} P^{(n)}_+(\phi_0(x)) \leq 1.$

let $v(\ell)$ be the number of circuits about 0 of length $\ell$ . then

\begin{align*} P^{(n)}_+ \{ X_0 = -1 \} &\leq \sum_{\gamma} \alpha^{\ell(y)} \\ &= \sum_{\ell=4}^{\infty} v(\ell) \alpha^{\ell} \\ &\leq \sum_{\ell=4}^{\infty} \ell (3 \alpha)^{\ell} \end{align*}

which can be made $\leq 0.4$ by $\alpha$ sufficiently small.

3.5 Hidden Markov Chains

3.5.1 Model overview

The hidden Markov chain model is a MRF in which some of the random variables are observed and others are not (hence hidden). In the graph structure for the hidden Markov chain, the hidden chain is $X_0, X_1, ..., X_n$ and the observed process if $Y_0, Y_1, ..., Y_n$ .

notes

Edges join $X_t$ to both $Y_t$ and $X_{t+1}$ . The model is parametrized by a marginal distribution $\zeta$ of $X_0$ and, if we assume time-homogeneity of the transition matrices, by two transition matrices $A$ and $B$ , where $A(i,j) = \P \{X_{t+1} = j | X_t = i \}$ and $B(i,j) = \P \{Y_t = j | X_t = i\}.$

Let us write $\theta = (\zeta, A, B)$ for the vector of all parameters. If there are $u$ states possible for each of the hidden $X$ random varirables and $v$ oucomes possible for the observed $y$ random variables, then $\zeta$ is a vector of $u$ probabilities, $A$ is a $u\times u$ probability transition matrix, and $B$ is a $u \times v$ matrix, each of whose rows is a proabbility mass function.

The hidden Markov chain is pretty general. Examples of cases include

One hidden state possible for each $X$ . Then $A$ is just the $1 \times 1$ transition matrix, $B$ is $1 \times v$ , and the $Y$ process is simply an iid sequence, where each $Y_t$ has probability mass function given by $B$ . So iid sequences are a special case of the hidden Markov chain model
If $u=v$ and $B$ is the identity matrix, then $Y_t = X_t$ for all $t$ , so Markov chains are a special case of the hidden Markov chain

3.5.2 How to calculate likelihoods

The likelihood function $L = L(\theta)$ is the probability of the observed data, as a function of the parameters of the model. The tricky aspect is we observe only the $Y$ ’s, so that

L(\theta) = p_{\theta}(y_0, ..., y_n) = \sum_{x_0} \sum_{x_1} ... \sum_{x_n} p_{\theta}(x_0, ..., x_n, y_0, ..., y_n) = \sum_x p_{\theta}(x,y).

This is a large sum. If the size of the state space of the hidden variables is just $u=2$ , the sum still has $2^{n+1}$ terms. Without a way around this computational issue, the hdiden Markov chain model would be of little pratical use. However, it turns out that we can do these calculations in time linear in $n$ .

We will denote the state space for the hidden variables $X_t$ by $\X$ .

We are thinking of the observed $Y$ values as fixed here–we know them, and will denote them by $y_0, ..., y_n$ . For each $t=0,1,..,n$ and for each $x_t \in \X$ , define

\alpha_t (x_t) = p_\theta (y_0, .., y_t, x_t).

We can calculate the function $\alpha_0$ immediately:

\alpha_0 (x_0) = p_\theta (y_0, x_0) = \zeta (x_0) B(x_0, y_0).

Note the simple recursion that expressed $\alpha_{t+1}$ in terms of $\alpha_t$ :

\begin{align} \alpha_{t+1}(x_{t+1}) &= p_\theta(y_0, ..., y_{t+1}, x_{t+1}) \\ &= \sum_{x_t \in \X} p_\theta (y_0, ..., y_t, x_t, x_{t+1}, y_{t+1}) \\ &= \sum_{x_t \in \X} p_{\theta}(y_0, ..., y_t) p_\theta (x_{t+1}|x_t) p_\theta(y_{t+1} | x_{t+1}) \\ &= \sum_{x_t \in \X} \alpha_t(x_t) A(x_t, x_{t+1}) B(x_{t+1}, y_{t+1}). \end{align}

The sum in the recursion

\sum_{x_t \in \X} \alpha_t(x_t) A(x_t, x_{t+1}) B(x_{t+1}, y_{t+1})

is fairly modest in comparison. Using the recursion to calculate the function $\alpha_{t+1}$ from $\alpha_t$ involves a fixed amount of work, and the task gets no harder as $t$ increases. Thus, the amount of work to calculate all of the probabilities $\alpha_t (x_t)$ for $t=0, ...,n$ and $x_t \in \X$ is linear in $n$ .

Having completed the recrusion to calculate the function $\alpha_n$ , the likelihood is simply

L(\theta) = p_\theta (y_0, ..., y_n) = \sum_{x_n} p_\theta (y_0, ..., y_n, x_n) = \sum_{x_n} \alpha_n (x_n).

The above probabilities are called “forward” probabilities. In a similar manner, we can calculate the “backward” probabilities

\beta_t (x_t) = p_\theta (y_{t+1},...,y_n | x_t) = \P_\theta \{Y_{t+1} = y_{t+1}, ..., Y_n = y_n | X_t = x_t\}.

by using the recursion

\beta_{t-1} (x_{t-1}) = \sum_{x_t} A(x_{t-1, x_t}) B(x_t, y_t) \beta_t(x_t).

3.5.3 Maximum Likelihood and the EM algorithm

Without an organized method for changing $\theta = (\zeta, A, B)$ , it is hard to find the $\theta$ that maximizes the likelihood. The EM algorithm is a method for finding maximum likelihood estimates that is applicable to many statistical problems, including hidden Markov chains.

PROP: For $p = (p_1, ..., p_k)$ and $q = (q_1, ..., q_k)$ two probability mass functions on $\{1,..., k\},$ we have

\sum_i p_i \log p_i \geq \sum_i p_i \log q_i.

A brief description of EM algorithm. We’ll assume $X,Y$ are discrete random variables or vectors, so probs are given by sums rathers than integrals.

We want to find the $\theta$ maximizing $\log L(\theta)$ where $L (\theta) = p_\theta (y) = \sum_x p_\theta(x,y)$ . The EM algorithm repeats the following update that is guaranteed to increase the likelihood at each iteration.

Let $\theta_0$ denote the current value of $\theta$ . Replace $\theta_0$ by $\theta_1$ , the value of $\theta$ that maximizes

\E_{\theta_0} \left[ \log p_\theta (X, y) | Y=y \right].

Why does it work? For a given $\theta_0$ , define $g(\theta) = \E_{\theta_0} [ \log p_\theta (X, y) | Y =y].$ We will see that, in order to have $p_{\theta_1}(y) > p_{\theta_0}(y),$ we do not need to find $\theta_1$ maximizing $g$ , but rather it is enough to find a $\theta_1$ such that $g(\theta_1) > g(\theta_0).$

PROP: If $\E_{\theta_0} [ \log p_{\theta_1} (X, y) | Y =y] > \E_{\theta_0} [ \log p_{\theta_0} (X, y) | Y =y],$ then $p_{\theta_1} (y) > p_{\theta_0}(y).$

pf: we have

\begin{align*} 0 &\underset{(a)}< {E_{\theta_0} \left[ \log \frac{p_{\theta_1}(X, y)}{p_{\theta_0}(X, y)} \,\middle|\, Y = y \right]} \\ &= \sum_x p_{\theta_0}(x \mid y) \log \frac{p_{\theta_1}(x, y)}{p_{\theta_0}(x, y)} \\ &= \sum_x p_{\theta_0}(x \mid y) \log \frac{p_{\theta_1}(x)}{p_{\theta_0}(y)} - \sum_x p_{\theta_0}(x \mid y) \log \frac{p_{\theta_0}(x \mid y)}{p_{\theta_1}(x \mid y)} \\ &\underset{(b)}{=} \log \frac{p_{\theta_1}(y)}{p_{\theta_0}(y)} - \sum_x p_{\theta_0}(x \mid y) \log \frac{p_{\theta_0}(x \mid y)}{p_{\theta_1}(x \mid y)} \\ &\underset{(c)}{\leq} \log \frac{p_{\theta_1}(y)}{p_{\theta_0}(y)}, \end{align*}

where $a$ holds by assumption, $b$ holds because $\sum_x p_{\theta_0}(x|y) = 1$ , and $c$ follows from an earlier proposition

3.5.4 Applying the EM algorithm to a hidden Markov chain

Consider a hidden Markov chain for $(X,Y) = (X_0, ..., X_n, Y_0, ..., Y_n)$ , where the r.v.s $Y_t$ are observed and the r.v.s $X_t$ are hidden. The model is parametrized by $\theta = (\zeta, A, B)$ for $\zeta$ the distribution of $X_0$ , $A$ the prob transition matrix for $X$ chain and $B$ the prob transition matrix from $X_t$ to $Y_t$ .

To describe one iteration of the EM method, imagine our current guess for $\theta$ is $\theta_0 = (\zeta_0, A_0, B_0).$ We want a new guess $\theta_1 = (\zeta_1, A_1, B_1)$ that has higher likelihood.

See book for more… too much to copy.

In summary, to do EM on hidden Markov chain model:

Start with some choice of parameter values $\theta_0 = (\zeta_0, A_0, B_0).$
Calculate forward and backward probabilities $\alpha_t(i), \beta_t(i)$ for $t=0,1,...n$ and $i \in \X$ (all with $\theta$ taken as $\theta_0$ ).
Calculate the quantities $\gamma_t(i,j) = \P_{\theta_0}\{X_t = i, X_{t+1}=j | y\}$ for $t \in \{0,...,n-1\}, i \in \X, j \in \X$ . These could all be stored in a $u \times u \times n$ array.
Define:

\begin{align*} \zeta_1(i) &= \sum_j \gamma_0 (i,j) \\ A_1 (i,j) &= \frac{\sum_{t=0}^{n-1}\gamma_y(i,j)}{\sum_l \sum_{t=0}^{n-1}\gamma_t(i,l)}\\ B_1(i,j) &= \frac{ \sum_{t=0}^{n-1} \sum_l \gamma_t(i, l) I(y_t = j) + \sum_m \gamma_{n-1}(m, i) I(y_n = j)}{\sum_{t=0}^{n-1} \sum_l \gamma_t(i, l) + \sum_m \gamma_{n-1}(m, i)}. \end{align*}

Replace $\theta_0$ by $\theta_1 = (\zeta_1, A_1, B_1)$ and repeat.

3.6 Gibbs Sampler

Terms like “Markov chain Monte Carlo” and “Markov sampling” refer to methods for generating random samples from given distributions by running Markov chains. Although such methods have quite a long history, they have become the subject of renewed interest in the last decade, particularly with the introduction of the “Gibbs sampler” by Geman and Geman (1984), who used the method in a Bayesian approach to image reconstruction. The Gibbs sampler itself has enjoyed a recent surge of intense interest within statistics community, spurred by Gelfand and Smith (1990), who applied the Gibbs sampler to a wide variety of inference problems.

Recall that a distribution $\pi$ being “stationary” for a Markov chain $X_0, X_1, ...$ means that, if $X_0 \sim \pi,$ then $X_0 \sim \pi$ for all $n$ . The basic phenomenon underlying all Markov sampling methods is the convergence in distribution of a Markov chain to its stationary distribution: If a Markov chain $X_0, X_1, ...$ has stationary distribution $\pi$ , then under the conditions of the Basic Limit Theorem, the distribution of $X_n$ for large $n$ is close to $\pi$ . Thus, in order to generate an observation from a desired distribution $\pi$ , we find a Markov chain $X_0, X_1, ...$ that has $\pi$ as its stationary distribution. The Basic Limit Theorem then suggests that running or simulating the chain until a large time $n$ will produce a random variable $X_n$ whose distribution is close to the desired $\pi$ . By taking $n$ large enough, in principle we obtain a value that may for practical purposes be considered a random draw from the distribution $\pi$ .

The Gibbs sampler is a way of constructing a Markov chain having a desired stationary distribution. A simple setting that illustrates the idea involves a probability mass function $\pi$ of the form $\pi(x, y)$ . Suppose we want to generate a random vector $(X, Y) \sim \pi$ . Denote the conditional probability distributions by $\pi(\cdot \mid X = \cdot)$ and $\pi(\cdot \mid Y = \cdot)$ . To perform a Gibbs sampler, start with any initial point $(X_0, Y_0)$ . Then generate $X_1$ from the conditional distribution $\pi(\cdot \mid Y = Y_0)$ , and generate $Y_1$ from the conditional distribution $\pi(\cdot \mid X = X_1)$ . Continue on in this way, generating $X_2$ from the conditional distribution $\pi(\cdot \mid Y = Y_1)$ and $Y_2$ from the conditional distribution $\pi(\cdot \mid X = X_2)$ , and so on. Then the distribution $\pi$ is stationary for the Markov chain $\{(X_n, Y_n) : n = 0, 1, \dots \}$ .

To see this, suppose $(X_0, Y_0) \sim \pi$ . In particular, $Y_0$ is distributed according to the $Y$ -marginal of $\pi$ , so that, since $X_1$ is drawn from the conditional distribution of $X$ given $Y = Y_0$ , we have $(X_1, Y_0) \sim \pi$ . Now we use the same reasoning again: $Y_1$ is distributed according to the $Y$ -marginal of $\pi$ , so that $(X_1, Y_1) \sim \pi$ . Thus, the Gibbs sampler Markov chain $\{(X_n, Y_n) : n \geq 0\}$ has the property that if $(X_0, Y_0) \sim \pi$ then $(X_1, Y_1) \sim \pi$ —that is, the distribution $\pi$ is stationary.

Simulating a Markov chain is technically and conceptually simple. We just generate the random variables in the chain, in order, and we are done. However, the index set of a Markov random field has no natural ordering in general. This is what causes iterative methods such as the Gibbs sampler to be necessary.

The Gibbs sampler is well suited to Markov random fields, since it works by repeatedly sampling from the conditional distribution at one node given the values at the remaining nodes, and the Markov property is precisely the statement that these conditional distributions are simple, depending only on the neighbors of the node.

4 – Martingales

4.2 Definitions

Notation: Given a process $W = \{ W_k \}$ , let $W_{m,n}$ denote the portion $W_m, W_{m+1}, ..., W_n$ of the process from time $m$ up to time $n$ .

DEF: A process $M_0, M_1, ...,$ is a martingale if

\E [M_{n+1} | M_{0,n} ] = M_n \text{ for each } n \geq 0.

Alternatively, A process $M_0, M_1, ...,$ is a martingale w.r.t another process $W_0, W_1, ...$ if

\E [M_{n+1} | W_{0,n} ] = M_n \text{ for each } n \geq 0.

The crux of the definition $\E [M_{n+1} | W_{0,n} ] = M_n$ is a “fair game” sort of requirement. If we are playing a fair game, we expect neither to win nor to lose money on the average. Given the history of our fortunes up to time $n$ , our expected fortune $M_{n+1}$ at the future time $n+1$ should just be the fortune $M_n$ that we have at time $n$ .

A minor technical condition: we also require $\E |M_n| < \infty$ for all $n$ s.t. the coniditional expectations in the definition are guaranteed to be well-defined.

How about submartingales and supermartingales? These are processes that are “better than fair” and “worse than fair,” respectively.

DEF: A process $X_0, X_1, ...$ is a submartingle w.r.t. a process $W_0, W_1, ...,$ if $\E [X_{n+1} | W_{0,n} ] \geq X_n$ for each $n \geq 0$ .

{X_n} is a supermartingale w.r.t $\{W_n\}$ if $\E [X_{n+1} | W_{0,n} ] \leq X_n$ for each $n \geq 0$ .

4.3 Examples

See book

4.4 Optional Sampling

Optional sampling property is a “conservation of fairness” type property.

By the “fair game” property, $\E \{M_{n+1}\} = \E \{ \E [M_{n+1} | W_{0,n}] \} = \E \{M_n\}$ for all $n$ . This implies that

E M_n = EM_0 \text{ for all times } n \geq 0.

That is, one can stop at any predetermined time $t$ , like $t =8$ , and their winnings will be fair: $EM_8 = EM_0$ .

Fairness is also conserved in many cases, but not in all cases (i.e., if one stops at a time that is not predetermined, but random (depending on the observed sample path of the game)). The issue of optional sampling is this:

If $T$ is a random time, that is, $T$ is a nonnegative random variable, does the equality $EM_T = EM_0$ still hold?

There are two sorts of things that shouldn’t be allowed if we want fairness to be conserved:

shouldn’t be able to take an indefinitely long time to stop. for example, a simple symmetric random walk with equal probabilities of going + or - 1. to stop at time $T_1 = \inf\{n : \text{ random walk is at 2 }\},$ then clearly this is not fair.
shouldn’t be able to “take back” moves. i.e., shouldn’t be able to have information up to time $t$ , then go back to some time $s<t$ and claim to stop at $s$ .

In fact, ruling out these two sorts of behaviors leaves a class of random times $T$ at which the optional sampling statement $EM_T = EM_0$ holds. Disallowing arbitrarily long times is done by assuming $T$ bounded. Random times that disallow the gambler from peeking ahead into the future are called stopping times.

DEF: A random variable $T$ taking values in the set $\{ 0, 1, 2, ..., \infty \}$ is a stopping time w.r.t. the process $W_0, W_1, ...$ if for each integer $k$ , the indicator random variable $I\{T=k\}$ is a function of $W_{0,k}$ .

The main optional sampling result:

THM: Let $M_0, M_1, ...$ be a martingale w.r.t. $W_0, W_1, ...,$ and let $T$ be a bounded stopping time. Then $EM_T = EM_0$ .

Roughly speaking, we’re interested in defining times t that are reasonable to say

$EM_t = EM_0$ .

In the beginning of this section, we see clearly that predetermined times of a “fair game”, i.e., let me say t=8, will have

$EM_8 = EM_0$

because on average you don’t expect to win or lose any money. But this doesn’t hold when I say I’m looking at some random time T in general. For example, if I say that I define T on a simple symmetric random walk as the first time when the walk hits 1, I obviously expect that

$EM_T = 1 > EM_0 = 0$ .

When looking at stopping times, I’m basically restricting myself to looking at times which are informed solely by the previous, realized results (I don’t get to look ahead, because this would be unfair).

Now we generalize by

applying to general supermartingales rather than martingales
replacing the times $0$ and $T$ with two stopping times $S, T$ s.t. $S \leq T$ .

THM: Let $X_0, X_1, ...$ be a supermartingale with respect to $W_0, W_1, ...$ and let $S$ and $T$ be bounded stopping times with $S \leq T$ . Then $EX_T \leq EX_S$ .

4.5 Stochastic integrals and option pricing in discrete time

oh my great goodness

4.6 Martingale convergence

PROP: Let $X_0, X_1, ...,$ be a nonnegative supermartingale, and let $X_0 \leq a$ , where $a$ is a nonnegative number. Then for $b > a$ , defining $T_b = \inf\{t : X_t \geq b\}$ , we have $\P \{T_b > \infty \} \leq a/b$ .

Intuition: suppose we are playing a supermartingale with initial fortune $X_0 \leq a$ . Suppose we adopt the strategy of stopping once our fortune exceeds $b$ . Our expected reward is at least $b \cdot \P \{ T_b < \infty \}$ . But we expect to lose money on a supermartingale, so this expected reward should be no larger than out initial fortune. Then $b \cdot \P \{ T_b < \infty \} \leq a$ .

THM: A nonnegative supermartingale converges with probability 1.

4.7 Stochastic approximation

We observe the random variables

Y_n = f(X_n) + \eta_n,

where the random variables $X_1, X_2, ...$ are subject to our choice and $\eta_1, \eta_2,...$ are assumed to be iid “noise” random variables with mean 0 and variance 1. THe function $f$ is unknown, and we do not know or observe the noise r.v.s, only the $Y$ ’s.

Robbins-Monro iteration: a simple way of choosing $X_{n+1}$ given the previously observed $Y$ values.

Some assumptions:

$f$ has a unique but unknown zero $x*$
$f(x) < 0$ for $x < x*$
$f(x) > 0$ for $x > x*$

We want to find $x*$ . Can we have a method of choosing values $X_t$ such that $X_t \rightarrow x*$ as $t \rightarrow \infty$ ?

Suppose we are given a suitable sequence of positive numbers $a_0, a_1, ...$ . Then, given $X_0$ , we define $X_1, X_2, ...$ according to the recursion

X_{n+1} = X_n - a_n Y_n

This iteration is qualitatively reasonable. If $Y_n$ positive, based on our assumptions about $f$ , we would guess that $X_n > x*$ . So we want $X_{n+1}$ to be smaller than $X_n$ , which works (if $Y_n>0$ then $X_{n+1} < X_n$ ), and vice versa.

How should we choose $a_0, a_1, ...$ ? We want $X_n$ to converge to something ( $x*$ ) as $n \rightarrow \infty$ , so we need $a_n \rightarrow 0$ as $n \rightarrow \infty$ .

THM: Let $f: \R \rightarrow \R$ and let $\E [(X_0)^2] < \infty$ . Consider the sequence $\{X_n\}$ generated by the recursion

\begin{align*} Y_n &= f(X_n) + \eta_n \\ X_{n+1} &= X_n - a_n Y_n \end{align*} ,

where we assume the following conditions:

- the r.v.s $X_0, \eta_0, \eta_1, ...$ are independent, with $\eta_0, \eta_1, ...$ iid having mean 0 and variance 1.

- for some $1 < c < \infty$ , we have $|f(x)| \leq c|x|$ for all $x$ (this incorporates the assumption $f(0) = 0$ ).

- for all $\delta > 0, \inf_{|x| > \delta}(x f(x)) > 0$

- each $a_n$ a nonnegative number and $\sum a_n = \infty$

- $\sum a_n^2 < \infty$

See proof in paper

5 – Brownian Motion

Markov processes in discrete time and discrete state space.

Markov processes with continuous sample paths are called diffusions. Brownian motion is the simplest diffusion.

5.1 Definitions

DEF: A standard Brownian motion (SBM) $\{W(t) : t \geq 0 \}$ is a stochastic process having

1. continuous paths,

2. stationary, independent increments, and

3. $W(t) \tilde N(0,t)$ for all $t \geq 0$ .

To see more about Brownian motion, consider the intro primer, constructions, or (potentially) more on site.

6 – Diffusions and Stochastic Calculus

Previously discussed: Markov processes in discrete time with discrete state space. Also, Brownian motion as a special Markov process in continuous time with continuous sample paths.

Now: a more general continuous-time Markov process with continuous sample paths, called diffusions.

Essentially, there is an entire family of Brownian motions built off of the SBM (standard BM). In some sense, they are all the same, just adding a deterministic linear function (drift) or multiplying by a constant (scaling). This is analogous to how all normal ranodm variables are obtained from a standard noraml random variable by adding and multiplying by constants.

Therefore, we can’t really expect our random processes to be generally model-able by BM. Unless we think it follows a linear trend, we can’t fit some $(\mu, \sigma^2)$ BM to it and expect it to do well.

Difussions are built up from Brownian motions in the same way that general differentiable functions are built from linear functions. They are the stochastic analog of getting solutions from differential equations (e.g., what if I didn’t know anything about the exponential function, and I wanted to model a function with the behavior $x'(t) = -x(t)$ ?).

So a diffusion is always beahving according to a Brownian motion, just continuously adjusting its drift $\mu$ and variance $\sigma^2$ parameteres according to its current state. We specify a diffusion by giving a rule for determining what Brownian motion we should be following as a function of the current state. That is, we specify two functions $\mu(\cdot)$ and $\sigma^2(\cdot)$ ; if the current state is $X_t$ , we should be running a $(\mu(X_t), \sigma^2(X_t))$ Brownian motion.

While we can’t simulate a Brownian motion perfectly, we can get an arbitrarily good approximate simulation by working with a fine grid on the time axis. With diffusions, instead of doing the ideal continuous adjustment, we’ll hold values constant until we get to the next point on the grid. If we take the grid sufficiently thin, we’ll have a simulation that is a good approximation to the real thing.

6.1 Specifying a diffusion

DEF: A stochastic process that has the strong Markov property and almost surely continuous sample paths is called a diffusion process.

As with Markov chains, to specify a difussion, we need

state space
initial distribution
probability transition structure

The state space will be an interval $I$ with bounds $l, r$ not being $-\infty, \infty$ . The initial distribution if a probability distribution on $I$ . What about the probability transition structure?

Intuitively, we would do something like this:

at time $t$ , check to see our current state $X_t$
evaluate the functions $\mu$ and $\sigma^2$ at $X_t$
run a $(\mu(X_t), \sigma^2(X_t))$ Brownian motion for a tiny amount of time

So in the interior of the state space, which would be $(l,r)$ , the probability transition structure of a time-homoheneous diffusion is specified by two functions $\mu = \mu(x)$ and $\sigma^2 = \sigma^2(x)$ , satisfying the relations

E(X(t+h) - X(t) | X(t) = x) = \mu(x)h + o(h)

and

\Var(X(t+h) - X(t) | X(t) = x) = \sigma^2(x)h + o(h)

as $h \downarrow 0$ . These two relationships combine to equivalently be expressed as

E((X(t+h) - X(t))^2 | X(t) = x) = \sigma^2(x)h + o(h).

We call $\mu(\cdot)$ “infinitesimal drift” or “infinitesimal mean function,” while $\sigma^2(\cdot)$ is often called “infinitesimal variance” or “diffusion function.” These two functions alone are enough to specify the entire probability transition structure.

Also, let’s assume moments of the increments $X(t+h) - X(t)$ higher than the second moment are negligible when compared with the first and second moments:

E(|X(t+h) - X(t)|^p | X(t) = x) = o(h) \text{ as } h \rightarrow 0 \text{ for all } p > 2.

The behavior at the boundary points must be specified separately. There are different types of boundary behaviors, including absorbing, reflecting, etc.

6.2 A calculation with diffusions

Consider a time inhomogeneous diffusion on the state space $I$ with infinitesimal parameters $\mu = \mu(x)$ and $\sigma^2 = \sigma^2(x)$ .

Choose two points $a, b$ in the interior of $I$ . If we start the diffusion $(X(t))$ off at some point $x \in (a,b)$ at time $0$ and let the diffusion run until the random time $T$ at which it first attains the value $a$ or $b$ , and $g(\cdot)$ is a “cost function” giving us the rate $g(x)$ at which “cost” accrues per unit time when the diffusion $X$ is in state $x$ , the total cost of the diffusion until time $T$ is the r.v.:

\int_0^T g(X_t) dt.

What is the expected cost,

\E^x[\int_0^T g(X_t) dt]?

We use $E^x$ to denote $X_0 = x.$

First consider the special case where $g$ is constant, e.g. $g(x) = 1$ always. Then $\int_0^T g(X_t)dt = T$ , and we want to find $E^x(T)$ . We claim that the function $w$ satisfies the differential equation

\mu(x)w'(x) + \frac{1}{2} \sigma^2(x) w''(x) = -g(x)

with the boundary conditions $w(a) = w(b) = 0.$

Let’s do a heuristic derivation to justify this equation. Begin with

w(x) = \E^x \left[ \int_0^h g(X_t)dt + \int_h^T g(X_t)dt \right]

= hg(x) + \E^x \left[ \int_h^T g(X_t) dt \right] + o(h)

for some extremely tiny $h$ such that we are virtualy certain that $T>h.$ Using tower property of conditional expectation,

\begin{align*} E^x \left[ \int_h^T g(X_t)dt \right] &= \E^x \left[ \E^x \left[ \int_h^T g(X_t) dt | X_h \right] \right] \\ &= \E^x \left[ \E^x \left[ \int_h^T g(X_t)dt | X_h, T > h \right]\right] + o(h) \\ &= \E^x \left[w(X_h)\right] +o(h). \end{align*}

The last equality comes from the Markov property. If $X_h = x_h$ and $T>h,$ the expected value of $\int_h^T g(X_t)dt$ is the same as the expected cost of running the diffusion untiil time $T$ starting at time $0$ from state $x_h$ . This is $w(x_h).$

Note that

\E^x \left[w(X_h)\right] = \E^x \left[w(x) + w'(x)(X_h -x) + \frac{1}{2}w''(x)(X_h - x)^2 \right] + o(h)

= w(x) + w'(x)\mu(x)h + \frac{1}{2}w''(x) \sigma^2(x)h+o(h).

Then combining the previous 3 equations, we have

w(x) = hg(x) + w(x) + w'(x)\mu(x)h + \frac{1}{2}w''(x)\sigma^2(x)h + o(h),

\left[\mu(x)w'(x) + \frac{1}{2}\sigma^2(x)w''(x) + g(x)\right]h = o(h),

implying that $\mu(x)w'(x) + \frac{1}{2}\sigma^2(x)w''(x) + g(x) = 0,$ as desired.

6.3 Infinitesimal parameters of a function of a diffusion

A common question is as follows.

Suppose $\{X_t\}$ a Brownian motion with $\mu_X(x) = \mu$ and $\sigma^2_X(x) = \sigma^2.$ Define the geometric Brownian motion $Y_t = e^{X_t}.$ What are the infinitesimal parameters $\mu_Y(\cdot)$ and $\sigma^2_Y(\cdot)$ of the $Y$ process?

PROP: Suppose $X$ is a $\mu_X, \sigma^2_X$ diffusion. Let $f$ be a strictly monotone function that is twice continuously differentiable. Define $Y_t = f(X_t).$ Then $Y$ is a diffusion with infinitesimal parameters

\mu_Y(y) = \mu_X(x) f'(x) + \frac{1}{2}\sigma^2_X(x)f''(x)

and

\sigma^2_Y(y) = \left[f'(x)\right]^2 \sigma^2_X(x),

where $x=f^{-1}(y).$

Pf: Using Taylor series,

\begin{align*} E(Y_{t+h} - Y_t | Y_t = y) &= E(f(X_{t+h})- f(x) | X_t = x) \\ &= E(f'(x)(X_{t+h} - x) + \frac{1}{2}f''(x)(X_{t+h} - x)^2 | X_t = x) + o(h) \\ &= f'(x) \mu_X(x)h + \frac{1}{2}f''(x) \sigma^2_X(x)h + o(h), \end{align*}

so $\mu_Y(y) = \mu_X(x)f'(x) + \frac{1}{2}\sigma^2_X(x) f''(x).$

Similarly,

\begin{align*} E((Y_{t+h} - Y_t)^2 | Y_t = y) &= E((f'(x))^2 (X_{t+h} -x)^2 | X_t = x) + o(h) \\ &= (f'(x))^2\sigma_X^2(x)h+o(h), \end{align*}

giving $\sigma_Y^2(y) = (f'(x))^2 \sigma_X^2(x). \blacksquare$

Therefore, for geometric Brownian motion, $\mu_X(x) = \mu, \sigma^2_X(x) = \sigma^2, y=e^x = f(x) = f'(x) = f''(x),$ we have

\mu_Y(y) = \mu e^x + \frac{1}{2}\sigma^2e^x = y(\mu + \frac{1}{2}\sigma^2)

and

\sigma^2_Y(y) = y^2\sigma^2.

6.4 Kolmogorov’s backward and forward equations

What is the probability transition structure of diffusions?

In a Markov chain, the transition rule is specified by a matrix $P$ . In continuous time, there is no “next time” after $t$ , i.e., we cannot imagine doing something like $\pi(n+1) = \pi(n)P$ as in a discrete time Markov chain. Can we talk about the process a tiny infinitesimal time later?

The probability transition rule will give the rate of change, per unit time, of the probability density function of the state. This is done with a partial differential equation.

Let $X$ be a diffusion with infinitesimal mean and variance functions $\mu(\cdot), \sigma^2(\cdot).$

For the “backward” equation, fix state $y$ and define the function $f(t,x)$ the density of $X_t$ evaluated at $y$ given $X_0 =x$ :

f(t,x) = f_{X_t}(y | X_0 = x).

Kolmogorov’s backward equation says

\partial_t f(t,x) = \mu(x) \partial_x f(t,x) + \frac{1}{2} \sigma^2(x) \partial_{xx}f(t,x).

For the “forward” equation, fix an initial probability density for $X_0$ and define $g(t,y)$ to be the density of $X_t$ evaluated at $y$ . The forward equation describes the evolution of this density over time:

\partial_t g(t,y) = - \partial_y (\mu(y) g(t,y)) + \partial_{yy}(\frac{1}{2} \sigma^2(y) g(t,y)).

See book for derivation.

6.5 Stationary distributions

The forward equation can be used to find stationary distributions. If $\pi(\cdot)$ a stationary density for diffusion $X$ , starting the diffusion in that density should stay in that density. So we should expect

v(t,y) = \pi(y)

should satisfy the forward equation. This would be

0 = -\frac{d}{dy} (\mu(y) \pi(y)) + \frac{1}{2}\frac{d^2}{dy^2}(\sigma^2(y)\pi(y)).

This is an ODE, so we can solve for stationary distributions by solving ODEs.

6.6 Probability flux for diffusions

$P(X_t < x, X_{t+h}>x)$ can be thought of as the flux from $(-\infty, x)$ to $(x, \infty)$ over the time interval $[t, t+h].$ Similarly, the probability $P(X_t >x, X_{t+h} < x)$ is the flux across $x$ in the other direction. We are interested in the net flux across $x$ ,

P(X_t<x, X_{t+h}>x) - P(X_t >x, X_{t+h}<x).

This would be $0$ for a stationary process.

Let $v(t, \zeta)$ be the density of $X_t$ evaluated at the state $\zeta$ , so that

P(X_t \in d\zeta) = v(t, \zeta)d\zeta.

For a small positive $h$ , let $\Delta(\zeta,y)$ denote the conditional density of the increment $X_{t+h} -X_t$ , given that $X_t = \zeta$ , evaluated at $y$ , that is:

P(X_{t+h}-X_t | X_t = \zeta) = \Delta(\zeta, y)dy.

Then

P(X_t<x, X_{t+h}>x) = \int_{y=0}^\infty \int_{\zeta = x-y}^x P(X_{t+h} - X_t \in dy | X_t =\zeta)P(X_t \in d\zeta)

=\int_{y=0}^\infty \int_{\zeta = x-y}^x v(t, \zeta) \Delta(\zeta, y)d\zeta dy.

Similarly,

P(X_t>x, X_{t+h}<x) = \int_{y=-\infty}^0 \int_{\zeta=x}^{x-y} v(t, \zeta) \Delta(\zeta,y)d\zeta dy.

Then the net flux is

\int_{y=-\infty}^\infty \int_{\zeta = x-y}^x v(t,\zeta) \Delta(\zeta,y)d\zeta dy.

Consider the integrand as a function of $\zeta.$ For small $h$ , only values of $\zeta$ very near to $x$ will contribute significantly. Unrigorously, we will say we can do a Taylor expansion about $\zeta =x$ :

v(t, \zeta)\Delta(\zeta, y) = v(t,x) \Delta(x,y) + (\partial_x v(t,x)\Delta(x,y))(\zeta-x) + ...,

so that (abbreviating the second term as $(*)$ ):

\begin{align*} & P(X_t<x, X_{t+h}>x) - P(X_t >x, X_{t+h}<x) \\ &= \int_{y=-\infty}^\infty \int_{\zeta = x-y}^x (v(t,x)\Delta(x,y) + (*) + ...)dy \\ &= v(t,x) \int_{-\infty}^\infty y \Delta(x,y)dy - \frac{1}{2}\partial_x (v(t,x)\int_{-\infty}^\infty y^2 \Delta(x,y)dy) + ... \\ &= v(t,x)\mu(x)h - \frac{1}{2}\partial_x(v(t,x)\sigma^2(x)h) + o(h). \end{align*}

Dividing by time increment $h$ and letting $h \rightarrow 0,$ we see the rate of net probability flux across $x$ at time $t$ is given by

v(t,x)\mu(x) - \frac{1}{2}\partial_x (v(t,x) \sigma^2(x)).

We can use the net flux to get the forward equation (see book).

6.7 Quadratic Variation of Brownian Motion

An important property of standard Brownian motion $W$ is that its quadratic variation over the interval $[0,t]$ is $t$ with probability $1.$

Let’s start with quadratic variation for a nonrandom function $f$ .

DEF: Let $f$ be a real valued function defined at least on the interval $[0,t].$ The quadratic variation $q_t(f)$ of $f$ over $[0,t]$ is defined

q_t(f) = \lim_{n\rightarrow \infty}\sum_{k=1}^{2^n} \left[f(\frac{kt}{2^n})-f(\frac{(k-1)t}{2^n})\right]^2

if the limit exists (otherwise it is undefined).

Generally, this is uninteresting for the tame functions we see every day. In fact, if $f$ is continuous and of bounded variation, then $q_t(f) = 0$ for all $t$ .

Sample paths of SBM are continuous, but with probability $1$ have quadratic variation $t$ on the interval $[0,t]$ and infinite total variation.

Let $\Delta W_{k,n}$ denote the increment

\Delta W_{k,n} = W(\frac{kt}{2^n}) - W(\frac{(k-1)t}{2^n})

and let $Q_n = \sum_{k=1}^{2^n}(\Delta W_{k.n})^2.$

THM: With probability $1$ , $Q_n \rightarrow t$ as $n \rightarrow \infty.$ Additionally, $Q_n \rightarrow t$ in mean square:

E((Q_n - t)^2) \rightarrow 0.

See book for proof.

Side thought: let $X(t) = \mu(t) + \sigma W(t)$ be a $\mu, \sigma^2$ Brownian motion. With probability $1$ , the quadratic variation of $X$ on $[0,t]$ is $\sigma^2 t$ . Suppose we are observing the Brownian motion $X$ and we want to estimate $\mu, \sigma^2.$ If we observe it for a long time $T$ , we can estimate the drift $\mu$ by its slope on $X(T)/(T)$ . And this estimation gets better as $T$ increases. However, for estimating $\sigma^2$ , it is enough to observe $X$ over any interval of time, and we can infer $\sigma^2$ exactly…

6.8 Stochastic Differential Equations

First, let’s think about an ordinary, deterministic, first-order differential equation:

\frac{dX_t}{dt} = f(X_t, t).

This is “Markov-like” in the sense that, given $X(t),$ all past values are unncessary for determining future behavior. Second order equations do not have this property, and we would need $X(t-\Delta t), X(t)$ to approximate $X(t + \Delta t)$ . In DE, one might convert a second order eq into first order by including the information $\dot X(t)$ as well as $X(t)$ . So we can convert “non-Markov” processes into “Markov” by including more information the state. But this is impractical.

A simple deterministic DE might be $dX/dt = rX,$ modeling exponential growth of a population. If we wanted to model a “noisy” growth rate, we might propose something like

\frac{dX(t)}{dt} = [r + N(t)]X(t),

for $N$ a noise process (stochastic). More generally, consider equations of the form

\frac{dX(t)}{dt} = \mu(X_t, t) + \sigma(X_t, t)N_t.

Some desirable assumptions to put on this noise process (characteristics of Gaussian white noise):

$\{N-t, t\geq 0\}$ is a stationary Gaussian process
$EN_t = 0.$
$N_s$ independent of $N_t$ for $s \neq t.$

Rewriting our equation as

dX_t = \mu(X_t, t)dt + \sigma(X_t, t)N_t dt

=: \mu(X_t,t)dt + \sigma(X_t, t) dV_t,

letting $\{V_t, t \geq 0\}$ be a process with increments $dV_t = N_t dt,$ we see that $\{V_t, t\geq 0\}$ is a Gaussian process with stationary independent increments. So it must be a Brownian motion. Then

dX_t = \mu(X_t, t)dt + \sigma(X_t, t) dW_t,

where $\{W_t, t \geq 0\}$ is a standard Brownian motion.

But what is $dW_t$ if all paths of BM are non differentiable? Hence we are very sad about stochastic calculus.

We will be interested in writing

X_t - X_0 = \int_0^t \mu(X_s,s)ds + \int_0^t \sigma(X_s,s)dW_s.

The first integral can be defined for each fixed path $w4 as an ordinary Riemann integral. So it is a random variable (function of $w$ ) where $w \mapsto \int_0^t \mu(X_s(w),s)ds.$

The second integral is weird and we will have to construct the stochastic integral to deal with it.

6.9 Stochastic Calculus and Ito’s Formula

A diffusion $\{ X_t, t\geq 0\}$ with infinitesimal parameters $\mu(x,t)$ and $\sigma^2(x,t)$ has stochastic differential

dX = \mu(X,t)dt + \sigma(X,t)dW.

A version of Ito’s formula says if $X$ is a stochastic process having stochastic differential $dX$ and $f$ is a suitably nice function, then the process $Y = f(X)$ has stochastic differential

dY_t = f'(X_t)dX_t + \frac{1}{2}f''(X_t)(dX_t)^2,

where $(dX_t)^2$ is computed using the rules

$(dt)(d(\text{anything}))=0$
$(dW_t)^2 = dt$

In a more general situation where $Y$ may be of the form $Y = f(X,t),$ Ito’s formula says to compute $dY$ first with a Taylor expansion, keeping terms up to quadratic order, and then simplifying by using the rules above.

EXAMPLE:

Consider the Brownian motion with drift $X_t = \mu t + \sigma W_t.$ The stochastic differential is

dX_t = \mu dt + \sigma dW_t.

Define the geometric Brownian motion $Y_t = e^{X_t}.$ What are the infinitesimal mean and variance functions of the diffusion $Y$ ?

$Y = f(X) = e^X.$ Then $f'(X) = f''(X) = e^X = Y.$

dY_t = f'(X_t)dX_t + \frac{1}{2}f''(X_t)(dX_t)^2

= Y dX_t + \frac{1}{2}Y(dX_t)^2.

Note that $(dX_t)^2 = (\mu dt + \sigma dW_t)^2,$ so

(dX_t)^2 = \mu^2(dt)^2 + 2\mu \sigma (dt)(dW_t) + \sigma^2(dW_t)^2

= 0 + 0 + \sigma^2 dt.

Plugging this in,

dY_t = Y_t (\mu + (1/2)\sigma^2)dt + \sigma Y_t dW_t.

So $Y$ is a diffusion process with infinitesimal parameters $\mu_Y(t) = (\mu + (1/2)\sigma^2)y$ and $\sigma^2_Y(y) = \sigma^2 y^2.$

The term Ito process refers to a stochastic process that has a stochastic differential of the form

dZ_t = X_t dW_t + Y_t dt

for $X,Y$ processes satisfying some conditions, roughly:

$X,Y$ are adapted
Conditions that ensure $X,Y$ not “too big.”

This somewhat concludes my review of the book. I would say I included fewer proofs than I would have liked, but LaTeXing through big Markov expansions is no fun. A couple sections are left incomplete.

Chapter 7 covers likelihood ratios. Chapter 8 covers extremes and Poisson clumping. Both are pretty short, and I might write separate notes on those.

For more detail on stochastic calculus, later notes will be better and explore the construction with more rigor.