Scattered notes from Prof. Joe Chang’s stochastic processes book, intended for a first undergraduate course. Brownian motion and stochastic integration largely omitted; some sections left incomplete and proofs largely reference the book. See other posts for more details on stochastic calculus.
1 – Markov Chains
1.1 Specifying and Simulating a Markov Chain
to specify a Markov chain, we need to know its
state space S, a finite or countable set of states—values the random variables Xi may take on
initial distribution π0, the probability distribution of the Markov chain at time 0.
for each state i, π0(i):=P{X0=i}, the probability the Markov chain starts in state i we can also think of π0 are a vector whose i-th entry is π0(i)
probability transition matrix P=(Pij)
if S has N states, then P is N×N
Pij or P(i,j) is the probability that the chain will transition to j from i: Pij=P{Xn+1=j∣Xn=i}.
rows sum to 1 (think: from state i, we are in row i, and there must be probability 1 of going to any next state j)
1.2 The Markov Property
a process X0,X1,…satisfies the Markov property if
THM (BASIC LIMIT): Let X0,X1,... be an irreducible, aperiodic Markov chain having a stationary distribution π(⋅). Let X0 have the distribution π0, an arbitrary initial distribution.
Then limn→∞πn(i)=π(i) for all states i.
Let’s define “irreducible,” “aperiodic,” and “stationary distribution” so this makes sense.
1.5 Stationary Distribution
stationary distribution amounts to saying π=πP is satisfied, i.e.,
π(j)=i∈S∑π(i)P(i,j)
for all j∈S.
a Markov chain might have no stationary distribution, one stationary distribution, or infinitely many stationary distributions.
for subsets A,B of the state space, define the probability flux from set A into B as
flux(A,B)=i∈A∑j∈B∑π(i)P(i,j)
1.6 Irreducibility, Periodicity, Recurrence
Use Pi(A) as shorthand for P{A∣X0=i}, and same for Ei.
Accessibility: for two states i,j, we say that j is accessible from i if it is possible for the chain ever to visit state j if the chain starts in state i:
Pi{∪n=0∞{Xn=j}}>0.
Equivalently,
n=0∑∞Pn(i,j)=n=0∑∞{Xn=j}>0.
Communication: we say i communicates with j if i is accessible from j and j is accessible from i.
Irreducibility: the Markov chain is irreducible if all pairs of states communicate.
The relation “communicates with” is an equivalence relation; hence, the states space S can be partitioned into “communicating classes” or “classes.”
–
The Basic Limit Theorem requires irreducibility and aperiodicity (see 1.5). Trivial examples why:
Irreducibility: takes S={0,1},P=(1001). Then πn=π0 holds for all n, i.e., πn does not approach a limit independent of π0.
Aperiodicity: same S, take P=(0110). If, for example, π0=(1,0),πn alternates with even and odd n, and does not converge to anything.
–
Period: Given a Markov chain {X0,X1,...}, define the period of a state i to be the greatest common divisor
di=gcd{n:Pn(i,i)>0}.
THM: if the states i and j communicate, then di=dj.
The period of a state is a “class property.” In particular, all states in an irreducible Markov chain have the same period. Thus, we can speak of the period of a Markov chain if the Markov chain is irreducible.
An irreducible Markov chain is aperiodic if its period is 1, and periodic otherwise. (A sufficient but not necessary condition for an irreducible chain to be aperiodic is that there exists a state i such that P(i,i)>0.)
–
One more concept for the Basic Limit Theorem: recurrence. We will begin with showing recurrence, then showing that it is a class property. In particular, in an irreducible Markov chain, etiher all states are recurrent OR all states are transient.
The idea of recurrence: a state i is recurrent if, starting from the state i at time 0, the chain is sure to return to i eventually. More precisely, define the first hitting timeTi of the state i by
Ti=inf{n>0:Xn=i}.
Recurrence: the state i is recurrent if Pi{Ti<∞}=1. If i is not recurrent, it is called transient.
(note that accessibility could be defined: for distinct states i=j, j is accessible from i iff Pi{Tj<∞}>0.)
THM: Let i be a recurrent state, and suppose that j is accessible from i. Then all of the following hold:
Pi{Tj<∞}=1
Pj{Ti<∞}=1
The state j is recurrent
–
We use the notation Ni for the total number of visits of the Markov chain to the state i:
Ni=n=0∑∞I{Xn=i}.
THM: The state i is recurrent iff Ei(Ni)=∞.
COROLLARY: If j is transient, then limn→∞Pn(i,j)=0 for all states i.
–
Introducing stationary distributions.
PROP: Suppose a Markov chain has a stationary distribution π. If the state j is transient, then π(j)=0.
COROLLARY: If an irreducible Markov chain has a stationary distribution, then the chain is recurrent.
Note that the converse for the above is not true. There are irreducible, recurrent Markov chains that do not have stationary distributions. For example, the simple symmetric random walk on the integers in one dimension is irreducible and recurrent but does not have a stationary distribution. By recurrence we have P0{T0<∞}=1, but also E0{T0}=∞. The name for this kind of recurrence is null recurrence, i.e., a state i is null recurrent if it is recurrent and Ei(Ti)=∞. Otherwise, a recurrent state is positive recurrent, where Ei(Ti)<∞.
Positive recurrence is also a class property: if a chain is irreducible, the chain is either transient, null recurrent, or positive recurrent. In fact, an irreducible chain has a stationary distribution iff it is positive recurrent.
1.7 Coupling
Example of coupling technique: consider a random graph on a given finite set of nodes, in which each pair of nodes is joined by an edge independently with probability p. We could simulate a random graph as follows: for each pair of nodes i,j generate a random number Uij∼U[0,1], and join nodes i and j with an edge if Uij≤p.
How do we show that the probability of the resulting graph being connected is nondecreasing in p, i.e., show that for p1<p2,
Pp1{graph connected}≤Pp2{graph connected}.
We could try to find an explicit function for the probability in terms of p, which seems inefficient. How to formalize the intuition that this seems obvious?
An idea: show that corresponding events are ordered, i.e., if A⊂B then PA≤PB.
Let’s make 2 events by making 2 random graphs, G1,G2 on the same set of nodes. G1 is constructed by having each possible edge appear with prob p1, and for G2, each edge present with prob p2. We can do this by using two sets of U[0,1] random variables: {Uij},{Vij} for the first and second graph, respectively. Is it true that
{G1 connected}⊂{G2 connected}?
No, since the two sets of r.v.s are independently generated.
A change: use the same random numbers for each graph. Then
{G1 connected}⊂{G2 connected}
becomes true. This establishes monotonicity of the probability being connected.
Conclusion: what characterizes a coupling argument? Generally, we show that the same set of random variables can be used to construct two different objects about which we want to make a probabilistic statement.
1.8 Proof of Basic Limit Theorem
The Basic Limit Theorem says that if an irreducible, aperiodic Markov chain has a stationary distribution π, then for each initial distirbution π0, as n→∞ we have πn(i)→π(i) for all states i.
(Note the wording “a stationary distribution”: assuming BLT true implies that an irreducible and aperiodic Markov chain cannot have two different stationary distributions)
Equivalently, let’s define a distance between probability distributions, called “total variation distance:”
DEF: Let λ and μ be two probability distributions on the set S. Then the total variation distance ∣∣λ−μ∣∣ is defined by
∣∣λ−μ∣∣=A⊂Ssup[λ(A)−μ(A)].
PROP: The total variation distance may also be expressed in the alternative forms
We now introduce the coupling method. Let Y0,Y1,... be a Markov chain with the same probability transition matrix as X0,X1,..., but let Y0 have the initial distribution π and X0 have the initial distribution π0. Note that {Yn} is a stationary Markov chain with distribution π for all n. Let the Y chain be independent of the X chain.
We want to show that, for large n, the probabilistic behavior of Xn is close to that of Yn.
Define the coupling timeT to be the first time at which Xn=Yn:
T=inf{n:Xn=Yn}.
LEMMA: For all n we have
∣∣πn−π∣∣≤P{T>n}.
Hence we need only show that P{T>n}→0, or equivalently, that P{T<∞}=1.
Consider the bivariate chain {Zn=(Xn,Yn):n≥0}.Z0,Z1,... is clearly a Markov chain on the state space S×S. Since the X and Y chains are independent, the probability transition matrix PZ of chain Z can be written
PZ(ixiy,jxjy)=P(ix,jx)P(iy,jy).
Z has stationary distribution
πZ(ixyiy)=π(ix)π(iy).
We want to show P{T≤∞}. So, in terms of the Z chain, we want to show that with probability one, the Z chain hits the “diagonal” {(j,j):j∈S} in S×S in finite time. To do this, it is sufficient to show that the Z chain is irreducible and recurrent.
This is where we use aperiodicity.
LEMMA: Sps A is a set of positive integers closed under addition, with gcd one. Then there exists an integer N s.t. n∈A for all n≥N.
Let i∈S and recall the assumption that the chain is aperiodic. Since the set {n:Pn(i,i)>0} is closed under addition, and, from aperiodicity, has greatest common divisor 1, we can use the previous lemma. So Pn(i,i)>0 for all sufficiently large n. From this, for any i,j∈S, since irreducibility implies Pm(i,j)>0 for some m, it follows that Pn(i,j)>0 for all sufficiently large n.
Now we show irreducibility. Let ix,iy,jx,jy∈S. It is sufficient to show that PZn(ixiy,jxjy)>0 for some n. By the assumed independence of {Xn} and {Yn}, we have
PZn(ixiy,jxjy)=Pn(ix,jx)Pn(iy,jy),
which (by the previous argument) is positive for all sufficiently large n, so we are done.
1.9 SLLN for Markov Chains
The usual Strong Law of Large Numbers for iid random variables says that if X1,X2,... are iid with mean μ, then
P{(1/n)t=1∑nXt→μ}=1 as n→∞.
We will do a generalization of this result for Markov chains: the fraction of times a Markov chain occupies state i converges to a limit.
Although the successive states of a Markov chain are not independent, certain features of a Markov chain are independent of each other. Here we will use the idea that the path of the chain consists of a succession of independent “cycles,” the segments of the path between successive visits to a recurrent state. This independence allows us to use the LLN that we already know.
THM: Let X0,X1,... be a Markov chain starting in the state X0=i, and suppose that the state i communictes with another state j. The limiting fraction of time that the chain spends in state j is 1/EjTj. That is,
Pi{n→∞limn1t=1∑nI{Xt=j}=EjTj1}=1.
In the proof of the previous theorem, we define Vn(j) as the number of visits made to state j made by X1,...,Xn, i.e.,
Vn(j)=t=1∑n{Xt=j}.
Using the Bounded Convergence Theorem,
we have the following:
COROLLARY: For an irreducible Markov chain, we have
n→∞limn1t=1∑nPt(i,j)=EjTj1
for all states i and j.
Consider an irreducible, aperiodic Markov chain haing a stationary distribution π. From the Basic Limit Theorem, we know that Pn(i,j)→π(j) as n→∞. Notice also that if a sequence of numbers converges to a limit, then the sequence of Cesaro averages converges to the same limit, i.e., if at→a as t→∞, then (1/n)∑t=1nat→a as n→∞. However, the previous Corollary shows the Cesaro averages of Pt(i,j)'s converge to 1/EjTj. So, we must have
π(j)=EjTj1.
In fact, aperiodicity is not needed for this conclusion:
THM: An irreducible, positive recurrent Markov chain has a unique stationary distribution π given by
π(j)=EjTj1.
2 – Markov Chains: Examples & Applications
2.1 Branching Processes
Motivating example: the branching process model was formulated by Galton, who was interested in the survival and extinction of family names.
Suppose children inherit their fathers’ names, so we need only to keep track of fathers and sons. Consider a male who is the only member of his generation to have a given family name, so that the responsibility of keeping the family name alive falls upon him–if his line of male descendants terminates, so does the family name.
Suppose for simplicity that each male has probability f(0) of producing no sons, f(1) of producing 1 son, and so on.
What is the probability that the family name eventually becomes extinct?
To formalize this, let:
Gt the number of males in generation t
Start with G0=1
If Gt=i, write Gt+1=Xt1+Xt2+...+Xti, where Xtj is the number of sons fathered by the j-th man in generation t
Assume the r.v.s {Xtj:t≥0,j≥1} are iid with probability mass function f
Hence P{Xtj=k}=f(k) for k=0,1,....
To avoid trivial cases, assume f(0)>0 and f(0)+f(1)<1.
We are interested in the extinction probability ρ=P1{Gt=0 for some t}.
{Gt:t≥0} is a Markov chain. State 0 is absorbing. So, for each i>0, since Pi{G1=0}=(f(0))i>0, the state i must be transient.
Consequently, with probability 1, each state i>0 is visited only a finite number of times. So, with probability 1, the chain must get absorbed at 0 or approach ∞.
We can obtain an equation for ρ by conditioning on what happens at the first step of the chain:
for z∈(0,1). Since these are positive, the function ψ is strictly increasing and convex on z∈(0,1). Also note that ψ(0)=f(0),ψ(1)=1. Finally, ψ′(1)=∑kf(k)=μ, where μ=E(X), the expected number of sons for each male.
Since ψ(1)=1, there is always a trivial solution at ρ=1. When μ≤1, this trivial solution is the only solution, so ρ=1.
When μ>1, we define r to be the smaller solution of ψ(r)=r. Since ψ(ρ)=ρ, we know that ρ must be either r or 1. We want to show that ρ=r.
Defining pt=P1{Gt=0}, observe that as t→∞,
pt↑P1[n=1⋃∞{Gn=0}]=ρ.
Therefore, to rule out ρ=1, it is sufficient to prove that
i.e., pt+1=ψ(pt). Since ψ is increasing and by the induction hypothesis pt≤r, we have that
pt+1=ψ(pt)≤ψ(r)=r.
So p0,pt+1 is bounded by r for all non negative t, which proves ρ=r. Hence there is a non negative probability that the family name goes on forever.
2.2 Time Reversibility
2.3 More on Time Reversibility: A Tandem Queue Model
2.4 The Metropolis Method
2.5 Simulated Annealing
2.6 Ergodicity Concepts
In this section we focus on a time-inhomogeneous Markov chain {Xn} on a countably infinite state space S.
Let Pn denote the probability transition matrix governing the transition from Xn to Xn+1, i.e.,
Pn(i,j)=P{Xn+1=j∣Xn=i}.
For m<n, define P(m,n)=∏k=mn−1Pk, s.t.
P(m,n)(i,j)=P{Xn=j∣Xm=i}.
DEF:{Xn} is strongly ergodic if there exists a probability distribution π∗ on S such that
n→∞limi∈Ssup∣∣P(m,n)(i,⋅)−π∗∣∣=0,∀m.
DEF:{Xn} is weakly ergodic if there exists a probability distribution π∗ on S such that
n→∞limi,j∈Ssup∣∣P(m,n)(i,⋅)−P(m,n)(j,⋅)∣∣=0,∀m.
We can understand weak ergodicity somewhat as a “loss of memory concept.” It says that at a large enough time n, the chain has nearly forgotten its state at time m, in the sense that the distribution at time n would be nearly the same no matter waht the state was at time m. However, there is no requirement that the distribution be converging to anything as n→∞. The concept that incorporates convergence in addition to loss of memory is strong ergodicity.
2.6.1 The Ergodic Coefficient
For a probability transition matrix P=P((i,j)), the ergodic coefficientδ(P) of P is defined to be the maximum total variation distance between pairs of rows of P, that is,
DEF: The ergodic coefficient δ(P) of a probability transition matrix P is
The basic idea is that δ(P) being small is “good” for ergodicity. For example, in the extreme case of δ(P)=0, all the rows of P are identical, so P would cause a Markov chain to lose its memory completely in just one step: v1=v0P does not depend on v0.
LEMMA:δ(PQ)≤δ(P)δ(Q) for probability transition matrices P,Q.
2.6.2 Sufficient Conditions for Weak and Strong Ergodicity
Sufficient conditions are given in the next two propositions:
PROP: If there exist n0<n1<n2<... such that ∑k[1−δ(P(nk,nk+1))]=∞, then {Xn} is weakly ergodic.
PROP: If {Xn} is weakly ergodic and if there exist π0,π1,... such that πn is a stationary distribution for Pn for all n and ∑n∣∣πn−πn+1∣∣<∞, then {Xn} is strongly ergodic. In that case, the distribution π∗ in the definition is given by π∗=limn→∞πn.
2.7 Proof of Main Theorem of Simulated Annealing
See book…
2.8 Card Shuffling
We have seen that for an irreducible, aperiodic Markov chain {Xn} having stationary distribution π, the distribution πn of {Xn} converges to π in the total variation distance. An example of using this would be generating a nearly uniformly distributed 4×4 table with given row and column sums by simulating a certain Markov chain for a long enough time. The question is: how long is long enough?
In certain simple Markov chain examples, it is easy to figure out the rate of convergence of πn to π. In this section we will concentrate on a simple shuffling example considered by Aldous and Diaconis in their article “Shuffling cards and stopping times.” The basic question is: How close is the deck to being “random” (uniformly distributed over the 52! possible permutation) after n shuffles? For the riffle shuffle model, the answer is “about 7.”
2.8.1 “Top in at random” Shuffle
The “top in at random” method consists of taking the top card off the deck and then inserting is back into the deck at a random position (this includes back on top). So, altogether, there are 52 equally likely positions. Repeated performance of this shuffle on a deck of cards produces a sequence of “states” of the deck. This sequence of states forms a Markov chain with state space S52, the group of permutations of the cards. This Markov chain is irreducible, aperiodic, and has stationary distribution π= Uniform on S52 (i.e., probability 1/52! for each permutation). Therefore, by the Basic Limit Theorem, we may conclude that ∣∣πn−π∣∣→0 as n→∞.
2.8.2 Threshold Phenomenon
Suppose we are working with a fresh deck of d cards in the original order (card 1 on top, card 2 under, etc.). Then ∣∣π0−π∣∣=1−(1/d!). We also know that ∣∣πn−π∣∣→0 as n→∞ from the Basic Limit Theorem. It is natural to assume that the distance from stationarity decreases to 0 in a smooth manner; however, it actually experiences what we call the “threshold phenomenon.” An abrupt change happens in a relatively small neighborhood of the value n=dlogd. That is, for large d the graph of ∣∣πn−π∣∣ versus n looks like the following picture.
The larger the value of the deck size d, the sharper (relative to dlogd) the drop is near n=dlogd.
2.8.3 A random time to exact stationarity
Let’s give a name to each card in the deck (i.e., say the 2 of hearts is card 1, the 3 of hearts is card 2, etc.). Suppose we start with the deck in pristine order (card 1 on top, then card 2, etc.). Though πn will never become exactly random, it is possible to find a random time T at which the deck becomes exactly uniformly distributed, that is, XT∼Unif(S52).
Here is an example of such a random time. Suppose that “card i” always refers to the same card (like, say, card 52, the ace of spades), whereas “top card,” “card in position 2,” etc. are just about cards at the time of consideration. Also note that we may describe a sequence of shuffles simply by a sequence of iid random vacriables U1,U2,... uniformly distributed on {1,2,...,52}: just say that the i-th shuffle moves the top card to position Ui. Define the following random times:
T1T2T3T51=inf{n:Un=52}= 1st time a top card goes below card 52 =inf{n>T1:Un≥51}= 2nd time a top card goes below card 52 =inf{n>T2:Un≥50}= 3rd time a top card goes below card 52 ⋮=inf{n>T50:Un≥2}= 51st time a top card goes card below 52
and
T=T52=T51+1.
It is not hard to see that T has the desired property and that XT is uniformly distributed. To understand this, start with T1. At time T1, we know that some card is below card 52; we don’t know which card, but that will not matter. After time T1 we continue to shuffle until T2, at which time another card goes below card 52. At time T2, there are 2 cards below card 52. Again, we do not know which cards they are, but conditional on which 2 cards are below card 52, each of the two possible orderings of those 2 cards is equally likely. Similarly, we continue to shuffle until time T3, at which time there are some 3 cards below card 52, and, whatever those 3 cards are, each of their (3!) possible relative positions in the deck is equally likely. And so on. At time T51, card 52 has risen all the way up to become the top card, and the other 51 cards are below card 52 (now we do know which cards they are), and those 51 cards are in random positions (i.e. uniform over 51! possibilities). Now all we have to do is shuffle one more time to get card 52 in random position, so that at time T=T52=T51+1, the whole deck is random.
Let us find ET. By the definitions above, T1∼Geom(1/52),(T2−T1)∼Geom(2/52),...,(T51−T50)∼Geom(51/52),(T52−T51)∼Geom(52/52). So
ET=E(T1)+E(T2−T1)+...+E(T52−T51)≈52log52.
Analogously, if the deck had d cards rather than 52, we would have obtained ET∼dlogd (for large d), where T is now a random time at which the whole deck of d cards becomes uniformly distributed on Sd.
2.8.4 Strong Stationary Times
As we have observed, the random variable T that we just constructed has the property that XT∼π. T also has two other important properites. First, XT is independent of T. Second, T is a stopping time, that is, for each n, one can determine whether or not T=n just by looking at the values of X0,...,Xn. In particular, to determine whether or not T=n it is not necessary to know any future values Xn+1,Xn+2,....
DEF: A random variable T satisfying
T is a stopping time,
XT is distributed as π, and
XT is independent of T
is called a strong stationary time.
What’s so good about strong stationary times?
LEMMA: If T is a strong stationary time for the Markov chain {Xn}, then ∣∣πn−π∣∣≤P{T>n} for all n.
This tells us that strong stationary times satisfy the same inequality derived for coupling times.
To see proof of threshold phenomenon in shuffling, refer to book.
3 – MRFs and HMMs
This section looks at aspects of Markov random fields (MRF’s), hidden Markov models (HMM’s), and their applications.
3.1 MRFs on Graphs and HMMs
A stochastic process is a collection of random variables {Xt:t∈T} indexed by some subset T of the real line R. The elements of T are often interpreted as times, in which case Xt represents the state at time t of the random process under consideration. The term random field refers to a generalization of the notion of a stochastic process: a random field {Xs:s∈G} is still a collection of random variables, but now the index set G need not be a subset of R. For example, G could be a subset of the plane R2. In this section, we’ll consider G as the set of nodes of a graph (the set being at most countable). Important aspects of the dependence among the random variables will be determined by the edges of the graph through a generalization of the Markov property.
NOTATION: Given a graph G, we say two nodes s,t are neighbors, denoted s∼t, if they are joined by an edge of the graph. We do not consider a node to be a neighbor of itself. N(t) the set of neighbors of t.
DEF: Suppose we are given a graph G with nodes {1,...,n} and a neighborhood structure N(t). The collection of random variables (X1,...,Xn) is a Markov random field on G if
P{Xt=xt∣Xs=xs for s=t}=P{Xt=xt∣Xs=xs for s∈N(t)}
for all nodes t∈{1,...,n}.
More compact notation: for a subset of nodes A⊂G, let xA be the vector (xs:s∈A). We will also write p(xt∣xN(t)) for P{Xt=xt∣Xs=xs for s∈N(t)}. HMM’s are Markov random fields in which some of the random variables are observable and others are not. We adopt X for hidden random variables and Y for observed random variables.
3.2 Bayesian Framework
What do we get out of these models & how can we use them? One approach is Bayesian: HMM’s fit nicely in the Bayesian framework. X is the object of interest; it is unknown. For example, in modeling a noisy image, X could be the true image. We consider the unknown X to be random, and we assume it has a certain prior distribution. This distribution, our probabilistic model for X, is assumed to be a Markov random field. We also postulate a certain probabilistic model for Y conditional on X. This conditional distribution of Y given X reflects our ideas about the noise of blurring or whatever transforms the hidden true image X into the image Y that we observe.
Given our assumed prior distribution of X and condiitonal distribution of (Y∣X), Bayes’ formula gives the posterior distribution of (X∣Y). Thus, given an observed value Y=y, in principle we get a posterior distribution P{X=⋅∣Y=y} over all possible true images, so that (again, in principle) we could make a variety of reasonable choices of our estimator of X. For example, we could choose the x that maximizes P{X=x∣Y=y}. This is called MAP estimation, where MAP stands for “maximum a posteriori”.
3.3 Hammersley - Clifford Theorem
How do we specify a Markov random field? Compared to the case of Markov chains, we might want to specify in terms of a conditional distribution. The following example suggests why this approach goes wrong.
EXAMPLE: Suppose we are designing a Markov random field for images on the 3x3 lattice:
For each pixel, let us specify the conditional distribution of its color given the color of its neighbors. Suppose there are two colors, 0 and 1. But it is possible to specify conditional distributions for each pixel that lead don’t work, i.e., there might be no joint distribution having the given conditional distributions.
In general, we can’t expect to specify a full set of conditional distributions as above. Fortunately, the Hammersley-Clifford Theorem says that a random field’s having the Markov property is equivalent to its having a Gibbs distribution, which is a friendly sort of distribution. Thus, instead of worrying baout specifying our MRF’s in terms of consistent conditional distributions, we can just consider Gibbs distributions, which are simple to write down and work with.
Some definitions needed to state HC:
DEF: A set of nodes C is complete if all distinct nodes in C are neighbors of each others. A clique is a maximal complete set of nodes.
DEF: Let G be a finite graph. A Gibbs distribution with respect fo G is a probability mass function that can be expressed in the form
p(x)=C complete∏VC(x),
where each VC is a function that depends only on the values xc=(xs:s∈C) of x at the nodes in the clique C. That is, the function VC satisfies VC(x)=VC(y) if xC=yC.
By combining functions VC for sets C that are subsets of the same clique, we see that we can further reduce the product in the definiton of Gibbs distribution to
p(x)=C a clique∏VC(x).
THM (HAMMERSLEY-CLIFFORD): Suppose that X=(X1,...,Xn) has a positive joint probability mass function. X is a Markov random field on G iff X has a Gibbs distribution with respect to G.
EXAMPLE: A Markov chain X0,...,Xn has joint distribution of the form p(x0,x1,...,xn)=π0(x0)P1(x0,x1)P2(x1,x2)...Pn(xn−1,xn).
By defining V{0,1}(x0,x1)=π0(x0)P1(x0,x1) and V{k−1,k}(xk−1,xk)=Pk(xk−1,xk) for k>0, we see that this product is a Gibbs distribution on the graph
(x1)−−(x1)−−(x2)−−...−−(xn)
PROP: Suppose (X,Y) is a Markov random field on the graph G with the neighborhood structure N. Write G=GX∪GY, where GX and GY are the sets of nodes in G corresponding to the X and Y random variables, respectively. Then the marginal distribution of Y is a Markov random field on GY, where two nodes y1,y2∈GY are neighbors if either
They were neighbors in the original graph, or
There are nodes x1,...,xk∈GX such that y1∼x1∼x2∼...∼xk∼y2.
The conditional distribution of X given Y is a Markov random field on the graph GX, where nodes x1 and x2 are neighbors if x1∼x2, that is, if x1,x2 were neighbors in the original graph.
3.4 Long range dependence in the Ising model
For this section, we will work in Zd. For each t∈Zd there is a binary random variable Xt taking values in {−1,1}. The Ising model gives a joint probability distribution for these random variables.
We consider a special case of the Ising model as follows.
For x a configuration of +1’s and -1’s at the ndoes of a finite subset of Zd, let b(x) denote the number of “odd bonds” in x, that is, the number of edges {t,u} such that xt=xu. Then, under the Ising model, a configuration x has a probability proportional to αb(x) where α is a positive parameter of the distribution (typically <1). The chocie α=1 corresponds to the uniform distribution, giving equal probability to all congiruations. Distributions with small α strongly dicourage odd bonds, placing large probability on configurations with few odd bonds.
For the case d=1, the model corresponds to a stationary Markov chain with probability transition matrix
Pα=(1/(1+α)α/(1+α)α/(1+α)1/(1+α)).
Basic Limit Theorem tells us
Pαn→(1/21/21/21/2) as n→∞.
Thus, the state X0 is asymptotically independent of information at nodes far away from 0.
We will show that the situation in d≥2 is qualitatively different, in that the effect of states at remote nodes does not disappear in 2 dimensions.
To state the result, some notation:
Imagine a “cube” Kn of side length 2n centered at 0 in Zd, consisting of lattice points whose d coordinates all lie between −n and n:
Kn={t∈Zd:∣ti∣≤n,∀i=1,...,d}.
Let Bn denote the boundary points of the cube Kn, that is, the points having at least one coordinate equal to n.
Let P+(n){X=x} denote the Ising probability of X=x, conditional on Xt=+1 for all t∈Bn. Similarly, let P−(n){X=x} denote probabilities conditional on −1’s on the boundary.
THM: For the Ising model on Z2, the effect of the boundary does not disappear. In particular, there exists α such that P+(n){X0=−1} remains below 0.4 for all n, no matter how large.
3.5 Hidden Markov Chains
3.5.1 Model overview
The hidden Markov chain model is a MRF in which some of the random variables are observed and others are not (hence hidden). In the graph structure for the hidden Markov chain, the hidden chain is X0,X1,...,Xn and the observed process if Y0,Y1,...,Yn.
Edges join Xt to both Yt and Xt+1. The model is parametrized by a marginal distribution ζ of X0 and, if we assume time-homogeneity of the transition matrices, by two transition matrices A and B, where A(i,j)=P{Xt+1=j∣Xt=i} and B(i,j)=P{Yt=j∣Xt=i}.
Let us write θ=(ζ,A,B) for the vector of all parameters. If there are u states possible for each of the hidden X random varirables and v oucomes possible for the observed y random variables, then ζ is a vector of u probabilities, A is a u×u probability transition matrix, and B is a u×v matrix, each of whose rows is a proabbility mass function.
The hidden Markov chain is pretty general. Examples of cases include
One hidden state possible for each X. Then A is just the 1×1 transition matrix, B is 1×v, and the Y process is simply an iid sequence, where each Yt has probability mass function given by B. So iid sequences are a special case of the hidden Markov chain model
If u=v and B is the identity matrix, then Yt=Xt for all t, so Markov chains are a special case of the hidden Markov chain
3.5.2 How to calculate likelihoods
The likelihood functionL=L(θ) is the probability of the observed data, as a function of the parameters of the model. The tricky aspect is we observe only the Y’s, so that
This is a large sum. If the size of the state space of the hidden variables is just u=2, the sum still has 2n+1 terms. Without a way around this computational issue, the hdiden Markov chain model would be of little pratical use. However, it turns out that we can do these calculations in time linear in n.
We will denote the state space for the hidden variables Xt by X.
We are thinking of the observed Y values as fixed here–we know them, and will denote them by y0,...,yn. For each t=0,1,..,n and for each xt∈X, define
αt(xt)=pθ(y0,..,yt,xt).
We can calculate the function α0 immediately:
α0(x0)=pθ(y0,x0)=ζ(x0)B(x0,y0).
Note the simple recursion that expressed αt+1 in terms of αt:
is fairly modest in comparison. Using the recursion to calculate the function αt+1 from αt involves a fixed amount of work, and the task gets no harder as t increases. Thus, the amount of work to calculate all of the probabilities αt(xt) for t=0,...,n and xt∈X is linear in n.
Having completed the recrusion to calculate the function αn, the likelihood is simply
Without an organized method for changing θ=(ζ,A,B), it is hard to find the θ that maximizes the likelihood. The EM algorithm is a method for finding maximum likelihood estimates that is applicable to many statistical problems, including hidden Markov chains.
PROP: For p=(p1,...,pk) and q=(q1,...,qk) two probability mass functions on {1,...,k}, we have
i∑pilogpi≥i∑pilogqi.
A brief description of EM algorithm. We’ll assume X,Y are discrete random variables or vectors, so probs are given by sums rathers than integrals.
We want to find the θ maximizing logL(θ) where L(θ)=pθ(y)=∑xpθ(x,y). The EM algorithm repeats the following update that is guaranteed to increase the likelihood at each iteration.
Let θ0 denote the current value of θ. Replace θ0 by θ1, the value of θ that maximizes
Eθ0[logpθ(X,y)∣Y=y].
Why does it work? For a given θ0, define g(θ)=Eθ0[logpθ(X,y)∣Y=y]. We will see that, in order to have pθ1(y)>pθ0(y), we do not need to find θ1 maximizing g, but rather it is enough to find a θ1 such that g(θ1)>g(θ0).
PROP: If Eθ0[logpθ1(X,y)∣Y=y]>Eθ0[logpθ0(X,y)∣Y=y], then pθ1(y)>pθ0(y).
3.5.4 Applying the EM algorithm to a hidden Markov chain
Consider a hidden Markov chain for (X,Y)=(X0,...,Xn,Y0,...,Yn), where the r.v.s Yt are observed and the r.v.s Xt are hidden. The model is parametrized by θ=(ζ,A,B) for ζ the distribution of X0, A the prob transition matrix for X chain and B the prob transition matrix from Xt to Yt.
To describe one iteration of the EM method, imagine our current guess for θ is θ0=(ζ0,A0,B0). We want a new guess θ1=(ζ1,A1,B1) that has higher likelihood.
See book for more… too much to copy.
In summary, to do EM on hidden Markov chain model:
Start with some choice of parameter values θ0=(ζ0,A0,B0).
Calculate forward and backward probabilities αt(i),βt(i) for t=0,1,...n and i∈X (all with θ taken as θ0).
Calculate the quantities γt(i,j)=Pθ0{Xt=i,Xt+1=j∣y} for t∈{0,...,n−1},i∈X,j∈X. These could all be stored in a u×u×n array.
Terms like “Markov chain Monte Carlo” and “Markov sampling” refer to methods for generating random samples from given distributions by running Markov chains. Although such methods have quite a long history, they have become the subject of renewed interest in the last decade, particularly with the introduction of the “Gibbs sampler” by Geman and Geman (1984), who used the method in a Bayesian approach to image reconstruction. The Gibbs sampler itself has enjoyed a recent surge of intense interest within statistics community, spurred by Gelfand and Smith (1990), who applied the Gibbs sampler to a wide variety of inference problems.
Recall that a distribution π being “stationary” for a Markov chain X0,X1,... means that, if X0∼π, then X0∼π for all n. The basic phenomenon underlying all Markov sampling methods is the convergence in distribution of a Markov chain to its stationary distribution: If a Markov chain X0,X1,... has stationary distribution π, then under the conditions of the Basic Limit Theorem, the distribution of Xn for large n is close to π. Thus, in order to generate an observation from a desired distribution π, we find a Markov chain X0,X1,... that has π as its stationary distribution. The Basic Limit Theorem then suggests that running or simulating the chain until a large time n will produce a random variable Xn whose distribution is close to the desired π. By taking n large enough, in principle we obtain a value that may for practical purposes be considered a random draw from the distribution π.
The Gibbs sampler is a way of constructing a Markov chain having a desired stationary distribution. A simple setting that illustrates the idea involves a probability mass function π of the form π(x,y). Suppose we want to generate a random vector (X,Y)∼π. Denote the conditional probability distributions by π(⋅∣X=⋅) and π(⋅∣Y=⋅). To perform a Gibbs sampler, start with any initial point (X0,Y0). Then generate X1 from the conditional distribution π(⋅∣Y=Y0), and generate Y1 from the conditional distribution π(⋅∣X=X1). Continue on in this way, generating X2 from the conditional distribution π(⋅∣Y=Y1) and Y2 from the conditional distribution π(⋅∣X=X2), and so on. Then the distribution π is stationary for the Markov chain {(Xn,Yn):n=0,1,…}.
To see this, suppose (X0,Y0)∼π. In particular, Y0 is distributed according to the Y-marginal of π, so that, since X1 is drawn from the conditional distribution of X given Y=Y0, we have (X1,Y0)∼π. Now we use the same reasoning again: Y1 is distributed according to the Y-marginal of π, so that (X1,Y1)∼π. Thus, the Gibbs sampler Markov chain {(Xn,Yn):n≥0} has the property that if (X0,Y0)∼π then (X1,Y1)∼π—that is, the distribution π is stationary.
Simulating a Markov chain is technically and conceptually simple. We just generate the random variables in the chain, in order, and we are done. However, the index set of a Markov random field has no natural ordering in general. This is what causes iterative methods such as the Gibbs sampler to be necessary.
The Gibbs sampler is well suited to Markov random fields, since it works by repeatedly sampling from the conditional distribution at one node given the values at the remaining nodes, and the Markov property is precisely the statement that these conditional distributions are simple, depending only on the neighbors of the node.
4 – Martingales
4.2 Definitions
Notation: Given a process W={Wk}, let Wm,n denote the portion Wm,Wm+1,...,Wn of the process from time m up to time n.
DEF: A process M0,M1,..., is a martingale if
E[Mn+1∣M0,n]=Mn for each n≥0.
Alternatively,
A process M0,M1,..., is a martingale w.r.t another process W0,W1,... if
E[Mn+1∣W0,n]=Mn for each n≥0.
The crux of the definition E[Mn+1∣W0,n]=Mn is a “fair game” sort of requirement. If we are playing a fair game, we expect neither to win nor to lose money on the average. Given the history of our fortunes up to time n, our expected fortune Mn+1 at the future time n+1 should just be the fortune Mn that we have at time n.
A minor technical condition: we also require E∣Mn∣<∞ for all n s.t. the coniditional expectations in the definition are guaranteed to be well-defined.
How about submartingales and supermartingales? These are processes that are “better than fair” and “worse than fair,” respectively.
DEF: A process X0,X1,... is a submartingle w.r.t. a process W0,W1,..., if E[Xn+1∣W0,n]≥Xn for each n≥0.
{X_n} is a supermartingale w.r.t {Wn} if E[Xn+1∣W0,n]≤Xn for each n≥0.
4.3 Examples
See book
4.4 Optional Sampling
Optional sampling property is a “conservation of fairness” type property.
By the “fair game” property, E{Mn+1}=E{E[Mn+1∣W0,n]}=E{Mn} for all n. This implies that
EMn=EM0 for all times n≥0.
That is, one can stop at any predetermined time t, like t=8, and their winnings will be fair: EM8=EM0.
Fairness is also conserved in many cases, but not in all cases (i.e., if one stops at a time that is not predetermined, but random (depending on the observed sample path of the game)). The issue of optional sampling is this:
If T is a random time, that is, T is a nonnegative random variable, does the equality EMT=EM0 still hold?
There are two sorts of things that shouldn’t be allowed if we want fairness to be conserved:
shouldn’t be able to take an indefinitely long time to stop. for example, a simple symmetric random walk with equal probabilities of going + or - 1. to stop at time T1=inf{n: random walk is at 2 }, then clearly this is not fair.
shouldn’t be able to “take back” moves. i.e., shouldn’t be able to have information up to time t, then go back to some time s<t and claim to stop at s.
In fact, ruling out these two sorts of behaviors leaves a class of random times T at which the optional sampling statement EMT=EM0 holds. Disallowing arbitrarily long times is done by assuming Tbounded. Random times that disallow the gambler from peeking ahead into the future are called stopping times.
DEF: A random variable T taking values in the set {0,1,2,...,∞} is a stopping time w.r.t. the process W0,W1,... if for each integer k, the indicator random variable I{T=k} is a function of W0,k.
The main optional sampling result:
THM: Let M0,M1,... be a martingale w.r.t. W0,W1,..., and let T be a bounded stopping time. Then EMT=EM0.
Now we generalize by
applying to general supermartingales rather than martingales
replacing the times 0 and T with two stopping times S,T s.t. S≤T.
THM: Let X0,X1,... be a supermartingale with respect to W0,W1,... and let S and T be bounded stopping times with S≤T. Then EXT≤EXS.
4.5 Stochastic integrals and option pricing in discrete time
oh my great goodness
4.6 Martingale convergence
PROP: Let X0,X1,..., be a nonnegative supermartingale, and let X0≤a, where a is a nonnegative number. Then for b>a, defining Tb=inf{t:Xt≥b}, we have P{Tb>∞}≤a/b.
Intuition: suppose we are playing a supermartingale with initial fortune X0≤a. Suppose we adopt the strategy of stopping once our fortune exceeds b. Our expected reward is at least b⋅P{Tb<∞}. But we expect to lose money on a supermartingale, so this expected reward should be no larger than out initial fortune. Then b⋅P{Tb<∞}≤a.
THM: A nonnegative supermartingale converges with probability 1.
4.7 Stochastic approximation
We observe the random variables
Yn=f(Xn)+ηn,
where the random variables X1,X2,... are subject to our choice and η1,η2,... are assumed to be iid “noise” random variables with mean 0 and variance 1. THe function f is unknown, and we do not know or observe the noise r.v.s, only the Y’s.
Robbins-Monro iteration: a simple way of choosing Xn+1 given the previously observed Y values.
Some assumptions:
f has a unique but unknown zero x∗
f(x)<0 for x<x∗
f(x)>0 for x>x∗
We want to find x∗. Can we have a method of choosing values Xt such that Xt→x∗ as t→∞?
Suppose we are given a suitable sequence of positive numbers a0,a1,.... Then, given X0, we define X1,X2,... according to the recursion
Xn+1=Xn−anYn
This iteration is qualitatively reasonable. If Yn positive, based on our assumptions about f, we would guess that Xn>x∗. So we want Xn+1 to be smaller than Xn, which works (if Yn>0 then Xn+1<Xn), and vice versa.
How should we choose a0,a1,...? We want Xn to converge to something (x∗) as n→∞, so we need an→0 as n→∞.
THM: Let f:R→R and let E[(X0)2]<∞. Consider the sequence {Xn} generated by the recursion
YnXn+1=f(Xn)+ηn=Xn−anYn,
where we assume the following conditions:
- the r.v.s X0,η0,η1,... are independent, with η0,η1,... iid having mean 0 and variance 1.
- for some 1<c<∞, we have ∣f(x)∣≤c∣x∣ for all x (this incorporates the assumption f(0)=0).
- for all δ>0,inf∣x∣>δ(xf(x))>0
- each an a nonnegative number and ∑an=∞
- ∑an2<∞
See proof in paper
5 – Brownian Motion
Markov processes in discrete time and discrete state space.
Markov processes with continuous sample paths are called diffusions. Brownian motion is the simplest diffusion.
5.1 Definitions
DEF: A standard Brownian motion (SBM) {W(t):t≥0} is a stochastic process having
1. continuous paths,
2. stationary, independent increments, and
3. W(t)N~(0,t) for all t≥0.
To see more about Brownian motion, consider the intro primer, constructions, or (potentially) more on site.
6 – Diffusions and Stochastic Calculus
Previously discussed: Markov processes in discrete time with discrete state space. Also, Brownian motion as a special Markov process in continuous time with continuous sample paths.
Now: a more general continuous-time Markov process with continuous sample paths, called diffusions.
Essentially, there is an entire family of Brownian motions built off of the SBM (standard BM). In some sense, they are all the same, just adding a deterministic linear function (drift) or multiplying by a constant (scaling). This is analogous to how all normal ranodm variables are obtained from a standard noraml random variable by adding and multiplying by constants.
Therefore, we can’t really expect our random processes to be generally model-able by BM. Unless we think it follows a linear trend, we can’t fit some (μ,σ2) BM to it and expect it to do well.
Difussions are built up from Brownian motions in the same way that general differentiable functions are built from linear functions. They are the stochastic analog of getting solutions from differential equations (e.g., what if I didn’t know anything about the exponential function, and I wanted to model a function with the behavior x′(t)=−x(t)?).
So a diffusion is always beahving according to a Brownian motion, just continuously adjusting its drift μ and variance σ2 parameteres according to its current state. We specify a diffusion by giving a rule for determining what Brownian motion we should be following as a function of the current state. That is, we specify two functions μ(⋅) and σ2(⋅); if the current state is Xt, we should be running a (μ(Xt),σ2(Xt)) Brownian motion.
While we can’t simulate a Brownian motion perfectly, we can get an arbitrarily good approximate simulation by working with a fine grid on the time axis. With diffusions, instead of doing the ideal continuous adjustment, we’ll hold values constant until we get to the next point on the grid. If we take the grid sufficiently thin, we’ll have a simulation that is a good approximation to the real thing.
6.1 Specifying a diffusion
DEF: A stochastic process that has the strong Markov property and almost surely continuous sample paths is called a diffusion process.
As with Markov chains, to specify a difussion, we need
state space
initial distribution
probability transition structure
The state space will be an interval I with bounds l,r not being −∞,∞. The initial distribution if a probability distribution on I. What about the probability transition structure?
Intuitively, we would do something like this:
at time t, check to see our current state Xt
evaluate the functions μ and σ2 at Xt
run a (μ(Xt),σ2(Xt)) Brownian motion for a tiny amount of time
So in the interior of the state space, which would be (l,r), the probability transition structure of a time-homoheneous diffusion is specified by two functions μ=μ(x) and σ2=σ2(x), satisfying the relations
E(X(t+h)−X(t)∣X(t)=x)=μ(x)h+o(h)
and
Var(X(t+h)−X(t)∣X(t)=x)=σ2(x)h+o(h)
as h↓0. These two relationships combine to equivalently be expressed as
E((X(t+h)−X(t))2∣X(t)=x)=σ2(x)h+o(h).
We call μ(⋅) “infinitesimal drift” or “infinitesimal mean function,” while σ2(⋅) is often called “infinitesimal variance” or “diffusion function.” These two functions alone are enough to specify the entire probability transition structure.
Also, let’s assume moments of the increments X(t+h)−X(t) higher than the second moment are negligible when compared with the first and second moments:
E(∣X(t+h)−X(t)∣p∣X(t)=x)=o(h) as h→0 for all p>2.
The behavior at the boundary points must be specified separately. There are different types of boundary behaviors, including absorbing, reflecting, etc.
6.2 A calculation with diffusions
Consider a time inhomogeneous diffusion on the state space I with infinitesimal parameters μ=μ(x) and σ2=σ2(x).
Choose two points a,b in the interior of I. If we start the diffusion (X(t)) off at some point x∈(a,b) at time 0 and let the diffusion run until the random time T at which it first attains the value a or b, and g(⋅) is a “cost function” giving us the rate g(x) at which “cost” accrues per unit time when the diffusion X is in state x, the total cost of the diffusion until time T is the r.v.:
∫0Tg(Xt)dt.
What is the expected cost,
Ex[∫0Tg(Xt)dt]?
We use Ex to denote X0=x.
First consider the special case where g is constant, e.g. g(x)=1 always. Then ∫0Tg(Xt)dt=T, and we want to find Ex(T). We claim that the function w satisfies the differential equation
μ(x)w′(x)+21σ2(x)w′′(x)=−g(x)
with the boundary conditions w(a)=w(b)=0.
Let’s do a heuristic derivation to justify this equation. Begin with
w(x)=Ex[∫0hg(Xt)dt+∫hTg(Xt)dt]
=hg(x)+Ex[∫hTg(Xt)dt]+o(h)
for some extremely tiny h such that we are virtualy certain that T>h. Using tower property of conditional expectation,
The last equality comes from the Markov property. If Xh=xh and T>h, the expected value of ∫hTg(Xt)dt is the same as the expected cost of running the diffusion untiil time T starting at time 0 from state xh. This is w(xh).
implying that μ(x)w′(x)+21σ2(x)w′′(x)+g(x)=0, as desired.
6.3 Infinitesimal parameters of a function of a diffusion
A common question is as follows.
Suppose {Xt} a Brownian motion with μX(x)=μ and σX2(x)=σ2. Define the geometric Brownian motion Yt=eXt. What are the infinitesimal parameters μY(⋅) and σY2(⋅) of the Y process?
PROP: Suppose X is a μX,σX2 diffusion. Let f be a strictly monotone function that is twice continuously differentiable. Define Yt=f(Xt). Then Y is a diffusion with infinitesimal parameters
Therefore, for geometric Brownian motion, μX(x)=μ,σX2(x)=σ2,y=ex=f(x)=f′(x)=f′′(x), we have
μY(y)=μex+21σ2ex=y(μ+21σ2)
and
σY2(y)=y2σ2.
6.4 Kolmogorov’s backward and forward equations
What is the probability transition structure of diffusions?
In a Markov chain, the transition rule is specified by a matrix P. In continuous time, there is no “next time” after t, i.e., we cannot imagine doing something like π(n+1)=π(n)P as in a discrete time Markov chain. Can we talk about the process a tiny infinitesimal time later?
The probability transition rule will give the rate of change, per unit time, of the probability density function of the state. This is done with a partial differential equation.
Let X be a diffusion with infinitesimal mean and variance functions μ(⋅),σ2(⋅).
For the “backward” equation, fix state y and define the function f(t,x) the density of Xt evaluated at y given X0=x:
f(t,x)=fXt(y∣X0=x).
Kolmogorov’s backward equation says
∂tf(t,x)=μ(x)∂xf(t,x)+21σ2(x)∂xxf(t,x).
For the “forward” equation, fix an initial probability density for X0 and define g(t,y) to be the density of Xt evaluated at y. The forward equation describes the evolution of this density over time:
∂tg(t,y)=−∂y(μ(y)g(t,y))+∂yy(21σ2(y)g(t,y)).
See book for derivation.
6.5 Stationary distributions
The forward equation can be used to find stationary distributions. If π(⋅) a stationary density for diffusion X, starting the diffusion in that density should stay in that density. So we should expect
v(t,y)=π(y)
should satisfy the forward equation. This would be
0=−dyd(μ(y)π(y))+21dy2d2(σ2(y)π(y)).
This is an ODE, so we can solve for stationary distributions by solving ODEs.
6.6 Probability flux for diffusions
P(Xt<x,Xt+h>x) can be thought of as the flux from (−∞,x) to (x,∞) over the time interval [t,t+h]. Similarly, the probability P(Xt>x,Xt+h<x) is the flux across x in the other direction. We are interested in the net flux across x,
P(Xt<x,Xt+h>x)−P(Xt>x,Xt+h<x).
This would be 0 for a stationary process.
Let v(t,ζ) be the density of Xt evaluated at the state ζ, so that
P(Xt∈dζ)=v(t,ζ)dζ.
For a small positive h, let Δ(ζ,y) denote the conditional density of the increment Xt+h−Xt, given that Xt=ζ, evaluated at y, that is:
Consider the integrand as a function of ζ. For small h, only values of ζ very near to x will contribute significantly. Unrigorously, we will say we can do a Taylor expansion about ζ=x:
Dividing by time increment h and letting h→0, we see the rate of net probability flux across x at time t is given by
v(t,x)μ(x)−21∂x(v(t,x)σ2(x)).
We can use the net flux to get the forward equation (see book).
6.7 Quadratic Variation of Brownian Motion
An important property of standard Brownian motion W is that its quadratic variation over the interval [0,t] is t with probability 1.
Let’s start with quadratic variation for a nonrandom function f.
DEF: Let f be a real valued function defined at least on the interval [0,t]. The quadratic variationqt(f) of f over [0,t] is defined
qt(f)=n→∞limk=1∑2n[f(2nkt)−f(2n(k−1)t)]2
if the limit exists (otherwise it is undefined).
Generally, this is uninteresting for the tame functions we see every day. In fact, if f is continuous and of bounded variation, then qt(f)=0 for all t.
Sample paths of SBM are continuous, but with probability 1 have quadratic variation t on the interval [0,t] and infinite total variation.
Let ΔWk,n denote the increment
ΔWk,n=W(2nkt)−W(2n(k−1)t)
and let Qn=∑k=12n(ΔWk.n)2.
THM: With probability 1, Qn→t as n→∞. Additionally, Qn→t in mean square:
E((Qn−t)2)→0.
See book for proof.
Side thought: let X(t)=μ(t)+σW(t) be a μ,σ2 Brownian motion. With probability 1, the quadratic variation of X on [0,t] is σ2t. Suppose we are observing the Brownian motion X and we want to estimate μ,σ2. If we observe it for a long time T, we can estimate the drift μ by its slope on X(T)/(T). And this estimation gets better as T increases. However, for estimating σ2, it is enough to observe X over any interval of time, and we can infer σ2 exactly…
6.8 Stochastic Differential Equations
First, let’s think about an ordinary, deterministic, first-order differential equation:
dtdXt=f(Xt,t).
This is “Markov-like” in the sense that, given X(t), all past values are unncessary for determining future behavior. Second order equations do not have this property, and we would need X(t−Δt),X(t) to approximate X(t+Δt). In DE, one might convert a second order eq into first order by including the information X˙(t) as well as X(t). So we can convert “non-Markov” processes into “Markov” by including more information the state. But this is impractical.
A simple deterministic DE might be dX/dt=rX, modeling exponential growth of a population. If we wanted to model a “noisy” growth rate, we might propose something like
dtdX(t)=[r+N(t)]X(t),
for N a noise process (stochastic). More generally, consider equations of the form
dtdX(t)=μ(Xt,t)+σ(Xt,t)Nt.
Some desirable assumptions to put on this noise process (characteristics of Gaussian white noise):
{N−t,t≥0} is a stationary Gaussian process
ENt=0.
Ns independent of Nt for s=t.
Rewriting our equation as
dXt=μ(Xt,t)dt+σ(Xt,t)Ntdt
=:μ(Xt,t)dt+σ(Xt,t)dVt,
letting {Vt,t≥0} be a process with increments dVt=Ntdt, we see that {Vt,t≥0} is a Gaussian process with stationary independent increments. So it must be a Brownian motion. Then
dXt=μ(Xt,t)dt+σ(Xt,t)dWt,
where {Wt,t≥0} is a standard Brownian motion.
But what is dWt if all paths of BM are non differentiable? Hence we are very sad about stochastic calculus.
We will be interested in writing
Xt−X0=∫0tμ(Xs,s)ds+∫0tσ(Xs,s)dWs.
The first integral can be defined for each fixed path $w4 as an ordinary Riemann integral. So it is a random variable (function of w) where w↦∫0tμ(Xs(w),s)ds.
The second integral is weird and we will have to construct the stochastic integral to deal with it.
6.9 Stochastic Calculus and Ito’s Formula
A diffusion {Xt,t≥0} with infinitesimal parameters μ(x,t) and σ2(x,t) has stochastic differential
dX=μ(X,t)dt+σ(X,t)dW.
A version of Ito’s formula says if X is a stochastic process having stochastic differential dX and f is a suitably nice function, then the process Y=f(X) has stochastic differential
dYt=f′(Xt)dXt+21f′′(Xt)(dXt)2,
where (dXt)2 is computed using the rules
(dt)(d(anything))=0
(dWt)2=dt
In a more general situation where Y may be of the form Y=f(X,t), Ito’s formula says to compute dY first with a Taylor expansion, keeping terms up to quadratic order, and then simplifying by using the rules above.
EXAMPLE:
Consider the Brownian motion with drift Xt=μt+σWt. The stochastic differential is
dXt=μdt+σdWt.
Define the geometric Brownian motion Yt=eXt. What are the infinitesimal mean and variance functions of the diffusion Y?
Y=f(X)=eX. Then f′(X)=f′′(X)=eX=Y.
dYt=f′(Xt)dXt+21f′′(Xt)(dXt)2
=YdXt+21Y(dXt)2.
Note that (dXt)2=(μdt+σdWt)2, so
(dXt)2=μ2(dt)2+2μσ(dt)(dWt)+σ2(dWt)2
=0+0+σ2dt.
Plugging this in,
dYt=Yt(μ+(1/2)σ2)dt+σYtdWt.
So Y is a diffusion process with infinitesimal parameters μY(t)=(μ+(1/2)σ2)y and σY2(y)=σ2y2.
The term Ito process refers to a stochastic process that has a stochastic differential of the form
dZt=XtdWt+Ytdt
for X,Y processes satisfying some conditions, roughly:
X,Y are adapted
Conditions that ensure X,Y not “too big.”
This somewhat concludes my review of the book. I would say I included fewer proofs than I would have liked, but LaTeXing through big Markov expansions is no fun. A couple sections are left incomplete.
Chapter 7 covers likelihood ratios. Chapter 8 covers extremes and Poisson clumping. Both are pretty short, and I might write separate notes on those.
For more detail on stochastic calculus, later notes will be better and explore the construction with more rigor.