MATH-4961,6961 / CSCI-4971,6971

A random variable is a measurable function $X : \{ S,{\cal S},\mu \}
\rightarrow M$, where S is a set, ${\cal S}$ is a sigma-algebra on S, and $\mu$ is a measure defined on ${\cal S}$ such that $\mu(S) =
1$. The set M is a metric space in general. Usually it is ${\bf
R}^n$ or a subset thereof. The triple $\{ S,{\cal S},\mu \}$ is called a ``probability space''.

A random variable induces a measure, $\mu(X)$ on the target space, M. This is called the ``distribution of X''.

All of this is more formal and general than what we shall need for this course. $\mu$ is usually written as P, or Pr; a shorthand for ``probability''. S is often finite, in which case ${\cal S}$ is the collection of all subsets of S. In this case, it suffices to define P for all $s \in S$. S could be countably infinite, as in the non-negative integers. Finally, S could be ${\bf R}$, or an interval such as [0,1] or $[0,\infty)$. In this case, the measure is often expressed in terms of a ``density function'', f. Thus $P(a \leq X
\leq b) = \int_a^b f$.

Independence

An event is a subset of S or M. Events $e_1, e_2, \dots, e_n$ are called ``independent'' if

 \begin{displaymath}
P(e_1 \bigcap e_2 \dots \bigcap e_n) = \prod_{i=1}^{n} P(e_i).
\end{displaymath} (1)

Note that equation 1 is written for events in S. If the ei were events in M, we would write

 \begin{displaymath}
\begin{array}{l}
P(X^{-1}(e_1) \bigcap X^{-1}(e_2) \dots \bigcap X^{-1}(e_n)) = \prod_{i=1}^{n} P(X^{-1}(e_i)).
\end{array}\end{displaymath} (2)

Random variables are called independent if equation 2 holds for any collection of subsets of M. When events or random variable are independent, people often say that ``they have nothing to do with one another''. This is imprecise. A better way to think about independence is to think ``product measure''. Random variables $X_1, X_2, X_3 \dots$ are called i.i.d. (independent and identically distributed) if they all have the same distribution and if any finite collection are independent.

Examples of independence and non-independence

EXAMPLE 1 X and Y are discrete random variables on the same space. The set of all possibilities for the values of X and Y are given in Table 1.


 
Table 1: A table of all joint probability values for 2 independent finite valued random variables, X and Y. The value of X is given in column 1, and the value of Y in row 1.
  1 6 11 13
3 0.12 0.09 0.06 0.03
7 0.2 0.15 0.1 0.05
9 0.08 0.06 0.04 0.02

How can we prove that X and Y are independent in Table 1. It's simple. If pi,j is the probability in row i and column j, then $p_{i,j} = \sum_{k=1}^3 p_{i,k} \times
\sum_{k=1}^4 p_{k,j} \; \forall \; i,j$.

EXAMPLE 2 X and Y are discrete random variables on the same space. A new set of all possibilities for the values of X and Y are given in Table 2.


 
Table 2: A table of all joint probability values for 2 non-independent finite valued random variables, X and Y. The value of X is given in column 1, and the value of Y in row 1.
  1 6 11 13
3 0.2 0.1 0.1 0.1
7 0.1 0.05 0.05 0.033
9 0.033 0.1 0.034 0.1

How can we prove that X and Y are not independent in Table 2. Again, it's simple. Find a counterexample. For example, P(X = 7) = 0.233 and P(Y = 6) = 0.25. The product of these numbers is 0.05825 which is not equal to 0.05.

Mutually exclusive events

This concept is easier than that of independence. Events $e_1, e_2, \dots, e_n$ are called ``mutually exclusive'' $\iff \; \; e_i \cap
e_j = \emptyset \; \forall \; 1 \leq i < j \leq n$. In this case,

 \begin{displaymath}
P(e_1 \cup e_2 \dots \cup e_n) = \sum_{i=1}^{n} P(e_i).
\end{displaymath} (3)

Note that equation 3 does not imply mutually exclusive events, since some events can have probability 0. In general,

 \begin{displaymath}
P(e_1 \cup e_2 \dots \cup e_n) \leq \sum_{i=1}^{n} P(e_i).
\end{displaymath} (4)

is true.

Conditional probability and
Bayes Theorem

If A and B are events, then we define the ``conditional probability of A given B'', $P(A \;\arrowvert \; B)$ by

 \begin{displaymath}
P(A \;\arrowvert \; B) = \frac{P(A \cap B)}{P(B)}
\end{displaymath} (5)

Any such set, B, defines a new probability distribution. Note that independence of A and B implies $P(A \;\arrowvert \; B) = P(A)$

Bayes theorem (This bit of trivia should really be called ``Bayes lemma'' at best. Bayes was a very minor figure in probability theory and certainly doesn't deserve to have his name used so often.)

Equation 5 can be rewritten as:

 \begin{displaymath}
P(B) = \frac{P(A \cap B)}{P(A \;\arrowvert \; B)}
\end{displaymath} (6)

From this, it is easy to derive:

 \begin{displaymath}
\begin{array}{lcl}
P(B \; \arrowvert \; A) & = & \frac{P(A \...
... \\
& = & \frac{P(B)P(A \;\arrowvert \; B)}{P(A)}
\end{array}\end{displaymath} (7)

Example: Using Bayes Theorem

General Widgets (GW) is a company that manufactures widgets. Each widget needs 3 doodles. The doodles are manufactured by 4 different companies, C1, C2, C3 and C4. 10% of C1's doodles are defective, 5% of C2's doodles are defective, 2% of C3's doodles are defective and 1% of C4's doodles are defective. Because of price and availability, 40% of GW's doodles come from C1, 30% come from C2, 20% from C3 and the final 10% from C4. A widget fails if any one of its doodles is defective. The failures are independent of one another.

We first need to answer the question: ``What is the probability that a randomly selected doodle is defective?''

Let D be the event that a doodle is defective. Let Ci be the event that company Ci produces a doodle, $1 \leq i \leq 4$. Then:

P(D) = $\displaystyle \sum_{i=1}^{4} P(D \cap C_{i})$  
  = $\displaystyle \sum_{i=1}^{4} P(C_{i})P(D\vert C_{i})$  
  = $\displaystyle 0.4\!\times\!0.1\!+\!0.3\!\times\!0.05\!+\!0.2\!\times\!0.02\!+\!0.1\!\times\!0.01$  
  = 0.06  

If a doodle is defective, what is the probability that it was made by C1?


P(C1 | D) = $\displaystyle \frac{P(D\vert C_{1})P(C_{1})}{P(D)}$  
  = $\displaystyle \frac{0.1 \times 0.4}{0.06}$  
  = $\displaystyle \frac{2}{3}$  

Mean, variance and characteristic functions

For a real valued random variable, X, we define its ``expectation'', or ``expected value'' as

\begin{displaymath}
\rm {\bf E}[X] = \int_{-\infty}^{\infty} x dX,
\end{displaymath} (8)

where dX refers to integration with respect to the measure induced on ${\bf R}$ by X. Note that $\rm {\bf E}[X]$ could be $+\infty$, $-\infty$, or simply undefined. If $\rm {\bf E}[X]$ exists, the ``variance of X'', $\sigma^2(X)$, is defined by

\begin{displaymath}
\sigma^2(X) = \int_{-\infty}^{\infty} (x - \rm {\bf E}[X])^2 dX.
\end{displaymath} (9)

This could also be infinite. In general, the nth moment of X is defined by

\begin{displaymath}
\rm {\bf E}[X^n] = \int_{-\infty}^{\infty} x^n dX.
\end{displaymath} (10)

The nth moment of X ``about the mean is defined by

\begin{displaymath}
\rm {\bf E}[(X-E(X])^n) = \int_{-\infty}^{\infty} (x-\rm {\bf E}[X])^n dX.
\end{displaymath} (11)

The ``characteristic function'', $\hat{X}$, is defined by

\begin{displaymath}
\hat{X}(t) = \rm {\bf E}[e^{itX}]
\end{displaymath} (12)

What good are characteristic functions? 1. $X_1 \; {\rm and} \; X_2$ have the same distribution $\iff
\hat{X_1} = \hat{X_2}$
2. If $X_1, X_2, X_3, \dots, X_n$ are independent and $Y = \sum_{i=1}^{n} X_i$, then

\begin{displaymath}
\hat{Y} = \prod_{i=1}^{n} \hat{X_i}.
\end{displaymath} (13)

The characteristic function is a very useful tool.

Some important probability distributions.

BINOMIAL DISTRIBUTION

Apart from the trivial case where X = 0 with probability 1, the simplest case of a random variable is one that takes on just 2 values. These are often taken as 0 and 1, although -1 and +1 are sometimes used. Sometimes, we might write ``heads'' and ``tails'' for the 2 possibilities, indicating that X represents the results of a coin toss. We'll adopt the following convention. X takes on the 2 values, 1 and 0 with probabilities p and q respectively, where p + q = 1. Let $Y_n = \sum_{i=1}^{n} X_i$, where the Xi are independent copies of X. What is the distribution of Y. Clearly Yn can take on any value between 0 and n. In fact, simple combinatorics gives us:

\begin{displaymath}
P( Y_n = k ) = \frac{n!}{k!(n-k)!} \: p^{k}q^{n-k},
\end{displaymath} (14)

$\forall \; 0 \leq k \leq n$. The distribution is called ``binomial'' since P(Yn) = k is the kth term in the binomial expansion of (p+q)n.

POISSON DISTRIBUTION

Suppose that p is ``small'' and the ``n'' is large. To be more explicit, let's set $p = \lambda / n$ and see what happens as $n \rightarrow \infty $. Stirling's formula gives the limiting value for P(Yn = k) as $e^{-\lambda}\lambda^{k}/k!$. The characteristic function is $\exp(\lambda(e^{it}-1))$. It's easier to do this limiting process in ``reciprocal space''. First, $\hat{X_1}(t) = q + pe^{it}$. Thus $\hat{Y_n} = (q + pe^{it})^n$. This approaches $\exp(\lambda(e^{it}-1))$ as $n \rightarrow \infty $.

This limiting distribution is called the ``Poisson distribution''. Think of it as the ``law of rare events''.

UNIFORM DISTRIBUTION

This is given by a constant density function, $ f(x) = \frac{1}{b-a}$ for $x \in [a,b] \; {\rm for} \; a < b$ and f(x) = 0 otherwise.

GAUSSIAN DISTRIBUTION

Density is $\frac{1}{\sqrt{2\pi}\sigma} e^{-(x-\mu)/2\sigma^2)}$ Note that $\mu$ and $\sigma^2$ are the mean and variance. This is called the ``normal distribution''. The notation ``N($\mu$,$\sigma^2$)'' refers to this distribution. If $\{X_i\}_{i=1}^{\infty}$ are i.i.d. with mean $\mu$ and variance $\sigma^2$, then

\begin{displaymath}
P \left ( a < \frac{\sum_{i=1}^{n} X_i - n\mu}{\sqrt{n\sigma...
...htarrow \int_a^b \frac{1}{\sqrt{1\pi}}\exp (-\frac{x^2}{2}) dx
\end{displaymath} (15)

as $n \rightarrow \infty $. This is known as the central limit theorem and is often invoked to justify the use of a normal distribution.

EXPONENTIAL AND GAMMA DISTRIBUTIONS

The density of the exponential distribution is $\alpha e^{-\alpha x}$ for $x
\geq 0$. The mean and variance are $1/\alpha$ and $1/\alpha^2$, respectively.

The density of the gamma distribution is given by

\begin{displaymath}
f_{\alpha,\nu}(x) = \frac{1}{\Gamma(\nu)} \alpha^{\nu} x^{\nu-1}
e^{-\alpha x}
\end{displaymath} (16)

Note that $\nu = 1$ yields the exponential distribution. If X and Y are independent random variables with gamma distributions $f_{\alpha,\mu}$ and $f_{\alpha,\nu}$, then X+Y has distribution $f_{\alpha,\mu+\nu}$. Thus sums of exponential distributions give gamma distributions with increasing parameter $\nu$.

GEOMETRIC DISTRIBUTION This is the discrete equivalent of the exponential distribution.
$P(X
= k \geq 1) = q^{k-1}p$ for some constant $p \in (0,1)$, q = 1 - p. A simple calculation shows that $P(X \geq k) = p^k$.

INFORMATION CONTENT:
ENTROPY AND RELATIVE ENTROPY

For a random variable, X with discrete probability values pi = P(X = xi) the ``entropy'' is defined by $H(X) = - \sum_i p_i \log
p_i$. The sum is replaced by an integral for non-discrete probabilities. Since $\lim_{x\rightarrow 0} x \log x = 0$, there is no problem in taking ``$0 \log 0$'' to be 0. Also, the general $\log$ is written here. In statistical mechanics applications, it is natural to take the natural logarithm. For molecular sequence analysis, base 2 is appropriate and the entropy is then said to be expressed ``bits''. In any case, the choice of base only changes the result by a constant scale factor.

If X takes on finitely many values, n, a trite calculation using Lagrange multipliers shows that H(X) is maximized when $p_i = 1/n \;
\forall i$. In this case, the entropy is $\log N$. Entropy is maximized when the distribution is ``most spread out''. Also, note that this spread out distribution is increasing with the number of states.

Entropy was used by Shannon in his effort to understand information content of messages. Let's consider the following problem. We're receiving a message that's been encoded into bits (0's and 1's). If there were no noise (no error) at all, we would have 1 bit of information per bit received. Think of the signal as Yi + Xi. Yi is the pure signal with no error. It is ``deterministic''. Xi is the ``random error''. It is 0 (no error) with probability p and 1 (bit swap error) with probability 1-p. The entropy is $-(1-p)\log(1-p) - p \log p$. This approaches 0 as $p
\rightarrow 0$. That is, small p implies small error. For p = 1/2, the entropy is $\log 2 = 1$ in bits. That is, all information is lost. Note that p = 1 is useful, since we can simply swap bits.

In this course, we will be computing entropy for ``sample distributions''.

For 2 random variables, X and Y, define the ``relative entropy'' of X with respect to Y as

\begin{displaymath}
H(X \Arrowvert Y) = \sum_i p_i \! \log \! \frac{p_i}{q_i}
\end{displaymath} (17)

where P(X = xi) = pi and P(Y = yi) = qi. This is also known as the Kullback-Leibler ``distance'', although it is not even symmetric, let alone a distance. Relative entropies will be arise later in this course as expected values for scores used in molecular sequence alignment. Note that X and Y are ``dummie'' variables. The relative entropy is defined on 2 probability measures.

Finally we define the ``mutual information'' of X with respect to Y as


\begin{displaymath}
\begin{array}{l}
M(X,Y) = \\
\sum_{i,j} P(X\!=\!x_i,Y\!=\!y...
...{P(X\!=\!x_i,Y\!=\!y_j)}{P(X\!=\!x_i)P(Y\!=\!y_j) }
\end{array}\end{displaymath} (18)

Mutual information is the relative entropy of the joint probability distribution of X and Y and the product distribution. As such, it is a good measure of the independence of X and Y.

Coin tossing and longest runs

Infinite tossing of a consistent coin.

Let $\{X_i\}_{i=1}^{\infty}$ be a series of i.i.d. random variables such that $P(X_i = \; 1) = p$ and $P(X_i = \; 0) = q$, where 0 < p < 1 and q = 1 - p. X is 1 for a head and is 0 for a tail.

We want to look at this series in a novel way. Let Ri be the length of the ith ``run of heads'', to be called ``run'' henceforth. By definition, a run begins at the first position, or at the first position after a T. A run ends when as soon as a T is encountered. The length of a run is the number of heads in the run. The tail is not counted. This means that we are including runs of zero length.

Another way of stating this is by defining Bi to be the beginning index of the ith run. Then B1 = 1 by definition and Bi = Bi-1 + Ri-1 + 1 for i > 0. Note that R1 = 0 if X1 = T.

The beginnings of runs for a particular sequence of coin tosses are shown below using the symbol |.

     2   4   6   8  10  12  14  16
   T T H H H T H H H H T T H T T H H ...
   | | |       |         | |   | |

i 1 2 3 4 5 6 7 8 $\dots$
Bi 1 2 3 7 12 13 15 16  
Ri 0 0 3 4 0 1 0 ?  

The sequence of heads and tails are independent, and the new sequence of runs, each one ending with a tail, are also independent. This means that the random variables, Ri, are independent. particular, the Ris are not independent. In a sequence of finite length, n, we cannot say when the last run ends unless Xn = 0.

What is the distribution of the Ris? They have a common distribution. In fact, $P(R_i = k) = p^{k}q \; \forall \; k \geq 0$. That is, Ri + 1 has the geometric distribution with parameter q. Ri + 1 is the number of tosses until a T is encountered.

In particular,

 
$\displaystyle P(R_i \geq k)$ = $\displaystyle \sum_{h=k}^{\infty} qp^{h}$  
  = $\displaystyle \frac{qp^{k}}{1-p}$  
  = $\displaystyle p^{k} \; \; \; \forall k \geq 0,$ (19)

which will be needed below.

The expectation is given by

 
$\displaystyle \rm {\bf E}[R_i]$ = $\displaystyle q \sum_{h=0}^{\infty} hp^{h}$  
  = $\displaystyle qp \sum_{h=0}^{\infty} hp^{h-1}$  
  = $\displaystyle \frac{qp}{(1-p)^2}$  
  = p/q. (20)

Equation 20 follows from the a simple calculation, 21 using an infinite geometric progression and differentiation.
 
$\displaystyle \sum_{h=0}^{\infty} x^{h}$ = $\displaystyle \frac{1}{1-x} \; \Longrightarrow$  
$\displaystyle \sum_{h=0}^{\infty} hx^{h-1}$ = $\displaystyle \frac{1}{(1-x)^{2}}.$ (21)

We can now very easily answer the question ``What is the expected number of runs of any particular size in n tosses?''. A run of size k may begin anywhere with the same probability q2pk, except for the first position. In any case, the expected number of runs is simply q2pk(n-k+1). If we let
$k = \log_{1/p} n - 2\log_{1/p} q $ and think of n as ``large'', then:

 \begin{displaymath}
q^2p^k(n-k+1) = 1 + {\cal O} \left ( \frac{\log n}{n} \right ).
\end{displaymath} (22)

The intuition here is that $\log_{1/p} n$ is ``about right''. We get 1 expected run of this size.

Using a fixed number of coin tosses, n, results in run lengths that are not independent and that do not end cleanly at n. We will now replace the coin toss by a ``Poissonized version'' that has much better properties and that approximates the true coin toss sufficiently well for our conclusions to transfer.

Let $\{R'_i\}_{i=1}^{\infty}$ be a series of i.i.d. random variables with the common ``run'' distribution discussed above (equations 19 & 20). Let N be a Poisson random variable independent of the R's and with mean nq. Consider the ``Poissonized'' coin toss:

$R'_1(H)TR'_2(H)TR'_3(H)TR'_4(H)T \dots R'_N(H)T$

consisting of the empty coin toss if N = 0, where R'i(H) means ``insert R'i heads in this position''. The expected length of this ``Poissonized'' coin toss is:

 
$\displaystyle \rm {\bf E}[\sum_{i=1}^{N} (R'_i + 1])$ = $\displaystyle e^{-qn} \sum_{j=1}^{\infty}
\frac{(nq)^j}{j!} j \rm {\bf E}[R'_i + 1]$  
  = $\displaystyle e^{-qn} \sum_{j=1}^{\infty} \frac{(nq)^j}{j!} j(p/q + 1)$  
  = $\displaystyle n e^{-qn} \sum_{j=1}^{\infty} \frac{(nq)^{j-1}}{(j-1)!}$  
  = n. (23)

This is as expected and by design. Now let LN be the length of the longest run in the Poissonized coin toss. LN = 0 when N = 0.

Then

 \begin{displaymath}
\begin{array}{l}
P(L_N < k) \\
= P( R_j < k \; \forall j \...
...= e^{-qn}e^{qn(1-p^k)} \nonumber \\
= e^{-nqp^k}.
\end{array}\end{displaymath}  

We now replace k by $\log_{1/p} qn + \lambda$. We think of $n \rightarrow \infty $ while $\lambda $ is fixed. Then

 \begin{displaymath}
P(L_N - \log_{1/p}(qn) < \lambda) = e^{-p^{\lambda}}.
\end{displaymath} (24)

The distribution in equation 25 is an example of the ``extreme value distribution''.

It's density is given by differentiation and is $f(\lambda) = -\ln{p} \: p^{\lambda} e^{-p^{\lambda}}$. Moments are given by:

 \begin{displaymath}
\int_{-\infty}^{+\infty} \lambda^k f(\lambda) d\!\lambda =
\frac{1}{(\ln p)^k} \int_{0}^{+\infty} (\ln{t})^k e^{-t} dt
\end{displaymath} (25)

To evaluate the mean and variance, we digress briefly to consider the ``gamma function''. If the real part of z is > 0, we can write

\begin{displaymath}
\Gamma(z) = \int_{0}^{\infty} t^{z-1} e^{-t} dt.
\end{displaymath} (26)

$\Gamma$ is analytic in the right half plane and we can easily compute the nth derivative as:

\begin{displaymath}
\Gamma^{(n)}(z) = \int_{0}^{\infty} (\ln t)^{n} t^{z-1} e^{-t} dt,
\end{displaymath} (27)

so that the nth moment in equation 26 is $\Gamma^{(n)}(1)$. Note that $\Gamma(1) = 1$. A well known power series for $\ln \Gamma(1+z)$ is
 
$\displaystyle \ln \Gamma(1+z)$ = $\displaystyle -\ln(1+z) + z(1-\gamma)$  
  + $\displaystyle \sum_{n=2}^{\infty}(-1)^{n}[\zeta(n)-1]z^{n}/n,$ (28)

where $\gamma$ is Euler's constant:
$\displaystyle \gamma$ = $\displaystyle \lim n \rightarrow \infty \sum_{i=1}^{n} \frac{1}{i} -
\ln{n}$  
  = $\displaystyle 0.5772156649\dots,$ (29)

and

\begin{displaymath}
\zeta(z) = \sum_{k=1}^{\infty} 1/k^z, \; {\cal R} z > 1
\end{displaymath} (30)

is the ``Riemann zeta function''.

Differentiation of equation 29 yields:

 
$\displaystyle \frac{\Gamma'(1+z)}{\Gamma(1+z)}$ = $\displaystyle -\frac{1}{(1+z)} + 1-\gamma$  
  + $\displaystyle \sum_{n=2}^{\infty}(-1)^{n}[\zeta(n)-1]z^{n-1}.$ (31)

In particular, $\Gamma'(1) = -\gamma$. Differentiating again gives, with z=0,
 
$\displaystyle \Gamma''(1) - \gamma^2$ = $\displaystyle 1 + [\zeta(2)-1]$  
  = $\displaystyle \pi^2/6.$ (32)

The end result of these calculations is that

\begin{displaymath}
\rm {\bf E}[L_N]= \log_{1/p}(qn) + \gamma/\ln{1/p}
\end{displaymath}

and

\begin{displaymath}
\sigma^2(L_N)= \frac{\pi^2}{6(\ln\!{1/p})^2}.
\end{displaymath}

What we have not proved here is that LN for the ``Poissonized coin toss'' is a good enough approximation to the longest run problem with exactly n coin tosses. However, this is not the application we are interested in.

Computer simulations of coin tossing.

In 216 (65536) tosses of a fair coin, the computer generated 32659 heads and 32877 tails. There were 16400 runs of heads of size > 0. This and various other simulations gave results shown in Figures 1, 2, 3 and 4.


  
Figure 1: The number of runs of size k versus k for a computer coin toss experiment with p = 1/2. Note the sharp drop at the end.
\begin{figure}\setlength{\unitlength}{0.240900pt}
\ifx\plotpoint\undefined\ne...
...
\put(1256.0,68.0){\rule[-0.200pt]{43.362pt}{0.400pt}}
\end{picture}\end{figure}


  
Figure 2: Runs in 1,000,000 tosses of a fair coin. Note the tail of the distribution.
\begin{figure}\setlength{\unitlength}{0.240900pt}
\ifx\plotpoint\undefined\ne...
...
\put(1247.0,68.0){\rule[-0.200pt]{45.530pt}{0.400pt}}
\end{picture}\end{figure}


  
Figure 3: Runs in 10,000,000 tosses of a fair coin.
\begin{figure}\setlength{\unitlength}{0.240900pt}
\ifx\plotpoint\undefined\ne...
...
\put(1234.0,68.0){\rule[-0.200pt]{36.617pt}{0.400pt}}
\end{picture}\end{figure}


  
Figure 4: Runs in 20,000,000 tosses of an unfair coin, where P(H)=0.4.
\begin{figure}\setlength{\unitlength}{0.240900pt}
\ifx\plotpoint\undefined\ne...
...366,103){\raisebox{-.8pt}{\makebox(0,0){$\Diamond$ }}}
\end{picture}\end{figure}

Local alignment without gaps

Let's return to sequence alignment.

${\bf A} = a_1a_2a_3{\dots}a_m$
and ${\bf B} = b_1b_2b_3
\dots b_n$
are 2 biomolecular sequences of the same type (from the same alphabet). Let's refer to the alphabet as ${\cal A} = \{A_1, A_2,
\dots, A_K\}$, where, for DNA, K=4 and ${\cal A} = \{A,C,G,T\}$. For proteins, K=20 and
${\cal A} =
\{A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,$
S,T,V,W,Y}. This time, however, we will regard the {ai} as i.i.d. random variables, and the same with {bj}. Let p = P(ai = bj). We do not assume that the ais and bjs have the same distribution. The degree to which they may differ is restricted, and is beyond the scope of this course. For example, if P(ai = Ah)P(bj = Ah) = 0 for all members of the alphabet, then nothing useful results.


  
Figure 5: The longest common exact match between 2 sequences, $a_1a_2 \dots a_m$ and $b_1b_2 \dots b_n$ is the longest number of consectutive dots along any diagonal, d=j-i. Some of these diagonals are shown in black dotted lines. The longest common exact match (of size 5) is shown in red.
\begin{figure}
\epsfig{file=lcem.ps,width=0.9\textwidth}
\end{figure}

Let Lm,n be the largest non-negative integer, k, such that there is some $i \leq m - k + 1$ and some $j \leq n - k + 1$ such that $a_{i+h} = b_{j+h} \; \forall 0 \leq h < k$. In other words, Lm,n is the ``longest run of exact matches''. How is this problem similar to the ``longest run of heads matches'' in coin tossing? We can think of the familiar m x n ``dot plot'' matrix, where the ``blanks'' are ``tails'' and the ``dots'' are ``head''. Instead of a linear array of coin tosses, we have a 2D grid and search for the maximum number of dots in a row along successive diagonals, j-i = d, for $d=1-m, d
\leq n-1, d^{++}$. The total length of all these diagonals is mn. Of course, the dots are not independent random variables, even though they are independent of each other along any diagonal. Let ph and p'h be P(ai = Ah) and P(bj = Ah), respectively. Then $p = \sum_{h=1}^{K} p_{h}p'_{h}$, so that $P(a_i = b_j, a_{i+1} = b_j) = \sum_{h=1}^{K} p_h^2p'_h$, while P(ai=bj)P(ai+1=bj) = p2. These 2 quantities will be unequal in general.

It turns out that the non-independence of the dots on different diagonals and the effect of stringing all the diagonals together into a long ``coin toss'' of size mn doesn't make a lot of difference. The basic results hold for large mn. That is
$\log_{1/p}(qmn) + \gamma/\ln\!{1/p}$
is an excellent approximation to $\rm {\bf E}[L_{m,n}]$ and
$\frac{\pi^2}{6(\ln{1/p})^2}$ is an excellent approximation to $\sigma^2(L_{m,n})$.

Now we consider local alignment with a scoring function w. Note that w(ai,bj) is now a random variable. We will assume that $\alpha =
\infty$, so that no gaps will be predicted. Let H(m,n) be the maximum local alignment weight achieved by aligning $a_1a_2 \dots a_m$ with $b_1b_2 \dots b_n$.

We will now define sufficient conditions on w(ai,bj) so that H(m,n) behaves like the longest run in a coin toss.

Let ph = P(ai = Ah) and p'h = P(bj = Ah). Also, we'll write wh1,h2 for the more cumbersome w(Ah1,Ah2). Suppose that the match weights,
$w(h_1,h_2), \; h_1,h_2 \in {\cal A}$ satisfy:

1.
At least one match weight is greater than zero with a positive probability.
That is, $\exists \; h1', h2' \in \{1,2, \dots, K\} \; \ni \; \\
p_{h1}p_{h2}'w_{h1',h2'} > 0$.
2.
The expected match weight is negative.
That is, $\sum_{h1,h2=1}^{K} p_{h1}p_{h2}'w_{h1,h2} < 0$.

Then $\exists \; ! \; \lambda^* > 0 \; \ni \; \\ \sum_{h1,h2=1}^{K}
p_{h1}p_{h2}'e^{\lambda^*w_{h1,h2}} = 1$


  
Figure 6: A plot of $f(\lambda )$ versus $\lambda $ for DNA alignment where pA = pC = pG = pT = p'A = p'C = p'G = p'T = 1/4, the match weight is 1 and the mismatch weight is -0.6.
\begin{figure}
\setlength{\unitlength}{0.240900pt}
\ifx\plotpoint\undefined\new...
...ut(1140.0,68.0){\rule[-0.400pt]{0.800pt}{194.888pt}}
\end{picture}
\end{figure}

Proof: Define $f(\lambda) = \sum_{h1,h2=1}^{K}
p_{h1}p_{h2}'e^{\lambda w_{h1,h2}}$.
Note that f(0) = 1. The negative expected score assures that f'(0) < 0. Thus $\exists \; \epsilon > 0 \; \ni \; f(\epsilon) < 1$. The existence of h1' and h2' as described ensures that $\lim_{\lambda \rightarrow
\infty} f(\lambda) = \infty$. Thus $\exists \; \lambda^* > 0 \; \ni \;
f(\lambda^*) = 1$. The uniqueness of $\lambda^*$ follows from the fact that f'' > 0 (convex function).

An immediate consequence of this simple theorem is that we can define new probabilities:

\begin{displaymath}
q_{h1,h2} = p_{h1}p_{h2}'e^{\lambda^{*}w_{h1,h2}}.
\end{displaymath}

They are probabilities since they add up to 1 by definition. Furthermore, the weights can be written as:

\begin{displaymath}
w_{h1,h2} = \frac{1}{\lambda^{*}} \ln \frac{q_{h1,h2}}{p_{h1}p_{h2}'}.
\end{displaymath}

In words, the weights for scoring an alignment can be written as the logarithm to some appropriate base ($e^{\lambda^*}$) of a probability ratio.

Recall equation 25. We can rewrite it replacing $<\lambda$ with $\geq \lambda$ and derive:

 
P(LN - $\displaystyle \log_{1/p}(qn) \geq \lambda)$  
  = $\displaystyle 1 - e^{-p^{\lambda}}$  
  = $\displaystyle 1 - [ 1 - p^{\lambda} + p^{2\lambda}/2! - p^{3\lambda}/3! +
\dots ]$  
  = $\displaystyle e^{-\lambda \ln(1/p)}[1 - p^{\lambda}/2! + p^{2\lambda}/3! - \dots ]$  
  < $\displaystyle e^{-\lambda \ln(1/p)}.$ (33)

Replacing $\ln(1/p)$ by $\lambda^*$ and rearranging a bit gives

\begin{displaymath}
P(L_N \geq \frac{\ln n}{\lambda^*} + \lambda) < (1/q)e^{-\lambda^* \lambda}.
\end{displaymath}

In comparison, Karlin showed that for the H(m,n) statistic, with the weights satisfying the above conditions, there is another constant, K* >0 such that

\begin{displaymath}
P(H(m,n) \geq \frac{\ln mn}{\lambda^*} + \lambda) < K^*e^{-\lambda^* \lambda}
\end{displaymath}

for large $\lambda $. The computation of K* is complicated. This inequality yields a conservative estimate for statistical significance, and is the basis for the ``BLAST'' statistics.


Michael Zuker
Professor of Mathematical Sciences
Rensselaer Polytechnic Institute
2003-02-07