Probability, runs, longest exact matches & gapless local alignment

A random variable is a measurable function $X : \{ S,{\cal S},\mu \} \rightarrow M$ , where S is a set, ${\cal S}$ is a sigma-algebra on S, and $\mu$ is a measure defined on ${\cal S}$ such that $\mu(S) = 1$ . The set M is a metric space in general. Usually it is ${\bf R}^n$ or a subset thereof. The triple $\{ S,{\cal S},\mu \}$ is called a ``probability space''.

A random variable induces a measure, $\mu(X)$ on the target space, M. This is called the ``distribution of X''.

All of this is more formal and general than what we shall need for this course. $\mu$ is usually written as P, or Pr; a shorthand for ``probability''. S is often finite, in which case ${\cal S}$ is the collection of all subsets of S. In this case, it suffices to define P for all $s \in S$ . S could be countably infinite, as in the non-negative integers. Finally, S could be ${\bf R}$ , or an interval such as [0,1] or $[0,\infty)$ . In this case, the measure is often expressed in terms of a ``density function'', f. Thus $P(a \leq X \leq b) = \int_a^b f$ .

An event is a subset of S or M. Events $e_1, e_2, \dots, e_n$ are called ``independent'' if

$\begin{displaymath} P(e_1 \bigcap e_2 \dots \bigcap e_n) = \prod_{i=1}^{n} P(e_i). \end{displaymath}$

(1)

Note that equation 1 is written for events in S. If the e_i were events in M, we would write

$\begin{displaymath} \begin{array}{l} P(X^{-1}(e_1) \bigcap X^{-1}(e_2) \dots \bigcap X^{-1}(e_n)) = \prod_{i=1}^{n} P(X^{-1}(e_i)). \end{array}\end{displaymath}$

(2)

Random variables are called independent if equation 2 holds for any collection of subsets of M. When events or random variable are independent, people often say that ``they have nothing to do with one another''. This is imprecise. A better way to think about independence is to think ``product measure''. Random variables $X_1, X_2, X_3 \dots$ are called i.i.d. (independent and identically distributed) if they all have the same distribution and if any finite collection are independent.

EXAMPLE 1 X and Y are discrete random variables on the same space. The set of all possibilities for the values of X and Y are given in Table 1.

Table 1: A table of all joint probability values for 2 independent finite valued random variables, X and Y. The value of X is given in column 1, and the value of Y in row 1.

	1	6	11	13
3	0.12	0.09	0.06	0.03
7	0.2	0.15	0.1	0.05
9	0.08	0.06	0.04	0.02

How can we prove that X and Y are independent in Table 1. It's simple. If p_i,j is the probability in row i and column j, then $p_{i,j} = \sum_{k=1}^3 p_{i,k} \times \sum_{k=1}^4 p_{k,j} \; \forall \; i,j$ .

EXAMPLE 2 X and Y are discrete random variables on the same space. A new set of all possibilities for the values of X and Y are given in Table 2.

Table 2: A table of all joint probability values for 2 non-independent finite valued random variables, X and Y. The value of X is given in column 1, and the value of Y in row 1.

	1	6	11	13
3	0.2	0.1	0.1	0.1
7	0.1	0.05	0.05	0.033
9	0.033	0.1	0.034	0.1

How can we prove that X and Y are not independent in Table 2. Again, it's simple. Find a counterexample. For example, P(X = 7) = 0.233 and P(Y = 6) = 0.25. The product of these numbers is 0.05825 which is not equal to 0.05.

This concept is easier than that of independence. Events $e_1, e_2, \dots, e_n$ are called ``mutually exclusive'' $\iff \; \; e_i \cap e_j = \emptyset \; \forall \; 1 \leq i < j \leq n$ . In this case,

$\begin{displaymath} P(e_1 \cup e_2 \dots \cup e_n) = \sum_{i=1}^{n} P(e_i). \end{displaymath}$

(3)

Note that equation 3 does not imply mutually exclusive events, since some events can have probability 0. In general,

$\begin{displaymath} P(e_1 \cup e_2 \dots \cup e_n) \leq \sum_{i=1}^{n} P(e_i). \end{displaymath}$

(4)

If A and B are events, then we define the ``conditional probability of A given B'', $P(A \;\arrowvert \; B)$ by

$\begin{displaymath} P(A \;\arrowvert \; B) = \frac{P(A \cap B)}{P(B)} \end{displaymath}$

(5)

Bayes theorem (This bit of trivia should really be called ``Bayes lemma'' at best. Bayes was a very minor figure in probability theory and certainly doesn't deserve to have his name used so often.)

$\begin{displaymath} P(B) = \frac{P(A \cap B)}{P(A \;\arrowvert \; B)} \end{displaymath}$

(6)

$\begin{displaymath} \begin{array}{lcl} P(B \; \arrowvert \; A) & = & \frac{P(A \... ... \\ & = & \frac{P(B)P(A \;\arrowvert \; B)}{P(A)} \end{array}\end{displaymath}$

(7)

General Widgets (GW) is a company that manufactures widgets. Each widget needs 3 doodles. The doodles are manufactured by 4 different companies, C1, C2, C3 and C4. 10% of C1's doodles are defective, 5% of C2's doodles are defective, 2% of C3's doodles are defective and 1% of C4's doodles are defective. Because of price and availability, 40% of GW's doodles come from C1, 30% come from C2, 20% from C3 and the final 10% from C4. A widget fails if any one of its doodles is defective. The failures are independent of one another.

We first need to answer the question: ``What is the probability that a randomly selected doodle is defective?''

Let D be the event that a doodle is defective. Let C_i be the event that company Ci produces a doodle, $1 \leq i \leq 4$ . Then:

P(D)	=	$\displaystyle \sum_{i=1}^{4} P(D \cap C_{i})$
	=	$\displaystyle \sum_{i=1}^{4} P(C_{i})P(D\vert C_{i})$
	=	$\displaystyle 0.4\!\times\!0.1\!+\!0.3\!\times\!0.05\!+\!0.2\!\times\!0.02\!+\!0.1\!\times\!0.01$
	=	0.06

P(C₁ \| D)	=	$\displaystyle \frac{P(D\vert C_{1})P(C_{1})}{P(D)}$
	=	$\displaystyle \frac{0.1 \times 0.4}{0.06}$
	=	$\displaystyle \frac{2}{3}$

For a real valued random variable, X, we define its ``expectation'', or ``expected value'' as

$\begin{displaymath} \rm {\bf E}[X] = \int_{-\infty}^{\infty} x dX, \end{displaymath}$

(8)

$\begin{displaymath} \sigma^2(X) = \int_{-\infty}^{\infty} (x - \rm {\bf E}[X])^2 dX. \end{displaymath}$

(9)

$\begin{displaymath} \rm {\bf E}[X^n] = \int_{-\infty}^{\infty} x^n dX. \end{displaymath}$

(10)

$\begin{displaymath} \rm {\bf E}[(X-E(X])^n) = \int_{-\infty}^{\infty} (x-\rm {\bf E}[X])^n dX. \end{displaymath}$

(11)

$\begin{displaymath} \hat{X}(t) = \rm {\bf E}[e^{itX}] \end{displaymath}$

(12)

What good are characteristic functions? 1. $X_1 \; {\rm and} \; X_2$ have the same distribution $\iff \hat{X_1} = \hat{X_2}$
2. If $X_1, X_2, X_3, \dots, X_n$ are independent and $Y = \sum_{i=1}^{n} X_i$ , then

$\begin{displaymath} \hat{Y} = \prod_{i=1}^{n} \hat{X_i}. \end{displaymath}$

(13)

Apart from the trivial case where X = 0 with probability 1, the simplest case of a random variable is one that takes on just 2 values. These are often taken as 0 and 1, although -1 and +1 are sometimes used. Sometimes, we might write ``heads'' and ``tails'' for the 2 possibilities, indicating that X represents the results of a coin toss. We'll adopt the following convention. X takes on the 2 values, 1 and 0 with probabilities p and q respectively, where p + q = 1. Let $Y_n = \sum_{i=1}^{n} X_i$ , where the X_i are independent copies of X. What is the distribution of Y. Clearly Y_n can take on any value between 0 and n. In fact, simple combinatorics gives us:

$\begin{displaymath} P( Y_n = k ) = \frac{n!}{k!(n-k)!} \: p^{k}q^{n-k}, \end{displaymath}$

(14)

Suppose that p is ``small'' and the ``n'' is large. To be more explicit, let's set $p = \lambda / n$ and see what happens as $n \rightarrow \infty$ . Stirling's formula gives the limiting value for P(Y_n = k) as $e^{-\lambda}\lambda^{k}/k!$ . The characteristic function is $\exp(\lambda(e^{it}-1))$ . It's easier to do this limiting process in ``reciprocal space''. First, $\hat{X_1}(t) = q + pe^{it}$ . Thus $\hat{Y_n} = (q + pe^{it})^n$ . This approaches $\exp(\lambda(e^{it}-1))$ as $n \rightarrow \infty$ .

This limiting distribution is called the ``Poisson distribution''. Think of it as the ``law of rare events''.

This is given by a constant density function, $f(x) = \frac{1}{b-a}$ for $x \in [a,b] \; {\rm for} \; a < b$ and f(x) = 0 otherwise.

Density is $\frac{1}{\sqrt{2\pi}\sigma} e^{-(x-\mu)/2\sigma^2)}$ Note that $\mu$ and $\sigma^2$ are the mean and variance. This is called the ``normal distribution''. The notation ``N( $\mu$ , $\sigma^2$ )'' refers to this distribution. If $\{X_i\}_{i=1}^{\infty}$ are i.i.d. with mean $\mu$ and variance $\sigma^2$ , then

$\begin{displaymath} P \left ( a < \frac{\sum_{i=1}^{n} X_i - n\mu}{\sqrt{n\sigma... ...htarrow \int_a^b \frac{1}{\sqrt{1\pi}}\exp (-\frac{x^2}{2}) dx \end{displaymath}$

(15)

The density of the exponential distribution is $\alpha e^{-\alpha x}$ for $x \geq 0$ . The mean and variance are $1/\alpha$ and $1/\alpha^2$ , respectively.

$\begin{displaymath} f_{\alpha,\nu}(x) = \frac{1}{\Gamma(\nu)} \alpha^{\nu} x^{\nu-1} e^{-\alpha x} \end{displaymath}$

(16)

Note that $\nu = 1$ yields the exponential distribution. If X and Y are independent random variables with gamma distributions $f_{\alpha,\mu}$ and $f_{\alpha,\nu}$ , then X+Y has distribution $f_{\alpha,\mu+\nu}$ . Thus sums of exponential distributions give gamma distributions with increasing parameter $\nu$ .

GEOMETRIC DISTRIBUTION This is the discrete equivalent of the exponential distribution.
$P(X = k \geq 1) = q^{k-1}p$ for some constant $p \in (0,1)$ , q = 1 - p. A simple calculation shows that $P(X \geq k) = p^k$ .

For a random variable, X with discrete probability values p_i = P(X = x_i) the ``entropy'' is defined by $H(X) = - \sum_i p_i \log p_i$ . The sum is replaced by an integral for non-discrete probabilities. Since $\lim_{x\rightarrow 0} x \log x = 0$ , there is no problem in taking `` $0 \log 0$ '' to be 0. Also, the general $\log$ is written here. In statistical mechanics applications, it is natural to take the natural logarithm. For molecular sequence analysis, base 2 is appropriate and the entropy is then said to be expressed ``bits''. In any case, the choice of base only changes the result by a constant scale factor.

If X takes on finitely many values, n, a trite calculation using Lagrange multipliers shows that H(X) is maximized when $p_i = 1/n \; \forall i$ . In this case, the entropy is $\log N$ . Entropy is maximized when the distribution is ``most spread out''. Also, note that this spread out distribution is increasing with the number of states.

Entropy was used by Shannon in his effort to understand information content of messages. Let's consider the following problem. We're receiving a message that's been encoded into bits (0's and 1's). If there were no noise (no error) at all, we would have 1 bit of information per bit received. Think of the signal as Y_i + X_i. Y_i is the pure signal with no error. It is ``deterministic''. X_i is the ``random error''. It is 0 (no error) with probability p and 1 (bit swap error) with probability 1-p. The entropy is $-(1-p)\log(1-p) - p \log p$ . This approaches 0 as $p \rightarrow 0$ . That is, small p implies small error. For p = 1/2, the entropy is $\log 2 = 1$ in bits. That is, all information is lost. Note that p = 1 is useful, since we can simply swap bits.

For 2 random variables, X and Y, define the ``relative entropy'' of X with respect to Y as

$\begin{displaymath} H(X \Arrowvert Y) = \sum_i p_i \! \log \! \frac{p_i}{q_i} \end{displaymath}$

(17)

$\begin{displaymath} \begin{array}{l} M(X,Y) = \\ \sum_{i,j} P(X\!=\!x_i,Y\!=\!y... ...{P(X\!=\!x_i,Y\!=\!y_j)}{P(X\!=\!x_i)P(Y\!=\!y_j) } \end{array}\end{displaymath}$

(18)

Mutual information is the relative entropy of the joint probability distribution of X and Y and the product distribution. As such, it is a good measure of the independence of X and Y.

Let $\{X_i\}_{i=1}^{\infty}$ be a series of i.i.d. random variables such that $P(X_i = \; 1) = p$ and $P(X_i = \; 0) = q$ , where 0 < p < 1 and q = 1 - p. X is 1 for a head and is 0 for a tail.

We want to look at this series in a novel way. Let R_i be the length of the i^th ``run of heads'', to be called ``run'' henceforth. By definition, a run begins at the first position, or at the first position after a T. A run ends when as soon as a T is encountered. The length of a run is the number of heads in the run. The tail is not counted. This means that we are including runs of zero length.

Another way of stating this is by defining B_i to be the beginning index of the i^th run. Then B₁ = 1 by definition and B_i = B_i-1 + R_i-1 + 1 for i > 0. Note that R₁ = 0 if X₁ = T.

The beginnings of runs for a particular sequence of coin tosses are shown below using the symbol |.

The sequence of heads and tails are independent, and the new sequence of runs, each one ending with a tail, are also independent. This means that the random variables, R_i, are independent. particular, the R_is are not independent. In a sequence of finite length, n, we cannot say when the last run ends unless X_n = 0.

What is the distribution of the R_is? They have a common distribution. In fact, $P(R_i = k) = p^{k}q \; \forall \; k \geq 0$ . That is, R_i + 1 has the geometric distribution with parameter q. R_i + 1 is the number of tosses until a T is encountered.

$\displaystyle P(R_i \geq k)$	=	$\displaystyle \sum_{h=k}^{\infty} qp^{h}$
	=	$\displaystyle \frac{qp^{k}}{1-p}$
	=	$\displaystyle p^{k} \; \; \; \forall k \geq 0,$	(19)

$\displaystyle \rm {\bf E}[R_i]$	=	$\displaystyle q \sum_{h=0}^{\infty} hp^{h}$
	=	$\displaystyle qp \sum_{h=0}^{\infty} hp^{h-1}$
	=	$\displaystyle \frac{qp}{(1-p)^2}$
	=	p/q.	(20)

$\displaystyle \sum_{h=0}^{\infty} x^{h}$	=	$\displaystyle \frac{1}{1-x} \; \Longrightarrow$
$\displaystyle \sum_{h=0}^{\infty} hx^{h-1}$	=	$\displaystyle \frac{1}{(1-x)^{2}}.$	(21)

We can now very easily answer the question ``What is the expected number of runs of any particular size in n tosses?''. A run of size k may begin anywhere with the same probability q²p^k, except for the first position. In any case, the expected number of runs is simply q²p^k(n-k+1). If we let
$k = \log_{1/p} n - 2\log_{1/p} q$ and think of n as ``large'', then:

$\begin{displaymath} q^2p^k(n-k+1) = 1 + {\cal O} \left ( \frac{\log n}{n} \right ). \end{displaymath}$

(22)

Using a fixed number of coin tosses, n, results in run lengths that are not independent and that do not end cleanly at n. We will now replace the coin toss by a ``Poissonized version'' that has much better properties and that approximates the true coin toss sufficiently well for our conclusions to transfer.

Let $\{R'_i\}_{i=1}^{\infty}$ be a series of i.i.d. random variables with the common ``run'' distribution discussed above (equations 19 & 20). Let N be a Poisson random variable independent of the R's and with mean nq. Consider the ``Poissonized'' coin toss:

consisting of the empty coin toss if N = 0, where R'_i(H) means ``insert R'_i heads in this position''. The expected length of this ``Poissonized'' coin toss is:

$\displaystyle \rm {\bf E}[\sum_{i=1}^{N} (R'_i + 1])$	=	$\displaystyle e^{-qn} \sum_{j=1}^{\infty} \frac{(nq)^j}{j!} j \rm {\bf E}[R'_i + 1]$
	=	$\displaystyle e^{-qn} \sum_{j=1}^{\infty} \frac{(nq)^j}{j!} j(p/q + 1)$
	=	$\displaystyle n e^{-qn} \sum_{j=1}^{\infty} \frac{(nq)^{j-1}}{(j-1)!}$
	=	n.	(23)

This is as expected and by design. Now let L_N be the length of the longest run in the Poissonized coin toss. L_N = 0 when N = 0.

$\begin{displaymath} \begin{array}{l} P(L_N < k) \\ = P( R_j < k \; \forall j \... ...= e^{-qn}e^{qn(1-p^k)} \nonumber \\ = e^{-nqp^k}. \end{array}\end{displaymath}$

We now replace k by $\log_{1/p} qn + \lambda$ . We think of $n \rightarrow \infty$ while $\lambda$ is fixed. Then

$\begin{displaymath} P(L_N - \log_{1/p}(qn) < \lambda) = e^{-p^{\lambda}}. \end{displaymath}$

(24)

The distribution in equation 25 is an example of the ``extreme value distribution''.

It's density is given by differentiation and is $f(\lambda) = -\ln{p} \: p^{\lambda} e^{-p^{\lambda}}$ . Moments are given by:

$\begin{displaymath} \int_{-\infty}^{+\infty} \lambda^k f(\lambda) d\!\lambda = \frac{1}{(\ln p)^k} \int_{0}^{+\infty} (\ln{t})^k e^{-t} dt \end{displaymath}$

(25)

To evaluate the mean and variance, we digress briefly to consider the ``gamma function''. If the real part of z is > 0, we can write

$\begin{displaymath} \Gamma(z) = \int_{0}^{\infty} t^{z-1} e^{-t} dt. \end{displaymath}$

(26)

$\begin{displaymath} \Gamma^{(n)}(z) = \int_{0}^{\infty} (\ln t)^{n} t^{z-1} e^{-t} dt, \end{displaymath}$

(27)

$\displaystyle \ln \Gamma(1+z)$	=	$\displaystyle -\ln(1+z) + z(1-\gamma)$
	+	$\displaystyle \sum_{n=2}^{\infty}(-1)^{n}[\zeta(n)-1]z^{n}/n,$	(28)

$\displaystyle \gamma$	=	$\displaystyle \lim n \rightarrow \infty \sum_{i=1}^{n} \frac{1}{i} - \ln{n}$
	=	$\displaystyle 0.5772156649\dots,$	(29)

$\begin{displaymath} \zeta(z) = \sum_{k=1}^{\infty} 1/k^z, \; {\cal R} z > 1 \end{displaymath}$

(30)

$\displaystyle \frac{\Gamma'(1+z)}{\Gamma(1+z)}$	=	$\displaystyle -\frac{1}{(1+z)} + 1-\gamma$
	+	$\displaystyle \sum_{n=2}^{\infty}(-1)^{n}[\zeta(n)-1]z^{n-1}.$	(31)

$\displaystyle \Gamma''(1) - \gamma^2$	=	$\displaystyle 1 + [\zeta(2)-1]$
	=	$\displaystyle \pi^2/6.$	(32)

$\begin{displaymath} \rm {\bf E}[L_N]= \log_{1/p}(qn) + \gamma/\ln{1/p} \end{displaymath}$

$\begin{displaymath} \sigma^2(L_N)= \frac{\pi^2}{6(\ln\!{1/p})^2}. \end{displaymath}$

What we have not proved here is that L_N for the ``Poissonized coin toss'' is a good enough approximation to the longest run problem with exactly n coin tosses. However, this is not the application we are interested in.

In 2¹⁶ (65536) tosses of a fair coin, the computer generated 32659 heads and 32877 tails. There were 16400 runs of heads of size > 0. This and various other simulations gave results shown in Figures 1, 2, 3 and 4.

**Figure 1:** The number of runs of size k *versus* k for a computer coin toss experiment with p = 1/2. Note the sharp drop at the end.
$\begin{figure}\setlength{\unitlength}{0.240900pt} \ifx\plotpoint\undefined\ne... ... \put(1256.0,68.0){\rule[-0.200pt]{43.362pt}{0.400pt}} \end{picture}\end{figure}$

**Figure 2:** Runs in 1,000,000 tosses of a fair coin. Note the tail of the distribution.
$\begin{figure}\setlength{\unitlength}{0.240900pt} \ifx\plotpoint\undefined\ne... ... \put(1247.0,68.0){\rule[-0.200pt]{45.530pt}{0.400pt}} \end{picture}\end{figure}$

**Figure 3:** Runs in 10,000,000 tosses of a fair coin.
$\begin{figure}\setlength{\unitlength}{0.240900pt} \ifx\plotpoint\undefined\ne... ... \put(1234.0,68.0){\rule[-0.200pt]{36.617pt}{0.400pt}} \end{picture}\end{figure}$

**Figure 4:** Runs in 20,000,000 tosses of an unfair coin, where P(H)=0.4.
$\begin{figure}\setlength{\unitlength}{0.240900pt} \ifx\plotpoint\undefined\ne... ...366,103){\raisebox{-.8pt}{\makebox(0,0){$\Diamond$ }}} \end{picture}\end{figure}$

${\bf A} = a_1a_2a_3{\dots}a_m$
and ${\bf B} = b_1b_2b_3 \dots b_n$
are 2 biomolecular sequences of the same type (from the same alphabet). Let's refer to the alphabet as ${\cal A} = \{A_1, A_2, \dots, A_K\}$ , where, for DNA, K=4 and ${\cal A} = \{A,C,G,T\}$ . For proteins, K=20 and
${\cal A} = \{A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,$
S,T,V,W,Y}. This time, however, we will regard the {a_i} as i.i.d. random variables, and the same with {b_j}. Let p = P(a_i = b_j). We do not assume that the a_is and b_js have the same distribution. The degree to which they may differ is restricted, and is beyond the scope of this course. For example, if P(a_i = A_h)P(b_j = A_h) = 0 for all members of the alphabet, then nothing useful results.

**Figure 5:** The longest common exact match between 2 sequences, $a_1a_2 \dots a_m$ and $b_1b_2 \dots b_n$ is the longest number of consectutive dots along any diagonal, d=j-i. Some of these diagonals are shown in black dotted lines. The longest common exact match (of size 5) is shown in red.
$\begin{figure} \epsfig{file=lcem.ps,width=0.9\textwidth} \end{figure}$

Let L_m,n be the largest non-negative integer, k, such that there is some $i \leq m - k + 1$ and some $j \leq n - k + 1$ such that $a_{i+h} = b_{j+h} \; \forall 0 \leq h < k$ . In other words, L_m,n is the ``longest run of exact matches''. How is this problem similar to the ``longest run of heads matches'' in coin tossing? We can think of the familiar m x n ``dot plot'' matrix, where the ``blanks'' are ``tails'' and the ``dots'' are ``head''. Instead of a linear array of coin tosses, we have a 2D grid and search for the maximum number of dots in a row along successive diagonals, j-i = d, for $d=1-m, d \leq n-1, d^{++}$ . The total length of all these diagonals is mn. Of course, the dots are not independent random variables, even though they are independent of each other along any diagonal. Let p_h and p'_h be P(a_i = A_h) and P(b_j = A_h), respectively. Then $p = \sum_{h=1}^{K} p_{h}p'_{h}$ , so that $P(a_i = b_j, a_{i+1} = b_j) = \sum_{h=1}^{K} p_h^2p'_h$ , while P(a_i=b_j)P(a_i+1=b_j) = p². These 2 quantities will be unequal in general.

It turns out that the non-independence of the dots on different diagonals and the effect of stringing all the diagonals together into a long ``coin toss'' of size mn doesn't make a lot of difference. The basic results hold for large mn. That is
$\log_{1/p}(qmn) + \gamma/\ln\!{1/p}$
is an excellent approximation to $\rm {\bf E}[L_{m,n}]$ and
$\frac{\pi^2}{6(\ln{1/p})^2}$ is an excellent approximation to $\sigma^2(L_{m,n})$ .

Now we consider local alignment with a scoring function w. Note that w(a_i,b_j) is now a random variable. We will assume that $\alpha = \infty$ , so that no gaps will be predicted. Let H(m,n) be the maximum local alignment weight achieved by aligning $a_1a_2 \dots a_m$ with $b_1b_2 \dots b_n$ .

We will now define sufficient conditions on w(a_i,b_j) so that H(m,n) behaves like the longest run in a coin toss.

Let p_h = P(a_i = A_h) and p'_h = P(b_j = A_h). Also, we'll write w_h1,h2 for the more cumbersome w(A_h1,A_h2). Suppose that the match weights,
$w(h_1,h_2), \; h_1,h_2 \in {\cal A}$ satisfy:

Then $\exists \; ! \; \lambda^* > 0 \; \ni \; \\ \sum_{h1,h2=1}^{K} p_{h1}p_{h2}'e^{\lambda^*w_{h1,h2}} = 1$

**Figure 6:** A plot of $f(\lambda )$ *versus* $\lambda$ for DNA alignment where p_A = p_C = p_G = p_T = p'_A = p'_C = p'_G = p'_T = 1/4, the match weight is 1 and the mismatch weight is -0.6.
$\begin{figure} \setlength{\unitlength}{0.240900pt} \ifx\plotpoint\undefined\new... ...ut(1140.0,68.0){\rule[-0.400pt]{0.800pt}{194.888pt}} \end{picture} \end{figure}$

Proof: Define $f(\lambda) = \sum_{h1,h2=1}^{K} p_{h1}p_{h2}'e^{\lambda w_{h1,h2}}$ .
Note that f(0) = 1. The negative expected score assures that f'(0) < 0. Thus $\exists \; \epsilon > 0 \; \ni \; f(\epsilon) < 1$ . The existence of h1' and h2' as described ensures that $\lim_{\lambda \rightarrow \infty} f(\lambda) = \infty$ . Thus $\exists \; \lambda^* > 0 \; \ni \; f(\lambda^*) = 1$ . The uniqueness of $\lambda^*$ follows from the fact that f'' > 0 (convex function).

An immediate consequence of this simple theorem is that we can define new probabilities:

$\begin{displaymath} q_{h1,h2} = p_{h1}p_{h2}'e^{\lambda^{*}w_{h1,h2}}. \end{displaymath}$

$\begin{displaymath} w_{h1,h2} = \frac{1}{\lambda^{*}} \ln \frac{q_{h1,h2}}{p_{h1}p_{h2}'}. \end{displaymath}$

Recall equation 25. We can rewrite it replacing $<\lambda$ with $\geq \lambda$ and derive:

P(L_N	-	$\displaystyle \log_{1/p}(qn) \geq \lambda)$
	=	$\displaystyle 1 - e^{-p^{\lambda}}$
	=	$\displaystyle 1 - [ 1 - p^{\lambda} + p^{2\lambda}/2! - p^{3\lambda}/3! + \dots ]$
	=	$\displaystyle e^{-\lambda \ln(1/p)}[1 - p^{\lambda}/2! + p^{2\lambda}/3! - \dots ]$
	<	$\displaystyle e^{-\lambda \ln(1/p)}.$	(33)

$\begin{displaymath} P(L_N \geq \frac{\ln n}{\lambda^*} + \lambda) < (1/q)e^{-\lambda^* \lambda}. \end{displaymath}$

In comparison, Karlin showed that for the H(m,n) statistic, with the weights satisfying the above conditions, there is another constant, K^* >0 such that

$\begin{displaymath} P(H(m,n) \geq \frac{\ln mn}{\lambda^*} + \lambda) < K^*e^{-\lambda^* \lambda} \end{displaymath}$