CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 ›...
Transcript of CS 540: Machine Learning Lecture 3: Bayesian parameter ... › ~arnaud › cs540 ›...
CS 540: Machine LearningLecture 3: Bayesian parameter estimation and hypothesis
testing for discrete data
AD from KPM
January 2008
AD from KPM () January 2008 1 / 49
Naive Bayes Classi�er
Let C 2 f1, . . . ,Kg represents the class of a document (e.g.,C =spam or C =not spam).Let Wi = 1 if word i occurs in this document, otherwise Wi = 0.A naive Bayes classi�er assumes the words (features) are conditionallyindependent given the class (written as Wi ? Wj jC ).This can be represented as a Bayes net (recall that a node isconditionally independent of its non-descendants given its parents).
C
E1 EN
AD from KPM () January 2008 2 / 49
Naive Bayes Classi�er: Inference
Since Wi ? Wj jC , the joint is
P(C ,W1:N ) = P(C )
"N
∏i=1P(Wi jC )
#
Hence the posterior over class labels is given by
P(C = c jw1:N ) =P(C = c)∏N
i=1 P(w1:N jc)∑c 0 P(C = c 0)∏N
i=1 P(w1:N jc 0)
AD from KPM () January 2008 3 / 49
Naive Bayes Classi�er: Learning
The root CPD P(C = c) can be estimated by counting how manytimes each class occurs(e.g., P(C = spam) = 0.05, P(C = non-spam = 0.95)).
Each leaf CPD P(wi jc) can have a di¤erent kind of distribution, e.g.,Bernoulli, Gaussian, etc.
For document classi�cation, P(Wi = 0/1jC = c) can be estimatedby counting how many times word i occurs in documents of class c .
For real-valued data, p(Wi jC = c) can be estimated by �tting aGaussian to all data points that are labeled as class c .
If the class labels are not observed during training, this model can beused for clustering (see later).
AD from KPM () January 2008 4 / 49
Parameter learning
We said that the root CPD P(C = c) can be estimated by countinghow many times each class occurs. Why?
We said P(Wi = 0/1jC = c) can be estimated by counting howmany times word i occurs in documents of class c . Why? And whatif the word never occurs?
We now discuss these issues, which are equivalent to estimating theparameters of coins and dice.
We will also discuss how to infer which words are useful forclassi�cation (feature selection) by computing the mutual informationbetween two variables.
AD from KPM () January 2008 5 / 49
Bernoulli distribution
Let X 2 f0, 1g represent heads or tails.Suppose P(X = 1) = µ. Then
P(x jµ) = Be(X jµ) = µx (1� µ)1�x
It is easy to show that
E [X ] = µ, Var[X ] = µ(1� µ)
AD from KPM () January 2008 6 / 49
MLE for a Bernoulli distribution distribution
Given D = (x1, . . . , xN ), the likelihood is
p(D jµ) =N
∏n=1
p(xn jµ) =N
∏n=1
µxn (1� µ)1�xn
The log-likelihood is
L(µ) = log p(D jµ) = ∑nxn log µ+ (1� xn) log(1� µ)
= N1 log µ+N0 log(1� µ)
where N1 = n = ∑n xn is the number of heads andN0 = m = ∑n(1� xn) is the number of tails (su¢ cient statistics).Solving for dLdµ = 0 yields
µML =n
n+m
AD from KPM () January 2008 7 / 49
Problems with the MLE
Suppose we have seen 3 heads out of 3 trials. Then we predict thatall future coins will land heads:
µML =n
n+m=
33+ 0
This is an example of the sparse data problem: if we fail to seesomething in the training set (e.g., an unknown word), we predictthat it can never happen in the future.
We will now see how to solve this pathology using Bayesianestimation.
AD from KPM () January 2008 8 / 49
Conjugate priors
A Bayesian estimate of µ requires a prior p(µ).
A prior is called conjugate if, when multiplied by the likelihoodp(D jµ), the resulting posterior is in the same parametric family asthe prior. (Closed under Bayesian updating.)
The Beta prior is conjugate to the Bernoulli likelihood
P(µjD) ∝ P(D jµ)P(µ)∝ [µn(1� µ)m ][µa�1(1� µ)b�1]
= µn+a�1(1� µ)m+b�1
where n is the number of heads and m is the number of tails.
a, b are hyperparameters (parameters of the prior) and correspond tothe number of �virtual�heads/tails (pseudo counts). N0 = a+ b iscalled the e¤ective sample size (strength) of the prior. a = b = 1 is auniform prior (Laplace smoothing).
AD from KPM () January 2008 9 / 49
The beta distribution
To ensure the prior is normalized, we de�ne
P(µja, b) = Beta(µja, b) = Γ(a+ b)Γ(a)Γ(b)
µa�1(1� µ)b�1
where the gamma function is de�ned as
Γ(x) =Z ∞
0ux�1e�udu
Note that Γ(x + 1) = xΓ(x) and Γ(1) = 1. Also, for integers,Γ(x + 1) = x !.
The normalization constant 1/Z (a, b) = Γ(a+b)Γ(a)Γ(b) ensuresZ 1
0Beta(µja, b)dµ = 1
AD from KPM () January 2008 10 / 49
The beta distribution
If µ � Be(a, b), then
Eµ =a
a+ b
mode µ =a� 1
a+ b� 2
AD from KPM () January 2008 11 / 49
Bayesian updating of a beta distribution
If we start with a beta prior Be(µja, b) and see n heads and m tails,we end up with a beta posterior Be(µja+ n, b+m):
P(µjD) =1
P(D)P(D jµ)P(µja, b)
=1
P(D)[µn(1� µ)m ]
1Z (a, b)
[µa�1(1� µ)b�1]
= Be(µjn+ a,m+ b)
The marginal likelihood is the ratio of the normalizing constants:
P(D) =Z (a+ n, b+m)
Z (a, b)
=Γ(a+ n)Γ(b+m)Γ(a+ n+ b+m)
Γ(a+ b)Γ(a)Γ(b)
AD from KPM () January 2008 12 / 49
Sequential Bayesian updating
Start with beta prior p(θjαh, αt ) = B(θ; αh, αt ).Observe N trials with Nh heads and Nt tails. Posterior becomes
p(θjαh, αt ,Nh,Nt ) = B(θ; αh +Nh, αt +Nt ) = B(θ; α0h, α0t )
Observe another N 0 trials with N 0h heads and N0t tails. Posterior
becomes
p(θjα0h, α0t ,N 0h,N 0t ) = B(θ; α0h +N 0h, α0t +N 0t )= B(θ; αh +Nh +N 0h, αt +Nt +N 0t )
So sequentially absorbing data in any order is equivalent to batchupdate. (assuming iid data and exact Bayesian updating).
This is useful for online learning and large datasets.
AD from KPM () January 2008 13 / 49
Bayesian updating in pictures
Start with Be(µja = 2, b = 2) and observe x = 1, so the posterior isBe(µja = 3, b = 2).
AD from KPM () January 2008 14 / 49
Posterior predictive distribution
The posterior predictive distribution is
p(X = 1jD) =Z 1
0p(X = 1jµ)p(µjD)dµ
=Z 1
0µp(µjD)dµ = E [µjD ] = n+ a
n+m+ a+ b
With a uniform prior a = b = 1, we get Laplace�s rule of succession
p(X = 1jNh,Nt ) =Nh + 1
Nh +Nt + 2
Start with Be(µja = 2, b = 2) and observe x = 1 to getBe(µja = 3, b = 2), so the mean shifts from E [µ] = 2/4 toE [µjD ] = 3/5.
AD from KPM () January 2008 15 / 49
E¤ect of prior strength
Let N = Nh +Nt be number of samples (observations).Let N 0 be the number of pseudo observations (strength of prior) andde�ne the prior means
αh = N0α0h, αt = N 0α0t , α0h + α0t = 1
Then posterior mean is a convex combination of the prior mean andthe MLE (where λ = N 0/(N +N 0)):
P(X = 1jαh, αt ,Nh,Nt ) =αh +Nh
αh +Nh + αt +Nt
=N 0α0h +NhN +N 0
=N 0
N +N 0α0h +
NN +N 0
NhN
= λα0h + (1� λ)NhN
AD from KPM () January 2008 16 / 49
E¤ect of prior strength
Suppose we have a uniform prior α0h = α0t = 0.5, and we observeNh = 3, Nt = 7.
Weak prior N 0 = 2. Posterior prediction:
P(X = hjαh = 1, αt = 1,Nh = 3,Nt = 7) =3+ 1
3+ 1+ 7+ 1=13� 0.33
Strong prior N 0 = 20. Posterior prediction:
3+ 103+ 10+ 7+ 10
=1330� 0.43
However, if we have enough data, it washes away the prior. e.g.,Nh = 300, Nt = 700. Estimates are 300+1
1000+2 and300+101000+20 , both of
which are close to 0.3
As N ! ∞, P(θjD)! δ(θ, θ̂ML), so E [θjD ]! θ̂ML.
AD from KPM () January 2008 17 / 49
Parameter posterior - small sample, uniform prior
0 0.5 10
1
2prior,1.0, 1.0
0 0.5 10
0.5
1likelihood, 1 heads, 0 tails
0 0.5 10
1
2posterior
0 0.5 1012
prior,1.0, 1.0
0 0.5 100.20.4
likelihood, 1 heads, 1 tails
0 0.5 1012
posterior
0 0.5 1012
prior,1.0, 1.0
0 0.5 100.020.04
likelihood, 10 heads, 1 tails
0 0.5 10
5posterior
0 0.5 1012
prior,1.0, 1.0
0 0.5 100.5
1 x 10 4likelihood, 10 heads, 5 tails
0 0.5 1024
posterior
0 0.5 1012
prior,1.0, 1.0
0 0.5 100.5
1 x 10 6likelihood, 10 heads, 10 tails
0 0.5 1024
posterior
AD from KPM () January 2008 18 / 49
Parameter posterior - small sample, strong prior
0 0.5 1024
prior,10.0, 10.0
0 0.5 100.5
1likelihood, 1 heads, 0 tails
0 0.5 1024
posterior
0 0.5 1024
prior,10.0, 10.0
0 0.5 100.20.4likelihood, 1 heads, 1 tails
0 0.5 1024
posterior
0 0.5 1024
prior,10.0, 10.0
0 0.5 100.020.04
likelihood, 10 heads, 1 tails
0 0.5 10
5posterior
0 0.5 1024
prior,10.0, 10.0
0 0.5 100.5
1 x 10 4likelihood, 10 heads, 5 tails
0 0.5 10
5posterior
0 0.5 1024
prior,10.0, 10.0
0 0.5 100.5
1 x 10 6likelihood, 10 heads, 10 tails
0 0.5 105
10posterior
AD from KPM () January 2008 19 / 49
Prior smooths parameter estimates
The MLE can change dramatically with small sample sizes.
The Bayesian estimate changes much more smoothly (depending onthe strength of the prior).
Lower blue=MLE, red = beta(1,1), pink = beta(5,5), upper blue =beta(10,10)
AD from KPM () January 2008 20 / 49
Maximum a posteriori (MAP) estimation
MAP estimation picks the mode of the posterior
θ̂MAP = argmaxθp(D jθ)p(θ)
If θ � Be(a, b), this is just
θ̂MAP = (a� 1)/(a+ b� 2)
MAP is equivalent to maximizing the penalized maximumlog-likelihood
θ̂MAP = argmaxθlog p(D jθ)� λc(θ)
where c(θ) = � log p(θ) is called a regularization term. λ is relatedto the strength of the prior.
AD from KPM () January 2008 21 / 49
Integrate out or Optimize?
θ̂MAP is not Bayesian (even though it uses a prior) since it is a pointestimate.
Consider predicting the future. A Bayesian will integrate out alluncertainty:
p (xnew jX ) =Zp (xnew , θjX ) dθ
=Zp (xnew j θ,X ) p ( θjX ) dθ
=Zp (xnew j θ) p ( θjX ) dθ.
A frequentist will use a �plug-in� estimator eg ML/MAP:
p (xnew jX ) = p�xnew jbθ� , bθ = argmax p ( θjX )
AD from KPM () January 2008 22 / 49
From coins to dice
Suppose we observe N iid die rolls (K-sided): D = 3, 1,K , 2, ...
Let [x ] 2 f0, 1gK be a one-of-K encoding of x eg. if x = 3 andK = 6, then [x ] = (0, 0, 1, 0, 0, 0)T .
Multinomial distribution: p(X = k) = θk ∑k θk = 1
Log-Likelihood
l (θ;D) = log p (D j θ) = ∑m log∏k
θδk (xm )k
= ∑m ∑k δk (xm) log θk = ∑k Nk log θk
We need to maximize this subject to the constraint ∑k θk = 1, so weuse a Lagrange multiplier.
AD from KPM () January 2008 23 / 49
MLE for multinomial
Constrained cost function:
l̃ = ∑k
Nk log θk + λ
1�∑
k
θk
!
Take derivatives wrt θk :
∂l̃∂θk
=Nkθk� λ = 0
Nk = λθk
∑k
Nk = N = λ ∑k
θk = λ
θ̂k ,ML =NkN
θ̂k ,ML is the fraction of times k occurs.
AD from KPM () January 2008 24 / 49
Dirichlet priors
Let X 2 f1, . . . ,Kg have a multinomial distribution
P(X jθ) = θI (X=1)1 θ
I (X=2)2 � � � θI (X=k )K
For a set of data X 1, . . . ,XN , the su¢ cient statistics are the countsNi = ∑n I (Xn = i).Consider a Dirichlet prior with hyperparameters α
p(θjα) = D(θjα) = 1Z (α)
� θα1�11 � θα2�1
2 � � � θαK�1K
where Z (α) is the normalizing constant
Z (α) =Z� � �
Zθα1�11 � � � θαK�1
K dθ1 � � � dθK
=∏Ki=1 Γ(αi )
Γ(∑Ki=1 αi )
AD from KPM () January 2008 25 / 49
Properties of the Dirichlet distribution
If θ � Dir(θjα1, . . . , αK ), then
E [θk ] =αkα0
mode[θk ] =αk � 1α0 �K
where α0def= ∑K
k=1 αk is the total strength of the prior.
AD from KPM () January 2008 26 / 49
Likelihood, prior, posterior, evidence
Likelihood, prior, posterior:
P(~N j~θ) =K
∏i=1
θNii
p(θjα) = D(θjα) = 1Z (α)
� θα1�11 � θα2�1
2 � � � θαK�1K
p(θj~N,~α) =1
Z (α)p(~N jα)θα1+N11 � � � θαK+Nk
K
= D(α1 +N1, . . . , αK +NK )
Marginal likelihood (evidence):
P(~N j~α) = Z (~N +~α)Z (~α)
=Γ(∑k αk )
Γ(N +∑k αk )∏k
Γ(Nk + αk )
Γ(αk )
AD from KPM () January 2008 27 / 49
Marginal likelihood as approximation of negative entropy
Marginal likelihood (evidence):
P(~N j~α) = Γ(∑k αk )
Γ(N +∑k αk )∏k
Γ(Nk + αk )
Γ(αk )
If αk = 1, this becomes
P(~N j~α = 1) = Γ(K )Γ(N +K ) ∏
k
Γ(Nk + 1)Γ(1)
Using the fact that Γ(1) = 1 and Stirling�s approximationlog Γ(x + 1) � x log x � x , we get
P(~N j~α = 1) � �N logN +N +∑k
(Nk logNk �Nk )
= ∑k
Nk log(Nk/N) = �NH(fNk/Ng)
where H(pk ) = �∑k pk log pk is the entropy of a distribution.
AD from KPM () January 2008 28 / 49
Entropy
Suprising (unlikely) events convey more information, so we de�ne theinformation content of an observation (in bits) to be
h(x) = log2 1/p(x)
The average information content of a random variable X is
H(X ) = �∑xp(x) log2 p(x)
AD from KPM () January 2008 29 / 49
Data compression
There is a close link between density estimation and data compression.
The noiseless coding theorem (Shannon 1948) says that the entropyis a lower bound on the number of bits needed to transmit the stateof X .
More likely states can be given shorter code words.
There is also a close link between data compression and modelselection/ hypothesis testing.
AD from KPM () January 2008 30 / 49
Hypothesis testing
When spun on edge N = 250 times, a Belgian one-euro coin came upheads Y = 141 times and tails 109.
�It looks very suspicious to me. We can reject the null hypothesis(that the coin is unbiased) with a signifcance level of 5%�. � BarryBlight, LSE (modi�ed from quote in The Guardian, 2002)
Does this mean P(H0jD) < 0.05? Let us compare classical hypothesistesting with a Bayesian approach (using marginal likelihood).
AD from KPM () January 2008 31 / 49
Classical hypothesis testing
We would like to distinguish two models, or hypotheses: H0 meansthe coin is unbiased (so p = 0.5); H1 means the coin is biased (hasprobability of heads p 6= 0.5).We need a decision rule that maps data to accept/reject.We will dothis by computing a scalar quantity of our data called the deviance,d(D), and comparing its observed value with what we would expect ifH0 were true.
We declare �H1� if d(D) > t for some threshold t (to bedetermined).
In our case, we will use d(D) = Nh, the number of heads.
AD from KPM () January 2008 32 / 49
P-values
The p-value of a threshold t is the probability of falsely rejecting thenull hypothesis:
p(t) = P(fD 0 : d(D 0) > tgjH0,N)
Intuitively, the p-value is the probability of getting data at least thatextreme given H0.
Since computing the p-value requires summing over all possibledatasets of size N, a standard approximation is consider the expecteddistribution of d(D 0), assuming D 0 � P(�jH0) , as N ! ∞.
AD from KPM () January 2008 33 / 49
Signi�cance levels
The p-value of a threshold t is the probability of falsely rejecting thenull hypothesis:
pval(t) = P(fD 0 : d(D 0) > tgjH0,N)
We usually choose a threshold t so that the probability of a falserejection is below some signi�cance level α = 0.05 (i.e., choose t s.t.,pval(t) � α).
This means that on average we will �only�be wrong 1/20 times (!).
AD from KPM () January 2008 34 / 49
Classical analysis of the euro-coin data
Blight used a two-sided test and found a p-value of 0.0497, so he said�we can reject the null hypothesis at signi�cance level 0.05�.
pval = P(Y � 141jH0) + P(Y � 109jH0)= (1� P(Y < 141jH0)) + P(Y � 109jH0)= (1� P(Y � 140jH0)) + P(Y � 109jH0)= 0.0497
n=250; p = 0.5;p1 = 1-binocdf(140,n,p);p2 = binocdf(109,n,p);pval = p1 + p2
AD from KPM () January 2008 35 / 49
Classical analysis violates the likelihood principle
Why do we care about tail probabilities, such as
P(Y � 141jH0) = P(Y = 141jH0) + P(Y = 142jH0) + � � �
when the number of heads we observed was 141, not 142 or larger?
�The use of P-values implies that a hypothesis that may be true canbe rejected because it has not predicted observable results that havenot actually occurred��Je¤reys, 1961.
P-values (and therefore all classical hypothesis tests) violate thelikelihood principle, which says that in order to choose betweenhypotheses H0 and H1 given observed data D, one should ask howlikely the observed data are under each hypothesis; do not askquestions about data that we might have observed but did not.
For more examples, see �What is Bayesian statistics and whyeverything else is wrong�, Michael Lavine (2000).
AD from KPM () January 2008 36 / 49
Bayesian approach
We want to compute the posterior ratio of the 2 hypotheses:
P(H1jD)P(H0jD)
=P(D jH1)P(H1)P(D jH0)P(H0)
Let us assume a uniform prior P(H0) = P(H1) = 0.5.
Then we just focus on the ratio of the marginal likelihoods:
P(D jH1) =Z 1
0dθ P(D jθ,H1)P(θjH1)
For H0, there is no free parameter, so
P(D jH0) = 0.5N
where N is the number of coin tosses in D.
AD from KPM () January 2008 37 / 49
Ratio of evidences (Bayes factor)
We compute the ratio of marginal likelihoods (evidence):
BF (1, 0) =P(D jH1)P(D jH0)
=Z (αh +Nh, αt +Nt )
Z (αh, αt )10.5N
=Γ(140+ α)Γ(110+ α)
Γ(250+ 2α)� Γ(2α)
Γ(α)Γ(α)� 2250
We compute BF (1, 0) for a range of prior strengths αt = αh = α.Must work in log domain to avoid under�ow!
AD from KPM () January 2008 38 / 49
So, is the coin biased or not?
We compute the likelihood ratio vs hyperparameter α
α 0.37 1 2.7 7.4 20 55 148 403 1096P (H1 jD )P (H0 jD ) 0.21 0.48 0.82 1.32 1.77 1.92 1.72 1.34 1.21
For a uniform prior, P (H1 jD )P (H0 jD ) = 0.48, (weakly) favoring the fair coinhypothesis H0!
At best, for α = 50, we can make the biased hypothesis twice as likely.
Not as dramatic as saying �we reject the null hypothesis (fair coin)with signi�cance 5%�.
AD from KPM () January 2008 39 / 49
Summary: Bayesian vs classical hypothesis testing
The Bayesian approach is simpler and more natural (no need for�p-values�, �signi�cance tests�, etc.)
The Bayesian approach does not violate the likelihood principle.
The Bayesian approach allows the use of prior knowledge to preventus from jumping to conclusions too hastily.
See the excellent tutorials on P-values and Bayes factors by StevenGoodman on the web page.
AD from KPM () January 2008 40 / 49
Another example: testing for independence
Suppose we are given N (x , y) pairs, where X has J possible valuesand Y has K . We want to know if X and Y are independent.
eg. consider this contingency table
y = 1 y = 2 y = 3x = 1 15 29 14x = 2 46 83 56
Traditional approach: compare H0 = X ? Y vs H1 = X 6? Y .Compute p-value using χ2 statistic
dχ2(D) = ∑x ,y
(Ox ,y � Ex ,y )2Ex ,y
= ∑x ,y
(N(x , y)�NP(x)P(y))2NP(x)P(y)
Let us consider a Bayesian approach.
AD from KPM () January 2008 41 / 49
Bayesian test for independence
If independent,P(D jH0) = p(X jαj �)p(Y jα�k )
where αj � and α�k are di¤erent prior vectors.
If dependent,P(D jH1) = p(X ,Y jαjk )
We want to compute
p(H0jD) =p(D jH0)p(H0)
p(D jH0)p(H0) + p(D jH1)p(H1)
=1
1+ p(D jH1)p(D jH0)
p(H1)p(H0)
If we assume p(H0) = p(H1), we can focus on the Bayes factorB = p(D jH0)
p(D jH1) .
AD from KPM () January 2008 42 / 49
Bayesian test for independence
It is simple to show that
B =p(D jH0)p(D jH1)
=p(X jαj �)p(X jα�k )p(X ,Y jαjk )
=Γ(∑jk αjk )
Γ(N +∑jk αjk )
J
∏j=1
Γ(Nj � + αj �)
Γ(αj �)
K
∏k=1
Γ(N�k + α�k )
Γ(α�k )∏j ,k=1
Γ(αjk )Γ(Njk + αjk )
Using the entropy approximation, we get
logp(D jH0)p(D jH1)
� �NH(Nj �N)�NH(N�k
N) +NH(Njk
N)
= �ND�NjkNjjNj �N� N�kN
�= �NI(X ,Y )
whereD(pjjq) def
= ∑k
pk logpkqk
is the Kullback-Leibler divergence between distributions p, q, and
I(X ,Y ) def= D(P(X ,Y )jjP(X )P(Y ))
is the mutual information between X and Y .AD from KPM () January 2008 43 / 49
KL divergence (relative entropy)
KL(pjjq) is a �distance�measure of q from p
D(pjjq) def= ∑
k
pk logpkqk
It is not strictly a distance, since it is asymmetric.
The KL can be rewritten as
D(pjjq) = ∑k
pk log pk �∑k
pk log qk = �∑k
pk log qk �H(pk )
This makes it clear that the KL measures the extra number of bits wewould need to use to encode X if we thought the distribution was qkbut it was actually pk .
KL satis�es D(pjjq) � 0 with equality i¤ p = q.
AD from KPM () January 2008 44 / 49
Minimizing KL divergence is maximiming likelihood
We would like to �nd q(x jθ) s.t. D(pjjq) is minimized, where p(x) isthe �true�distribution.
Of course p(x) is unknown but we can approximate by the empiricaldistribution given samples. Then
KL(pjjq) � 1N ∑
nlog p(xn)� log q(xn jθ)
Since p(x) is independent of θ, we �nd that
argminqKL(pjjq) = argmax
q
1N ∑
nlog q(xn jθ)
AD from KPM () January 2008 45 / 49
Mutual information
The mutual information measures how close the joint andindependent distributions are:
I(X ,Y ) def= D(P(X ,Y )jjP(X )P(Y ))
It is easy to show that the mutual information between X and Y ishow much our uncertainty about Y decreases when we observe X (orvice versa):
I (X ,Y ) = H(X )�H(X jY ) = H(Y )�H(Y jX )
I (X ,Y ) � 0 with equality i¤ X ? Y .
AD from KPM () January 2008 46 / 49
Bayesian test for independence
It is simple to show (homework!) that
B =p(D jH0)p(D jH1)
=p(X jαj �)p(X jα�k )p(X ,Y jαjk )
=Γ(∑jk αjk )
Γ(N +∑jk αjk )
J
∏j=1
Γ(Nj � + αj �)
Γ(αj �)
K
∏k=1
Γ(N�k + α�k )
Γ(α�k )∏j ,k=1
Γ(αjk )Γ(Njk + αjk )
Using the entropy approximation, we get
logp(D jH0)p(D jH1)
� �NH(Nj �N)�NH(N�k
N) +NH(Njk
N)
= �ND�NjkNjjNj �N� N�kN
�= �NI(X ,Y )
whereD(pjjq) def
= ∑k
pk logpkqk
is the Kullback-Leibler divergence between distributions p, q, and
I(X ,Y ) def= D(P(X ,Y )jjP(X )P(Y ))
is the mutual information between X and Y .AD from KPM () January 2008 47 / 49
Chi square is an approximation to the mutual information
We showed
logp(D jH0)p(D jH1)
� �ND�NjkNjjNj �N� N�kN
�= �NI(X ,Y )
If we make the additional approximation
D(pjjq) � ∑k
(pk � qk )22qk
then we recover the χ2 statistic.
AD from KPM () January 2008 48 / 49
Are two histograms from the same distribution?
To see if two samples X and Y come from the same multinomialdistribution, create an indicator variable C 2 f1, 2g which speci�eswhich data set each sample comes from
z = 1 z = 2 � � � z = Kc = 1 N1 N2 � � � NKc = 2 N1 M2 � � � MK
If the two histograms are from the same distribution, then C isindependent of Z . So just compute P(C ? Z jD). As before, we getlog P (D jsame)
P (D jdi¤) � �NI (X ,Y ), which can be further approximatedusing χ2.
AD from KPM () January 2008 49 / 49