Supplement: Loglikelihood and Confidence Intervals


This topic relates to what we do in the analysis of discrete data in the following ways...

Review Topics

The following are the topics that are covered in this lesson:

Review of Important Concepts

Let X1, X2, ..., Xn be a simple random sample from a probability distribution f(x; θ).

A parameter θ of f(x; θ) is a variable that is a characteristic of f(x; θ).

A statistic T is any quantity that can be calculated from a sample; it's a function of X1, ..., Xn.

An estimate theta hat for θ is a single number that is a reasonable value for θ

An estimator theta hat for θ is a statistic that gives the formula for computing the estimate theta hat.

The likelihood of the sample is the joint PDF (or PMF)


The maximum likelihood estimate (MLE) formula maximizes L(θ):


If we use Xi's instead of xi's then theta hat is the maximum likelihood estimator. Usually, MLE:

The loglikelihood function is defined to be the natural logarithm of the likelihood function,

l(θ ; x) = logL(θ ; x).

For a variety of reasons, statisticians often work with the loglikelihood rather than with the likelihood. One reason is that l(θ ; x) tends to be a simpler function than L(θ ; x). When we take logs, products are changed to sums. If X = (X1, X2, . . . , Xn) is an iid sample from a probability distribution f(x; θ), the overall likelihood is the product of the likelihoods for the individual Xi's:


The loglikelihood, however, is the sum of the individual loglikelihoods:


Next, we will go through some examples of loglikelihood functions associated with different distributions.

Binomial Loglikelihood Functions

Suppose X ∼ Bin(n, p) where n is known. The likelihood function is


and so the loglikelihood function is

l(p ; x) = k + x log p + (nx) log(1 − p),

where k is a constant that does not involve the parameter p. In the future we will omit the constant, because it's statistically irrelevant.

Poisson Loglikelihood Functions

Suppose X = (X1, X2, . . . , Xn) is an iid sample from a Poisson distribution with parameter λ. The likelihood is


and the loglikelihood is


ignoring the constant terms that do not depend on λ. As the above examples show, l(θ ; x) often looks nicer than L(θ ; x) because the products become sums and the exponents become multipliers.

Asymptotic Confidence Intervals

Loglikelihood forms the basis for many approximate confidence intervals and hypothesis tests because it behaves in a predictable manner as the sample size grows. The following example illustrates what happens to l(θ ; x) as n becomes large.

Example: Online-Class Exercise

Suppose that we observe X = 1 from a binomial distribution with n = 4 and p unknown. Calculate the loglikelihood. What does the graph of loglikelihood look like? Find the MLE (do you understand the difference between the estimator and the estimate?) Locate the MLE on the graph of the likelihood.

The MLE is phat = 1/4 = .25. Ignoring constants, the loglikelihood is

l(p ; x) = log p + 3 log(1 − p),

which looks like this:


Here is a sample code for plotting this function in R:

loglik=log(p) + 3*log(1-p)

For clarity I omitted from the plot all values of p beyond .8, because for p > .8 the loglikelihood drops down so low that including these values of p would distort the plot's appearance. When plotting loglikelihoods, we don't need to include all θ values in the parameter space; in fact, it's a good idea to limit the domain to those θ's for which the loglikelihood is no more than 2 or 3 units below the maximum value l(theta hat; x) because, in a single-parameter problem, any θ whose loglikelihood is more than 2 or 3 units below the maximum is highly implausible.

Now suppose that we observe X = 10 from a binomial distribution with n = 40. The MLE is again p hat = 10/40 = .25, but the loglikelihood is

l(p ; x) = 10 log p + 30 log(1 − p),


Finally, suppose that we observe X = 100 from a binomial with n = 400. The MLE is still p hat = 100/400 = .25, but the loglikelihood is now

l(p ; x) = 100 log p + 300 log(1 − p),


As n gets larger, two things are happening to the loglikelihood. First, l(p ; x) is becoming more sharply peaked around p hat. Second, l(p ; x) is becoming more symmetric about p hat.

The first point shows that as the sample size grows, we are becoming more confident that the true parameter lies close to p hat. If the loglikelihood is highly peaked—that is, if it drops sharply as we move away from the MLE—then the evidence is strong that p is near p hat. A flatter loglikelihood, on the other hand, means that p is not well estimated and the range of plausible values is wide. In fact, the curvature of the loglikelihood (i.e. the second derivative of l(θ ; x) with respect to θ) is an important measure of statistical information about θ.

The second point, that the loglikelihood function becomes more symmetric about the MLE as the sample size grows, forms the basis for constructing asymptotic (large-sample) confidence intervals for the unknown parameter. In a wide variety of problems, as the sample size grows the loglikelihood approaches a quadratic function (i.e. a parabola) centered at the MLE.

The parabola is significant because that is the shape of the loglikelihood from the normal distribution.

If we had a random sample of any size from a normal distribution with known variance σ2 and unknown mean μ, the loglikelihood would be a perfect parabola centered at the formula

From elementary statistics, we know that if we have a sample from a normal distribution with known variance σ2, a 95% confidence interval for the mean μ is

formula        (1)

The quantity σ/√n is called the standard error; it measures the variability of the sample mean xbar about the true mean μ. The number 1.96 comes from a table of the standard normal distribution; the area under the standard normal density curve between −1.96 and 1.96 is .95 or 95%.

The confidence interval (1) is valid because over repeated samples the estimate xbar is normally distributed about the true value μ with a standard deviation of σ/√n.

There is much confusion about how to interpret a confidence interval (CI).

A CI is NOT a probability statement about θ since θ is a fixed value, not a random variable; unless you are doing a Bayesian analaysis and interperting a credible interval!

One interpretation: if we took many samples, most of our intervals would capture true parameter (e.g. 95% of out intervals will contain the true parameter).

Another Example

The nationwide telephone poll was conducted by NY Times/CBS News between Jan. 14-18. with 1118 adults. About 58% of respondents feel optimistic about next four years. The results are reported with a margin of error of 3%.

In STAT 504, the parameter of interest will not be the mean of a normal population, but some other parameter θ pertaining to a discrete probability distribution.

We will often estimate the parameter by its MLE theta hat.

But because in large samples the loglikelihood function l(θ ; x) approaches a parabola centered at theta hat, we will be able to use a method similar to (1) to form approximate confidence intervals for θ.

Just as x hat is normally distributed about μ, theta hat is approximately normally distributed about θ in large samples.

This property is called the “asymptotic normality of the MLE,” and the technique of forming confidence intervals is called the “asymptotic normal approximation.” This method works for a wide variety of statistical models, including all the models that we will use in this course.

The asymptotic normal 95% confidence interval for a parameter θ has the form

formula       (2)

where l"(theta hat; x) is the second derivative of the loglikelihood function with respect to θ, evaluated at θ = theta hat.

Of course, we can also form intervals with confidence coefficients other than 95%.

All we need to do is to replace 1.96 in (2) by z, a value from a table of the standard normal distribution, where ± z encloses the desired level of confidence.

If we wanted a 90% confidence interval, for example, we would use 1.645.

Observed and Expected Information

The quantity −l"(theta hat; x) is called the “observed information,” and 1√−l"(theta hat; x) is an approximate standard error for theta hat. As the loglikelihood becomes more sharply peaked about the MLE, the second derivative drops and the standard error goes down.

When calculating asymptotic confidence intervals, statisticians often replace the second derivative of the loglikelihood by its expectation; that is, replace −l"(theta hat; x) by the function

I(θ) = −E [l"(theta hat; x)],

which is called the expected information or the Fisher information. In that case, the 95% confidence interval would become

formula       (3)

When the sample size is large, the two confidence intervals (2) and (3) tend to be very close. In some problems, the two are identical.

Next, what follows are a few examples of asymptotic confidence intervals.

Bernoulli Asymptotic Confidence Intervals

If X is Bernoulli with success probability p, the loglikelihood is

l(p ; x) = x log p + (1 − x) log (1 − p),

the first derivative is


and the second derivative is


(to derive this, use the fact that x2 = x). Because

E [(xp)2] = V (x) = p(1 − p),

the Fisher information is


Of course, a single Bernoulli trial does not provide enough information to get a reasonable confidence interval for p. Let's see what happens when we have multiple trials.

Binomial Asymptotic Confidence Intervals

If X ∼ Bin(n, p), then the loglikelihood is

l(p ; x) = x log p + (nx) log (1 − p),

the first derivative is


the second derivative is


and the Fisher information is


Thus an approximate 95% confidence interval for p based on the Fisher information is

formula       (4)

where p hat = x/n is the MLE.

Notice that the Fisher information for the Bin(n, p) model is n times the Fisher information from a single Bernoulli trial. This is a general principle; if we observe a sample size n,

X = (X1, X2, . . . , Xn),

where X1, X2, . . . , Xn are independent random variables, then the Fisher information from X is the sum of the Fisher information functions from the individual Xi's. If X1, X2, . . . , Xn are iid, then the Fisher information from X is n times the Fisher information from a single observation Xi.

What happens if we use the observed information rather than the expected information? Evaluating the second derivative l"(p ; x) at the MLE p hat = x/n gives


so the 95% interval based on the observed information is identical to (4).

Unfortunately, Agresti (2002, p. 15) points out that the interval (4) performs poorly unless n is very large; the actual coverage can be considerably less than the nominal rate of 95%.

The confidence interval (4) has two unusual features:

A variety of fixes are available. One ad hoc fix, which can work surprisingly well, is to replace p hat by


which is equivalent to adding half a success and half a failure; that keeps the interval from becoming degenerate. To keep the endpoints within the parameter space, we can express the parameter on a different scale, such as the log-odds


which we will discuss later.

Poisson Asymptotic Confidence Intervals

If X = (X1, X2, . . . , Xn) is an iid sample from a Poisson distribution with parameter ", the loglikelihood is


the first derivative is


the second derivative is


and the Fisher information is


An approximate 95% interval based on the observed or expected information is


where formula is the MLE.

Once again, this interval may not perform well in some circumstances; we can often get better results by changing the scale of the parameter.

Alternative Parameterizations

Statistical theory tells us that if n is large enough, the true coverage of the approximate intervals (2) or (3) will be very close to 95%. How large n must be in practice depends on the particulars of the problem. Sometimes an approximate interval performs poorly because the loglikelihood function doesn't closely resemble a parabola. If so, we may be able to improve the quality of the approximation by applying a suitable reparameterization, a transformation of the parameter to a new scale. Here is an example.

Suppose we observe X = 2 from a binomial distribution Bin(20, p). The MLE is p hat = 2/20 = .10 and the loglikelihood is not very symmetric:


This asymmetry arises because p hat is close to the boundary of the parameter space. We know that p must lie between zero and one. When p hat is close to zero or one, the loglikelihood tends to be more skewed than it would be if p hat were near .5. The usual 95% confidence interval is


or (−.031, .231), which strays outside the parameter space.

The “logistic” or “logit” transformation is defined as

formula       (6)

The logit is also called the “log odds,” because p/(1 − p) is the odds associated with p. Whereas p is a proportion and must lie between 0 and 1, φ may take any value from − ∞ to + ∞, so the logit transformation solves the problem of a sharp boundary in the parameter space.

Solving (6) for p produces the back-transformation

formula       (7)

Let's rewrite the binomial loglikelihood in terms of φ:


Now let's graph the loglikelihood l(φ; x) versus φ:


It's still skewed, but not quite as sharply as before. This plot strongly suggests that an asymptotic confidence interval constructed on the φ scale will be more accurate in coverage than an interval constructed on the p scale. An approximate 95% confidence interval for φ is


where phi hat is a the MLE of φ, and I(φ) is the Fisher information for φ. To find the MLE for φ, all we need to do is apply the logit transformation to p hat:


Assuming for a moment that we know the Fisher information for φ, we can calculate this 95% confidence interval for φ. Then, because our interest is not really in φ but in p, we can transform the endpoints of the confidence interval back to the p scale. This new confidence interval for p will not be exactly symmetric—i.e. p hat will not lie exactly in the center of it—but the coverage of this procedure should be closer to 95% than for intervals computed directly on the p-scale.

The general method for reparameterization is as follows.

First, we choose a transformation φ = φ(θ) for which we think the loglikelihood will be symmetric.

Then we calculate theta hat, the MLE for θ, and transform it to the φ scale,

phi hat = φ(ˆθ).

Next we need to calculate I(phi hat), the Fisher information for φ. It turns out that this is given by

formula      (8)

where φ(theta hat) is the first derivative of φ with respect to θ. Then the endpoints of a 95% confidence interval for φ are:


The approximate 95% confidence interval for φ is [φ low, φ high]. The corresponding confidence interval for θ is obtained by transforming φ low and φ high back to the original θ scale. A few common transformations are shown in Table 1, along with their back-transformations and derivatives.

Table 1: Some common transformations, their back transformations, and derivatives.


Going back to the binomial example with n = 20 and X = 2, let's form a 95% confidence interval for φ = log p/(1 − p). The MLE for p is p hat = 2/20 = .10, so the MLE for φ is phi hat = log(.1/.9) = −2.197. Using the derivative of the logit transformation from Table 1, the Fisher information for φ is


Evaluating it at the MLE gives

I( phi hat) = 20 × 0.1 × 0.9 = 1.8

The endpoints of the 95% confidence interval for φ are


and the corresponding endpoints of the confidence interval for p are


The MLE p hat = .10 is not exactly in the middle of this interval, but who says that a confidence interval must be symmetric about the point estimate?

Intervals Based on the Likelihood Ratio

Another way to form a confidence interval for a single parameter is to find all values of θ for which the loglikelihood l(θ ; x) is within a given tolerance of the maximum value l(phi hat ; x). Statistical theory tells us that, if θ0 is the true value of the parameter, then the likelihood-ratio statistic

formula       (9)

is approximately distributed as %21 when the sample size n is large. This gives rise to the well known likelihood-ratio (LR) test.

In the LR test of the null hypothesis

H0 : θ = θ0

versus the two-sided alternative

H1 : θ ≠ θ0,

we would reject H0 at the α-level if the LR statistic (9) exceeds the 100(1 − α)th percentile of the %21 distribution. That is, for an α = .05-level test, we would reject H0 if the LR statistic is greater than 3.84.

The LR testing principle can also be used to construct confidence intervals. An approximate 100(1 − α)% confidence interval for θ consists of all the possible θ0's for which the null hypothesis H0 : θ = θ0 would not be rejected at the α level. For a 95% interval, the interval would consist of all the values of θ for which

2 [ l (theta hat ; x) − l (θ ; x)] ≤ 3.84


l(θ ; x) ≥ l(theta hat ; x) − 1.92.

In other words, the 95% interval includes all values of θ for which the loglikelihood function drops off by no more than 1.92 units.

Returning to our binomial example, suppose that we observe X = 2 from a binomial distribution with n = 20 and p unknown. The graph of the loglikelihood function

l(p ; x) = 2 log p + 18 log(1 − p)

looks like this,


the MLE is p hat = x/n = .10, and the maximized loglikelihood is

l(p hat ; x) = 2 log .1 + 18 log .9 = −6.50.

Let's add a horizontal line to the plot at the loglikelihood value −6.50 − 1.92 = −8.42:


The horizontal line intersects the loglikelihood curve at p = .018 and p = .278. Therefore, the LR confidence interval for p is (.018, .278).

When n is large, the LR method will tend to produce intervals very similar to those based on the observed or expected information. Unlike the information-based intervals, however, the LR intervals are scale-invariant. That is, if we find the LR interval for a transformed version of the parameter such as φ = log p/(1 − p) and then transform the endpoints back to the p-scale, we get exactly the same answer as if we apply the LR method directly on the p-scale. For that reason, statisticians tend to like the LR method better.

If the loglikelihood function expressed on a particular scale is nearly quadratic, then a information-based interval calculated on that scale will agree closely with the LR interval. Therefore, if the information-based interval agrees with the LR interval, that provides some evidence that the normal approximation is working well on that particular scale. If the information-based interval is quite different from the LR interval, the appropriateness of the normal approximation is doubtful, and the LR approximation is probably better.


© 2006 The Pennsylvania State University. All rights reserved.