Went Ropy

8/3/2019 Went Ropy

http://slidepdf.com/reader/full/went-ropy 1/11

ENTROPY

A LI E. A BBASUniversity of Illinois at Urbana-

ChampaignUrbana, Illinois

1. INTRODUCTION: WHAT IS ENTROPY?

Many ways to describe the entropy of a system exist. Onemethod interprets the entropy as the quantity of heat, Q,that is absorbed in a reversible system when the tempera-ture is T ,

Entropy ¼QT

:

Another method interprets entropy as the amount of energy in a system that is unavailable to do work. Theentropy of a system is also a measure of its disorder. Thesecond law of thermodynamics states that the entropy of an isolated system is non-decreasing. As a result, thelonger that time elapses, the more disordered is thesystem until it reaches maximum disorder (or maximumentropy). Consider, for example, the set of molecules in theclosed system shown in Fig. 1. The second law of thermo-dynamics suggests that the system on the left-hand sidewould precede the system on the right-hand side as it hasless disorder and less entropy.

Entropy is also a measure of the multiplicity of thestates of a system. To illustrate this point further, considertossing a die twice and observing the sum of the twothrows. For two throws of the die, the sum can be any of

2; 3; 4; 5; 6; 7; 8; 9; 10 ; 11 ; 12 :

Assume that the sum is a state variable. Figure 2 showsthe possible realizations for each of these states. Each of the states can be realized in a different way, and some of these states are more likely to occur than others.

If the sum is considered as a macrostate for this systemand the possible realizations are considered as micro-states, the total number of macrostates found is 11, andthe number of microstates is 36. As can be seen, it is morelikely that the sum of 7 will appear, as it has a larger

number of ways by which it can be realized. If the multi-plicity of each state, O , is dened as the number of itsmicrostates, it is said that state 7 has a larger multiplicitythan any of the other states. The entropy of each macro-state is proportional to the logarithm of its multiplicity,

entropy / lnðO Þ:In this example, the multiplicity of each macrostate is

proportional to its probability of occurrence, because theprobability of a macrostate is simply the multiplicity of this macrostate divided by the total multiplicities of thesystem. This observation leads to a statistical denition of the entropy of a given macrostate as a measure of the‘‘probability’’ of its occurrence. The average entropy of themacrostates can also be dened as the average entropy of the system.

The previous denition of entropy using the probabilityof its macrostates is the basis of Claude Shannon’s tech-nical report ‘‘A mathematical theory of communication’’(1). Shannon’s work set the Magna Carta of the informa-

tion age and the dawn of a new science called informationtheory. Information theory started with a discovery of thefundamental laws of data compression and transmission.For example, in communication systems, the two mainapplications of information theory are:

1. Optimal coding of a message, thereby setting a limiton the amount of compression for a source of in-formation or a random process, and

2. Maximum permissible transmission bandwidth for achannel, known as the channel capacity.

Information theory has now become a unifying theorywith profound intersections with probability, statistics,computer science, management science, economics, bioin-formatics, politics, and has numerous applications inmany other elds. In the following sections, an intuitiveinterpretation for entropy as a measure of information isprovided. Also, other elements of information theory arepresented, and several interpretations and applicationsand for their use.

2. ENTROPY AS A MEASURE OF INFORMATION

The birth of information theory started when Shannonmade the key observation that a source of informationshould be modeled as a random process. The following is a

quote from his original paper:We can think of a discrete source as generating the message, symbol by symbol. It will choose successive symbols accordingto certain probabilities depending, in general, on preceding choices as well as the particular symbols in question. A physical system, or a mathematical model of a system which produces such a sequence of symbols governed by a set of probabilities, is known as a stochastic process. We may consider a discrete source, therefore, to be represented by a stochastic process. Conversely, any stochastic process, which produces a discrete sequence of symbols chosen from a nite set, may be considered a discrete source.

Low Entropy High Entropy

Figure 1. Entropy as a measure of disorder.

1

Wiley Encyclopedia of Biomedical Engineering, Copyright & 2006 John Wiley & Sons, Inc.

8/3/2019 Went Ropy


Shannon realized the importance of a measure thatquanties the amount of information (or uncertainty)about a discrete random process as it will also describe

the amount of information that it produces. He then posedthe following question:

Can we dene a quantity which will measure, in some sense,how much information is ‘‘produced’’ by such a process, orbetter, at what rate information is produced?

In an attempt to answer his own question, Shannonsuggested that the measure of information should satisfythree essential axioms, and proposed an informationmeasure that uniquely satises these axioms. The follow-ing is taken from Shannon’s original paper:

Suppose we have a set of possible events whose probabilities of occurrence are p 1; p2 ; ::::; pn . These probabilities are known butthat is all we know concerning which event will occur. Can we nd a measure of how uncertain we are of the outcome? If thereis such a measure, say H ð p1 ; p2 ; :::; pn Þ, it is reasonable torequire of it the following properties:1. H should be continuous in the p i . 2. If all the p i ‘s are equal (p i ¼1

n ) then H should be a monotonicincreasing function of n. 3. If an event can be further decomposed into two several events, the original H should be the weighted sum of theindividual values of each H.

The third axiom is an important property of thismeasure of information. It will be explained further usingthe following example from Shannon’s original paper.

Consider the following equivalent decision trees of Fig.3. The tree on the left-hand side describes events that willoccur with probabilities f1

2 ; 13 ; 1

6g. The tree on the right-handside is an equivalent tree that has the same probabilitiesof the events but is further decomposed into conditionalprobabilities that are assigned to match the original treeon the left-hand side.

Shannon’s third axiom requires that the measure of information that is calculated for the tree on the left-handside, H ð12 ; 1

3 ; 16Þ, be numerically equal to that calculated

using the marginal and conditional probabilities from thetree on the right-hand side. The information measure, he

proposed, should be calculated from the right-hand sidetree using the marginal-conditional probabilities as H ð1

2 ; 12Þþ1

2 H ð23 ; 1

3Þ, where the coefcient of 12 appears be-

cause the decomposition occurs half the time.Shannon pointed out that the only measure that satis-

es his three axioms is,

H ð X Þ¼ À kXn

i ¼1 pð xiÞlogð pð xiÞÞ;

where k is a positive constant that will be assigned a valueof unity.

Shannon called this term the entropy and proposed itas a measure of information (or uncertainty) about adiscrete random variable having a probability mass func-tion p(x) . To get an intuitive feel for this entropy expres-sion, consider the following example.

2.1. Example 1: Interpretation of the Entropy

Consider a random variable, X , with four possible out-comes, 0, 1, 2, 3, and with probabilities of occurrence,

f12 ; 1

4 ; 18 ; 1

8g, respectively. Calculate the entropy of X usingShannon’s entropy denition and using base two for the

2 3 4 5 6 7 8 9 10 11 12

(2) 1Ω =

(3) 2Ω =

(4) 3Ω =

(5) 4Ω =

(6) 5Ω =

(7) 6Ω =

(8) 5Ω =

(9) 4Ω =

(10) 3Ω =

(11) 2Ω =

(12) 1Ω =

77

Number of Microstates = 36 Number of Microstates = 11

(2) 1Ω =

(3) 2Ω =

(4) 3Ω =

(5) 4Ω =

(6) 5Ω =

(7) 6Ω =

(8) 5Ω =

(9) 4Ω =

(10) 3Ω =

(11) 2Ω =

(12) 1Ω =

Figure 2. Entropy as a measure of multiplicity.

1

2

1

3

1

6

1

2

2

3

1

3

1

2

1

3

1

6

1

2

1

3

1

6

1

2

2

3

1

3

1

2

2

3

1

3

1

2

1

3

1

6

1

2

2

3

1

3

1

2

1

3

1

6

1

2

1

3

1

6

1

2

2

3

1

3

1

2

2

3

1

3

1 1 1( , , )2 3 6

H H H 1 1 1 2 1

( , ) ( , )2 2 2 3 3

+

1

2

1

3

1

6

1

2

2

3

1

3

1

2

1

3

1

6

1

2

1

3

1

6

1

2

2

3

1

3

1

2

2

3

1

3

1

2

1

3

1

6

1

2

2

3

1

3

1

2

1

3

1

6

1

2

1

3

1

6

1

2

2

3

1

3

1

2

2

3

1

3

1 1 1( , , )2 3 6

1 1 1 2 1( , ) ( , )2 2 2 3 3

+

1

2

1

3

1

6

1

2

2

3

1

3

1

2

1

3

1

6

1

2

1

3

1

6

1

2

2

3

1

3

1

2

2

3

1

3

1

2

1

3

1

6

1

2

2

3

1

3

1

2

1

3

1

6

1

21

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

2

1

3

1

6

1

2

2

3

1

3

1

2

2

3

1

3

1 1 1( , , )2 3 6

1 1 2 1( , ) ( , )2 2 2 3 3

+=

Figure 3. Decomposability of the entropy measure.

2 ENTROPY

8/3/2019 Went Ropy


logarithm in the entropy expression.

H ð X Þ¼ À12

log2ð12Þ À

14

log2ð14Þ À

18

log2ð18Þ À

18

log2ð18Þ¼1

34

:

One intuitive way to explain this number is to considerthe minimum expected number of binary (Yes/No) ques-

tions needed to describe an outcome of X . The mostefcient question-selection procedure in this example isto start with the outcome that has the highest probabilityof occurrence (i.e., the question is, does X ¼0? If it does,then X has been determined in one question. If it does not,then the question is, does X ¼1? Again, if it does, then X has been determined in two questions, if it does not, does X

¼2? X ¼3 question does not need to be asked because if itis not 0, 1, or 2, then it must be 3). The expected number of binary questions needed to determine X is then

Eð X Þ¼1:12 þ2:

14 þ3:

18 þ3:

18 ¼1

34

:

Note that the expected number of binary questionsneeded to determine X in this example was equal to theentropy of the random variable X .

H ð X Þ¼ Eð X Þ:

Although this equality does not always hold, an optimalquestion selection process always exists using entropycoding principles such that the expected number of ques-tions is bounded by the following well-known data com-pression inequality (explained in further detail in Coverand Thomas (2), page 87)

H ð X Þ Expectednumberof questions H ð X Þþ1:

The entropy of a discrete random variable thus has anintuitive interpretation as it provides some notion of theexpected number of questions needed to describe theoutcome of this random variable. The fact that binaryquestions were used in this example relates to the basetwo in the logarithm of the entropy expression. For ex-ample, if we had calculated the entropy using base four inthe logarithm, this result would provide an idea of theexpected number of tertiary questions needed to deter-mine X . In coding theory, entropy also provides somenotion of the expected length of the optimal code needed

to describe the output of a random process. Note that withthese interpretations, the entropy of a discrete randomvariable cannot be negative.

2.2. Entropy as a Measure of Spread

The entropy of a discrete random variable is also ameasure of spread of its probability mass function. Theminimum entropy (minimum spread) occurs when thevariable is deterministic and has no uncertainty, whichis equivalent to a probability mass function where oneoutcome has a probability equal to one and the remaining

outcomes have a probability of zero.

H ð X Þ¼ À1: logð1Þ¼0:

The maximum entropy (maximum spread) probabilitymass function of a discrete random variable with m out-comes occurs, when all the outcomes have equal probabil-ity of 1

m. This probability mass function has a maximumentropy value

H ð X Þ¼ ÀXm

i ¼1

1m

logð1mÞ¼logðmÞ:

As the number of outcomes, m , increases, the entropy of a probability mass function with equal outcomes also

increases logarithmically, which agrees with Shannon’ssecond axiom.

2.3. Entropy as a Measure of Realizations

The entropy of a discrete random variable also has aninterpretation in terms of the number of ways that theoutcomes of an experiment can be realized. Consider, forexample, a random variable X with m possible outcomes, x1 ; x2 ; :::; xm . Now consider an experiment where n inde-pendent trials have been observed, with each trial giving apossible realization of the variable x. Now consider thenumber of possible observations that can be obtained withthis experiment. Using factorials, the number of possiblesequences that can be obtained for n trials with m possibleoutcomes for each trial is

W ¼n !

n 1!n 2!::::n m !;

where n 1 ; n 2 ; :::; n m are the number of observations of outcomes x1 ; x2 ; :::; xm , respectively, such that P

ni ¼1 n i ¼n .

If the number of observations, n , as well as n 1 ; n 2 ; :::; n m

are large compared with m , then Stirling’s approximationcan be used for the factorial such that

n !¼ eÀn n n

ffiffiffiffiffiffiffi2p np :

Alternatively, it can be written

logðn !Þ¼ Àn þn logðnÞþ oðlogðnÞÞ:

ENTROPY 3

8/3/2019 Went Ropy


Now, it can be written

logðW Þ¼logðn !Þ ÀXn

i ¼1logðn i !Þ

¼n logðnÞ Àn ÀXn

i ¼1n i logðn iÞ

þ Xn

i ¼1n i þ oðlogðnÞÞ

¼ ÀXn

i ¼1n i logð

n i

n Þþ oðlogðnÞÞ:

For large n , the value of

1n

logðW Þ % ÀXn

i ¼1

n i

nlogð

n i

n Þ¼ ÀXn

i ¼1 f i logð f iÞ

is calculated, where f i is the fraction of observations for

outcome xi and converges to the probability of xi as nincreases. The entropy is thus equal to the number of possible realizations of an experiment divided by the totalnumber of trials as the number of trials approachesinnity.

2.4. Entropy as a Measure of the Probability of a Sequence:The Asymptotic Equipartition Theorem (AEP)

The AEP theorem states that if a set of values X 1 , X 2 ,..., X n is drawn independently from a random variable X distributed according to P(x) , then the joint probability Pð X 1 ; X 2 ; :::; X n Þof observing the sequence satises

À1n log2ð Pð X 1 ; X 2 ; . . . X nÞÞ H ð X Þ:

Rearranging the terms in the AEP expression gives

Pð X 1 ; X 2 ; :::; X nÞ 2ÀnH ð X Þ:

The AEP theorem also implies that all sequences drawnfrom an independent and identically distributed distribu-tion will have the same probability in the long run. Of course, as n is small, it may not be true, but as n getslarger, more and more sequences satisfy this equipartitiontheorem.

Let us dene Ane as the set of all sequences that satises

the equipartition theorem, for a given n, and a given e suchthat

2Ànð H ð X ÞþeÞ pð X 1 ; X 2 ; :::; X nÞ 2Ànð H ð X ÞÀeÞ:

The set Ane

is also known as the typical set. Thecardinality of the set An

e is bounded by

2nð H ð X ÞÀeÞ j Ane j 2nð H ð X ÞþeÞ:

Furthermore, because all sequences have approxi-

mately the same probability for large n , the number of possible realizations of the sequences, W , is

W ¼1

Pð X 1 ; X 2 ; :::; X nÞ¼2nH ð X Þ;

which is the result obtained earlier for entropy as a

measure of the long run number of possible realizationsof a sequence.

2.5. Differential Entropy

Now that the entropy of a discrete random variable hasbeen discussed, the extensions of this denition to thecontinuous case will be discussed. In continuous form, theintegral

hð xÞ¼ ÀZ f ð xÞlnð f ð xÞÞdx

is known as the differential entropy of a random variablehaving a probability density function, f

ð x

Þ. As opposed to

the discrete case, however, the differential entropy can benegative, which is illustrated by the following example.

2.5.1. Example 2: Entropy of a Uniform Bounded RandomVariable. Consider a random variable x with a uniformprobability density function on the domain [0,a],

f ð xÞ¼1a

; 0 x a :

The entropy of this random variable is

hð xÞ¼ ÀZ a

0

f ð xÞlogð f ð xÞÞdx ¼ ÀZ a

0

1a

logð1aÞdx ¼l ogðaÞ:

Note that the entropy expression can be positive ornegative depending on the value of a : i f 0o a o 1, thedifferential entropy is negative, if a ¼1, the differentialentropy is zero, and if a 4 1, the differential entropy ispositive. The reader can also verify that a uniform dis-tribution over the domain ½a ; b is equal to log ðb ÀaÞ. Thus,the entropy of a uniform distribution is determined onlyby the width of its interval and will not change by adding ashift to the values of a and b, thus the entropy is invariantto a shift transformation but not invariant to scale.

2.5.2. Example 3: Entropy of a normal Distribution. If a

random variable x has a normal distribution, with mean mand variance s 2, its differential entropy using the naturallogarithm is equal to

hð xÞ¼12

logð2p es2Þ;

where e is the base of the natural log.It is again noted that the entropy is a function of the

variance only and not of the mean. As such, it is invariantunder a shift transformation but is not invariant under ascale transformation. Note further, that when the stan-

4 ENTROPY

8/3/2019 Went Ropy


dard deviation

s o1

ffiffiffiffiffiffiffiffi2p ep ;

then the entropy is negative and is positive when s > 1

ffiffiffiffiffiffi2p ep .

2.6. Entropy of a Discretized VariableThe possibility of a negative sign in the last two examplesshows the absence of the expected number of questionsinterpreted for entropy in the continuous case. Never-theless, numerous applications of entropy exist in thecontinuous case. Furthermore, the differential entropycan be related to the discrete entropy by discretizing thecontinuous density into intervals of length D , as shown inFig. 4, and using the substitution p i ¼ f ð xiÞD in the discreteform, entropy expression as follows:

H ð X D Þ¼ ÀX1

À1 pi log pi ¼ÀX

1

À1 f ð xiÞD logð f ð xiÞDÞ;

¼ ÀXD f ð xiÞlogð f ð xiÞÞ ÀlogðDÞ;

where H ð X D Þ¼entropy of the discretized random vari-able.

If f ð xiÞlogð f ð xiÞÞis Riemann integrable, then the rstterm on the right-hand side of the discretized entropyexpression approaches the integral of À f ð xiÞlogð f ð xiÞÞbydenition of Riemann integrability. Thus,

H ð X D ÞþlogðDÞ hð xÞ; as D 0:

Now, several other terms of information theory arepresented that relate to the entropy expression.

2.7. Relative Entropy

Kullback and Leibler (3) generalized Shannon’s denitionof entropy and introduced a measure that is used fre-quently to determine the ‘‘closeness’’ between two prob-ability distributions, which is now known as the Kullback– Leibler (KL) measure of divergence, and is also known asthe relative entropy or cross entropy.

The KL measure from a distribution, P, to a distribu-tion, Q, is dened as

KL measure ¼X

n

i ¼1

pð xiÞlog pð xiÞq

ð xi

Þ:

The KL measure is non-negative and is zero if and onlyif the two distributions are identical, which makes it anattractive candidate to measure the divergence of onedistribution from another. However, the KL measure isnot symmetric. For example, the KL measure from P to Qis not necessarily equal to the KL measure from Q to P.Furthermore, the KL measure does not follow the triangleinequality. For example, if three distributions, P, Q, and Z,are given, the KL measure from P to Q can be greater thanthe sum of the KL measures from P to Z and from Z to Q.

The following example provides an intuitive interpreta-tion for the KL measure.

2.7.1. Example 4: Interpretation of the Relative Entro-py. Refer back to Example 1 and consider another personwishing to determine the value of the outcome of X.However, suppose that this person does not know thegenerating distribution, P, and believes it is generatedby a different probability distribution, Q ¼ð18 ; 1

8 ; 12 ; 1

4Þ.The KL measure (relative entropy) from P to Q is

KL measure from P to Q ¼12

log2ð1218Þþ

14

log2ð1418Þþ

18

log2ð1812Þ

þ18

log2ð1814Þ¼

78

:

Now, calculate the expected number of binary questionsneeded for this second person to determine X. His questionselection based on the distribution, Q, would be ‘‘is X ¼2?’’,‘‘is X

¼3?’’, ‘‘is X

¼0?’’

The expected number of binary questions needed todetermine X is given by

Eð X Þ¼1:18 þ2:

18 þ3:

12 þ3:

14 ¼

218

:

Note that the difference in the expected number of questions needed to determine X for the two distributions,P and Q ¼21

8 À1 34 ¼7

8 ¼KL distance from P to Q.This example provides an intuitive explanation for the

KL measure as a measure of ‘‘closeness’’ between a dis-tribution, P, and another distribution, Q. Think of the KLmeasure as the increase in the number of questions thatdistribution Q would introduce if it were used to deter-mine the outcome of a random variable that was gener-ated by the distribution P. As entropy is also a measure of the compression limit for a discrete random variable, onecan think of the KL measure as the inefciency in thecompression introduced by distribution Q when used tocode a stochastic process generated by a distribution P.

3. RELATIVE ENTROPY OF CONTINUOUS VARIABLES

Similar to the entropy expression, the relative entropyalso has an expression in continuous form from a prob-ability density, f(x), to a target (reference) density, q(x) , as

KL measure ¼Z b

a f ð xÞlnð f ð xÞqð xÞÞ

dx :

3.1. Joint Entropy

Shannon’s entropy denition works well for describing theuncertainty (or information) about a single random vari-able. It is natural to extend this denition to describe theentropy of two or more variables using their joint distribu-tion. For example, when considering two discrete random

ENTROPY 5

8/3/2019 Went Ropy


variables, X and Y , one can measure the amount of information (or uncertainty) associated with them usingthe denition of their joint entropy as

H ð X ; Y Þ¼ ÀXm 1

i

¼1 X

m 2

j

¼1 pð xi ; y jÞlogð pð xi ; y jÞÞ;

where X and Y may respectively take m 1 and m 2 possiblevalues, and p ij is their joint probability. The joint entropyis a symmetric measure that ranges from zero (when bothvariables are deterministic) to a maximum value of logðm 1Þþlogðm 2Þwhen the variables have probabilityindependence. In other words, the joint entropy of twodiscrete random variables is bound by

0 H ð X ; Y Þ logðm 1Þþlogðm 2Þ:

3.2. Mutual Information

The mutual information, I ð X ; Y Þ, between two variables X and Y is the KL measure between their joint distributionand the product of their marginal distributions.

I ð X ; Y Þ¼Xm 1

i ¼1 Xm 2

j¼1 pð xi ; y jÞlogð

pð xi ; y jÞ pð xiÞ pð y jÞÞ

:

The value of the mutual information is larger whenknowing the outcome of one variable leads to a larger

expected reduction in the entropy of the other. In compu-tational biology, the information measure quanties howmuch information the expression level of one gene pro-vides about the expression level of the other. The mutualinformation is thus a symmetric measure of dependencebetween the two variables and is zero if and only if the twovariables are independent, which is a direct result of theproperties of the KL measure, being zero if and only if thetwo distributions are identical.

The mutual information can also be dened for contin-uous variables using a joint probability density and mar-ginal probability densities, and a double integral to

replace the summation

Z Z f ð x; yÞlogðf ð x; yÞ

f ð xÞ f ð yÞÞdxdy :

The mutual information measure is invariant to anyinvertible transformation of the individual variables.

The mutual information can be generalized to a largernumber of variables yielding the multi-information mea-sure

Z Z f ð x1 ; x2 ; :::; xnÞlogðf ð x1 ; x2 ; :::; xnÞ

f ð x1Þ f ð x2Þ::: f ð xnÞÞdx 1dx 2 :::dx n

3.3. Conditional Entropy

The conditional entropy of y given x is dened as

H ðY j X Þ¼ ÀXm 1

i ¼1 Xm 2

j¼1 pð xi ; y jÞlogð pð y jj xiÞÞ;

where pð y jj xiÞis the conditional probability of y j given xi .The conditional entropy can also be dened as the joint

entropy of x and y minus the entropy of x. In other words,it is the difference between the joint information of x and y,and the information brought by x:

H ðY j X Þ¼ H ð X ; Y Þ À H ð X Þ:One can also dene the conditional entropy of Y given X asthe difference between the entropy (information) in Y andthe mutual information of X and Y :

H ðY j X Þ¼ H ðY Þ À I ð X ; Y Þ:This equation provides an interpretation for this type of

entropy as the reduction of uncertainty about Y broughtby knowledge of X . If the two variables are probabilisti-cally independent, their mutual information is zero andthe conditional entropy of Y given X is simply equal to theentropy of Y , which places an upper bound on the condi-

00.02

0.04

0.06

0.08

0.1

0.12

0.14

0 0.08 0.17 0.25 0.33 0.42 0.5 0.58 0.67 0.75 0.83 0.92 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

∆

Discretized Probability Mass Function

00.02

0.04

0.06

0.08

0.1

0.12

0.14

0 0.08 0.17 0.25 0.33 0.42 0.5 0.58 0.67 0.75 0.83 0.92 1

Continuous Probability Density Function

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

∆

Figure 4. Continuous and discretized distribution.

6 ENTROPY

8/3/2019 Went Ropy


tional entropy as

H ðY j X Þ H ðY Þ:

Contrary to the joint entropy or the mutual information,conditional entropy is not a symmetrical measure. Con-ditioning on one variable or the other does not give the

same result.It is useful to think of the relationships between the joint entropy, the conditional entropy, the mutual informa-tion, and the individual entropies using the Venn diagramdepicted in Fig. 5.

This diagram helps us deduce many expressions forentropy relations by inspection of the areas. For example,one can see that

I ð X ; Y Þ¼ H ð X Þþ H ðY Þ À H ð X ; Y Þ:

Another measure of interest is the mutual informationindex (4) that measures the expected fraction of entropyreduction of variable Y because of variable X and is givenas

Mutual Information Index ðMII Þ¼I ð X ; Y Þ H ðY Þ

¼1 À H ðY j X Þ

H ðY Þ:

The mutual information index is thus normalized torange from zero (when the variables are independent) toone (when the variables are functionally related).

3.4. The Chain Rule for Joint Entropy

From the denition of the conditional entropy for the two-variable case, it has been shown that the joint entropy isequal to

H ð X ; Y Þ¼ H ð X Þþ H ðY j X Þ:For three variables, one can express the joint entropy as

H ð X ; Y ; ZÞ¼ H ð X Þþ H ðY j X Þþ H ð Zj X ; Y Þ;

which is very simple to prove because the joint entropy can

be written in terms of the conditional entropy in two steps,

H ð X ; Y ; ZÞ¼ H ð X Þþ H ðY ; Zj X Þ¼ H ð X Þþ H ðY j X Þþ H ðY j X ; ZÞ:

By induction, for n variables, X 1 ; X 2 ; :::; X n , the jointentropy can be written as

H ð X 1; X 2 ; :::; X nÞ¼ H ð X 1Þþ H ð X 2j X 1Þþ:::: þ H ð X n j X 1 ; X 2 ; :::; X nÀ1

¼Xn

i ¼1 H ð X ij X 1 ; :::; X iÀ1Þ:

3.5. The Chain Rule for Mutual Information

A chain rule for mutual information exists, as will beshown below.

For the case of three variables, one has

I ð X ; Y ; ZÞ¼ H ð X ; Y Þ À H ð X ; Y j ZÞ:Using the chain rule for relative entropy gives

I ð X ; Y ; ZÞ¼ H ð X Þþ H ðY j X Þ À H ð X j ZÞ À H ðY j X ; ZÞ¼ H ð X Þ À H ð X j ZÞþ H ðY j X Þ À H ðY j X ; ZÞ¼ I ð X ; ZÞþ I ðY ; Zj X Þ;

where I ðY ; Zj X Þ¼ H ðY j X Þ À H ðY j X ; ZÞis the conditionalmutual information of y and z given x.

For n – variables, X 1 ; X 2 ; :::; X n , the chain rule formutual information is

I ð X 1 ; X 2 ; :::; X n ; Y Þ¼Xn

i ¼1 I ð X i ; Y j X 1 ; X 2 ; :::; X iÀ1Þ:

3.6. The Maximum Entropy Principle

Jaynes (5) built on the entropy concept and proposed amethod to assign probability values with partial informa-tion. This method is known as the maximum entropyprinciple, and is stated from his original paper as follows.

In making inferences on the basis of partial information we

must use that probability distribution which has maximum entropy subject to whatever is known. This is the only unbiasedassignment we can make; to use any other would amount toarbitrary assumption of information, which by hypothesis wedo not have.

The partial information may have several forms suchas moment constraints, probability constraints, and shapeconstraints. Jaynes’ maximum entropy principle is con-sidered to be an extension of Laplace’s principle of insuf-cient reason to assign ‘‘unbiased’’ probability distributionsbased on partial information. The term unbiased means

H (x, y )

H (x ) H (y )

H (x | y ) H (y | x )I (x ; y )

Figure 5. Venn diagram for entropy expressions.

ENTROPY 7

8/3/2019 Went Ropy


the assignment satises the partial information con-straints and adds the least amount of information asmeasured by Shannon’s entropy measure. To illustratethis further, consider the examples below.

3.6.1. Example 5: Maximum Entropy Discrete Formula-tion. The maximum entropy probability mass function fora discrete variable, X, having n outcomes, when no furtherinformation is available, is

pmaxent ¼argmax pð xiÞ ÀX

n

i ¼1 pð xiÞlogð pð xiÞÞ

such that

Xn

i ¼1 pð xiÞ¼1;

pð xiÞ !0; i ¼1; :::; n :

The solution to this formulation yields a probability

mass function with equal probability values for all out-comes,

pð xiÞ¼1n

; i ¼1; :::; n :

The maximum entropy value for this probability massfunction is equal to log ðnÞ, which is the maximum valuethat can be attained for n outcomes. Recall also that thisresult agrees with Laplace’s principle of insufcient rea-son discussed above.

3.6.2. Example 6: Maximum Entropy Given Moments andPercentiles of the Distribution. The maximum entropy for-

mulation for the probability density function of a contin-uous variable having prescribed moments and percentileconstraints is

f maxent ð xÞ¼argmax f ð xÞÀZ b

a f ð xÞlnð f ð xÞÞdx

subject to Z b

ah ið xÞ f ð xÞdx ¼mi i ¼1; :::n ;

Z b

a f ð xÞdx ¼1 and f ð xÞ !0;

where [a,b] is the domain of the variable, h ið xÞis x raised

to a certain power for moment constraints or an indicatorfunction for percentile constraints, and mi’s are a given setof moments or percentiles of the distribution.

Using the method of Lagrange multipliers, one obtains

Lð f ð xÞÞ¼ ÀZ b

a f ð xÞlnð f ð xÞÞdx Àa 0fZ

b

a f ð xÞdx À1g

ÀXn

i ¼1 a ifZ b

ah ið xÞ f ð xÞdx Àmig;

where a i is the Lagrange multiplier for each constraint.

Taking the partial derivative with respect to f ð xÞandequating it to zero gives

@ Lð f ð xÞÞ@ f ð xÞ ¼ Àlnð f ð xÞÞ À1 Àa 0 ÀX

n

i ¼1a ih ið xÞ¼0:

Rearranging this equation gives

f maxent ð xÞ¼ eÀa 0À1Àa 1 h 1ð xÞÀa 2 h 2ð xÞÀ:::::Àa n h n ð xÞ:

When only the normalization and non-negativity con-straints are available, the maximum entropy solution isuniform over a bounded domain:

f maxent ð xÞ¼ eÀa 0À1 ¼1

b Àa; a x b:

If the rst moment is available, the maximum entropysolution on the non-negative domain is

f maxent

ð x

Þ¼ eÀa 0À1Àa 1 x

¼1

m1 eÀ

xm1 ; x

!0:

If the rst and second moments are available over theinterval ðÀ1; 1Þ, the maximum entropy solution is aGaussian distribution

f maxent ð xÞ¼ eÀa 0À1Àa 1 xÀa 2 x2

¼1

s ffiffiffiffiffiffi2pp eÀð xÀmÞ22s 2 ; À1o xo 1 ;

where m is the rst moment and s 2 is the variance.When percentiles of the distribution are available, the

maximum entropy solution is a staircase probability den-sity function (Fig. 6b) that satises the percentile con-straints and is uniform over each interval, whichintegrates to a piecewise linear cumulative probabilitydistribution that has the shape of a taut string. Anexample of a maximum entropy distribution given vecommon percentiles used in decision analysis practice,(1%, 25%, 50%, 75%, and 99%), with a bounded supportis shown in Fig. 6.

3.6.3. Example 7: Inverse Maximum Entropy Proble-m. The inverse maximum entropy problem is the problemof nding, for any given probability density function, theconstraints in the maximum entropy formulation thatlead to its assignment. The inverse maximum entropyproblem can be solved by rst placing any probabilitydensity function in the form

f ð xÞ¼ eÀa 0À1Àa 1 h 1ð xÞÀa 2h 2ð xÞÀ:::::Àa n h n ð xÞ:

Now, the inverse maximum entropy problem can besolved by inspection because the constraint set neededfor its assignment is

Z b

ah ið xÞuð xÞdx ¼mi ; i ¼0; :::; n :

For example, a Beta probability density function can be

8 ENTROPY

8/3/2019 Went Ropy


rewritten in the form

f ð xÞ¼ 1 Beta ðm ; nÞ xmÀ1ð1

À xÞnÀ1 ¼ eÀlnð Beta ðm ;nÞÞÀðmÀ1Þln xÀðnÀ1Þlnð1À xÞ; 0

x 1:

By inspection, it can be seen that the constraint setneeded to assign a Beta density function is

Z 1

0lnð xÞuð xÞdx ¼m1 ;Z

1

0lnð1 À xÞuð xÞdx ¼m2 ;

where m1and m2 are given constants.

3.6.4. Dependence on the Choice of the CoordinateSystem. In the continuous case, the maximum entropysolution depends on the choice of the coordinate system.For example, suppose you have a random variable x on thedomain [0,1], and another random variable, y, on the samedomain, where y ¼ x2. Suppose also that nothing is knownabout the random variables except their domains. If theproblem is solved in the x coordinate system, the max-imum entropy distribution for x is uniform

f maxent ð xÞ¼1; 0 ! x 1:

The probability density function for variable y can be

obtained using the well-known formula for a change of variables

f yð yÞ¼f xð xÞ

jdydx j x¼ffiffi yp ¼

12 x ¼

12ffiffiffi yp :

On the other hand, if the problem is solved directly inthe y coordinate system, one would get a maximumentropy distribution for variable y as

f maxent ð yÞ¼1; 0 ! y 1;

which is not equal to the solution obtained in the rst case,which may seem to pose a problem to entropy formula-

tions at rst; however, it will be shown that this depen-dence on the choice of the coordinate system has a simpleremedy using the minimum cross-entropy principle. Thisprinciple will be discussed in more detail below.

3.7. The Minimum Cross-Entropy Principle

In many situations, additional knowledge may exist aboutthe shape of the probability density function or its relationto a certain family of probability functions. In this case,the relative entropy measure is minimized (3) to a knownprobability density function. Minimum cross-entropy for-mulations for a probability density, f ð xÞ, and a targetdensity, qð xÞ, take the form

f minXent ð xÞ¼argmin f ð xÞ ðZ

b

a f ð xÞlnð

f ð xÞqð xÞÞ

dxÞsuch that

Z b

ah ið xÞ f ð xÞdx ¼mi i ¼1; :::n ;

Z b

a f ð xÞdx ¼1andf ð xÞ !0:

Using the method of Lagrange multipliers, one has

Lð f ð xÞÞ¼Z b

a f ð xÞlnð f ð xÞqð xÞÞ

dx þ a 0fZ b

a f ð xÞdx À1g

þ Xn

i ¼1a ifZ

b

ah ið xÞ f ð xÞdx Àmig

:

Taking the rst partial derivative with respect to f ð xÞand equating it to zero gives

@ Lð f ð xÞÞ@ f ð xÞ ¼lnð

f ð xÞqð xÞÞþ1 þ a 0 þ X

n

i ¼1a ih ið xÞ¼0:

Fractile Maximum Entropy Distribution

0.01

(b −a )

0.01(g −f )

0.24(c −b )

0.24(f −e )

0.25(e −d )

0.25(d −c )

a b c d e f g

Taut String given fractile constraints

0

0.2

0.4

0.6

0.8

1

a b c d e f g

0.75

0.01

0.25

0.50

0.99

Figure 6. (a) Percentile maximum entropy distribution obtained using the 0.01, 0.25, 0.5, 0.75,0.99 percentiles. (b) Probability density function corresponding to the given percentile constraints.

ENTROPY 9

8/3/2019 Went Ropy


Rearranging shows that the minimum cross-entropysolution has the form

f minXent ð xÞ¼qð xÞ eÀa 0À1Àa 1 h 1ð xÞÀa 2 h 2ð xÞÀ:::::Àa n h n ð xÞ;

where a i is the Lagrange multiplier for each constraintand f minXent ð xÞis the minimum cross-entropy probability

density. One can see that maximizing the entropy of f(x) is,therefore, a special case of minimizing the cross entropywhen the target density, q(x) , is uniform. The minimumcross-entropy principle generalizes the maximum entropyprinciple, and assigns a probability distribution thatminimizes the cross-entropy relative to a target (or areference) distribution.

3.7.1. Independence on the Choice of the CoordinateSystem. Now refer back to the maximum entropy solutionin the continuous case and recall its dependence on thechoice of the coordinate system. A remedy to this problemis obtained using the minimum cross-entropy principle.When suing minimum cross entropy, a target density

function that represents our prior knowledge about anyof the variables is rst decided on. If the problem in the x-coordinate system is then solved, one obtains a minimumcross-entropy distribution equal to

f minXent ð xÞ¼qð xÞ;where qð xÞis the target probability density function, thatrepresents our prior knowledge about variable x. On theother hand, if one solves the variable in the y coordinatesystem, the change of variables formula is rst used todetermine the target density for variable y,

q y

ð y

Þ¼q xð xÞ

jdydx j x¼ffiffi yp

:

Next, one nds the minimum cross-entropy distributionfor y. If no further information exists except its domainand the target density, the minimum cross-entropy solu-tion is equal to

f minXent ð yÞ¼qð yÞ;a result that is consistent with the solution in the x-coordinate system. Thus, solving a minimum cross-en-tropy problem (1) generalizes the maximum entropy ap-proach and reduces to the maximum entropy solution if the target density, q

ð x

Þ, is uniform; and (2) solves the

problem of coordinate system independence for the con-tinuous case.

4. ENTROPY RATE OF A STOCHASTIC PROCESS

The entropy rate of a stochastic process, f X ig, is dened by

H ðwÞ¼limn 1

1n

H ð X 1 ; X 2 ; :::; X nÞ;

when the limit exists.

The entropy rate is well dened for stationary pro-cesses. When X 1 ; X 2 ; :::; X n are independent and identicallydistributed random variables, one can use the chain ruleof joint entropy to show that

H ðwÞ¼limn 1

1n X

n

i ¼1 H ð X iÞ¼lim

n 1nH ð X 1Þ

n ¼ H ð X 1Þ:

Another denition of the entropy rate is

H ðwÞ¼limn 1

H ð X nj X nÀ1 ; X nÀ2 ; :::; X 1Þ:

For a stationary process, both denitions exist and areequivalent. Furthermore, for a stationary ergodic process,one has

À1n

log2ð Pð X 1 ; X 2 ; . . . ; XnÞÞ H ðwÞ:

Using the results of the asymptotic equipartion theo-

rem, one sees that a generated sequence of a stationaryergodic stochastic process has a probability of 2 ÀnH ð X Þand2nH ð X Þpossible sequences exist. Therefore, a typical se-quence of length n can be represented usinglogð2nH ð X ÞÞ¼nH ð X Þbits. Thus, the entropy rate is theaverage description length for a stationary ergodic pro-cess.

Of special interest is also the entropy rate of a Markovprocess, which can be easily calculated as

H ðwÞ¼limn 1

H ð X n j X nÀ1 ; X nÀ2 ; :::; X 1Þ¼limn 1

H ð X n j X nÀ1Þ¼ H ð X 2j X

For a given stationary distribution, m, and transitionmatrix, P , the entropy rate of a Markov process is

H ðwÞ¼ ÀXn

i ¼1 Xn

j¼1mi P ij log P ij :

5. ENTROPY APPLICATIONS IN BIOLOGY

Maximum entropy applications have found wide use invarious elds. In computational biology, for example,maximum entropy methods have been used extensivelyfor modeling biological sequences. In particular, how one

can model the probability distribution p (a | h), theprobability that a will be the next amino acid, given thehistory h of amino acids that have already occurred (6,7).Maximum entropy methods have also been applied to themodeling of short sequence motifs (8). They recommendapproximating short sequence motif distributions with themaximum entropy distribution (MED) consistent withlow-order marginal constraints estimated from availabledata, which may include dependencies between nonadja-cent as well as adjacent positions. Entropy methods havealso been used in computing new distance measuresbetween sequences (9). Mixtures of conditional maximum

10 ENTROPY

8/3/2019 Went Ropy


entropy distributions have also been used to model biolo-gical sequences (10), and for modeling splicing sites withpairwise correlations (11). Maximum entropy analysis of biomedical literature has also been used for associatinggenes with gene ontology codes (12). Maximum entropyapplications have also been used to assess the accuracy of structure analysis (13). Maximum entropy methods havealso been used in the prediction of protein secondarystructure in multiple sequence alignments by combiningthe single-sequence predictions using a maximum entropyweighting scheme (14).

Recently, there have been other expressions related tothe entropy of a signal that have found applications inbiomedical engineering. For example, the approximateentropy, ApEn, is a term that quanties regularities indata and time series (15). The approximate entropy hasbeen applied in biology to discriminate atypical EEG’s andrespiratory patterns from normative counterparts (16–18)and for physiological time series analysis (19). Finally, it ispointed out that other forms of nonextensive entropymeasures also exist such as the Tsallis entropy (20) andRenyi’s entropy (21), special cases of which reduce toShannon’s entropy.

BIBLIOGRAPHY

1. C. E. Shannon, A mathematical theory of communication. Bell Sys. Tech. J. 1948; 27 :379–423, 623–656.

2. T. M. Cover and J. A. Thomas, Elements of InformationTheory. New York: John Wiley, 1991.

3. S. Kullback and R. A. Leibler, On information and sufciency. Ann. Mathemat. Statist. 1951; 2 :79–86.

4. H. Joe, Relative entropy measures of multivariate depen-dence. J. Am. Statistic. Assoc. 1989; 84 :157–164.

5. E. T. Jaynes, Information Theory and Statistical Mechanics.

Phys. Rev . 1957; 106 :620–630.6. E. C. Buehler and L. H. Ungar, Maximum entropy methods

for biological sequence modeling. J. Computat. Biol. , in press.7. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological

sequence analysis: probabilistic models of proteins and nu-cleic acids. Cambridge, UK: Cambridge University Press,1998.

8. G. Yeo and C. B. Burge, Proceedings of the seventh annualinternational conference on research in computational mole-cular biology RECOMB ‘03, April, 2003.

9. G. Benson, A new distance measure for comparing sequenceproles based on path lengths along an entropy surface. Bioinformatics 2002; 18 (suppl. 2m);S44–S53.

10. D. Pavlov, Sequence modeling with mixtures of conditionalmaximum entropy distributions. Third IEEE InternationalConference on Data Mining, ICDM, 2003:251–258.

11. M. Arita, K. Tsuda, and K. Asai, Modeling splicing sites withpairwise correlations. Bioinformatics 2002; 18 (suppl 2):S27– SS34.

12. S. Raychaudhuri, J. T. Chang, P. D. Sutphin, and R. B. Altman, Associating genes with gene ontology codes using amaximum entropy analysis of biomedical literature. Genome Res. 2002; 12 :203–214

13. M. Sakata and M. Sato, Accurate structure analysis by themaximum-entropy method. Acta Cryst. 1990; A46 :263–270.

14. Krogh and Mitchinson, Maximum entropy weighting of aligned sequences of proteins of DNA. In: C. Rawlings, D.Clark, R. Altman, L. Hunter, T. Lengauer, and S. Wodak, eds., Proceedings of the Third International Conference on Intelli- gent Systems for Molecular Biology. Menlo Park, CA: AAAIPress, 1995, pp. 215–221.

15. S. Pincu, Approximate entropy as a measure of system

complexity. Proc. Natl. Acad. Sci. USA 1991; 88 :2297–2301.16. J. Bruhn, H. Ropcke, B. Rehberg, T. Bouillon, and A. Hoeft,Electroencephalogram approximate entropy correctly classi-es the occurrence of burst suppression pattern as increasinganesthetic drug effect. Anesthsiology 2000; 93 :981–985.

17. M. Engoren Approximate entropy of respiratory rate andtidal volume during weaning from mechanical ventilation.Crit. Care Med. 1998; 26 :1817–1823.

18. Hornero et al. 2005.19. J. S. Richman and J. R. Moorman, Physiological time-series

analysis using approximate entropy and sample entropy. Am. J. Physiol. Heart Circ. Physiol. 2000; 6 :278.

20. C. Tsallis, J. Stat. Phys. 1998; 52 :479.21. A. Renyi, On measures of entropy and information. Proc.

Fourth Berkeley Symposium Mathematical Statistical Pro-blems 1960, vol. I. University of California Press, Berkeley,1961:547.

READING LIST

E. T. Jaynes, Prior probabilities. IEEE Trans. Syst. Sci. Cybernet .1968; 4 :227–241.

ENTROPY 11

Documents

Went Ropy