Artificial Intelligence 2 (Künstliche Intelligenz 2) Part ... · Kohlhase: Künstliche Intelligenz 2 130 July 12, 2018. Environmenttypes IExample1.11. Someenvironmentsclassiﬁed:

Artificial Intelligence 2 (Künstliche Intelligenz 2)Part V: Probabilitstic Reasoning

Michael Kohlhase

Professur für Wissensrepräsentation und -verarbeitungInformatik, FAU Erlangen-Nürnberg

http://kwarc.info

July 12, 2018

Kohlhase: Künstliche Intelligenz 2 122 July 12, 2018

http://kwarc.info

Chapter 7 Quantifying Uncertainty


7.1 Dealing with Uncertainty: Probabilities


7.1.1 Sources of Uncertainty


Sources of Uncertainty in Decision-Making

Where’s that d. . .Wumpus?And where am I, anyway??

I Non-deterministic actions.

I “When I try to go forward in this dark cave, I might actually go forward-left orforward-right.”

I Partial observability with unreliable sensors.I “Did I feel a breeze right now?”;I “I think I might smell a Wumpus here, but I got a cold and my nose is blocked.”I “According to the heat scanner, the Wumpus is probably in cell [2,3].”

I Uncertainty about the domain behavior.I “Are you sure the Wumpus never moves?”




I Non-deterministic actions.I “When I try to go forward in this dark cave, I might actually go forward-left or

forward-right.”I Partial observability with unreliable sensors.

I “Did I feel a breeze right now?”;I “I think I might smell a Wumpus here, but I got a cold and my nose is blocked.”I “According to the heat scanner, the Wumpus is probably in cell [2,3].”






forward-right.”I Partial observability with unreliable sensors.I “Did I feel a breeze right now?”;I “I think I might smell a Wumpus here, but I got a cold and my nose is blocked.”I “According to the heat scanner, the Wumpus is probably in cell [2,3].”

I Uncertainty about the domain behavior.

I “Are you sure the Wumpus never moves?”





forward-right.”I Partial observability with unreliable sensors.I “Did I feel a breeze right now?”;I “I think I might smell a Wumpus here, but I got a cold and my nose is blocked.”I “According to the heat scanner, the Wumpus is probably in cell [2,3].”



Unreliable Sensors

I Robot Localization: Suppose we want to support localization using landmarksto narrow down the area.

I “If you see the Eiffel tower, then you’re in Paris.”

I Difficulty: Sensors can be imprecise.I Even if a landmark is perceived, we cannot conclude with certainty that the robot is

at that location.I (“This is the half-scale Las Vegas copy, you dummy.”)I Even if a landmark is not perceived, we cannot conclude with certainty that the

robot is not at that location.I (“Top of Eiffel tower hidden in the clouds.”)

I Only the probability of being at a location increases or decreases.


Unreliable Sensors


I “If you see the Eiffel tower, then you’re in Paris.”I Difficulty: Sensors can be imprecise.I Even if a landmark is perceived, we cannot conclude with certainty that the robot is


robot is not at that location.

I (“Top of Eiffel tower hidden in the clouds.”)I Only the probability of being at a location increases or decreases.


Unreliable Sensors


I “If you see the Eiffel tower, then you’re in Paris.”I Difficulty: Sensors can be imprecise.I Even if a landmark is perceived, we cannot conclude with certainty that the robot is


robot is not at that location.I (“Top of Eiffel tower hidden in the clouds.”)

I Only the probability of being at a location increases or decreases.


7.1.2 Recap: Rational Agents as a ConceptualFramework


Agents and Environments

I Definition 1.1. An agent is anything thatI perceives its environment via sensors (means of sensing the environment)I acts on it with actuators (means of changing the environment).

I Example 1.2. Agents include humans, robots, softbots, thermostats, etc.


Agent Schema: Visualizing the Internal Agent Structure

I Agent Schema: We will use the following kind of schema to visualize theinternal structure of an agents:

Different agents differ on the contents of the white box in the center.


Rationality

I Idea: Try to design agents that are successful (do the right thing)I Definition 1.3. A performance measure is a function that evaluates a sequence

of environments.I Example 1.4. A performance measure for the vacuum cleaner world couldI award one point per square cleaned up in time T?I award one point per clean square per time step, minus one per move?I penalize for > k dirty squares?

I Definition 1.5. An agent is called rational, if it chooses whichever actionmaximizes the expected value of the performance measure given the perceptsequence to date.

I Question: Why is rationality a good quality to aim for?


Consequences of Rationality: Exploration, Learning,Autonomy

I Note: a rational need not be perfectI only needs to maximize expected value (Rational 6= omniscient)I need not predict e.g. very unlikely but catastrophic events in the future

I percepts may not supply all relevant information (Rational 6= clairvoyant)I if we cannot perceive things we do not need to react to them.I but we may need to try to find out about hidden dangers (exploration)

I action outcomes may not be as expected (rational 6= successful)I but we may need to take action to ensure that they do (more often) (learning)

I Rational =⇒ exploration, learning, autonomyI Definition 1.6. An agent is called autonomous, if it does not rely on the prior

knowledge of the designer.I Autonomy avoids fixed behaviors that can become unsuccessful in a changing

environment. (anything else would be irrational)I The agent has to learn all relevant traits, invariants, properties of the

environment and actions.


PEAS: Describing the Task Environment

I Observation: To design a rational agent, we must specify the task environmentin terms of performance measure, environment, actuators, and sensort, togethercalled the PEAS components.

I Example 1.7. designing an automated taxi:I Performance measure: safety, destination, profits, legality, comfort, . . .I Environment: US streets/freeways, traffic, pedestrians, weather, . . .I Actuators: steering, accelerator, brake, horn, speaker/display, . . .I Sensors: video, accelerometers, gauges, engine sensors, keyboard, GPS, . . .

I Example 1.8 (Internet Shopping Agent). The task environment:I Performance measure: price, quality, appropriateness, efficiencyI Environment: current and future WWW sites, vendors, shippersI Actuators: display to user, follow URL, fill in formI Sensors: HTML pages (text, graphics, scripts)


Environment types

I Observation 1.9. The environment type largely determines the agent design.

I Problem: There is a vast number of possible environments in AI.

I Solution: Classify along a handful of “dimensions” (independent characteristics)I Definition 1.10. For an agent a we call an environment eI fully observable, iff the a’s sensors give it access to the complete state of the

environment at any point in time, else partially observable.I deterministic, iff the next state of the environment is completely determined by the

current state and a’s action, else stochastic.I episodic, iff a’a experience is divided into atomic episodes, where it perceives and

then performes a single action. Crucially the next episode does not depend onprevious ones. Non-episodic environments are called sequential.

I dynamic, iff the environment can change with out an action performed by a, elsestatic. If the environment does not change but a’s performance measure does, wecall e semidynamic.

I discrete, iff the sets of e’s states and a’s actions are countable, else continuous.I single-agent, iff only a acts on e (when must we count parts of e as agents?)


Environment types

I Example 1.11. Some environments classified:Solitaire Backgammon Internet shopping Taxi

observable Yes Yes No Nodeterministic Yes No Partly Noepisodic No No No Nostatic Yes Semi Semi Nodiscrete Yes Yes Yes Nosingle-agent Yes No Yes (except auctions) No

I Observation 1.12. The real world is (of course) partially observable,stochastic, sequential, dynamic, continuous, multi-agent (worst case for AI)


Simple reflex agents

I Definition 1.13. A simple reflex agent is an agent a that only bases its actionson the last percept: fa : P → A.

I Agent Schema:I Example 1.14.

procedure Reflex−Vacuum−Agent [location,status] returns an action if status =Dirty then . . .


Reflex agents with state

I Idea: Keep track of the state of the world we cannot see now in an internalmodel

I Definition 1.15. A stateful reflex agent (also called reflex agent with state ormodel-based agent) whose agent function depends on a model of the world(called the world model).

I


7.1.3 Agent Architectures based on Belief States


World Models for Uncertainty

I Problem: We do not know with certainty what state the world is in!

I Idea: Just keep track of all the possible states it could be in.I Definition 1.16. A stateful reflex agent has a world model consisting ofI a belief state that has information about the possible states the world may be in andI a transition model that updates the belief state based on sensor information and

actions.

Idea: The agent environment determines what the world model can be.

II In a fully observable, deterministic environment,I we can observe the initial state and subsequent states are given by the actions alone.I thus the belief state is a singleton set (we call its member the world state) and the

transition model is a function from states and actions to states: a transition function.





actions.








actions.








actions.





World Models by Agent Type

I Note: All of these considerations only give requirements to the world modelWhat we can do with it depends on representation and inference.

I Search-based Agents: In a fully observable, deterministic environmentworld state = “current state”no inference.

I CSP-based Agents: In a fully observable, deterministic environmentworld state = constraint networkinference = constraint propagation.

I Logic-based Agents: In a fully observable, deterministic environmentworld state = logical formulainference = e.g. DPLL or resolution.

I Planning Agents: In a fully observable, deterministic, environmentworld state = PL0, transition model = Strips,inference = state/plan space search.






























World Models for Complex Environments

I In a fully observable, but stochastic environment,I the belief state must deal with a set of possible statesI generalize the transition function to a transition relation

I Note: this even applies for online problem solving we can just perceive the state.(e.g. when we want to optimize utility)

I In a deterministic, but partially observable environment,I the belief state must deal with a set of possible states.I we can use transition functions.I We need a sensor model, which predicts the influence of percepts on the belief state

– during update.I In a stochastic partially observable environment,I mix the ideas from the last two. (sensor model + transition relation)












– during update.

I In a stochastic partially observable environment,I mix the ideas from the last two. (sensor model + transition relation)








Preview: New World Models (Belief) ; new Agent Types

I Probabilistic Agents: In a partially observable, belief model = Bayesiannetworks, inference = probabilistic inference.

I Decision-Theoretic Agents: In a partially observable, stochastic, belief model +transition model = Decision networks, inference = MEU.


Preview: New World Models (Belief) ; new Agent Types

I Probabilistic Agents: In a partially observable, belief model = Bayesiannetworks, inference = probabilistic inference.

I Decision-Theoretic Agents: In a partially observable, stochastic, belief model +transition model = Decision networks, inference = MEU.


7.1.4 Modeling Uncertainty


Wumpus World Revisited

I Recall: We have updated agents with world/transition models with possibleworlds.

I Problem: But pure sets of possible worlds are not enoughI Example 1.17 (The Wumpus is Back).

I We have a maze with pits that are detected inneighboring squares via breeze (Wumpus andgold will not be assumed now).

I Where does the agent should go, if there isbreeze at (1,2) and (2,1)?

I Problem: (1.3), (2,2), and (3.1) are all unsafe!(there are possible worlds with wumpus in any ofthem)

I Idea: We need world models that estimate the wumpus-likelyhood in cells!


Uncertainty and Logic

I Diagnosis: We want to build an expert dental diagnosis system, that deducesthe cause (the disease) from the symptoms.

I Can we base this on logic?

I Attempt 1: Say we have a toothache. How’s about:

∀p Symptom(p, toothache)⇒Disease(p, cavity)

I Is this rule correct?I No, toothaches may have different causes (“cavity” = “Loch im Zahn”).

I Attempt 2: So what about this:

∀p Symptom(p, toothache)⇒Disease(p, cavity)∨Disease(p, gingivitis)∨ . . .

I We don’t know all possible causes.I And we’d like to be able to deduce which causes are more plausible!




I Can we base this on logic?I Attempt 1: Say we have a toothache. How’s about:


I Is this rule correct?

I No, toothaches may have different causes (“cavity” = “Loch im Zahn”).I Attempt 2: So what about this:






I Can we base this on logic?I Attempt 1: Say we have a toothache. How’s about:


I Is this rule correct?I No, toothaches may have different causes (“cavity” = “Loch im Zahn”).

I Attempt 2: So what about this:




Uncertainty and Logic, ctd.

I Attempt 3: Perhaps a causal rule is better?

∀p Disease(p, cavity)⇒ Symptom(p, toothache)

I Is this rule correct?

I No, not all cavities cause toothaches.I Does this rule allow to deduce a cause from a symptom?I No, setting Symptom(p, toothache) to true here has no consequence on the truth of

Disease(p, cavity).I Note: If Symptom(p, toothache) is false, we would conclude ¬Disease(p, cavity)

. . . which would be incorrect, cf. previous question.I Anyway, this still doesn’t allow to compare the plausibility of different causes.I Logic does not allow to weigh different alternatives, and it does not allow to express

incomplete knowledge (“cavity does not always come with a toothache, nor viceversa”).





I Is this rule correct?I No, not all cavities cause toothaches.I Does this rule allow to deduce a cause from a symptom?

I No, setting Symptom(p, toothache) to true here has no consequence on the truth ofDisease(p, cavity).

I Note: If Symptom(p, toothache) is false, we would conclude ¬Disease(p, cavity). . . which would be incorrect, cf. previous question.

I Anyway, this still doesn’t allow to compare the plausibility of different causes.I Logic does not allow to weigh different alternatives, and it does not allow to express






I Is this rule correct?I No, not all cavities cause toothaches.I Does this rule allow to deduce a cause from a symptom?I No, setting Symptom(p, toothache) to true here has no consequence on the truth of

Disease(p, cavity).

I Note: If Symptom(p, toothache) is false, we would conclude ¬Disease(p, cavity). . . which would be incorrect, cf. previous question.









. . . which would be incorrect, cf. previous question.









. . . which would be incorrect, cf. previous question.I Anyway, this still doesn’t allow to compare the plausibility of different causes.I Logic does not allow to weigh different alternatives, and it does not allow to express



Beliefs and Probabilities

I What do we model with probabilities?

I Incomplete knowledge! We are not 100% sure, but we believe to a certaindegree that something is true.

I Probability ≈ Our degree of belief, given our current knowledge.I Example 1.18 (Diagnosis).I Symptom(p, toothache)⇒Disease(p, cavity) with 80% probability.I But, for any given p, in reality we do, or do not, have cavity: 1 or 0!I The “probability” depends on our knowledge! The “80%” refers to the fraction of

cavity, within the set of all p′ that are indistinguishable from p based on ourknowledge.

I If we receive new knowledge (e.g., Disease(p, gingivitis)), the probability changes!

I Probabilities represent and measure the uncertainty that stems from lack ofknowledge.



I What do we model with probabilities?I Incomplete knowledge! We are not 100% sure, but we believe to a certain

degree that something is true.I Probability ≈ Our degree of belief, given our current knowledge.I Example 1.18 (Diagnosis).I Symptom(p, toothache)⇒Disease(p, cavity) with 80% probability.

I But, for any given p, in reality we do, or do not, have cavity: 1 or 0!I The “probability” depends on our knowledge! The “80%” refers to the fraction of







degree that something is true.I Probability ≈ Our degree of belief, given our current knowledge.I Example 1.18 (Diagnosis).I Symptom(p, toothache)⇒Disease(p, cavity) with 80% probability.I But, for any given p, in reality we do, or do not, have cavity: 1 or 0!

I The “probability” depends on our knowledge! The “80%” refers to the fraction ofcavity, within the set of all p′ that are indistinguishable from p based on ourknowledge.






degree that something is true.I Probability ≈ Our degree of belief, given our current knowledge.I Example 1.18 (Diagnosis).I Symptom(p, toothache)⇒Disease(p, cavity) with 80% probability.I But, for any given p, in reality we do, or do not, have cavity: 1 or 0!I The “probability” depends on our knowledge! The “80%” refers to the fraction of





How to Obtain Probabilities?

I Assessing probabilities through statistics:I The agent is 90% convinced by its sensor information := in 9 out of 10 cases, the

information is correct.I Disease(p, cavity)⇒ Symptom(p, toothache) with 80% probability := 8 out of 10

persons with a cavity have toothache.I The process of estimating a probability P using statistics is called assessing P.

I Assessing even a single P can require huge effort! (Eg. “The likelihood ofmaking it to the university within 10 minutes”)

I What is probabilistic reasoning? Deducing probabilities from knowledgeabout other probabilities.

I Probabilistic reasoning determines, based on probabilities that are (relatively)easy to assess, probabilities that are difficult to assess.





persons with a cavity have toothache.I The process of estimating a probability P using statistics is called assessing P.I Assessing even a single P can require huge effort!

(Eg. “The likelihood ofmaking it to the university within 10 minutes”)

I What is probabilistic reasoning? Deducing probabilities from knowledgeabout other probabilities.

I Probabilistic reasoning determines, based on probabilities that are (relatively)easy to assess, probabilities that are difficult to assess.





persons with a cavity have toothache.I The process of estimating a probability P using statistics is called assessing P.I Assessing even a single P can require huge effort! (Eg. “The likelihood of

making it to the university within 10 minutes”)I What is probabilistic reasoning? Deducing probabilities from knowledge

about other probabilities.I Probabilistic reasoning determines, based on probabilities that are (relatively)

easy to assess, probabilities that are difficult to assess.


7.1.5 Acting under Uncertainty


Decision-Making Under Uncertainty

I Example 1.19. Giving a lecture:I Goal: Be in HS002 at 10:15 to give a lecture.

I Possible plans:I P1: Get up at 8:00, leave at 8:40, arrive at 9:00.

I P2: Get up at 9:50, leave at 10:05, arrive at 10:15.I Decision: Both plans are correct, but P2 succeeds only with probability 50%, and

giving a lecture is important, so P1 is the plan of choice.I Better Example: Which train to take to Frankfurt airport?



I Example 1.19. Giving a lecture:I Goal: Be in HS002 at 10:15 to give a lecture.I Possible plans:I P1: Get up at 8:00, leave at 8:40, arrive at 9:00.I P2: Get up at 9:50, leave at 10:05, arrive at 10:15.

I Decision: Both plans are correct, but P2 succeeds only with probability 50%, andgiving a lecture is important, so P1 is the plan of choice.

I Better Example: Which train to take to Frankfurt airport?












Uncertainty and Rational Decisions

I Here: We’re only concerned with deducing the likelihood of facts, not withaction choice. In general, selecting actions is of course important.

I Rational Agents:I We have a choice of actions (go to FRA early, go to FRA just in time).I These can lead to different solutions with different probabilities.I The actions have different costs.I The results have different utilities (safe timing/dislike airport food).

I A rational agent chooses the action with the maximum expected utility.I Decision Theory = Utility Theory + Probability Theory.


Utility-based agents

I Definition 1.20. A utility-based agent uses a world model along with a utilityfunction that influences its preferences among the states of that world. Itchooses the action that leads to the best expected utility, which is computed byaveraging over all possible outcome states, weighted by the probability of theoutcome.

I


Utility-based agents

I A utility function allows rational decisions where mere goals are inadequateI conflicting goals (utility gives tradeoff to make rational decisions)I goals obtainable by uncertain actions (utility * likelyhood helps)


Decision-Theoretic Agent

I A particular kind of utility-based agent:

13 QUANTIFYINGUNCERTAINTY

function DT-AGENT(percept ) returns anactionpersistent: belief state , probabilistic beliefs about the current state of the world

action , the agent’s action

updatebelief state based onaction andperceptcalculate outcome probabilities for actions,

given action descriptions and currentbelief stateselectaction with highest expected utility

given probabilities of outcomes and utility informationreturn action

Figure 13.1 A decision-theoretic agent that selects rational actions.

32


7.1.6 Agenda for this Chapter: Basics of ProbabilityTheory


Our Agenda for This Topic

I Our treatment of the topic “Probabilistic Reasoning” consists of this Chapterand the next.I This Chapter: All the basic machinery at use in Bayesian networks.I Chapter 8: Bayesian networks: What they are, how to build them, how to use them.

I Bayesian networks are the most wide-spread and successful practical frameworkfor probabilistic reasoning.


Our Agenda for This Chapter

I Unconditional Probabilities and Conditional Probabilities: Which conceptsand properties of probabilities will be used?I Mostly a recap of things you’re familiar with from school.

I Independence and Basic Probabilistic Reasoning Methods: What simplemethods are there to avoid enumeration and to deduce probabilities from otherprobabilities?I A basic tool set we’ll need. (Still familiar from school?)

I Bayes’ Rule: What’s that “Bayes”? How is it used and why is it important?I The basic insight about how to invert the “direction” of conditional probabilities.

I Conditional Independence: How to capture and exploit complex relationsbetween random variables?I Explains the difficulties arising when using Bayes’ rule on multiple evidences.

Conditional independence is used to ameliorate these difficulties.


7.2 Unconditional Probabilities


1 EdN:1


Probabilistic Models

I Definition 2.1. A probability theory is an assertion language for talking aboutpossible worlds and an inference method for quantifying the degree of belief insuch assertions.

I Remark: Like logic, but for non-binary belief degree.

I The possible worlds are mutually exclusive: possible worlds cannot both be thecase and exhaustive: one possible world must be the case.

I This determines the set of possible worldsI Example 2.2. If we roll two (distinguishable) dice with six sides, then we have

36 possible worlds: (1, 1), (2, 1), . . . , (6, 6).I We will restrict ourselves to a discrete, countable sample space. (others more

complicated, less useful in AI)I Definition 2.3. A probabiltiy model 〈Ω,P〉 consists of a set Ω of possible

worlds called the sample space and a probability function P : Ω→ R, such that0≤P(ω)≤1 for all ω ∈ Ω and

∑ω∈Ω P(ω) = 1.


Unconditional Probabilities, Random Variables, and Events

I Definition 2.4. A random variable (also called random quantity, aleatoryvariable, or stochastic variable) is a variable quantity whose value depends onpossible outcomes of unknown variables and processes we do not understand.

I Definition 2.5. We will refer to the fact X = x as an outcome and a set ofoutcomes as an event.

I The notation uppercase “X ” for a variable, and lowercase “x” for one of itsvalues will be used frequently. (Follows Russel/Norvig)

I Definition 2.6. Given a random variable X , P(X = x) denotes the priorprobability, or unconditional probability, that X has value x in the absence ofany other information.

I Example 2.7. P(Cavity = T) = 0.2, where Cavity is a random variable whosevalue is true iff some given person has a cavity.


Types of Random Variables

I Note: In general, random variables can have arbitrary domains. Here, weconsider finite-domain random variables only, and Boolean random variablesmost of the time.

I Example 2.8.

P(Weather = sunny) = 0.7P(Weather = rain) = 0.2

P(Weather = cloudy) = 0.08P(Weather = snow) = 0.02P(Headache = T) = 0.1

I Unlike us, Russel and Norvig live in California . . . :-( :-(I Convenience Notations:I By convention, we denote Boolean random variables with A, B, and more general

finite-domain random variables with X , Y .I For Boolean variable Name, we write name for Name = T and ¬ name for

Name = F. (Follows Russel/Norvig)


Probability Distributions

I Definition 2.9. The probability distribution for a random variable X , writtenP(X ), is the vector of probabilities for the (ordered) domain of X .

I Example 2.10. Probability distributions for finite-domain and Boolean randomvariables

P(Headache) = 〈0.1, 0.9〉P(Weather) = 〈0.7, 0.2, 0.08, 0.02〉

define the probability distribution for the random variables Headache andWeather.

I Definition 2.11. Given a subset Z⊆X1, . . . ,Xn of random variables, an eventis an assignment of values to the variables in Z. The joint probabilitydistribution, written P(Z), lists the probabilities of all events.

I Example 2.12. P(Headache,Weather) isHeadache = T Headache = F

Weather = sunny P(W = sunny∧ headache) P(W = sunny∧¬ headache)

Weather = rainWeather = cloudyWeather = snow


The Full Joint Probability Distribution

I Definition 2.13. Given random variables X1, . . . ,Xn, an atomic event is anassignment of values to all variables.

I Example 2.14. If A and B are Boolean random variables, then we have 4atomic events: a∧ b, a∧¬ b, ¬ a∧ b, ¬ a∧¬ b.

I Definition 2.15. Given random variables X1, . . . ,Xn, the full joint probabilitydistribution, denoted P(X1, . . . ,Xn), lists the probabilities of all atomic events.

I Example 2.16. P(Cavity ,Toothache)

toothache ¬ toothachecavity 0.12 0.08¬ cavity 0.08 0.72

I All atomic events are disjoint (their pairwise conjunctions all are ⊥); the sum ofall fields is 1 (corresponds to their disjunction >).


Probabilities of Propositional Formulas

I Definition 2.17. Given random variables X1, . . . ,Xn, a propositional formula,short proposition, is a propositional formula over the atoms Xi = xi where xi is avalue in the domain of Xi .A function P that maps propositions into [0, 1] is a probability measure if(i) P(>) = 1 and(ii) for all propositions A, P(A) =

∑e|=A P(e) where e is an atomic event.

I Propositions represent sets of atomic events: the interpretations satisfying theformula.

I Example 2.18. P(cavity∧ toothache) = 0.12 is the probability that some givenperson has both a cavity and a toothache. (Note the use of cavity forCavity = T and toothache for Toothache = T.)

I Notes:I Instead of P(a∧ b), we often write P(a, b).I Propositions can be viewed as Boolean random variables; we will denote them with

A, B as well.


Questionnaire

I Theorem 2.19 (Kolmogorow). A function P that maps propositions into[0, 1] is a probability measure if and only ifi P(>) = 1 and

ii’ for all propositions A, B: P(a∨ b) = P(a) + P(b)− P(a∧ b).I We can equivalently replace

ii for all propositions A, P(A) =∑

I |=A P(I ) (c.f. previous slide) with Kolmogorow’s(ii’).

1. Question!: Assume we haveiii P(⊥) = 0.How to derive from (i), (ii’), and (iii) that, for all propositions A, P(¬ a) = 1−P(a)?

1.1 By (i), P(>) = 1; as (a∨¬ a)⇔>, we get P(a∨¬ a) = 1.1.2 By (iii), P(⊥) = 0; as (a∧¬ a)⇔⊥, we get P(a∧¬ a) = 0.1.3 Inserting this into (ii’), we get P(a∨¬ a) = 1 = P(a) + P(¬ a)− 0.


Questionnaire

I Theorem 2.19 (Kolmogorow). A function P that maps propositions into[0, 1] is a probability measure if and only ifi P(>) = 1 and

ii’ for all propositions A, B: P(a∨ b) = P(a) + P(b)− P(a∧ b).I We can equivalently replace

ii for all propositions A, P(A) =∑

I |=A P(I ) (c.f. previous slide) with Kolmogorow’s(ii’).

1. Question!: Assume we haveiii P(⊥) = 0.How to derive from (i), (ii’), and (iii) that, for all propositions A, P(¬ a) = 1−P(a)?

1.1 By (i), P(>) = 1; as (a∨¬ a)⇔>, we get P(a∨¬ a) = 1.1.2 By (iii), P(⊥) = 0; as (a∧¬ a)⇔⊥, we get P(a∧¬ a) = 0.1.3 Inserting this into (ii’), we get P(a∨¬ a) = 1 = P(a) + P(¬ a)− 0.


Questionnaire, ctd.

I Reminder 1: (i) P(>) = 1; (ii’) P(a∨ b) = P(a) + P(b)− P(a∧ b).I Reminder 2: “Probabilities model our belief.”I If P represents an objectively observable probability, the axioms clearly make sense.

But why should an agent respect these axioms, when modeling its subjective ownbelief?

Question: Do you believe in Kolmogorow’s axioms?

II You’re free to believe whatever you want, but note this [deFinetti:sssdp31]: Ifan agent has a belief that violates Kolmogorov’s axioms, then there exists acombination of “bets” on propositions so that the agent always looses money.

I If your beliefs are contradictory, then you will not be successful in the long run(and even the next minute if your opponent is clever).


Questionnaire, ctd.

I Reminder 1: (i) P(>) = 1; (ii’) P(a∨ b) = P(a) + P(b)− P(a∧ b).I Reminder 2: “Probabilities model our belief.”I If P represents an objectively observable probability, the axioms clearly make sense.

But why should an agent respect these axioms, when modeling its subjective ownbelief?

Question: Do you believe in Kolmogorow’s axioms?

II You’re free to believe whatever you want, but note this [deFinetti:sssdp31]: Ifan agent has a belief that violates Kolmogorov’s axioms, then there exists acombination of “bets” on propositions so that the agent always looses money.

I If your beliefs are contradictory, then you will not be successful in the long run(and even the next minute if your opponent is clever).


7.3 Conditional Probabilities


Conditional Probabilities: Intuition

I Do probabilities change as we gather new knowledge?

I Yes! Probabilities model our belief, thus they depend on our knowledge.I Example 3.1. Your “probability of missing the connection train” increases when

you are informed that your current train has 30 minutes delay.I Example 3.2. The “probability of cavity” increases when the doctor is informed

that the patient has a toothache.I In the presence of additional information, we can no longer use the unconditional

(prior!) probabilities.

I Given propositions A and B, P(a | b) denotes the conditional probability of a(i.e., A = T) given that all we know is b (i.e., B = T).

I Example 3.3. P(cavity) = 0.2 vs. P(cavity | toothache) = 0.6. AndP(cavity | toothache∧¬ cavity) = 0



I Do probabilities change as we gather new knowledge?I Yes! Probabilities model our belief, thus they depend on our knowledge.I Example 3.1. Your “probability of missing the connection train” increases when


that the patient has a toothache.

I In the presence of additional information, we can no longer use the unconditional(prior!) probabilities.




















Conditional Probabilities: Definition

I Definition 3.4. Given propositions A and B where P(b) 6= 0, the conditionalprobability, or posterior probability, of a given b, written P(a | b), is defined as:

P(a | b) :=P(a∧ b)

P(b)

I Intuition: The likelihood of having a and b, within the set of outcomes where wehave b.

I Example 3.5. P(cavity∧ toothache) = 0.12 and P(toothache) = 0.2 yieldP(cavity | toothache) = 0.6.


Conditional Probability Distributions

I Definition 3.6. Given random variables X and Y , the conditional probabilitydistribution of X given Y , written P(X | Y ), is the table of all conditionalprobabilities of values of X given values of Y .

I For sets of variables: P(X1, . . . ,Xn | Y1, . . . ,Ym).I Example 3.7. P(Weather | Headache) =

Headache = T Headache = FWeather = sunny P(W = sunny | headache) P(W = sunny | ¬ headache)

Weather = rainWeather = cloudyWeather = snow

What is “The probability of sunshine given that I have a headache?”I If you’re susceptible to headaches depending on weather conditions, this makes

sense. Otherwise, the two variables are independent (see next section)


7.4 Independence


Working with the Full Joint Probability Distribution

I Example 4.1. Consider the following joint probability distribution:


I How to compute P(cavity)?

I Sum across the row:

P(cavity∧ toothache) + P(cavity∧¬ toothache) = 0.2

I How to compute P(cavity∨ toothache)?I Sum across atomic events:

P(cavity∧ toothache) + P(¬ cavity∧ toothache) + P(cavity∧¬ toothache) = 0.28

I How to compute P(cavity | toothache)?I P(cavity∧ toothache)

P(toothache)I All relevant probabilities can be computed using the full joint probability

distribution, by expressing propositions as disjunctions of atomic events.





I How to compute P(cavity)?I Sum across the row:


I How to compute P(cavity∨ toothache)?

I Sum across atomic events:













I How to compute P(cavity | toothache)?

I P(cavity∧ toothache)P(toothache)

I All relevant probabilities can be computed using the full joint probabilitydistribution, by expressing propositions as disjunctions of atomic events.













Working with the Full Joint Probability Distribution??

I Question: Is it a good idea to use the full joint probability distribution?

I Answer: No:I Given n random variables with k values each, the joint probability distribution

contains kn probabilities.I Computational cost of dealing with this size.I Practically impossible to assess all these probabilities.

I Question: So, is there a compact way to represent the full joint probabilitydistribution? Is there an efficient method to work with that representation?

I Answer: Not in general, but it works in many cases. We can work directly withconditional probabilities, and exploit (conditional) independence.

I Bayesian networks.

(First, we do the simple case.)








I Bayesian networks. (First, we do the simple case.)


















Independence

I Definition 4.2. Events a and b are independent if P(a∧ b) = P(a) · P(b).I Proposition 4.3. Given independent events a and b where P(b) 6= 0, we have

P(a | b) = P(a).I Proof:

P.1 By definition, P(a | b) = P(a∧ b)P(b) ,

P.2 which by independence is equal to P(a)·P(b)P(b) = P(a).

I Similarly, if P(a) 6= 0, we have P(b | a) = P(b).I Example 4.4.I P(Dice1 = 6∧Dice2 = 6) = 1/36.I P(W = sunny | headache) = P(W = sunny) unless you’re weather-sensitive (cf.

slide 26).I But toothache and cavity are NOT independent.I The fraction of “cavity” is higher within “toothache” than within “¬ toothache”.

P(toothache) = 0.2 and P(cavity) = 0.2, but P(toothache∧ cavity) = 0.12 > 0.04.I Definition 4.5. Random variables X and Y are independent if

P(X ,Y ) = P(X ) · P(Y ). (System of equations!)


Illustration: Exploiting Independence

I Example 4.6. Consider (again) the following joint probability distribution:toothache ¬ toothache

cavity 0.12 0.08¬ cavity 0.08 0.72

Adding variable Weather with values sunny, rain, cloudy, snow, the full jointprobability distribution contains 16 probabilities.But your teeth do not influence the weather, nor vice versa!I Weather is independent of each of Cavity and Toothache: For all value combinations

(c, t) of Cavity and Toothache, and for all values w of Weather, we haveP(c ∧ t ∧w) = P(c ∧ t) · P(w).

I P(Cavity,Toothache,Weather) can be reconstructed from the separate tablesP(Cavity,Toothache) and P(Weather). (8 probabilities)

I Independence can be exploited to represent the full joint probability distributionmore compactly.

I Sometimes, variables are independent only under particular conditions:conditional independence, see later.


7.5 Basic Probabilistic Reasoning Methods


The Product Rule

I Proposition 5.1 (Product Rule). Given propositions A and B,P(a∧ b) = P(a | b) · P(b)

I Example 5.2. P(cavity∧ toothache) = P(toothache | cavity) · P(cavity).I If we know the values of P(a | b) and P(b), then we can compute P(a∧ b).I Similarly, P(a∧ b) = P(b | a) · P(a).I Definition 5.3. P(X ,Y ) = P(X | Y ) · P(Y ) is a system of equations:

P(W = sunny∧ headache) = P(W = sunny | headache) · P(headache)

P(W = rain∧ headache) = P(W = rain | headache) · P(headache)

... =...

P(W = snow∧¬ headache) = P(W = snow | ¬ headache) · P(¬ headache)

I Similar for unconditional distributions, P(X ,Y ) = P(X ) · P(Y ).


The Chain Rule

I Proposition 5.4 (Chain Rule). Given random variables X1, . . . ,Xn, we have

P(X1, . . . ,Xn) = P(Xn | Xn−1, . . . ,X1) · P(Xn−1 | Xn−2, . . . ,X1) · . . . · P(X2 | X1) · P(X1)

I Example 5.5.

P(¬ brush∧ cavity∧ toothache)

= P(toothache | cavity,¬ brush) · P(cavity,¬ brush)

= P(toothache | cavity,¬ brush) · P(cavity | ¬ brush) · P(¬ brush)

I Proof: Iterated application of Product RuleP.1 P(X1, . . . ,Xn) = P(Xn | Xn−1, . . . ,X1) · P(Xn−1, . . . ,X1) by Product

Rule.P.2 In turn, P(Xn−1, . . . ,X1) = P(Xn−1 | Xn−2, . . . ,X1) · P(Xn−2, . . . ,X1),

etc.

Note: This works for any ordering of the variables.I I We can recover the probability of atomic events from sequenced conditional

probabilities for any ordering of the variables.I First of the four basic techniques in Bayesian networks.


Marginalization

I Extracting a sub-distribution from a larger joint distribution:I Proposition 5.6 (Marginalization). Given sets X and Y of random variables,

we have:P(X) =

∑y∈Y

P(X, y)

where∑

y∈Y sums over all possible value combinations of Y.I Example 5.7. (Note: Equation system!)

P(Cavity) =∑

y∈Toothache

P(Cavity, y)

P(cavity) = P(cavity, toothache) + P(cavity,¬ toothache)

P(¬ cavity) = P(¬ cavity, toothache) + P(¬ cavity,¬ toothache)


Questionnaire

I Say P(dog) = 0.4, (¬ dog)⇔ cat, and P(likeslasagna | cat) = 0.5.

I Question: Is P(likeslasagna∧ cat) is A: 0.2, B: 0.5, C: 0.475, D: 0.3

I Answer: We have P(cat) = 0.6 and P(likeslasagna | cat) = 0.5, hence (D) bythe product rule.

I Question: Can we compute the value of P(likeslasagna), given the aboveinformations?

I Answer: No. We don’t know the probability that dogs like lasagna, i.e.P(likeslasagna | dog).


Questionnaire







Questionnaire







Normalization: Idea

I Problem: We know P(cavity∧ toothache) but don’t know P(toothache).

I Step 1: Case distinction over values of Cavity: (P(toothache) as an unknown)

P(cavity | toothache) =P(cavity∧ toothache)

P(toothache)=

0.12P(toothache)

P(¬ cavity | toothache) =P(¬ cavity∧ toothache)

P(toothache)=

0.08P(toothache)

I Step 2: Assuming placeholder α := 1/P(toothache):

P(cavity | toothache) = α P(cavity∧ toothache) = α 0.12P(¬ cavity | toothache) = α P(¬ cavity∧ toothache) = α 0.08

I Step 3: Fixing toothache to be true, view P(cavity∧ toothache) vs.P(¬ cavity∧ toothache) as the relative weights of P(cavity) vs. P(¬ cavity)within toothache. Then normalize their summed-up weight to 1:1 = α (0.12+ 0.08) ; α = 1

0.12+ 0.08 = 10.2 = 5

I α is a normalization constant scaling the sum of relative weights to 1.


Normalization: Idea




P(toothache)=

0.12P(toothache)


P(toothache)=

0.08P(toothache)




0.12+ 0.08 = 10.2 = 5



Normalization: Idea




P(toothache)=

0.12P(toothache)


P(toothache)=

0.08P(toothache)




0.12+ 0.08 = 10.2 = 5



Normalization: Idea




P(toothache)=

0.12P(toothache)


P(toothache)=

0.08P(toothache)




0.12+ 0.08 = 10.2 = 5



Normalization: Idea




P(toothache)=

0.12P(toothache)


P(toothache)=

0.08P(toothache)




0.12+ 0.08 = 10.2 = 5

I α is a normalization constant scaling the sum of relative weights to 1.Kohlhase: Künstliche Intelligenz 2 169 July 12, 2018

Normalization: Formal

I Definition 5.8. Given a vector 〈w1, . . . ,wk〉 of numbers in [0, 1] where∑ki=1 wi ≤ 1, the normalization constant α is α〈w1, . . . ,w1〉 := 1∑k

i=1 wi.

I Example 5.9. α〈0.12, 0.08〉 = 5 〈0.12, 0.08〉 = 〈0.6, 0.4〉.Proposition 5.10 (Normalization). Given a random variable X and an evente, we have P(X | e) = α P(X , e).

I Proof:P.1 For each value x of X , P(X = x | e) = P(X = x ∧ e)/P(e).P.2 So all we need to prove is that α = 1/P(e).P.3 By definition, α = 1/

∑x P(X = x ∧ e),

so we need to proveP(e) =

∑x P(X = x ∧ e) which holds by marginalization.

I Example 5.11. α 〈P(cavity∧ toothache),P(¬ cavity∧ toothache)〉 =α 〈0.12, 0.08〉, so P(cavity | toothache) = 0.6, andP(¬ cavity | toothache) = 0.4.

I Another way of saying this is: “We use α as a placeholder for 1/P(e), which wecompute using the sum of relative weights by Marginalization.”

I Normalization+Marginalization: Given “query variable” X , “observed event”e, and “hidden variables” set Y: P(X | e) = α · P(X , e) = α ·

∑y∈Y P(X , e, y).

I Second of the four basic techniques in Bayesian networks.


Normalization: Formal

I Definition 5.8. Given a vector 〈w1, . . . ,wk〉 of numbers in [0, 1] where∑ki=1 wi ≤ 1, the normalization constant α is α〈w1, . . . ,w1〉 := 1∑k

i=1 wi.

I Example 5.9. α〈0.12, 0.08〉 = 5 〈0.12, 0.08〉 = 〈0.6, 0.4〉.Proposition 5.10 (Normalization). Given a random variable X and an evente, we have P(X | e) = α P(X , e).

I Proof:P.1 For each value x of X , P(X = x | e) = P(X = x ∧ e)/P(e).P.2 So all we need to prove is that α = 1/P(e).P.3 By definition, α = 1/

∑x P(X = x ∧ e), so we need to prove

P(e) =∑

x P(X = x ∧ e) which holds by marginalization.I Example 5.11. α 〈P(cavity∧ toothache),P(¬ cavity∧ toothache)〉 =α 〈0.12, 0.08〉, so P(cavity | toothache) = 0.6, andP(¬ cavity | toothache) = 0.4.

I Another way of saying this is: “We use α as a placeholder for 1/P(e), which wecompute using the sum of relative weights by Marginalization.”

I Normalization+Marginalization: Given “query variable” X , “observed event”e, and “hidden variables” set Y: P(X | e) = α · P(X , e) = α ·

∑y∈Y P(X , e, y).

I Second of the four basic techniques in Bayesian networks.Kohlhase: Künstliche Intelligenz 2 170 July 12, 2018

7.6 Bayes’ Rule


Bayes’ Rule

I Proposition 6.1 (Bayes’ Rule). Given propositions A and B where P(a) 6= 0and P(b) 6= 0, we have:

P(a | b) =P(b | a) · P(a)

P(b)

I Proof:P.1 By definition, P(a | b) = P(a∧ b)

P(b)

P.2 by the product rule P(a∧ b) = P(b | a) · P(a) is equal to the claim.

Notation: note that this is a system of equations!

P(X | Y ) =P(Y | X ) · P(X )

P(Y )


Applying Bayes’ Rule

II Example 6.2. Say we know that P(toothache | cavity) = 0.6, P(cavity) = 0.2,and P(toothache) = 0.2.We can we compute P(cavity | toothache): By Bayes’ rule,P(cavity | toothache) = P(toothache|cavity)·P(cavity)

P(toothache) = 0.6·0.20.2 = 0.6.

I Ok, but: Why don’t we simply assess P(cavity | toothache) directly?I P(toothache | cavity) is causal, P(cavity | toothache) is diagnostic.I Causal dependencies are robust over frequency of the causes.I Example 6.3. If there is a cavity epidemic then P(cavity | toothache) increases,

but P(toothache | cavity) remains the same. (only depends on how cavities“work”)

I Also, causal dependencies are often easier to assess.I Bayes’ rule allows to perform diagnosis (observing a symptom, what is the

cause?) based on prior probabilities and causal dependencies.


Extended Example: Bayes’ Rule and Meningitis

I Facts known to doctors:I The prior probabilities of meningitis (m) and stiff neck (s) are P(m) = 0.00002 and

P(s) = 0.01.I Meningitis causes a stiff neck 70% of the time: P(s | m) = 0.7.

I Doctor d uses Bayes’ Rule:P(m | s) = P(s|m)·P(m)

P(s) = 0.7·0.000020.01 = 0.0014 ∼ 1

700 .I Even though stiff neck is strongly indicated by meningitis (P(s | m) = 0.7)I the probability of meningitis in the patient remains small.I The prior probability of stiff necks is much higher than that of meningitis.

I Doctor d ′ knows P(m | s) from observation; she does not need Bayes’ rule!I Indeed, but what if a meningitis epidemic eruptsI Then d knows that P(m | s) grows proportionally with P(m) (d ′ clueless)


Questionnaire

I Say P(dog) = 0.4, P(likeschappi | dog) = 0.8, and P(likeschappi) = 0.5.

I Question: What is P(dog | likeschappi)?A: 0.8 B: 0.64 C:0.9 D: 0.32?

I Answer: By Bayes’ rule,P(dog | likeschappi) = P(likeschappi|dog) P(dog)

P(likeschappi) = 0.8∗0.40.5 = 0.64 so (B).

I Question: Is P(dog | likeschappi) causal or diagnostic?

I Answer: Diagnostic; liking Chappi does not cause anybody to be a dog.

I Question: Is P(likeschappi | dog) causal or diagnostic?

I Answer: Causal; liking or not liking dog food may be caused by being or notbeing a dog.


Questionnaire










Questionnaire










Questionnaire










7.7 Conditional Independence


Bayes’ Rule with Multiple Evidence

I Example 7.1. Say we know from medicinical studies that P(cavity) = 0.2,P(toothache | cavity) = 0.6, P(toothache | ¬ cavity) = 0.1,P(catch | cavity) = 0.9, and P(catch | ¬ cavity) = 0.2.Now, in case we did observe the symptoms toothache and catch (the dentist’sprobe catches in the aching tooth), what would be the likelihood of having acavity? What is P(cavity | toothache∧ catch)?I Trial 1: Bayes’ rule

P(cavity | toothache∧ catch) = P(toothache∧ catch | cavity) P(cavity)P(toothache∧ catch)

I Trial 2: Normalization P(X | e) = α P(X , e) then Product RuleP(X , e) = P(e | X ) P(X ), with X = Cavity, e = toothache∧ catch:

P(Cavity | catch∧ toothache) = α P(toothache∧ catch | Cavity) P(Cavity)P(cavity | catch∧ toothache) = α P(toothache∧ catch | cavity) P(cavity)

P(¬ cavity | catch∧ toothache) = α P(toothache∧ catch | ¬ cavity) P(¬ cavity)


Bayes’ Rule with Multiple Evidence

I Example 7.1. Say we know from medicinical studies that P(cavity) = 0.2,P(toothache | cavity) = 0.6, P(toothache | ¬ cavity) = 0.1,P(catch | cavity) = 0.9, and P(catch | ¬ cavity) = 0.2.Now, in case we did observe the symptoms toothache and catch (the dentist’sprobe catches in the aching tooth), what would be the likelihood of having acavity? What is P(cavity | toothache∧ catch)?I Trial 1: Bayes’ rule

P(cavity | toothache∧ catch) = P(toothache∧ catch | cavity) P(cavity)P(toothache∧ catch)

I Trial 2: Normalization P(X | e) = α P(X , e) then Product RuleP(X , e) = P(e | X ) P(X ), with X = Cavity, e = toothache∧ catch:

P(Cavity | catch∧ toothache) = α P(toothache∧ catch | Cavity) P(Cavity)P(cavity | catch∧ toothache) = α P(toothache∧ catch | cavity) P(cavity)

P(¬ cavity | catch∧ toothache) = α P(toothache∧ catch | ¬ cavity) P(¬ cavity)


Bayes’ Rule with Multiple Evidence, ctd.

I P(Cavity | toothache∧ catch) = αP(toothache∧ catch | Cavity)P(Cavity)

I Question: So, is everything fine?

I Answer: No! We need P(toothache∧ catch | Cavity), i.e. causal dependenciesfor all combinations of symptoms! ( 2, in general)

I Question: Are Toothache and Catch independent?

I Answer: No. If a probe catches, we probably have a cavity which probablycauses toothache.

I But: They are independent given the presence or absence of cavity!


























Conditional Independence

I Definition 7.2. Given sets of random variables Z1, Z2, and Z, we say that Z1and Z2 are conditionally independent given Z if:

P(Z1,Z2 | Z) = P(Z1 | Z) · P(Z2 | Z)

We alternatively say that Z1 is conditionally independent of Z2 given Z.I Example 7.3.

P(Toothache,Catch | cavity) = P(Toothache | cavity)P(Catch | cavity)

P(Toothache,Catch | ¬ cavity) = P(Toothache | ¬ cavity)P(Catch | ¬ cavity)

I For cavity: this may cause both, but they don’t influence each other.I For ¬ cavity: catch and/or toothache would each be caused by something else.

I Note: The definition is symmetric regarding the roles of Z1 and Z2: Toothacheis conditionally independent of

I But there may be dependencies within Z1 or Z2, e.g.Z2 = Toothache, Sleeplessness.


Conditional Independence, ctd.

I Proposition 7.4. If Z1 and Z2 are conditionally independent given Z, thenP(Z1 | Z2,Z) = P(Z1 | Z).

I Proof:P.1 By definition, P(Z1 | Z2,Z) = P(Z1,Z2,Z)

P(Z2,Z)

P.2 which by product rule is equal to P(Z1,Z2|Z)·P(Z)P(Z2,Z)

P.3 which by conditional independence is equal to P(Z1|Z)·P(Z2|Z)·P(Z)P(Z2,Z) .

P.4 Since P(Z2|Z)·P(Z)P(Z2,Z) = 1 this proves the claim.

I Example 7.5. Using Toothache as Z1, Catch as Z2, and Cavity as Z:P(Toothache | Catch,Cavity) = P(Toothache | Cavity).

I In the presence of conditional independence, we can drop variables from theright-hand side of conditional probabilities.

I Third of the four basic techniques in Bayesian networks.I Last missing technique: “Capture variable dependencies in a graph”; illustration

see next slide, details see next Chapter


Exploiting Conditional Independence: Overview

I 1. Graph captures variable dependencies: (Variables X1, . . . ,Xn)

Toothache Catch

Cavity

I Given evidence e, want to know P(X | e).I Remaining vars: Y.

I 2. Normalization+Marginalization:P(X | e) = α · P(X , e); if Y 6= ∅ then P(X | e) = α ·

∑y∈Y P(X , e, y)

I A sum over atomic events!

I 3. Chain rule: Order X1, . . . ,Xn consistently with dependency graph.P(X1, . . . ,Xn) = P(Xn | Xn−1, . . . ,X1) · P(Xn−1 | Xn−2, . . . ,X1) · . . . · P(X1)

I 4. Exploit conditional independence: Instead of P(Xi | Xi−1, . . . ,X1), withprevious slide we can use P(Xi | Parents(Xi )).I Bayesian networks!




Toothache Catch

Cavity



∑y∈Y P(X , e, y)







Toothache Catch

Cavity



∑y∈Y P(X , e, y)







Toothache Catch

Cavity



∑y∈Y P(X , e, y)





Exploiting Conditional Independence: Example

I 1. Graph captures variable dependencies: (See previous slide.)I Given toothache, catch, want P(Cavity | toothache, catch). Remaining vars: ∅.

I 2. Normalization+Marginalization:P(Cavity | toothache, catch) = α · P(Cavity, toothache, catch)

I 3. Chain rule: Order X1 = Cavity, X2 = Toothache, X3 = Catch.P(Cavity, toothache, catch) =

P(catch | toothache,Cavity) · P(toothache | Cavity) · P(Cavity)

I 4. Exploit conditional independence:Instead of P(catch | toothache,Cavity) use P(catch | Cavity).

I Thus:

P(Cavity | toothache, catch)

= α · P(catch | Cavity) · P(toothache | Cavity) · P(Cavity)

= α · 〈0.9 · 0.6 · 0.2, 0.2 · 0.1 · 0.8〉= α · 〈0.108, 0.016〉

I So α ≈ 8.06 and P(cavity | toothache∧ catch) ≈ 0.87.








I Thus:



= α · 〈0.9 · 0.6 · 0.2, 0.2 · 0.1 · 0.8〉= α · 〈0.108, 0.016〉









I Thus:



= α · 〈0.9 · 0.6 · 0.2, 0.2 · 0.1 · 0.8〉= α · 〈0.108, 0.016〉









I Thus:



= α · 〈0.9 · 0.6 · 0.2, 0.2 · 0.1 · 0.8〉= α · 〈0.108, 0.016〉









I Thus:



= α · 〈0.9 · 0.6 · 0.2, 0.2 · 0.1 · 0.8〉= α · 〈0.108, 0.016〉









I Thus:



= α · 〈0.9 · 0.6 · 0.2, 0.2 · 0.1 · 0.8〉= α · 〈0.108, 0.016〉

I So α ≈ 8.06 and P(cavity | toothache∧ catch) ≈ 0.87.Kohlhase: Künstliche Intelligenz 2 180 July 12, 2018

Native Bayes Models

I Definition 7.6. A Bayesian network in which a single cause directly influences anumber of effects, all of which are conditionally independent, given the cause iscalled a naive Bayes model or Bayesian classifiers. (also called idiot Bayes modelby Bayesian fundamentalists)

I Observation 7.7. In a naive Bayes model, the full joint probabilitydistribution can be written as

P(cause | effect1, . . . , effectn) = P(cause) ·∏i

P(effecti | cause)

I This kind of model is called “naive” or “idiot” since it is often used as asimplifying model if the effects are not conditionally independent after all.

I In practice, naive Bayes systems can work surprisingly well, even when theconditional independence assumption is not true.

I Example 7.8. The dentistry example is a (true) naive Bayes model.


Questionnaire

I Consider the random variables X1 = Animal, X2 = LikesChappi, andX3 = LoudNoise; X1 has values dog, cat, other, X2 and X3 are Boolean.

I Question: Which statements are correct?(A) Animal is independent of LikesChappi.(B) LoudNoise is independent of LikesChappi.(C) Animal is conditionally independent of LikesChappi given LoudNoise.(D) LikesChappi is conditionally independent of LoudNoise given Animal.

I Answer:(A) No: likeschappi indicates dog.

(B) No: Not knowing what animal it is, loudnoise is an indication for dog whichindicates likeschappi.

(C) No: For example, even if we know loudnoise, knowing in addition that likeschappigives us a stronger indication of Animal = dog.

(D) Yes: If we know what animal it is, LoudNoise does not influence LikesChappi. (Well,at least that’s a reasonable assumption.)


Questionnaire



I Answer:(A) No: likeschappi indicates dog.(B) No: Not knowing what animal it is, loudnoise is an indication for dog which

indicates likeschappi.

(C) No: For example, even if we know loudnoise, knowing in addition that likeschappigives us a stronger indication of Animal = dog.



Questionnaire




indicates likeschappi.(C) No: For example, even if we know loudnoise, knowing in addition that likeschappi

gives us a stronger indication of Animal = dog.



Questionnaire




indicates likeschappi.(C) No: For example, even if we know loudnoise, knowing in addition that likeschappi

gives us a stronger indication of Animal = dog.(D) Yes: If we know what animal it is, LoudNoise does not influence LikesChappi. (Well,

at least that’s a reasonable assumption.)


7.8 The Wumpus World Revisited


Wumpus World Revisited

I Example 8.1 (The Wumpus is Back).

I We have a maze with pits that are detectedin neighboring squares via breeze (forgetwumpus and gold for now)

I Where does the agent should go, if there isbreeze at (1,2) and (2,1)?

I Pure logical inference can conclude nothingabout which square is most likely to be safe!

I Idea: Let’s evaluate our probabilistic reasoning machinery, if that can help!


Wumpus: Probabilistic Model

I Boolean Variables (only for the observed squares)

I Pi,j : pit at square (i , j)

I Bi,j : breeze at square (i , j)

I Full joint probability distribution1. P(P1,2, . . . ,P4,4,B1,1,B1,2,B2,1) =

P(B1,1,B1,2,B2,1 | P1,2, . . . ,P4,4) P(P1,2, . . . ,P4,4) (Product Rule)2. P(P1,2, . . .P4,4) =

∏4,4i,j=1,1 P(Pi,j) (pits are spread independently)

3. P(P1,2, . . .P4,4) = 0.2n · 0.816− n (probability of a pit is 0.2 and there are n pits)


Wumpus: Query and Simple Reasoning

Assume that we have evidence:I I b = ¬ b1,1 ∧ b1,2 ∧ b2,1 andI κ = ¬ p1,1 ∧¬ p1,2 ∧¬ p2,1

We are interested in answering queries suchas P(P1,3 | κ, b). (pit in (1, 3) givenevidence)

I The answer can be computed by enumeration of the full joint probabilitydistribution.

I Let U be the variables Pi,j except P1,3 and κ, then

P(P1,3 | κ, b) =∑u∈U

P(P1,3, u, κ, b)

I Problem: We need to explore all possible values of variables in U (212 = 4096terms!)

I Can we do better (faster)?Kohlhase: Künstliche Intelligenz 2 185 July 12, 2018

Wumpus: Conditional Independence

I Observation 8.2.

The observed breezes are conditionallyindependent of the other variables given theknown, frontier, and query variables.

I We split the set of hidden variables into fringe and other variables: U = F ∪Owhere F is the fringe and O the rest.

I From conditional independence we get: P(b | P1,3, κ,U) = P(b | P1,3, κ,F )

I Now, let us exploit this formula.


Wumpus: Reasoning

I We calculate:

P(P1,3 | κ, b) = α∑u∈U

P(P1,3, u, κ, b)

= α∑u∈U

P(b | P1,3, κ, u) · P(P1,3, κ, u)

= α∑f∈F

∑o∈O

P(b | P1,3, κ, f , o) · P(P1,3, κ, f , o)

= α∑f∈F

P(b | P1,3, κ, f ) ·∑o∈O

P(P1,3, κ, f , o)

= α∑f∈F

P(b | P1,3, κ, f ) ·∑o∈O

P(P1,3) · P(κ) · P(f ) · P(o)

= α P(P1,3) P(κ)∑f∈F

P(b | P1,3, κ, f ) · P(f ) ·∑o∈O

P(o)

= α′ P(P1,3)∑f∈F

P(b | P1,3, κ, f ) · P(f )

for α′ := α P(κ) as∑

o∈O P(o) = 1.


Wumpus: Solution

I We calculate using the product rule and conditional independence (see above)P(P1,3 | κ, b) = α′ P(P1,3)

∑f∈F

P(b | P1,3, κ, f ) · P(f )

I Let us explore possible models (values) of Fringe that are F compatible withobservation b.

I P(P1,3 | κ, b) = α′ 〈0.2 (0.04+ 0.16+ 0.16), 0.8 (0.04+ 0.16)〉 = 〈0.31, 0.69〉I P(P3,1 | κ, b) = 〈0.31, 0.69〉 by symmetryI P(P2,2 | κ, b) = 〈0.86, 0.14〉 (definitely avoid)


7.9 Conclusion


Summary

I Uncertainty is unavoidable in many environments, namely whenever agents donot have perfect knowledge.

I Probabilities express the degree of belief of an agent, given its knowledge, intoan event.

I Conditional probabilities express the likelihood of an event given observedevidence.

I Assessing a probability means to use statistics to approximate the likelihood ofan event.

I Bayes’ rule allows us to derive, from probabilities that are easy to assess,probabilities that aren’t easy to assess.

I Given multiple evidence, we can exploit conditional independence.I Bayesian networks (up next) do this, in a comprehensive manner.


Chapter 8 Probabilistic Reasoning, Part II: BayesianNetworks


8.1 Introduction


Reminder: Our Agenda for This Topic

I Our treatment of the topic “Probabilistic Reasoning” consists of this and lastdocument.I Chapter 7: All the basic machinery at use in Bayesian networks.I This Chapter: Bayesian networks: What they are, how to build them, how to use

them.I The most wide-spread and successful practical framework for probabilistic reasoning.


Reminder: Our Machinery


Toothache Catch

Cavity

I Given evidence e, want to know P(X | e). Remaining vars: Y.I 2. Normalization+Marginalization:

P(X | e) = αP(X , e) = α∑y∈Y

P(X , e, y)

I A sum over atomic events!I 3. Chain rule: X1, . . . ,Xn consistently with dependency graph.

P(X1, . . . ,Xn) = P(Xn | Xn−1, . . . ,X1) · P(Xn−1 | Xn−2, . . . ,X1) · . . . · P(X1)

I 4. Exploit conditional independence: Instead of P(Xi | Xi−1, . . . ,X1), we canuse P(Xi | Parents(Xi )).I Bayesian networks!


Some Applications

I A ubiquitous problem: Observe “symptoms”, need to infer “causes”.Medical Diagnosis Face Recognition

Self-Localization Nuclear Test Ban


Our Agenda for This Chapter

I What is a Bayesian Network? What is the syntax?I Tells you what Bayesian networks look like.

I What is the Meaning of a Bayesian Network? What is the semantics?I Makes the intuitive meaning precise.

I Constructing Bayesian Networks: How do we design these networks? Whateffect do our choices have on their size?I Before you can start doing inference, you need to model your domain.

I Inference in Bayesian Networks: How do we use these networks? What is theassociated complexity?I Inference is our primary purpose. It is important to understand its complexities and

how it can be improved.


8.2 What is a Bayesian Network?


What is a Bayesian Network? (Short: BN)

I What do the others say?I “A Bayesian network is a methodology for representing the full joint probability

distribution. In some cases, that representation is compact.”I “A Bayesian network is a graph whose nodes are random variables Xi and whose

edges 〈Xj ,Xi 〉 denote a direct influence of Xj on Xi . Each node Xi is associated witha conditional probability table (CPT), specifying P(Xi | Parents(Xi )).”

I “A Bayesian network is a graphical way to depict conditional independence relationswithin a set of random variables.”

I A Bayesian network (BN) represents the structure of a given domain.Probabilistic inference exploits that structure for improved efficiency.

I BN inference: Determine the distribution of a query variable X given observedevidence e: P(X | e).


John, Mary, and My Brand-New Alarm

I Example 2.1 (From Russell/Norvig).I I got very valuable stuff at home. So I bought an alarm. Unfortunately, the alarm

just rings at home, doesn’t call me on my mobile.I I’ve got two neighbors, Mary and John, who’ll call me if they hear the alarm.I The problem is that, sometimes, the alarm is caused by an earthquake.I Also, John might confuse the alarm with his telephone, and Maria might miss the

alarm altogether because she typically listens to loud music.

Question: Given that both John and Mary call me, what is the probability of aburglary?


John, Mary, and My Alarm: Designing the BN

II Cooking Recipe:(1) Design the random variables X1, . . . ,Xn;(2) Identify their dependencies;(3) Insert the conditional probability tables P(Xi | Parents(Xi )).

I Example 2.2 (Let’s cook!).(1) Random variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls.(2) Dependencies: Burglaries and earthquakes are independent (this is actually

debatable ; design decision!)the alarm might be activated by either. John and Mary call if and only if they hearthe alarm (they don’t care about earthquakes)

(3) Conditional probability tables: Assess the probabilities, see next slide.


John, Mary, and My Alarm: The BN

B

TTFF

E

TFTF

P(A)

.95

.29

.001

.001

P(B).002

P(E)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

TF

.90

.05

A P(M)

TF

.70

.01

.94

Note: In each P(Xi | Parents(Xi )), we show only P(Xi = T | Parents(Xi )). Wedon’t show P(Xi = F | Parents(Xi )) which is 1− P(Xi = T | Parents(Xi )).


The Syntax of Bayesian Networks

B

TTFF

E

TFTF

P(A)

.95

.29

.001

.001

P(B).002

P(E)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

TF

.90

.05

A P(M)

TF

.70

.01

.94

I Definition 2.3 (Bayesian Network). Given random variables X1, . . . ,Xn withfinite domains D1, . . . ,Dn, a Bayesian network (also belief network orprobabilistic network) is an acyclic directed graph BN = 〈X1, . . . ,Xn,E 〉. Wedenote Parents(Xi ) := Xj | (Xj ,Xi ) ∈ E. Each Xi is associated with a functionCPT(Xi ) : Di ×

∏Xj∈Parents(Xi )

Dj → [0, 1], the conditional probability table.I Related formalisms summed up under the term graphical models.


8.3 What is the Meaning of a Bayesian Network?


The Semantics of BNs: Illustration

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

I Alarm depends on Burglary and Earthquake.I MaryCalls only depends on Alarm.

P(MaryCalls | Alarm,Burglary) = P(MaryCalls | Alarm)

I Bayesian networks represent sets of independence assumptions.


The Semantics of BNs: Illustration, ctd.

I Each node X in a BN is conditionally independent of its non-descendants givenits parents Parents(X ).

. . .

. . .U1

X

Um

Yn

Znj

Y1

Z1j



Alarm

Earthquake

MaryCallsJohnCalls

Burglary

I Given the value of Alarm, MaryCalls is independent of?

I Burglary,Earthquake, JohnCalls.



Alarm

Earthquake

MaryCallsJohnCalls

Burglary

I Given the value of Alarm, MaryCalls is independent of?I Burglary,Earthquake, JohnCalls.


The Semantics of BNs: Formal

B

TTFF

E

TFTF

P(A)

.95

.29

.001

.001

P(B).002

P(E)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

TF

.90

.05

A P(M)

TF

.70

.01

.94

I Definition 3.1. Given a Bayesian network BN = 〈X1, . . . ,Xn,E 〉, we identifyBN with the following two assumptions:

(A) For 1 ≤ i ≤ n, Xi is conditionally independent of NonDesc(Xi ) given Parents(Xi ),where NonDesc(Xi ) := Xj | (Xi ,Xj) 6∈ E∗\Parents(Xi ) where E∗ is thetransitive-reflexive closure of E .

(B) For 1 ≤ i ≤ n, all values xi of Xi , and all value combinations of Parents(Xi ), we haveP(xi | Parents(Xi )) = CPT(xi ,Parents(Xi )).


Recovering the Full Joint Probability Distribution

I “A Bayesian network is a methodology for representing the full joint probabilitydistribution.”

I Problem: How to recover the full joint probability distribution P(X1, . . . ,Xn)from BN = 〈X1, . . . ,Xn,E 〉?

I Chain rule: For any ordering X1, . . . ,Xn, we have:

P(X1, . . . ,Xn) = P(Xn | Xn−1, . . . ,X1) · P(Xn−1 | Xn−2, . . . ,X1) . . .P(X1)

Choose X1, . . . ,Xn consistent with BN: Xj ∈ Parents(Xi ) ; j < i .I Observation 3.2 (Exploiting Conditional Independence).

With BN assumption (A), we can use P(Xi | Parents(Xi )) instead ofP(Xi | Xi−1 . . . ,X1):

P(X1, . . . ,Xn) =n∏

i=1

P(Xi | Parents(Xi ))

The distributions P(Xi | Parents(Xi )) are given by BN assumption (B).I Same for atomic events P(x1, . . . , xn).I Observation 3.3 (Why “acyclic”?). for cyclic BN, this does NOT hold,

indeed cyclic BNs may be self-contradictory. (need a consistent ordering)







Choose X1, . . . ,Xn consistent with BN: Xj ∈ Parents(Xi ) ; j < i .

I Observation 3.2 (Exploiting Conditional Independence).With BN assumption (A), we can use P(Xi | Parents(Xi )) instead ofP(Xi | Xi−1 . . . ,X1):

P(X1, . . . ,Xn) =n∏

i=1



indeed cyclic BNs may be self-contradictory. (need a consistent ordering)









P(X1, . . . ,Xn) =n∏

i=1


The distributions P(Xi | Parents(Xi )) are given by BN assumption (B).I Same for atomic events P(x1, . . . , xn).

I Observation 3.3 (Why “acyclic”?). for cyclic BN, this does NOT hold,indeed cyclic BNs may be self-contradictory. (need a consistent ordering)









P(X1, . . . ,Xn) =n∏

i=1



indeed cyclic BNs may be self-contradictory. (need a consistent ordering)Kohlhase: Künstliche Intelligenz 2 203 July 12, 2018

Recovering a Probability for John, Mary, and the Alarm

I Example 3.4. John and Mary called because there was an alarm, but noearthquake or burglary

P(j ,m, a,¬ b,¬ e) = P(j | a) · P(m | a) · P(a | ¬ b,¬ e) · P(¬ b) · P(¬ e)

= 0.9 ∗ 0.7 ∗ 0.001 ∗ 0.999 ∗ 0.998= 0.00062

B

TTFF

E

TFTF

P(A)

.95

.29

.001

.001

P(B).002

P(E)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

TF

.90

.05

A P(M)

TF

.70

.01

.94


Questionnaire

Animal

LoudNoise LikesChappi

I Say BN is the Bayesian network above. Which statements are correct?(A) Animal is independent of LikesChappi.(B) LoudNoise is independent of LikesChappi.(C) Animal is conditionally independent of LikesChappi given LoudNoise.(D) LikesChappi is conditionally independent of LoudNoise given Animal.

I Answers:(A) No: likeschappi indicates dog.(B) No: Not knowing what animal it is, likeschappi is an indication for dog which

indicates loudnoise.(C) No: For example, even if we know loudnoise, knowing in addition that likeschappi

gives us a stronger indication of Animal = dog.(D) Yes: Xi = LikesChappi is conditionally independent of NonDesc(Xi ) = LoudNoise

given Parents(Xi ) = Animal.


Questionnaire

Animal


I Say BN is the Bayesian network above. Which statements are correct?(A) Animal is independent of LikesChappi.(B) LoudNoise is independent of LikesChappi.(C) Animal is conditionally independent of LikesChappi given LoudNoise.(D) LikesChappi is conditionally independent of LoudNoise given Animal.I Answers:(A) No: likeschappi indicates dog.(B) No: Not knowing what animal it is, likeschappi is an indication for dog which

indicates loudnoise.(C) No: For example, even if we know loudnoise, knowing in addition that likeschappi

gives us a stronger indication of Animal = dog.(D) Yes: Xi = LikesChappi is conditionally independent of NonDesc(Xi ) = LoudNoise

given Parents(Xi ) = Animal.


8.4 Constructing Bayesian Networks


Constructing Bayesian Networks

I BN construction algorithm:1. Initialize BN := 〈X1, . . . ,Xn,E〉 where E = ∅.2. Fix any order of the variables, X1, . . . ,Xn.3. for i := 1, . . . , n do

a. Choose a minimal set Parents(Xi )⊆X1, . . . ,Xi−1 so thatP(Xi | Xi−1, . . . ,X1) = P(Xi | Parents(Xi )).

b. For each Xj ∈ Parents(Xi ), insert (Xj ,Xi ) into E .c. Associate Xi with CPT(Xi ) corresponding to P(Xi | Parents(Xi )).

Attention: Which variables we need to include into Parents(Xi ) depends on what“X1, . . . ,Xi−1” is . . . !

II The size of the resulting BN depends on the chosen order X1, . . . ,Xn.I The size of a Bayesian network is not a fixed property of the domain. It depends

on the skill of the designer.


John and Mary Depend on the Variable Order!

I Example 4.1. MaryCalls, JohnCalls,Alarm,Burglary,Earthquake.


John and Mary Depend on the Variable Order!

I Example 4.1. MaryCalls, JohnCalls,Alarm,Burglary,Earthquake.


John and Mary Depend on the Variable Order! Ctd.

I Example 4.2. MaryCalls, JohnCalls,Earthquake,Burglary,Alarm.


John and Mary Depend on the Variable Order! Ctd.

I Example 4.2. MaryCalls, JohnCalls,Earthquake,Burglary,Alarm.


John and Mary, What Went Wrong?

I These BNs link from symptoms to causes! (P(Cavity | Toothache))I We fail to identify many conditional independence relations (e.g., get

dependencies between conditionally independent symptoms).I Also recall: Conditional probabilities P(Symptom | Cause) are more robust and

often easier to assess than P(Cause | Symptom).I Rule of Thumb: We should order causes before symptoms.


Compactness of Bayesian Networks

I Definition 4.3. Given random variables X1, . . . ,Xn with finite domainsD1, . . . ,Dn, the size of BN = 〈X1, . . . ,Xn,E 〉 is defined assize(BN) :=

∑ni=1 #(Di ) ·

∏Xj∈Parents(Xi )

#(Dj).I = The total number of entries in the CPTs.I Smaller BN ; assess less probabilities, more efficient inference.I Explicit full joint probability distribution has size

∏ni=1 #(Di ).

I If #(Parents(Xi )) ≤ k for every Xi , and Dmax is the largest variable domain,then size(BN) ≤ n #(Dmax)k+1.

I For #(Dmax) = 2, n = 20, k = 4 we have 220 = 1048576 probabilities, but aBayesian network of size ≤ 20 · 25 = 640 . . . !

I In the worst case, size(BN) = n ·∏n

i=1 #(Di ), namely if every variable dependson all its predecessors in the chosen order.

I BNs are compact if each variable is directly influenced only by few of itspredecessor variables.


Representing Conditional Distributions: Deterministic Nodes

I Problem: Even if max(Parents) is small, the CPT has 2k entries. (worst-case)

I Idea: Usually CPTs follow standard patterns called canonical distributions.

I only need to determine pattern and some values.

I Definition 4.4. A node X in a Bayesian network is called deterministic, if itsvalue is completely determined by the values of Parents(X ).

I Example 4.5 (Logical Dependencies).

In the network on the right, the node European isdeterministic, the CPT corresponds to a logicaldisjunction, i.e.P(european) = P(greek∨ german∨ french).

Greek

German

French

European

I Example 4.6 (Numerical Dependencies).In the network on the right, the nodeStudents is deterministic, the CPTcorresponds to a sum, i.e.P(S = i − d − g) =P(I = i)∧P(D = d)∧P(G = g).

Inscriptions

Dropouts

Graduations

Students

I Intuition: Deterministic nodes model direct, causal relationships



I Problem: Even if max(Parents) is small, the CPT has 2k entries. (worst-case)I Idea: Usually CPTs follow standard patterns called canonical distributions.I only need to determine pattern and some values.I Definition 4.4. A node X in a Bayesian network is called deterministic, if its

value is completely determined by the values of Parents(X ).

I Example 4.5 (Logical Dependencies).


Greek

German

French

European


Inscriptions

Dropouts

Graduations

Students





value is completely determined by the values of Parents(X ).I Example 4.5 (Logical Dependencies).


Greek

German

French

European


Inscriptions

Dropouts

Graduations

Students







Greek

German

French

European


Inscriptions

Dropouts

Graduations

Students







Greek

German

French

European


Inscriptions

Dropouts

Graduations

Students

I Intuition: Deterministic nodes model direct, causal relationshipsKohlhase: Künstliche Intelligenz 2 211 July 12, 2018

Representing Conditional Distributions: Noisy Nodes

I Problem: Sometimes, values of nodes are only “almost deterministic”.(uncertain, but mostly logical)

I Idea: Use “noisy” logical relationships. (generalize logical ones softly to [0, 1])I Example 4.7 (Inhibited Causal Dependencies).In the network on the right, deterministic disjunctionfor the node Fever is incorrect, since the diseasessometimes fail to develop fever. The causal relationbetween parent and child is inhibited.

Cold

Flu

Malaria

Fever

I Assumptions: We make the following assumptions for modeling Example 4.7:1. Cold, Flu, and Malaria is a complete list of fever causes (add a leak node for the

others otherwise).2. Inhibitions of the parents are independent

Thus we can model the inhibitions by individual inhibition factors qd .I Definition 4.8. The CPT of a noisy disjunction node X in a Bayesian network

is given by P(xi | Parents(Xi )) =∏j |Xj=T qj , where the qi are the inhibition

factors of Xi ∈ Parents(X ).




I Idea: Use “noisy” logical relationships. (generalize logical ones softly to [0, 1])

I Example 4.7 (Inhibited Causal Dependencies).In the network on the right, deterministic disjunctionfor the node Fever is incorrect, since the diseasessometimes fail to develop fever. The causal relationbetween parent and child is inhibited.

Cold

Flu

Malaria

Fever










Cold

Flu

Malaria

Fever










Cold

Flu

Malaria

Fever








I Example 4.9. We have the following inhibition factors for Example 4.7:

qcold = P(¬ fever | cold,¬ flu,¬malaria) = 0.6qflu = P(¬ fever | ¬ cold, flu,¬malaria) = 0.2

qmalaria = P(¬ fever | ¬ cold,¬ flu,malaria) = 0.1

If we model Fever as a noisy disjunction node, then the general ruleP(xi | Parents(Xi )) =

∏j |Xj=T qj for the CPT gives the following table:

Cold Flu Malaria P(Fever) P(¬Fever)F F F 0.0 1.0F F T 0.9 0.1F T F 0.8 0.2F T T 0.98 0.02 = 0.2 · 0.1T F F 0.4 0.6T F T 0.94 0.06 = 0.6 · 0.1T T F 0.88 0.12 = 0.6 · 0.2T T T 0.988 0.012 = 0.6 · 0.2 · 0.1



I Observation 4.10. In general, noisy logical relationships in which a variabledepends on k parents can be described by O(k) parameters instead of O(2k) forthe full conditional probability table. This can make assessment (and learning)tractable.

I Example 4.11. The CPCS network [PraProMid:kelbn94] uses noisy-OR andnoisy-MAX distributions to model relationships among diseases and symptoms ininternal medicine. With 448 nodes and 906 links, it requires only 8,254 valuesinstead of 133,931,430 for a network with full CPTs.


Questionnaire

I Question: What is the Bayesian network we get by constructing according to theordering X1 = LoudNoise,X2 = Animal,X3 = LikesChappi?

I Answer:Animal


I Question: What is the Bayesian network we get by constructing according to theordering X1 = LoudNoise,X2 = LikesChappi,X3 = Animal?

I Answer:Animal



Questionnaire


I Answer:Animal



I Answer:Animal



Questionnaire


I Answer:Animal



I Answer:Animal



Questionnaire


I Answer:Animal



I Answer:Animal



8.5 Inference in Bayesian Networks


Inference for Mary and John

I Intuition: Observe evidence variables and draw conclusions on query variables.I Example 5.1.

B

TTFF

E

TFTF

P(A)

.95

.29

.001

.001

P(B).002

P(E)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

TF

.90

.05

A P(M)

TF

.70

.01

.94

I What is P(Burglary | johncalls)?I What is P(Burglary | johncalls,marycalls)?


Probabilistic Inference Tasks in Bayesian Networks

I Definition 5.2 (Probabilistic Inference Task). Given random variablesX1, . . . ,Xn, a probabilistic inference task consists of a set X⊆X1, . . . ,Xn ofquery variables, a set E⊆X1, . . . ,Xn of evidence variables, and an event ethat assigns values to E. We wish to compute the posterior probabilitydistribution P(X | e).Y := X1, . . . ,Xn\(X∪E) are the hidden variables.

I Notes:I We assume that a BN for X1, . . . ,Xn is given.I In the remainder, for simplicity, X = X is a singleton.

I Example 5.3. In P(Burglary | johncalls,marycalls), X = Burglary,e = johncalls,marycalls, and Y = Alarm,EarthQuake.


Inference by Enumeration: The Principle (A Reminder!)

I Problem: Given evidence e, want to know P(X | e). Hidden variables: Y.

I 1. Bayesian network BN captures variable dependencies.I 2. Normalization+Marginalization.

P(X | e) = α P(X , e); if Y 6= ∅ then P(X | e) = α∑

y∈Y P(X , e, y)

I Recover the summed-up probabilities P(X , e, y) from BN!I 3. Chain rule. Order X1, . . . ,Xn consistent with BN.

P(X1, . . . ,Xn) = P(Xn | Xn−1, . . . ,X1) P(Xn−1 | Xn−2, . . . ,X1) . . . P(X1)

I 4. Exploit conditional independence. Instead of P(Xi | Xi−1, . . . ,X1), useP(Xi | Parents(Xi )).

I Given a Bayesian network BN, probabilistic inference tasks can be solved assums of products of conditional probabilities from BN.

I Sum over all value combinations of hidden variables.



I Problem: Given evidence e, want to know P(X | e). Hidden variables: Y.I 1. Bayesian network BN captures variable dependencies.

I 2. Normalization+Marginalization.P(X | e) = α P(X , e); if Y 6= ∅ then P(X | e) = α

∑y∈Y P(X , e, y)








I Problem: Given evidence e, want to know P(X | e). Hidden variables: Y.I 1. Bayesian network BN captures variable dependencies.I 2. Normalization+Marginalization.


y∈Y P(X , e, y)










y∈Y P(X , e, y)










y∈Y P(X , e, y)










y∈Y P(X , e, y)










y∈Y P(X , e, y)







Inference by Enumeration: John and Mary

B

TTFF

E

TFTF

P(A)

.95

.29

.001

.001

P(B).002

P(E)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

TF

.90

.05

A P(M)

TF

.70

.01

.94

I Want: P(Burglary | johncalls,marycalls).Hidden variables: Y = Earthquake,Alarm.

I Normalization+Marginalization:

P(B | j ,m) = α P(B, j ,m) = α∑vE

∑vA

P(B, j ,m, vE , vA)

I Order X1 = B, X2 = E , X3 = A, X4 = J, X5 = M.I Chain rule and conditional independence:

P(B | j ,m) = α∑vE

∑vA

P(B) · P(vE ) · P(vA | B, vE ) · P(j | vA) · P(m | vA)

I Continuation on next slide . . .Kohlhase: Künstliche Intelligenz 2 219 July 12, 2018


B

TTFF

E

TFTF

P(A)

.95

.29

.001

.001

P(B).002

P(E)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

TF

.90

.05

A P(M)

TF

.70

.01

.94



P(B | j ,m) = α P(B, j ,m) = α∑vE

∑vA

P(B, j ,m, vE , vA)


P(B | j ,m) = α∑vE

∑vA




B

TTFF

E

TFTF

P(A)

.95

.29

.001

.001

P(B).002

P(E)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

TF

.90

.05

A P(M)

TF

.70

.01

.94



P(B | j ,m) = α P(B, j ,m) = α∑vE

∑vA

P(B, j ,m, vE , vA)

I Order X1 = B, X2 = E , X3 = A, X4 = J, X5 = M.

I Chain rule and conditional independence:

P(B | j ,m) = α∑vE

∑vA




B

TTFF

E

TFTF

P(A)

.95

.29

.001

.001

P(B).002

P(E)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

A P(J)

TF

.90

.05

A P(M)

TF

.70

.01

.94



P(B | j ,m) = α P(B, j ,m) = α∑vE

∑vA

P(B, j ,m, vE , vA)


P(B | j ,m) = α∑vE

∑vA



Inference by Enumeration: John and Mary, ctd.

I Move variables outwards (until we hit the first parent):

P(B | j ,m) = α · P(B) ·∑vE

P(vE ) ·∑vA

P(vA | B, vE ) · P(j | vA) · P(m | vA)

I The probabilities of the outside-variables multiply the entire “rest of the sum”

I Chain rule and conditional independence, ctd.:P(B | j ,m)

= α P(B)∑vE

P(vE )∑vA

P(vA | B, vE ) P(j | vA) P(m | vA)

= α · P(b) ·

P(e) ·

a︷︸︸︷

P(a | b, e) P(j | a) P(m | a)+ P(¬ a | b, e) P(j | ¬ a) P(m | ¬ a)︸︷︷︸

¬ a

e

+ P(¬ e) ·

a︷︸︸︷

P(a | b,¬ e) P(j | a) P(m | a)+ P(¬ a | b,¬ e) P(j | ¬ a) P(m | ¬ a)︸︷︷︸

¬ a

¬ e

= α 〈0.00059224, 0.0014919〉 ≈ 〈0.284, 0.716〉


Inference by Enumeration: John and Mary, ctd.

I Move variables outwards (until we hit the first parent):

P(B | j ,m) = α · P(B) ·∑vE

P(vE ) ·∑vA

P(vA | B, vE ) · P(j | vA) · P(m | vA)

I The probabilities of the outside-variables multiply the entire “rest of the sum”I Chain rule and conditional independence, ctd.:

P(B | j ,m)

= α P(B)∑vE

P(vE )∑vA

P(vA | B, vE ) P(j | vA) P(m | vA)

= α · P(b) ·

P(e) ·

a︷︸︸︷

P(a | b, e) P(j | a) P(m | a)+ P(¬ a | b, e) P(j | ¬ a) P(m | ¬ a)︸︷︷︸

¬ a

e

+ P(¬ e) ·

a︷︸︸︷

P(a | b,¬ e) P(j | a) P(m | a)+ P(¬ a | b,¬ e) P(j | ¬ a) P(m | ¬ a)︸︷︷︸

¬ a

¬ e

= α 〈0.00059224, 0.0014919〉 ≈ 〈0.284, 0.716〉


The Evaluation of P(b | j ,m), as a “Search Tree”

I Inference by enumeration = a tree with “sum nodes” branching over values ofhidden variables, and with non-branching “multiplication nodes”.


Inference by Enumeration: Properties

I Inference by Enumeration:I Evaluates the tree in a depth-first manner.

I Space Complexity: Linear in the number of variables.I Time Complexity: Exponential in the number of hidden variables, e.g. O(2#(Y)) in

case these variables are Boolean.I Can we do better than this?I Variable Elimination:I Improves on inference by enumeration through (A) avoiding repeated computation,

and (B) avoiding irrelevant computation.I In some special cases, variable elimination runs in polynomial time.


Inference by Enumeration: Properties

I Inference by Enumeration:I Evaluates the tree in a depth-first manner.I Space Complexity: Linear in the number of variables.I Time Complexity: Exponential in the number of hidden variables, e.g. O(2#(Y)) in

case these variables are Boolean.I Can we do better than this?I Variable Elimination:I Improves on inference by enumeration through (A) avoiding repeated computation,

and (B) avoiding irrelevant computation.I In some special cases, variable elimination runs in polynomial time.


Variable Elimination: Sketch of Ideas

I (A) Avoiding repeated computation: Evaluate expressions from right to left,storing all intermediate results.

I For query P(B | j ,m):1. CPTs of BN yield factors (probability tables):

P(B | j ,m) = α P(B)︸︷︷︸f1(B)

∑vE

P(vE )︸︷︷︸f2(E)

∑vA

P(vA | B, vE )︸︷︷︸f3(A,B,E)

P(j | vA)︸︷︷︸f4(A)

P(m | vA)︸︷︷︸f5(A)

2. Then the computation is performed in terms of factor product and summing outvariables from factors:

P(B | j ,m) = α · f1(B) ·∑vE

f2(E) ·∑vA

f3(A,B,E) · f4(A) · f5(A)

I (B) Avoiding irrelevant computation: Repeatedly remove hidden variablesthat are leaf nodes.

I For query P(JohnCalls | burglary):

P(J | b) = α P(b)∑vE

P(vE ),∑vA

P(vA | b, vE ) P(J | vA)∑vM

P(vM | vA)

I The rightmost sum equals 1 and can be dropped.


Variable Elimination: Sketch of Ideas

I (A) Avoiding repeated computation: Evaluate expressions from right to left,storing all intermediate results.

I For query P(B | j ,m):1. CPTs of BN yield factors (probability tables):

P(B | j ,m) = α P(B)︸︷︷︸f1(B)

∑vE

P(vE )︸︷︷︸f2(E)

∑vA

P(vA | B, vE )︸︷︷︸f3(A,B,E)

P(j | vA)︸︷︷︸f4(A)

P(m | vA)︸︷︷︸f5(A)

2. Then the computation is performed in terms of factor product and summing outvariables from factors:

P(B | j ,m) = α · f1(B) ·∑vE

f2(E) ·∑vA

f3(A,B,E) · f4(A) · f5(A)

I (B) Avoiding irrelevant computation: Repeatedly remove hidden variablesthat are leaf nodes.

I For query P(JohnCalls | burglary):

P(J | b) = α P(b)∑vE

P(vE ),∑vA

P(vA | b, vE ) P(J | vA)∑vM

P(vM | vA)

I The rightmost sum equals 1 and can be dropped.Kohlhase: Künstliche Intelligenz 2 223 July 12, 2018

The Complexity of Exact Inference

I Good News:I Definition 5.4. A graph is called singly connected, or a polytree, if there is at most

one undirected path between any two nodes in the graph.I Theorem 5.5. On polytree Bayesian networks, variable elimination runs in

polynomial time.

I Is our BN for Mary & John a polytree?I Bad News:I For multiply connected Bayesian networks, in general probabilistic inference is

#P-hard.I #P is harder than NP (i.e. NP ⊆ #P).

I So?I Life goes on . . . In the hard cases, if need be we can throw exactitude to the

winds and approximate.I Example 5.6. Sampling techniques





polynomial time.I Is our BN for Mary & John a polytree?

I Bad News:I For multiply connected Bayesian networks, in general probabilistic inference is








polynomial time.I Is our BN for Mary & John a polytree?I Bad News:I For multiply connected Bayesian networks, in general probabilistic inference is


I So?

I Life goes on . . . In the hard cases, if need be we can throw exactitude to thewinds and approximate.

I Example 5.6. Sampling techniques





polynomial time.I Is our BN for Mary & John a polytree?I Bad News:I For multiply connected Bayesian networks, in general probabilistic inference is





8.6 Conclusion


Summary

I Bayesian networks (BN) are a wide-spread tool to model uncertainty, and toreason about it. A BN represents conditional independence relations betweenrandom variables. It consists of a graph encoding the variable dependencies, andof conditional probability tables (CPTs).

I Given a variable order, the BN is small if every variable depends on only a few ofits predecessors.

I Probabilistic inference requires to compute the probability distribution of a set ofquery variables, given a set of evidence variables whose values we know. Theremaining variables are hidden.

I Inference by enumeration takes a BN as input, then appliesNormalization+Marginalization, the Chain rule, and exploits conditionalindependence. This can be viewed as a tree search that branches over all valuesof the hidden variables.

I Variable elimination avoids unnecessary computation. It runs in polynomial timefor poly-tree BNs. In general, exact probabilistic inference is #P-hard.Approximate probabilistic inference methods exist.


Topics We Didn’t Cover Here

I Inference by sampling: A whole zoo of methods for doing this exists.

I Clustering: Pre-combining subsets of variables to reduce the runtime ofinference.

I Compilation to SAT: More precisely, to “weighted model counting” in CNFformulas. Model counting extends DPLL with the ability to determine thenumber of satisfying interpretations. Weighted model counting allows to definea mass for each such interpretation (= the probability of an atomic event).

I Dynamic BN: BN with one slice of variables at each “time step”, encodingprobabilistic behavior over time.

I Relational BN: BN with predicates and object variables.I First-order BN: Relational BN with quantification, i.e. probabilistic logic. E.g.,

the BLOG language developed by Stuart Russel and co-workers.



I Inference by sampling: A whole zoo of methods for doing this exists.I Clustering: Pre-combining subsets of variables to reduce the runtime of

inference.

I Compilation to SAT: More precisely, to “weighted model counting” in CNFformulas. Model counting extends DPLL with the ability to determine thenumber of satisfying interpretations. Weighted model counting allows to definea mass for each such interpretation (= the probability of an atomic event).







inference.I Compilation to SAT: More precisely, to “weighted model counting” in CNF

formulas. Model counting extends DPLL with the ability to determine thenumber of satisfying interpretations. Weighted model counting allows to definea mass for each such interpretation (= the probability of an atomic event).


















I Relational BN: BN with predicates and object variables.

I First-order BN: Relational BN with quantification, i.e. probabilistic logic. E.g.,the BLOG language developed by Stuart Russel and co-workers.










Chapter 9 Making Simple Decisions Rationally


9.1 Introduction


Decision Theory

I Definition 1.1. Decision theory investigates how an agent a deals with choosingamong actions based on the desirability of their outcomes.

I Wait: Isn’t that what we did in ?problem-solving? Problem Solving?

I Yes, but: now we do it for stochastic (i.e. non-deterministic), partiallyobservable environments.

I Recall: We call an agent environmentI fully observable, iff the a’s sensors give it access to the complete state of the

environment at any point in time, else partially observable.I deterministic, iff the next state of the environment is completely determined by the

current state and a’s action, else stochastic.I episodic, iff a’s experience is divided into atomic episodes, where it perceives and

then performes a single action. Crucially the next episode does not depend onprevious ones. Non-episodic environments are called sequential.

I We restrict ourselves to episodic decision theory, which deals with choosingamong actions based on the desirability of their immediate outcomes.


Preview: Episodic Decision Theory

I Problem: The environment is partially observable, so we do not know the“current state”

I Idea: rational decisions = choose actions that maximize expected utility (MEU)I Treat the result of an action a as a random variable R(a) whose variables are the

possible outcome states.I Study P(R(a) = s ′ | a, e) given evidence observations e.I Capture the agent’s preferences in a utility function U from states to R+

0 .I Definition 1.2. The expected utility EU(a) of an action a (given evidence e) is then

EU(a|e) =∑s′

P(R(a) = s ′ | a, e) · U(s ′)

I Intuitively: A formalization of what it means to “do the right thing”.

I Hooray: This solves all of the AI problem (in principle)

I Problem: There is a long long way towards an operationalization (do that now)


Outline of this Chapter

I Rational preferencesI Utilities and MoneyI Multiattribute utilitiesI Decision networksI Value of information


9.2 Rational Preferences


Preferences

I Problem: We cannot directly measure utility of (or satisfaction/happiness in) astate.

I Idea: We can let people choose between two states! (subjective preference)I Definition 2.1.

An agent chooses among prizes(A, B, etc.) and lotteries, i.e.,situations with uncertain prizes.

Lottery L = [p,A; (1− p),B]

L

A

B

p

1− p

I Definition 2.2 (Preferences).A ≺ B A preferred to BA ∼ B indifference between A and BA B B not preferred to A


Rational preferences

I Idea: Preferences of a rational agent must obey constraints:Rational preferences ; behavior describable as maximization of expected utility

I Definition 2.3. Constraints:Orderability A ≺ B ∨B ≺ A∨A ∼ BTransitivity A ≺ B ∧B ≺ C⇒A ≺ CContinuity A ≺ B ≺ C⇒ (∃p [p,A; (1− p),C ] ∼ B)Substitutability A ∼ B⇒ [p,A; (1− p),C ] ∼ [p,B; (1− p),C ]Monotonicity A ≺ B⇒ (p≥q)⇔ [p,A; (1− p),B] [q,A; (1− q),B]


Rational preferences contd.

I Violating the constraints leads to self-evident irrationalityI Example 2.4. An agent with intransitive preferences can be induced to give

away all its money:I If B ≺ C , then an agent who has C would pay (say) 1 cent to get BI If A ≺ B, then an agent who has B would pay (say) 1 cent to get AI If C ≺ A, then an agent who has A would pay (say) 1 cent to get C


9.3 Utilities and Money


Ramseys Theorem and Value Functions

I Theorem 3.1. (Ramsey, 1931; von Neumann and Morgenstern, 1944)Given preferences satisfying the constraints there exists a real-valued function Usuch that

(U(A)≥U(B))⇔A B and U([p1,S1; . . .; pn,Sn]) =∑i

pi U(Si )

I These are existence theorems, uniqueness not guaranteed.

I Note: Agent behavior is invariant w.r.t. positive linear transformation, i.e.

U ′(x) = k1 U(x) + k2 where k1 > 0

behaves exactly like U.

I With deterministic prizes only (no lottery choices), only a total order on prizescan be determined

I Definition 3.2. We call a total ordering on states a value function or ordinalutility function.


Maximizing Expected Utility

I Definition 3.3 (MEU principle). Choose the action that maximizes expectedutility (MEU)

I Note: an agent can be entirely rational (consistent with MEU) without everrepresenting or manipulating utilities and probabilities

I Example 3.4. A lookup table for perfect tic tac toe.I But an observer can construct a value function V by observing the agent’s

preferences. (even if the agent does not know V )


Utilities

I Utilities map states to real numbers. Which numbers?I Definition 3.5 (Standard approach to assessment of human utilities).

Compare a given state A to a standard lottery Lp that hasI “best possible prize” u> with probability pI “worst possible catastrophe” u⊥ with probability 1− p

adjust lottery probability p until A ∼ Lp. Then U(A) = p.I Example 3.6. Choose u> = current state, u⊥ = instant death

pay $30 ∼ L

continue as before

instant death

0.999999

0.000001


Utility scales

I Definition 3.7. Normalized utilities: u> = 1, u⊥ = 0I Definition 3.8. Micromorts: one-millionth chance of deathI Micromorts are useful for Russian roulette, paying to reduce product risks, etc.

I Problem: What is the value of a micromort?

I Ask them directly: What would you pay to avoid playing Russian roulette with amillion-barrelled revolver (very large numbers)

I But their behavior suggests a lower price:I driving in a car for 370 km incurs a risk of one micromort;I over the life of your car – say, 150,000 km that’s 400 micromorts.I People appear to be willing to pay about €10,000 more for a safer car that halves

the risk of death (; €25 per micromort)I This figure has been confirmed across many individuals and risk types.I Of course, this argument holds only for small risks. Most people won’t agree to

kill themselves for €25 million.I Definition 3.9. QALYs: quality-adjusted life yearsI they useful for medical decisions involving substantial risk


Money

I Money does not behave as a utility functionI Given a lottery L with expected monetary value EMV (L), usually

U(L) < U(EMV (L)), i.e., people are risk-averse.I Utility curve: for what probability p am I indifferent between a prize x and a

lottery [p,M$; (1− p), 0$] for large M?I Typical empirical data, extrapolated with risk-prone behavior for debitors:

I Empirically: comes close to the logarithm on the positive numbers.Kohlhase: Künstliche Intelligenz 2 237 July 12, 2018

9.4 Multi-Attribute Utility


Multi-Attribute Utility

I How can we handle utility functions of many variables X1 . . .Xn?I Example 4.1 (Assessing an Airport Site).

Construction

Litigation

Air Traffic Deaths

Noise

Cost

what isU(Deaths,Noise,Cost) fora projected airport?

I How can complex utility functions be assessed from preference behaviour?

I Idea 1: identify conditions under which decisions can be made without completeidentification of U(x1, . . . , xn)

I Idea 2: identify various types of independence in preferences and deriveconsequent canonical forms for U(x1, . . . , xn)


Strict Dominance

I Typically define attributes such that U is monotonic in each argument. (wlog.growing)

I Definition 4.2. Choice B strictly dominates choice A iff Xi (B) ≥ Xi (A) for all i(and hence U(B) ≥ U(A))

I Strict dominance seldom holds in practice (life is difficult)I but is useful for narrowing down the field of contenders.I For uncertain attributes strict dominance is eve more unlikely


Stochastic Dominance

I Definition 4.3. Distribution p2 stochastically dominates distribution p1 iff thecummulative distribution of p2 dominates that for p1 for all t, i.e.∫ t

−∞p1(x)dx ≤

∫ t

−∞p2(x)dx

I Example 4.4.


Stochastic dominance contd.

I Observation 4.5. If U is monotonic in x , then A1 with outcome distributionp1 stochastically dominates A2 with outcome distribution p2:∫ ∞

−∞p1(x)U(x)dx ≥

∫ ∞−∞

p2(x)U(x)dx

I Multiattribute case: stochastic dominance on all attributes ; optimalI Stochastic dominance can often be determined without exact distributions using

qualitative reasoningI E.g., construction cost increases with distance from city S1 is closer to the city

than S2 ; S1 stochastically dominates S2 on costI E.g., injury increases with collision speedI Can annotate belief networks with stochastic dominance informationI Definition 4.6. X +→ Y (X positively influences Y ) means that P(Y | x1, z)

stochastically dominates P(Y | x2, z) for every value z of Y ’s other parents Zand all x1 and x2 with x1 ≥ x2.


Label the arcs + or – for influence in a Bayesian Network












Preference Structure and Multiattribute Utility

I Observation 4.7. n attributes with d values each ; need dn values todetermine utility function U(x1, . . . , xn). (worst case)

I Assumption: Preferences of real agens have much more structure

I Approach: identify regularities and prove repesentation theorems based on these:

U(x1, . . . , xn) = F (f1(x1), . . . , fn(xn))

where F is simple, e.g. addition.I note the similarity to Bayesian networks that decompose the full joint probability

distribution.


Preference structure: Deterministic

I Recall: In deterministic environments an agent has a value function.I Definition 4.8. X1 and X2 preferentially independent of X3 iff preference

between 〈x1, x2, x3〉 and 〈x ′1, x ′2, x3〉 does not depend on x3.I Example 4.9. E.g., 〈Noise,Cost, Safety〉: are preferentially independent〈20,000 suffer, 4.6 G$, 0.06 deaths/mpm〉 vs.〈70,000 suffer, 4.2 G$, 0.06 deaths/mpm〉

I Theorem 4.10 (Leontief, 1947). If every pair of attributes is preferentiallyindependent of its complement, then every subset of attributes is preferentiallyindependent of its complement: mutual preferential independence.

I Theorem 4.11 (Debreu, 1960). Mutual preferential independence impliesthat there is an additive value function: V (S) =

∑i Vi (Xi (S)), where Vi is a

value function referencing just one variable Xi .I Hence assess n single-attribute functions; often a good approximationI Example 4.12. The value function for the airport decision might be

V (noise, cost, deaths) = −noise · 104 − cost − deaths · 1012


Preference structure: Stochastic

I Need to consider preferences over lotteries and real utility functions (not justvalue functions)

I Definition 4.13. X is utility-independent of Y iff preferences over lotteries in Xdo not depend on particular values in Y.

I Definition 4.14. A set X is mutually utility-independent, iff each subset isutility-independent of its complement.

I Theorem 4.15. For mutually utility-indepenent sets there is a multiplicativeutility function: [Keeney:muf74]

U = k1U1 + k2U2 + k3U3 + k1k2U1U2 + k2k3U2U3 + k3k1U3U1 + k1k2k3U1U2U3

I Routine procedures and software packages for generating preference tests toidentify various canonical families of utility functions


9.5 Decision Networks


Utility-Based Agents (Recap)

I


Decision networks

I Definition 5.1. Add action nodes and utility nodes (also called value nodes) tobelief networks to enable rational decision making.

I Example 5.2 (Choosing an Airport Site).

Algorithm:For each value of action node

compute expected value of utility node given action, evidenceReturn MEU action (via argmax)


A Decision-Theoretic Expert System or Aortic Coarctation

I


Knowledge Eng. for Decision-Theoretic Expert Systems

I Create a causal modelI symptoms, disorders, treatments, outcomes, and their influences

I Simplify to a qualitative decision modelI remove vars not involved in treatment decisions

I Assign probabilities (; Bayesian network)I e.g. from patient databases, literature studies, or the expert’s subjective assessments

I Assign utilities (e.g. in QUALYs or micromorts)I Verify and refine the model wrt. a gold standard given by expertsI refine by “running the model backwards” and compare with the literature

I Perform sensitivity analysis (important step in practice)I is the optimal treatment decision robust against small changes in the parameters? (if

yes ; great! if not, collect better data)


9.6 The Value of Information


What if we do not have all information we need?

I It is Well-Known: that one of the most important parts of decision making isknowing what questions to ask.

I Example 6.1 (Medical Diagnosis).I We do not expect a doctor to already know the results of the diagnostic tests when

the patient comes in.I Tests are often expensive, and sometimes hazardous (directly or by delaying

treatment)I only test, ifI knowing the results lead to a significantly better treatment planI information from test results is not drowned out by a-priori likelihood.

I Information value theory enables the agent to make such decisions rationally.I Simple form of sequential decision making (action only impacts belief state)

I Intuition: With the information, we can change the action to the actualinformation, rather than the average.


Value of Information by Example

I Idea: compute value of acquiring each possible piece of evidenceCan be done directly from decision network

I Example 6.2 (Buying Oil Drilling Rights). n blocks of rights, exactly one hasoil, worth kI Prior probabilities 1/n each, mutually exclusiveI Current price of each block is k/nI “Consultant” offers accurate survey of block 3. Fair price?

Solution: compute expected value of information = expected value of best actiongiven the information minus expected value of best action without information

II Example 6.3 (Oil Drilling Rights contd.).I Survey may say “oil in block 3”, prob. 1/n ; buy block 3 for k/n make profit of

k − k/n.I Survey may say “no oil in block 3” prob. (n− 1)/n ; buy another block make profit

of k/(n − 1)− C/n.I Expected profit is 1

n· (n−1)k

n+ n−1

n· kn(n−1)

= kn

I we should pay up to k/n for the information (as much as block 3 is worth)


General formula (VPI)

I Current evidence E , current best action αI Possible action outcomes Si :

EU(α | E ) = maxa

∑i

U(Si ) · P(Si | E , a)

I Suppose we knew Ej = ejk (new evidence), then we would choose αejk s.t.

EU(αejk | E ,Ej = ejk) = maxa

∑i

U(Si ) · P(Si | E , a,Ej = ejk)

Ej is a random variable whose value is currently unknownI So we must compute expected gain over all possible values:

VPIE (Ej) =∑k

P(Ej = ejk | E ) · EU(αejk | E ,Ej = ejk)− EU(α | E )

I Definition 6.4. VPI = value of perfect information


Properties of VPI

I Nonnegative: in expectation, not post hoc: VPIE (Ej) ≥ 0 for all j and E

I Nonadditive: consider, e.g., obtaining Ej twice

VPIE (Ej ,Ek) 6= VPIE (Ej) + VPIE (Ek)

I Order-independent:

VPIE (Ej ,Ek) = VPIE (Ej) + VPIE ,Ej (Ek) = VPIE (Ek) + VPIE ,Ek(Ej)

I Note: when more than one piece of evidence can be gathered,maximizing VPI for each to select one is not always optimal; evidence-gathering becomes a sequential decision problem


Questionnaire: Qualitative behaviors

I Question: Say we have three distributions for P(U | Ej)

What is the value of information in these three cases?

I Answers: qualitativelya) Choice is obvious (a1 almost certainly better) ; information worth little

b) Choice is nonobvious (unclear) ; information worth a lotc) Choice is nonobvious (unclear) but makes little difference ; information worth little

The fact that U2 has a high peak in (c) means that its expected value is knownwith higher certainty than U1. (irrelevant to the argument)





I Answers: qualitativelya) Choice is obvious (a1 almost certainly better) ; information worth littleb) Choice is nonobvious (unclear) ; information worth a lot

c) Choice is nonobvious (unclear) but makes little difference ; information worth little






I Answers: qualitativelya) Choice is obvious (a1 almost certainly better) ; information worth littleb) Choice is nonobvious (unclear) ; information worth a lotc) Choice is nonobvious (unclear) but makes little difference ; information worth little






I Answers: qualitativelya) Choice is obvious (a1 almost certainly better) ; information worth littleb) Choice is nonobvious (unclear) ; information worth a lotc) Choice is nonobvious (unclear) but makes little difference ; information worth little



A simple Information-Gathering Agent

I Definition 6.5. A simple Information-Gathering Agent (gathers info beforeacting)

function Information−Gathering−Agent (percept) returns an actionpersistent: D, a decision networkintegrate percept into Dj := argmax

k(VPIE (Ek)/Cost(Ek))

if VPIE (Ej) > Cost(Ej) return Request(Ej)else return the best action from D

The next percept after Request(Ej) provides a value for Ej

I Problem: The information gathering implemented here is myopic, i.e. calculatingVPI as if only a single evidence variable will be acquired. (cf. greedy search)

I But it works relatively well in practice. (e.g. outperforms humans for selectingdiagnostic tests)


Chapter 10 Temporal Probability Models


Outline

I Time and uncertaintyI Inference: filtering, prediction, smoothingI Hidden Markov modelsI Dynamic Bayesian networksI Particle filteringI Further Algorithms and Topics


10.1 Modeling Time and Uncertainty


Time and uncertainty

I Observation 1.1. The world changes; we need to track and predict itI Example 1.2. Diabetes management vs. vehicle diagnosisI Definition 1.3. A temporal probability model is a probability model, where

possible worlds are indexed by a time structure 〈S ,〉I We restrict ourselves to linear, discrete time structures, i.e. 〈S ,〉 = 〈N,≤〉.

(Step size irrelevant for theory, depends on problem in practice)

I Basic idea: index random variables by N.I Xt = set of unobservable state variables at time t

e.g., BloodSugart , StomachContentst , etc.I Et = set of observable evidence variables at time t

e.g., MeasuredBloodSugart , PulseRatet , FoodEatentI Example 1.4 (Umbrellas). You are a security guard in a secret underground

facility, want to know it if is raining outside. Your only source of information iswhether the director comes in with an umbrella.State variables R1,R2,R3, . . ., Observations U1,U2,U3, . . .

I Notation: Xa:b = Xa,Xa+1, . . . ,Xb−1,Xb


Markov Processes

I Construct a Bayesian network from these variables: parents?

I Definition 1.5. Markov property: Xt only depends on a bounded subset ofX0:t−1.

I Definition 1.6. A (discrete-time) Markov process (also called Markov chain) isa sequence of random variables with the Markov property.

I Definition 1.7. First-order Markov process: P(Xt | X0:t−1) = P(Xt | Xt−1)

Second-order Markov process: P(Xt | X0:t−1) = P(Xt | Xt−2,Xt−1)

I We will use Markov processes to model sequential environments.


Markov Processes

I Construct a Bayesian network from these variables: parents?I Definition 1.5. Markov property: Xt only depends on a bounded subset of

X0:t−1.I Definition 1.6. A (discrete-time) Markov process (also called Markov chain) is

a sequence of random variables with the Markov property.

I Definition 1.7. First-order Markov process: P(Xt | X0:t−1) = P(Xt | Xt−1)




Markov Processes

I Construct a Bayesian network from these variables: parents?I Definition 1.5. Markov property: Xt only depends on a bounded subset of

X0:t−1.I Definition 1.6. A (discrete-time) Markov process (also called Markov chain) is

a sequence of random variables with the Markov property.I Definition 1.7. First-order Markov process: P(Xt | X0:t−1) = P(Xt | Xt−1)




Markov Process Example: The Umbrella

I Example 1.8 (Umbrellas continued). We model the situation in a Bayesiannetwork:

Raint−1 Raint Raint+1

Umbrellat−1 Umbrellat Umbrellat+1

Problem: First-order Markov assumption not exactly true in real world!

II Possible fixes:1. Increase order of Markov process2. Augment state, e.g., add Tempt , Pressuret

I Example 1.9 (Robot Motion). Augment Positiont and Velocityt with Batteryt


Stationary Markov Processes as Transition Models

I Definition 1.10. We divide the random variables in a Markov process M into aset of (hidden) state variables Xt and a set of (observable) evidence variablesEt . We call P(Xt | Xt−1) the transition model and P(Et | Et−1) the sensormodel of M.

I Problem: Even with Markov assumption the transition model is infinite (t ∈ N)

I Definition 1.11. A Markov process is called stationary if the transition model isindependent of time, i.e. P(Xt | Xt−1) is the same for all t.

I Example 1.12 (Umbrellas are stationary). P(Rt | Rt−1) does not depend ont (need only one table)



Rt−1 P(Rt )

T 0.7F 0.3

I Don’t confuse “stationary” (processes) with “static” (environments).I We restrict ourselves to stationary Markov processes in this course.




I Problem: Even with Markov assumption the transition model is infinite (t ∈ N)I Definition 1.11. A Markov process is called stationary if the transition model is

independent of time, i.e. P(Xt | Xt−1) is the same for all t.

I Example 1.12 (Umbrellas are stationary). P(Rt | Rt−1) does not depend ont (need only one table)



Rt−1 P(Rt )

T 0.7F 0.3






independent of time, i.e. P(Xt | Xt−1) is the same for all t.I Example 1.12 (Umbrellas are stationary). P(Rt | Rt−1) does not depend on

t (need only one table)



Rt−1 P(Rt )

T 0.7F 0.3






independent of time, i.e. P(Xt | Xt−1) is the same for all t.I Example 1.12 (Umbrellas are stationary). P(Rt | Rt−1) does not depend on

t (need only one table)



Rt−1 P(Rt )

T 0.7F 0.3



Markov Sensor Models

I Recap: The sensor model predicts the influence of percepts (and the worldstate) on the belief state (used during update)

I Problem: The evidence variables Et could depend on previous variables as wellas the current state.

I we restrict dependency to current state (otherwise state repn. deficient)I Definition 1.13. We say that a sensor model has the sensor Markov property,

iff P(Et | X0:t ,E0:t−1) = P(Et | Xt)

I Assumptions on Sensor Models: We usually assume the sensor Markov propertyand make it stationary as well: P(Et | Xt) is fixed for all t.
















Umbrellas, the full Story

I Example 1.14 (Umbrellas, Transition & Sensor Models).



Rt−1 P(Rt )

T 0.7F 0.3 Rt P(Ut )

T 0.9F 0.2

Note that influence goes from Raint to Umbrellat (causal)I Observation 1.15. If we additionally know the initial prior probabilities P(X0)

(= time t = 0), then we can compute the full joint probability distribution as

P(X0:t ,E0:t) = P(X0) ·t∏

i=1

P(Xi | Xi−1) · P(Ei | Xi )


10.2 Inference: Filtering, Prediction, and Smoothing


Inference tasks

I Definition 2.1. Filtering (or monitoring): P(Xt | e1:t)computing the belief state – input to the decision process of a rational agent.

I Definition 2.2. Prediction (or state estimation): P(Xt+k | e1:t) for k > 0evaluation of possible action sequences. (= filtering without the evidence)

I Definition 2.3. Smoothing (or hindsight): P(Xk | e1:t) for 0 ≤ k < tbetter estimate of past states (essential for learning)

I Definition 2.4. Most likely explanation argmaxx1:t

(P(x1:t | e1:t))

speech recognition, decoding with a noisy channel.


Filtering

I Aim: recursive state estimation: P(Xt+1 | e1:t+1) = f (et+1:,P(Xt | e1:t))

I Project the current distribution forward from t to t + 1:

P(Xt+1 | e1:t+1) = P(Xt+1 | e1:t , et+1) (dividing up evidence)= α · P(et+1 | Xt+1, e1:t) · P(Xt+1 | e1:t) (using Bayes’ rule)= α · P(et+1 | Xt+1) · P(Xt+1 | e1:t) (sensor Markov assumption)

I Note that P(et+1 | Xt+1) can be obtained directly from the sensor modelI Continue by conditioning on the current state Xt :

P(Xt+1 | e1:t+1)

= α · P(et+1 | Xt+1) ·∑xt

P(Xt+1 | xt , e1:t) · P(xt | e1:t)

= α · P(et+1 | Xt+1) ·∑xt

P(Xt+1 | xt) · P(xt | e1:t)

I P(Xt+1 | Xt) is simply the transition model, P(xt | e1:t) the “recursive call”.I So f1:t+1 = α · FORWARD(f1:t , et+1) where f1:t = P(Xt | e1:t) and FORWARD

is the update shown above. (Time and space constant (independent of t))


Filtering

I Aim: recursive state estimation: P(Xt+1 | e1:t+1) = f (et+1:,P(Xt | e1:t))I Project the current distribution forward from t to t + 1:


I Note that P(et+1 | Xt+1) can be obtained directly from the sensor model

I Continue by conditioning on the current state Xt :

P(Xt+1 | e1:t+1)

= α · P(et+1 | Xt+1) ·∑xt

P(Xt+1 | xt , e1:t) · P(xt | e1:t)

= α · P(et+1 | Xt+1) ·∑xt

P(Xt+1 | xt) · P(xt | e1:t)




Filtering




P(Xt+1 | e1:t+1)

= α · P(et+1 | Xt+1) ·∑xt

P(Xt+1 | xt , e1:t) · P(xt | e1:t)

= α · P(et+1 | Xt+1) ·∑xt

P(Xt+1 | xt) · P(xt | e1:t)




Filtering




P(Xt+1 | e1:t+1)

= α · P(et+1 | Xt+1) ·∑xt

P(Xt+1 | xt , e1:t) · P(xt | e1:t)

= α · P(et+1 | Xt+1) ·∑xt

P(Xt+1 | xt) · P(xt | e1:t)


is the update shown above. (Time and space constant (independent of t))Kohlhase: Künstliche Intelligenz 2 264 July 12, 2018

Filtering the Umbrellas

I Example 2.5. Say the guard believes P(R0) = 〈0.5, 0.5〉. On day 1 and 2 theumbrella appears.

P(R1) =∑r0

P(R1 | r0) · P(r0) = 〈0.7, 0.3〉 · 0.5 + 〈0.3, 0.7〉 · 0.5 = 〈0.5, 0.5〉

update with evidence for t = 1 givesP(R1 | u1) = α · P(u1 | R1) · P(R1) = α · 〈0.9, 0.2〉〈0.5, 0.5〉 = α · 〈0.45, 0.1〉 ≈ 〈0.818, 0.182〉


Prediction

I Prediction computes future k > 0 state distributions: P(Xt+k | e1:t)

I Intuition: Prediction is filtering without new evidence.I Lemma 2.6. P(Xt+k+1 | e1:t) =

∑xt+k

P(Xt+k+1 | xt+k) · P(xt+k | e1:t)

I Proof Sketch: Using the same reasoning as for the FORWARD algorithm forfiltering.

I Observation 2.7. As k →∞, P(xt+k | e1:t) tends to the stationarydistribution of the Markov chain, i.e. the a fixed point under prediction.

I The mixing time, i.e. the time until prediction reaches the stationary distributiondepends on how “stochastic” the chain is.


Smoothing

I Smoothing estimates past states by computing P(Xk | e1:t) for 0 ≤ k < t

I Divide evidence e1:t into e1:k (before k) and ek+1:t (after k):

P(Xk | e1:t) = P(Xk | e1:k , ek+1:t)= α · P(Xk | e1:k) · P(ek+1:t | Xk , e1:k) (Bayes Rule)= α · P(Xk | e1:k) · P(ek+1:t | Xk) (cond. independence)= α · f1:k · bk+1:t


Smoothing (continued)

I Backward message bk+1:t = P(ek+1:t | Xk) computed by a backwards recursion:

P(ek+1:t | Xk) =∑xk+1

P(ek+1:t | Xk , xk+1) · P(xk+1 | Xk)

=∑xk+1

P(ek+1:t | xk+1) · P(xk+1 | Xk)

=∑xk+1

P(ek+1, ek+2:t | xk+1) · P(xk+1 | Xk)

=∑xk+1

P(ek+1 | xk+1) · P(ek+2:t | xk+1) · P(xk+1 | Xk)

P(ek+1 | xk+1) and P(xk+1 | Xk) can be directly obtained from the model,P(ek+2:t | xk+1) is the “recursive call” (bk+2:t).

I In message notation: bk+1:t = BACKWARD(bk+2:t , ek+1t) where BACKWARDis the update shown above. (Time and space constant (independent of t))


Smoothing example

I Example 2.8 (Smoothing Umbrellas). Umbrella appears on days 1/2.I P(R1 | u1, u2) = α · P(R1 | u1) · P(u2 | R1) = α · 〈0.818, 0.182〉 · P(u2 | R1)I compute P(u2 | R1) by backwards recursion:

P(u2 | R1) =∑r2

P(u2 | r2) · P( | r2) · P(r2 | R1)

= 0.9 · 1 · 〈0.7, 0.3〉+ 0.2 · 1 · 〈0.3, 0.7〉 = 〈0.69, 0.41〉I So P(R1 | u1, u2) = α · 〈0.818, 0.182〉 · 〈0.69, 0.41〉 ≈ 0.883, 0.117

smoothing gives a higherprobability for rain on day 1I umbrella on day 2I ; rain more likely on day 2I ; rain more likely on day 1.


Forward/Backward Algorithm for Smoothing

I Definition 2.9. Forward-backward algorithm: cache forward messages along theway

function Forward−Backward (ev ,prior)returns: a vector of probability distributionsinputs: ev , a vector of evidence evidence values for steps 1, . . . ,t

prior , the prior distribution on the initial state, P(X0)local: fv , a vector of forward messages for steps 0, . . . , t

b, a representation of the backward message, initially all 1ssv , a vector of smoothed estimates for steps 1, . . . ,t

fv [0] := priorfor i = 1 to t do

fv [i ] := FORWARD(fv [i − 1], ev [i ])for i = t downto 1 do

sv [i ] := NORMALIZE(fv [i ]b)b := BACKWARD(b, ev [i ])

return sv

I Time linear in t (polytree inference), space O(t ·#(f))


Most Likely Explanation

I Observation 2.10. Most likely sequence 6= sequence of most likely states!I Example 2.11. Suppose the umbrella sequence is T,T,F,T,T what is the most

likely weather sequence?I Idea: Use smoothing to find posterior distribution in each time step, construct

sequence of most likely statesI Problem: These posterior distributions range over a single time step (and this

difference matters)I Example 2.12.


Most Likely Explanation (continued)

I Most likely path to each xt+1 = most likely path to some xt plus one more step

maxx1,...,xt

P(x1, . . . , xt ,Xt+1 | e1:t+1)

= P(et+1 | Xt+1) ·maxxt

(P(Xt+1 | xt) · maxx1,...,xt−1

P(x1, . . . , xt−1, xt | e1:t)))

I Identical to filtering, except f1:t replaced by

m1:t = maxx1,...,xt−1

P(x1, . . . , xt−1,Xt | e1:t)

I.e., m1:t(i) gives the probability of the most likely path to state i .Update has sum replaced by max, giving the Viterbi algorithm:

m1:t+1 = P(et+1 | Xt+1) ·maxxt

P(Xt+1 | xt ,m1:t)

I Observation 2.13. Viterbi has linear time complexity (like filtering), butlinear space complexity (needs to keep a pointer to most likely sequence leadingto each state).


10.3 Hidden Markov Models


Hidden Markov Models

I Xt is a single, discrete variable (usually Et is too) , Domain of Xt is 1, . . .,SI Observation: The transition model P(Xt | Xt−1) can be described by a single

S × S matrix.I Definition 3.1. Transition matrix Tij = P(Xt = j | Xt−1 = i)

I Example 3.2. For the umbrella example: T = P(Xt | Xt−1) =

(0.7 0.30.3 0.7

).

I Definition 3.3. Sensor matrix Ot for each time step, diagonal elementsP(et | Xt = i)

I Example 3.4. With U1 = T and U3 = F we have

O1 =

(0.9 00 0.2

)and O3 =

(0.1 00 0.8

)I Definition 3.5. Forward and backward messages as column vectors:

HMM filtering equation: f1:t+1 = α · (Ot+1 Tt f1:t)HMM smoothing equation: bk+1:t = T Ok+1 bk+2:t

I Forward-backward algorithm needs time O(S2 t) and space O(S t)


HMM Example: Robot Localization

I Example 3.6 (Robot Localization in a Maze). Robot has four sonar sensorsthat tell it about obstacles in four directions: N, S, W, E.

I Notation: We write the result where the sensor that detects obstacles in thenorth, south, and east as NSE.

I Example 3.7 (Filter out Impossible States).

a) Possible robot locations after E1 = NSE



I Example 3.6 (Robot Localization in a Maze). Robot has four sonar sensorsthat tell it about obstacles in four directions: N, S, W, E.

I Notation: We write the result where the sensor that detects obstacles in thenorth, south, and east as NSE.

I Example 3.7 (Filter out Impossible States).

b) Possible robot locations after E1 = NSE and E2 = NS



I Example 3.8 (HMM-based Robot Localization).I Random variable Xt for robot location (domain: 42 empty squares)I Transition action for the move action: (T has 422 = 1764 entries)

P(Xt+1 = j | X=i) = Tij =

1N(i)

if j ∈ N(i)

0 else

I We do not know where the robot starts: P(X0) =1n

(here n = 42)I Sensor variable Et : four-bit presence/absence of obstacles in N, S, W, E. Let dit be

the number of wrong bits and ε the error rate of the sensor.

P(Et = et | Xt = i) = Oti = (1− ε)4−dit εdit

I For instance, the probability that the sensor on a square with obstacles in north andsouth would produce NSE is NSE is (1− ε)3 ε1.

Use HMM filtering equation f1:t+1 = α · (Ot+1 Tt f1:t) for localization. (next)



I Idea: Use HMM filtering equation f1:t+1 = α · (Ot+1 Tt f1:t) to computeposterior distribution over locations (i.e. robot localization)I Example 3.9. We come back to the maze of Example 3.6, with ε = 0.2.

a) Posterior distribution over robot location after E1 = NSE

Still the same locations as in the “perfect sensing” case, but now other locationshave non-zero probability.



I Idea: Use HMM filtering equation f1:t+1 = α · (Ot+1 Tt f1:t) to computeposterior distribution over locations (i.e. robot localization)I Example 3.9. We come back to the maze of Example 3.6, with ε = 0.2.

b) Posterior distribution over robot location after E1 = NSE and E2 = NS

Still the same locations as in the “perfect sensing” case, but now other locationshave non-zero probability.


HMM Example: Further Inference Applications

I Idea: Use smoothing: bk+1:t = T Ok+1 bk+2:t to find out where it started andthe Viterbi algorithm to find the most likely path it took.

I Example 3.10. Performance of HMM localization vs. observation length(various error rates ε)

Localization error, (Manhattan dis-tance from true location)

Viterbi path accuracy (fraction of cor-rect states on Viterbi path)


Country dance algorithm

I Can avoid storing all forward messages in smoothing by runningforward algorithm backwards:

f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t

I Algorithm: forward pass computes f1:t , backward pass does f1:i , bt−i :t .

I Observation: backwards pass only needs to store one copy of f1:i , bt:t−i ;constant space.

I Problem: Algorithm is severely limited: transition matrix must be invertible andsensor matrix cannot have zeroes – that is, that every observation be possible inevery state.




f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t


I Observation: backwards pass only needs to store one copy of f1:i , bt:t−i ;constant space.

I Problem: Algorithm is severely limited: transition matrix must be invertible andsensor matrix cannot have zeroes – that is, that every observation be possible inevery state.




f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t


Observation: backwards pass only needs to store one copy of f1:i , bt:t−i ;constant space.

II Problem: Algorithm is severely limited: transition matrix must be invertible andsensor matrix cannot have zeroes – that is, that every observation be possible inevery state.




f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t







f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t







f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t







f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t







f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t







f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t







f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t







f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t







f1:t+1 = α · (Ot+1 Tt f1:t)

Ot+1−1 f1:t+1 = α · (Tt f1:t)

α′ (Tt−1 Ot+1−1 f1:t+1) = f1:t





10.4 Dynamic Bayesian Networks


Dynamic Bayesian networks

I Definition 4.1. A Bayesian network D is called dynamic (a DBN), iff its randomvariables are indexed by a time structure. We assume that its structure isI time sliced, i.e. that the time slices Dt – the subgraphs of t-indexed random

variables and the edges between them – are isomorophic.I a first-order Markov process, i.e. that variables Xt can only have parents in Dt and

Dt−1.I Xt , Et contain arbitrarily many variables in a replicated Bayes netI Example 4.2.


DBNs vs. HMMs

I Every HMM is a single-variable DBN; every discrete DBN is an HMM

I Sparse dependencies ; exponentially fewer parameters;I e.g., 20 state variables, three parents each DBN has 20 · 23 = 160 parameters,

HMM has 220 · 220 ≈ 1012


Exact inference in DBNs

I Definition 4.3 (Naive method). Unroll the network and run any exactalgorithm

Problem: inference cost for each update grows with t

II Definition 4.4. Rollup filtering: add slice t + 1, “sum out” slice t using variableelimination

I Largest factor is O(dn+1), update cost O(dn+2) (cf. HMM update cost O(d2n))


Summary

I Temporal models use state and sensor variables replicated over timeI Markov assumptions and stationarity assumption, so we needI transition model P(Xt | Xt−1)I sensor model P(Et | Xt)

I Tasks are filtering, prediction, smoothing, most likely sequence;all done recursively with constant cost per time step

I Hidden Markov models have a single discrete state variable; usedfor speech recognition

I Dynamic Bayes nets subsume HMMs, exact update intractableI Particle filtering is a good approximate filtering algorithm for DBNs


Chapter 11 Making Complex decisions


Outline

I Markov Decision Problems (for Sequential Environments)I Value/Policy iteration for computing utilities in MDPsI Partially Observable MDPs (POMDPs)I Decision-theoretic agents for POMDPs


11.1 Sequential Decision Problems


Sequential decision problems

I In sequential decision problems, the agent’s utility depends on a sequence ofdecisions.

I Sequential decision problems incorporate utilities, uncertainty, and sensingI search and planning problems are special cases


Example MDP

I Example 1.1. A (fully observable) environment with uncertain actions

I States s ∈ S , actions a ∈ AI Transition Model P(s ′ | s, a) = probability that a in s leads to s ′

I Reward function

R(s, a, s ′) :=

−0.04 if (small penalty) for nonterminal states±1 if for terminal states


Markov Decision Process

I Definition 1.2. A sequential decision problem in a fully observable, stochasticenvironment with a Markovian transition model and an additive reward functionis called a Markov decision process. It consist ofI a set of S of states (with initial state s0 ∈ S),I sets Actions(s) of actions for each state sI a transition model P(s ′ | s, a), andI a reward function R : S → R.


Solving MDPs

I In search problems, the aim is to find an optimal sequence

I In MDPs, aim is to find an optimal policy π(s) i.e., best action for every possiblestate s (because can’t predict where one will end up)

I The optimal policy maximizes (say) the expected sum of rewardsI Example 1.3. Optimal policy when state penalty R(s) is −0.04:


Risk and Reward (Example and Questionnaire)

I Example 1.4. Optimal policy depends on the reward R(s) on non-terminals

1 2 3

1

2

3 + 1

–1

4

–1

+1

R(s) < –1.6284

(a) (b)– 0.0221 < R(s) < 0

–1

+1

–1

+1

–1

+1

R(s) > 0

– 0.4278 < R(s) < – 0.0850

Explanation: Explain what you see

II Remark: The careful balancing of risk and reward is characteristic of MDPs.




1 2 3

1

2

3 + 1

–1

4

–1

+1

R(s) < –1.6284

(a) (b)– 0.0221 < R(s) < 0

–1

+1

–1

+1

–1

+1

R(s) > 0

– 0.4278 < R(s) < – 0.0850

Explanation: −∞ ≤ R(s) ≤ −1.6284 ; Life is so painful that agent heads for the next exit.





1 2 3

1

2

3 + 1

–1

4

–1

+1

R(s) < –1.6284

(a) (b)– 0.0221 < R(s) < 0

–1

+1

–1

+1

–1

+1

R(s) > 0

– 0.4278 < R(s) < – 0.0850

Explanation: −0.4278 ≤ R(s) ≤ −0.0850, life is quite unpleasant; the agent takes theshortest route to the +1 state and is willing to risk falling into the −1 state by accident. Inparticular, the agent takes the shortcut from (3,1).





1 2 3

1

2

3 + 1

–1

4

–1

+1

R(s) < –1.6284

(a) (b)– 0.0221 < R(s) < 0

–1

+1

–1

+1

–1

+1

R(s) > 0

– 0.4278 < R(s) < – 0.0850

Explanation: Life is slightly dreary (−0.0221 < R(s) < 0) ; take no risks at all. In (4,1)and (3,2) head directly away from the −1 ; cannot fall in by accident.





1 2 3

1

2

3 + 1

–1

4

–1

+1

R(s) < –1.6284

(a) (b)– 0.0221 < R(s) < 0

–1

+1

–1

+1

–1

+1

R(s) > 0

– 0.4278 < R(s) < – 0.0850

Explanation: If R(s) > 0, then life is positively enjoyable ; avoid both exits ; reapinfinite rewards.





1 2 3

1

2

3 + 1

–1

4

–1

+1

R(s) < –1.6284

(a) (b)– 0.0221 < R(s) < 0

–1

+1

–1

+1

–1

+1

R(s) > 0

– 0.4278 < R(s) < – 0.0850

Explanation:



11.2 Utilities over Time


Utility of state sequences

I Need to understand preferences between sequences of statesI Definition 2.1. We call preferences on reward sequences stationary, iff

[r , r0, r1, r2, . . .] ≺ [r , r ′0, r′1, r′2, . . .]⇔ [r0, r1, r2, . . .] ≺ [r ′0, r

′1, r′2, . . .]

I Theorem 2.2. For stationary preferences, there are only two ways to combinerewards over time.I An additive rewards: U([s0, s1, . . .]) = R(s0) + R(s1) + · · ·I A discounted rewards:

U([s0, s1, s2, . . .]) = R(s0) + γR(s1) + γ2R(s2) + · · ·

where γ is the discount factor


Utilities contd.

I Problem: infinite lifetimes ; additive utilities become infinite1. Finite horizon: termination at a fixed time T ; nonstationary policy: π(s) depends

on time left2. If there are absorbing states: for any policy π agent eventually “dies” with probability

1 ; expected utility of every state is finite.3. Discounting: assuming γ < 1, R(s) ≤ Rmax,

U([s0, . . ., s∞]) =∞∑t=0

γtR(st) =≤∞∑t=0

γtRmax = Rmax/(1− γ)

Smaller γ ; shorter horizon


Utility of States

I Intuition: Utility of a state = expected (discounted) sum of rewards (untiltermination) assuming optimal actions

I Definition 2.3. Given a policy π, let St be the state the agent reaches at time tstarting at state s0. Then the expected utility obtained by executing π startingin s is given by

Uπ(s) = E (∞∑t=0

γt R(st))

we define the π∗s := argmaxπ

(Uπ(s)).

I Observation 2.4. π∗s is independent from s.I Proof Sketch: If π∗a and π∗b reach point c , then there is no reason to disagree –

or with π∗cI Definition 2.5. We call π∗ := π∗s for some s the optimal policy.

I : This does not hold for finite-horizon policies.


Utility of States (continued)

I Definition 2.6. The utility U(s) of a state s is Uπ∗(s).

I Remark: R(s) = “short-term reward”, whereas U = “long-term reward”.I Given the utilities of the states, choosing the best action is just MEU:I maximize the expected utility of the immediate successors

π∗(s) = argmaxa∈A(s)

(∑s′

P(s ′ | s, a),U(s ′))


11.3 Value/Policy Iteration


Dynamic programming: the Bellman equation

I Definition of utility of states leads to a simple relationship among utilities ofneighboring states:

expected sum of rewards = current reward+γ · exp. reward sum after best action

I Theorem 3.1 (Bellman equation (1957)).

U(s) = R(s) + γ ·maxa

∑s′

U(s ′) · T (s, a, s ′)

I Example 3.2. U(1, 1) = −0.04+ γ max0.8U(1, 2) + 0.1U(2, 1) + 0.1U(1, 1), up

0.9U(1, 1) + 0.1U(1, 2) left0.9U(1, 1) + 0.1U(2, 1) down0.8U(2, 1) + 0.1U(1, 2) + 0.1U(1, 1) right

I Problem: One equation per state ; n nonlinear equations in n unknowns (maxnot linear) ; cannot use linear algebra techniques for solving them.


Value Iteration Algorithm I

I Idea:1. Start with arbitrary utility values,2. Update to make them locally consistent with Bellman eqn.3. Everywhere locally consistent ; global optimality

I Definition 3.3. The value iteration algorithm for utility functions is given byfunction VALUE−ITERATION (mdp,ε) returns a utility fn.

inputs: mdp, an MDP with states S , actions A(s), transition model P(s′ | s, a),rewards R(s), and discount γ

ε, the maximum error allowed in the utility of any statelocal variables: U, U′, vectors of utilities for states in S , initially zero

δ, the maximum change in the utility of any stae in an iterationrepeat

U := U′; δ := 0for each state s in S doU′[s] := R(s) + γ ·max

a

∑s′ U[s′] · P(s′ | s, a)

if |U′[s]− U[s]| > δ then δ := |U′[s]− U[s]|until δ < ε(1− γ)/γreturn U


Value Iteration Algorithm II

I Example 3.4 (Iteration on 4x3).


Convergence

I Define the max-norm ||U|| = maxs |U(s)|, so ||U − V || = maximum differencebetween U and V

I Let U t and U t+1 be successive approximations to the true utility U

I Theorem 3.5. For any two approximations U t and V t

||U t+1 − V t+1|| ≤ γ ||U t − V t ||

I.e., any distinct approximations must get closer to each otherso, in particular, any approximation must get closer to the true Uand value iteration converges to a unique, stable, optimal solution

I Theorem 3.6. If ||U t+1 − U t || < ε, then ||U t+1 − U|| < 2εγ/(1− γ)I.e., once the change in U t becomes small, we are almost done.

I MEU policy using U t may be optimal long before convergence of values


Policy Iteration

I Recap: value iteration computes utilities ; optimal policy by MEU.I This even works if the utility estimate is inaccurate ( policy loss small)

I Idea: search for optimal policy and utility values simultaneously[Howard:dpmp60]: IterateI policy evaluation: given policy πi , calulate Ui = Uπi , the utility of each state were πi

to be executed.I policy improvement: calculate a new MEU policy πi+1 using 1-lookahead

Terminate if policy improvement yields no change in utilities.I Observation 3.7. Upon termination Ui is a fixed point of Bellman update

; Solution to Bellman equation ; πi is an optimal policy.I Observation 3.8. policy improvement improves policy and policy space is

finite ; termination.


Policy Iteration Algorithm

I Definition 3.9. The policy iteration algorithm is given by the followingpseudocode:function POLICY−ITERATION(mdp) returns a policy

inputs: mdp, and MDP with states S , actions A(s), transition model P(s′ | s, a)local variables: U a vector of utilities for states in S , initially zero

π a policy indexed by state, initially random,repeat

U := POLICY−EVALUATION(π,U,mdp)unchanged? := trueforeach state s in X doif max

a∈A(s)()∑

s′ P(s′ | s, a) · U(s′) >∑

s′ P(s′ | s, π[s′]) · U(s′) then do

π[s] := argmaxb∈A(s)

(∑

s′ P(s′ | s, a) · U(s′))

unchanged? := falseuntil unchanged?return π


Policy Evaluation

I Problem: How to implement the POLICY−EVALUATION algorithm?

I Solution: To compute utilities given a fixed π: For all s we have

U(s) = R(s) + γ∑s′

U(s ′) · T (s, π(s), s ′)

I Example 3.10 (Simplified Bellman Equations for π).

Ui (1, 1) = −0.04 + 0.8Ui (1, 2) + 0.1Ui (1, 1) + 0.1Ui (2, 1)

Ui (1, 2) = −0.04 + 0.8Ui (1, 3) + 0.1Ui (1, 2)

...

I Observation 3.11. n simultaneous linear equations in n unknowns, solve inO(n3) with standard linear algebra methods


Modified policy iteration

I Policy iteration often converges in few iterations, but each is expensive

I Idea: Use a few steps of value iteration (but with π fixed)starting from the value function produced the last timeto produce an approximate value determination step.

I Often converges much faster than pure VI or PII Leads to much more general algorithms where Bellman value updates and

Howard policy updates can be performed locally in any orderI Reinforcement learning algorithms operate by performing such updates based on

the observed transitions made in an initially unknown environment


11.4 Partially Observable MDPs


Partial Observability

I Definition 4.1. A partially observable MDP (a POMDP for short) is a MDPtogether with an observation model O that is stationary and has the sensorMarkov property: O(s, e) = P(e | s).

Example 4.2 (Noisy 4x3 World).Add a partial and/or noisy sensor.e.g. count number of adjacent walls (1 ≤ w ≤ 2)with 0.1 error (noise)If sensor reports 1, we are in (3, ?) (probably)

I

I Problem: Agent does not know which state it is in ; makes no sense to talkabout policy π(s)!

I Theorem 4.3 (Astrom 1965). The optimal policy in a POMDP is a functionπ(b) where b is the belief state (probability distribution over states).

I Idea: convert a POMDP into an MDP in belief-state space, where T (b, a, b′) isthe probability that the new belief state is b′ given that the current belief stateis b and the agent does a. I.e., essentially a filtering update step





I








I





POMDP: Filtering at the Belief State Level

I Recap: Filtering updates the belief state for new evidence.I For POMDP, we also need to consider actions (but the effect is the same)I If b(s) is the previous belief state and agent does action a and then perceives e,

then the new belief state is

b′(s ′) = α P(e | s ′)∑s

P(s ′ | s, a) · b(s)

We write b′ = FORWARD(b, a, e) in analogy to recursive state estimation.I Fundamental Insight for POMDPs: The optimal action only depends on the

agent’s current belief state. (good, it does not know the state!)I Consequence: the optimal policy can be written as a function π∗(b) from belief

states to actions.I POMDP decision cycle: Iterate over

1. Given the current belief state b, execute the action a = π∗(b)2. Receive percept e.3. Set the current belief state to FORWARD(b, a, e) and repeat.

I Intuition: POMDP decision cycle is search in belief state space.Kohlhase: Künstliche Intelligenz 2 302 July 12, 2018

Partial observability contd.

I Recap: POMDP decision cycle is search in belief state space.I Observation 4.4. Actions change the belief state, not just the physical state.

thus POMDP solutions automatically include information-gathering behavior.

I Problem: The belief state is continuous: If there are n states, b is ann-dimensional real-valued vector.

I Example 4.5. The belief state of the 4x3 world is a 11-dimensional continuousspace (11 states)

I Theorem 4.6. Solving POMDPs is very hard! (actually, PSPACE-hard)

I In particular none of the algorithms we have learned applied (discretenessassumption)

I The real world is a POMDP (with initially unknown transition model T andobservation model O)


11.5 Online Agents with POMDPs


Designing Online Agents for POMDPs

I Definition 5.1 (Dynamic Decision Networks).I transition and sensor models are represented as a DBN (a dynamic Bayesian

network).I action nodes and utility nodes are added to create a dynamic decision network

(DDN).I a filtering algo is used to incorporate each new percept and action and to update the

belief state representation.I decisions are made by projecting forward possible action sequences and choosing the

best one.I Generic structure of a dymamic decision network at time t

Variables with known values are gray, agent must choose a value for At .Rewards for t = 0, . . . , t + 2, but utility for t + 3 (= discounted sum of rest)


Designing Online Agents for POMDPs (continued)

I Part of the lookahead solution of the DDN above (search over action tree)

circle = chance nodes (the environment decides)triangle = belief state (each action decision is taken there)


Designing Online Agents for POMDPs (continued)

I Note: belief state update is deterministic irrespective of the action outcome; no chance nodes for action outcomes

I belief state at triangle computed by filtering with actions/percepts leading to itI for decision At+i will have percepts Et+1:t+i (even if it does not know their values at

time t)I A POMDP-agent automatically takes into account the value of information and

executes information-gathering actions where appropriate.I time complexity for exhaustive search up to depth d is O(|A|d · |E|d) (|A| =

number of actions, |E| = number of percepts)Kohlhase: Künstliche Intelligenz 2 306 July 12, 2018

Summary

I Decision theoretic agents for sequential environmentsI Building on temporal, probabilistic models/inference (dynamic Bayesian

networks)I MDPs for fully observable caseI Value/Policy Iteration for MDPs ; optimal policiesI POMDPs for partially observable caseI = MDP on belief state space.I The world is a POMDP with (initially) unknown transition and sensor models.


References I


Documents

Artificial Intelligence 2 (Künstliche Intelligenz 2) Part ... · Kohlhase: Künstliche Intelligenz 2 130 July 12, 2018. Environmenttypes IExample1.11. Someenvironmentsclassiﬁed: