Chapter V: Conditional Probability. Stochastic Independence.

Introduction

Hi again! Now we are going to read the next chapter.

(Note that the “next chapter” is Chapter V. If you are reading a book on your own, the order of the book is not always the best order to read in. You can usually find some notes about this at the beginning (see “Notes on the Use of This Book, page xi). If we had a lot of time, I would probably go in order, but I want to cover the most fundamental things first.)

I will make less detailed reading suggestions this time. As always:

read slowly;
write down comments and important points;
work through examples;
ask questions; and
for abstract concepts, invent your own examples.

When you have completed a chapter or section, try to summarize what it contains in your notes. (This could be very brief point form, and it’s fine if it’s only understandable to you: the purpose is to be a reminder, and to help you see the structure of the text overall.)

For this chapter, I would also add:

leave things for later if you need to. Sometimes the book will have a difficult example or side point, and it isn’t worth struggling through it before going forward. If something seems really tricky, feel free to flag it, continue on, and come back to it later.

This chapter will use the material of Preamble Assignment #1, particularly the material on Cartesian products. It will use a binomial coefficient in a couple of places, from Preamble Assignment #2, but you won’t need a lot. PA #2 will be super important in the next chapter.

1. Conditional Probability

Conditional probability is a fairly simple idea, that has many applications, and that is behind many subtleties in probability. The basic form is, “what is the chance that A happened, knowing that B happened?”.

Preparatory Examples (page 114)

Again, the author is giving some concrete examples before the abstract definitions.

I sometimes find it helpful to make a book’s concrete examples even more concrete if I can. For example, in this paragraph, the author introduces a population of N people, with N_A colorblind people and N_H females. I imagine this as, let’s say there are N=1000 people, with N_A=100 colorblind people and N_H=500 women. I find it easier to deal with the abstraction if I have specific numbers in mind. (The numbers don’t have to be realistic. And we are assuming that “female” and “colorblind” are uncomplicated and unambiguous.)

Even when dealing with fancy abstract math, I am still in the habit of trying to go back to concrete things; what does this mean in terms of actual numbers?

Exercise 1:
a) For this example, draw a Venn diagram. Label which regions have how many points. (You might get confused at first about how many circles you need. Think about the list of all possibilities.)
b) Note that a Venn diagram doesn’t have to be circles. For example, in this example, the entire sample space (which we have been drawing as an enclosing rectangle) is split into men and women, so we could indicate that with a line down the middle. If I do it this way, how would you draw the region for colorblindness? Do you find this redrawing helpful?
c) Experiment with other ways to draw a Venn diagram, for two events, or for three events.

It’s often useful to draw a Venn diagram when picturing examples like this. Again, it helps to make things more concrete.

This may be taking things to silly extremes, but I find it helpful to picture Venn diagrams even more concretely: in this example, I would have a big field, and make a fence or rope circle with all the men on one side of the fence and women on the other. Then there is another intersecting rope circle for colorblind and not-colorblind. Sometimes I draw little stick figures in my Venn diagram, and imagine who those people are, in each of those regions. Sounds silly, but try it!

Exercise 2:
a) Given my hypothetical numbers above, assign a number for N_HA, the number of people who are both female and colorblind. (What is the range of options that N_HA could be with my numbers?)
b) Calculate the probabilities in (1.2), and check that the claimed equality actually is true.
c) Going back to the abstract: prove the second equality in (1.2), using (1.1).

The previous exercise is illustrating a principle: when the author gives you a new equation, think of it as an exercise: “where did this equation come from? How was it proved?”.

Does the reasoning of equation (1.2) make sense to you? Think it through in words and numbers, for our example, if it doesn’t yet.

Exercise 3: Each of the following refers to the example we are working through on page 114. For each one, (i) say what it means in words; (ii) compute the conditional probability as a number (using our hypothetical numbers); and (iii) give an abstract formula for the conditional probability (similar to (1.2)).
a) $P(A\vert H’)$
b) $P(A’\vert H)$
c) $P(A’\vert H’)$
d) $P(H\vert A)$ (does this even make sense? Can you calculate it?)

Is it clear to you how I generated Exercise 3? Given an example formula/calculation, (1.2), I was asking myself: what else could I have done with this formula, in a similar way? I recommend getting in this habit whenever you are presented with a new formula, computation, or theorem.

OK, I said I would be briefer this time…

Page 115, Second paragraph

For this bridge example, the author was not too specific.

Exercise 4:
a) Make up some specific examples of conditional probabilities about bridge. (Recall the definition of bridge on page 8. You don’t have to know how the game is played, just how the cards are dealt.) My example was: what is the probability that North has at least one ace, assuming that South has at least one ace. (In the notation of Chapter I, Problem 12, page 24, I’m trying to compute $P(N_1\vert S_1)$.) Come up with a couple of additional examples of your own of this type.
b) Try to find those conditional probabilities, using (1.2) or some other way. You might find it is too hard, but give it a try!

When you generate your own questions, you may find that they are too hard to answer. This is OK: it sets you up to want to know the method when you get to it later in the book. (And if the method is never given in the book, well, welcome to mathematical research!)

Exercise 5: I found the conditional probability problems I made up just now in Exercise 4 a little too hard to be helpful right now. I wanted just a simple example that would help me understand the idea. I personally liked the “rolling two dice” experiment. For rolling two dice, and make up a conditional probability problem that you CAN solve. (Or, if you prefer another example, do that! Try to find something simple that you can solve.)

Page 116, Second paragraph (“Taking conditional probabilities…”)

If this sentence is confusing, think of it with the colorblind people, with the particular numbers we made up.

Exercise 6:
a) Prove (1.4) on page 116. (By “prove” I don’t mean anything intimidating. I just mean, how did the author get formula (1.4) from what came before?)
b) Prove (1.5) on page 116.
c) Prove (1.6) on page 116.

Page 116, Last paragraph (“We conclude with a simple formula…”)

When you are given a statement about a list $H_1,\dotsc, H_n$, you should always read it through the first time assuming n=2. That is, we are assuming we have two mutually exclusive events, $H_1 $ and $H_2$. Makes it easier to read, right?

NOTE: In formula (1.8), the $\Sigma$ is a Greek capital “sigma”, or letter “s”, standing for “sum”. This is called a “summation symbol”. The instruction is to add whatever appears, for each value of the “index” i. So written out in full, formula (1.8) says $$P(A) = P(A\vert H_1)\cdot P(H_1) + \dotsb P(A\vert H_n)\cdot P(H_n).$$ There are n terms in the sum, corresponding to i=1, 2, 3, …, n.

Exercise 7:
a) In our original example of men, women, and colorblindness, what could I take $H_1$ and $H_2$ to be? Remember, they have to be “mutually exclusive events of which one necessarily occurs”.
b) Write out formula (1.7) for two terms, and interpret it in words for this example (with the $H_1$ and $H_2$ you picked in (a)).
c) Write out formula (1.8) for two terms, and interpret it in words for this example (with the $H_1$ and $H_2$ you picked in (a)).
d) Check that (1.8) is numerically true, with the particular numbers we made up.

Exercise 8: Still assuming n=2,
a) Prove (1.7). (It doesn’t have to be formal, you can just explain to yourself why it’s true.)
b) Prove (1.8).

For n=2, say formula (1.8) to yourself in words. Is it starting to make intuitive sense, what (1.8) is saying?

Now that you have read through everything assuming n=2, go back and read it again assuming n=3. Write out (1.7) and (1.8) in the case n=3, and say them to yourself in words.

Finally, the statement with $H_1,\dotsc, H_n$ should make more sense to you now. Roughly, it is saying “this works the same way no matter how many $H_i$ hypotheses you have”.

Examples (page 117)

As before, treat every example as a solved problem. Try to solve it yourself, and check every statement the author makes.

Example (a): This might look abstract at first. Try replacing variables with particular numbers on the first reading. Try some way to imagine it as more concrete. (Once you sort through it, you will find that it is saying something almost trivially simple. Which raises the question, why is he even doing this example? See if you can understand why.)

Example (b):

Exercise 9: Solve example (b)! This example might look intimidating at first, but take the setup, and try to solve it yourself first. Then go back and look at the author’s answer. The author’s answer won’t make much sense until you try solving it yourself—he effectively only gives a hint. It’s a little tricky, so if you don’t get the same answer, go back and try to figure out why. Then go back and try it another way: you should be able to solve this problem by two methods. (The author suggests two different ways you can solve this problem in his discussion.)

(I’m only making this an “exercise” to emphasize that it is important to work through all the examples! I won’t do that in future: assume that for every example in the book, I want you to work through it as an exercise, unless I say otherwise.)

Example (c): This comes back to Puzzle 2′ from the introductory lecture! Follow through the calculation yourself, using formula (1.3). (Again, this IS an exercise! And again, some restrictive and old-fashioned assumptions are being made here.)

Exercise 10: In the last part of Example (c), the author talks about how the probability of a boy is going to depend on how the family was selected. Apply that reasoning to Puzzle 2 from the introductory lecture. Remember, that puzzle was: “I flip two coins, and show you one is heads; what is the probability that the other is heads?” How does this probability depend on how I choose which coin to show you? What would be various ways I could choose?

Exercise 11:
a) Suppose I flip a coin three times. What is the probability that it comes up three heads, knowing that at least two of the flips came up heads?
b) Suppose I flip a coin one hundred times. What is the probability that it comes up 100 heads, knowing that at least 99 of the flips came up heads?
c) As before, what precise conditions do I have to apply to make the conditional probability the correct one here? How could I change the wording slightly so that the conditional probability is NOT the correct one?

Example (d): There isn’t anything numerical to work through here. But it’s abstract, so make up some examples. Again, there are an unknown number of hypotheses $H_1, H_2, \dotsc$; start by assuming there are just two, and write out (1.9) in that case. Come up with a more specific example than is given in the text.

NOTE: Depending on your style, you may prefer to do Exercises 12–14 in a different order. For example, it might be easier for you to prove (1.9) (Exercise 12) if you come up with a good practical example first (Exercise 14) that helps you keep track what things mean. Or it may be easier to do a numerical example first (Exercise 13). Also, if you find this example and exercises tricky, you might want to flag it and come back later.

Exercise 12: Prove (1.9) on page 118.

Exercise 13: For our original colorblindness example, use (1.9) to find a formula for the probability that someone is female, given that they are colorblind. Find the answer numerically, using our made-up numbers.

Exercise 14: Try to come up with a practical example of subpopulations $H_1, \dotsc, H_n$ of people, and an event $A$, and say what (1.9) would tell you in those practical terms.

I find it easier to understand (1.9) if I write it in terms of $P(A\vert H_j)$, etc., rather than with these new (not very descriptive) letters $p_j$ and $q_j$. You might want to do that (you probably already have if you solved Exercise 12). Note that the author gives the solution a bit later: in the “Note on Bayes’ rule” at the bottom of page 124, he proves 1.9, writing it in the more descriptive form that I am suggesting.

2. Probabilities defined by conditional probabilities. Urn models.

Looking ahead, this section is one introductory chapter, and seven pages of examples. You might anticipate that you might need to skip some of the examples and come back to them later; let’s see how it goes!

Introductory paragraph (page 118)

The author is saying that sometimes what you are given is the conditional probabilities, and you work out other probabilities from that.

For example, in our made-up numbers for colorblindness, rather than giving a sample size, I could have just given conditional probabilities. Suppose $A$ is the event someone is colorblind, $H_1$ is the event they are female, and $H_2$ is the event they are male. (We are making the out of date assumptions that $H_1H_2=0$, the empty set, and $H_1\cup H_2=S$, the whole sample space.) Then I could have specified the situation by three numbers:

$P(A\vert H_1)$ (say for argument’s sake $P(A\vert H_1)=0.01$)
$P(A\vert H_2)$ (say for argument’s sake $P(A\vert H_2)=0.05$)
$P(H_1)$ (say for argument’s sake $P(H_1)=0.5$)

Exercise 15:
a) Say the above assumptions in words.
b) Given the above assumptions, calculate $P(H_2)$, $P(A)$, $P(H_1\vert A)$, and $P(H_2\vert A)$ (using formulas that we’ve just established).

Example (a) (page 118)

Verify everything the author says. (I don’t know about you, but I kind of took this for granted back when we did Example I (5.b) and Problem 5 in Chapter I,8. Is it clear why the author is bringing this up now?)

Example (b) (page 118–119)

As usual, please forgive the dated assumptions.

If the author states a formula for n, try it for n=1, for n=2, for n=3, at least. And don’t forget n=0, if applicable!

As before, be sure to verify everything the author says! Treat it as a solved exercise! You will soon get lost if you just read, rather than working it through yourself.

In particular, make sure you understand:

where $P(A)=p_1\cdot 2^{-1} + p_2\cdot 2^{-2} + \dotsb$ comes from
why the previous formula is a “special case of (1.8)”
how (2.1) is derived.

Example (c): Urn models for aftereffect (first part, pages 119–120)

xkcd comic.
"Imagine that you're drawing at random from an urn containing fifteen balls—six red and nine black."
"OK. I reach in and... ...my grandfather's ashes?!? Oh God!"
"I...what?"
"Why would you do this to me?!?"

It might help to set this up a little before you read the subsection. In our examples of flipping coins or rolling dice, the flips or rolls are independent; each flip of the coin remembers nothing about the previous flips of a coin. The probability of heads is 1/2 on every flip. We can use idealized coins or dice (or whatever) to model realistic events, like births of children or instances of a disorder. But such a model will only be good if the realistic events have the same property. For example, in the above, we were implicitly assuming that the probability of having a girl is 1/2, and that it is not affected by the sex of previous children you have had.

However, there might be realistic events where something happening makes it more likely to happen the next time (say missing a free throw in sports), or makes it less likely to happen the next time (say catching a disease).

With urn models, Polya was trying to imagine a very simple idealized example (like coins or dice) that would show this effect, of the different trials not being independent.

Instead of flipping a coin, you could put one black and one red ball in a bowl. (An “urn” is a fancy bowl. I’m not sure why people use the word “urn” in this context. Maybe because the neck is narrow, so you can’t see inside?) Pull out a ball at random, black=heads and red=tails. Put the ball back in, and you have a 1/2 probability each of black and red.

But what if, when you pick a black ball out, you put TWO black balls back in (the original plus one more)? And same with red: if you pick a red ball, you put TWO red balls back in. Now, getting black makes it more likely to get another black. And runs become more likely than they would have been for coins.

You could modify this by changing the rules about what you put back in, in each case. For another example, we could put 500 black and 500 red balls in at first, and when we pick a ball, NOT put it back in (i.e. put ZERO balls back in). Then, if we had had a long string of black balls, say, it would make the red balls more likely.

That is Polya’s idea for an idealized model where the probabilities have “memory”. Note that we don’t necessarily care that much about pulling balls out of urns; it is an idealized model, which we can use to think about the ideas, and we can use as a model for more practical situations.

OK, now you should go ahead with the reading! Work through pages 119–120, being sure to verify every statement for yourself. (Except you can skip “it is easily verified by induction that the probabilities of all sample points indeed add to unity”, in the middle of page 120.) When you’ve gotten through page 120, come back!

…

OK, how did it go? Did you understand where the author got each formula? In particular, you derive (2.3) yourself to see where it comes from. There are some subtle points there, so be careful.

As usual, when introducing something abstract, it can be helpful to pick some simple numbers for the variables to see what is going on:

Exercise 16:
a) Assume b=r=c=1. Talk through to yourself what this means for the procedure. Write out (2.3) in this case.
b) Now, still assuming b=r=c=1, assume that n=2. What does this mean in words? What does it mean when $n_1=0$, $n_1=1$, and $n_1=2$. Work out (2.3) numerically in each of those cases.
c) Do the same, still assuming b=r=c=1, and assuming n=3. Work out (2.3) numerically for $n_1=0$, $n_1=1$, $n_1=2$, and $n_1=3$.

Example (c): Urn models for aftereffect continued (second part, page 121)

What does $p_{n_1,n}$ mean, in your own words? For example, assuming b=r=c=1 and n=2 as in the previous exercise, what do $p_{0,2}$, $p_{1,2}$, and $p_{2,2}$ mean?

It might help to remind you of something about binomial coefficients, from Preamble Assignment #2. In the current situation, we are assuming $n_1$ black balls are chosen, and we want to find out how many ways those $n_1$ black balls could be distributed in the sequence of $n$ balls chosen. For example, if $n_1=2$ and $n=3$, then the possibilities are

BBR, BRB, and RBB,

so there are three ways to get two black balls in three drawings.

The number of ways is the same as the number of ways to choose $n_1$ things (in this case, positions) from $n$ things. The name for this is the binomial coefficient, and the symbol is $$\binom{n}{n_1}.$$ For example, we just demonstrated that $\binom{3}{2}=3$. In general, when we choose the $n_1$ things, there will be $n$ choices for the first thing, $n-1$ choices for the second thing, up to $n-(n_1-1)=n-n_1+1$ choices for the $n_1$st thing. But it doesn’t matter what order we pick the things, so we have over-counted! The over-counting is by the number of different orders we could have picked the same $n_1$ things, which is $n_1(n_1-1)(n_1-2)\dotsb\cdot 3\cdot 2\cdot 1$. So $$\binom{n}{n_1}=\frac{n(n-1)(n-2)\dotsb(n-n_1+1)}{n_1(n_1-1)(n_1-2)\dotsb\cdot 3\cdot 2\cdot 1}.$$ (See Preamble Assignment #2 for more detail!)

Is it clear to you why we must multiply (2.3) by $\binom{n}{n_1}$ to get the total probability $p_{n_1,n}$? It might help to compute some small cases explicitly:

Exercise 17: For this exercise, assume b=r=c=1.
a) Find $p_{0,2}$, $p_{1,2}$, and $p_{2,2}$ numerically. (Check that they add to 1, because these are all the mutually exclusive possibilities!)
b) Find $p_{0,3}$, $p_{1,3}$, $p_{2,3}$, and $p_{3,3}$ numerically.

I’d like to skip explaining (2.4) until after we do more on the binomial coefficients in Chapter VI. The book is assuming material here from Chapter II, which we skipped over for now. (When you are reading on your own, and not doing things in order, this is a judgment call you have to make, about whether it is wise to skip something until later. Of course, if you do skip it and find you need it, you can always come back and fill in as needed.)

Ehrenfest model of heat exchange and Friedman safety campaign (page 121)

Since the Ehrenfest model is related to physics, I’m going to guess that we can skip it. (It is interesting and not hard, but you’re probably getting tired at this point.) Same with the safety campaign example. We can come back to them later if it seems important and we have time.

Examples (d) and (e), pages 121–125

Wow, this section is getting long!

This brings up the question: when you are reading on your own, when is it a good idea to skip ahead?

The first time I read this chapter, I skipped over Examples (d) and (e), and came back to them later. I am going to suggest you do the same. Here was my reasoning:

They are additional examples; maybe they are not needed right now.
I want to get back to the learning the main concepts. Skipping ahead a little, I see that V.3 gets back to central concepts, so maybe it is good to go there now and come back.
Skimming over (d) and (e), it seems like the author is making a bit of a philosophical discussion. That might be interesting, but it is also something I could leave to the end, after I get a better understanding of the chapter.

It does turn out that Examples (d) and (e) are very interesting, both philosophically, and also for modern applications. (They concern something called Bayes’ Theorem, which is increasingly important nowadays.) However, I don’t know about you, but I’m starting to feel bogged down. Let’s skip to V.3, and come back to Examples (d) and (e) at the end!!

3. Stochastic Independence

I will start assuming that you are getting in the habits of reading carefully (making examples, making notes, etc.). I’ll stop including so much detail, and just make some shorter comments.

First, make sure you derive (3.1) for yourself.

Next, make sure you think of examples for the abstract definition. Specifically:

Exercise 18:
a) Try to think of some examples of events that (i) are independent, according to Definition 1, and also some examples of events that (ii) are NOT independent, according to Definition 1. (Given a definition, it is always good to try to think of examples that obey it, but also examples which do NOT obey it.) You can use abstract examples (flipping coins, rolling dice, etc.), or practical examples (e.g. getting a medical test); preferably, try to come up with a couple of both.
b) According to Definition 1, two events A and H are NOT independent if $P(AH)\neq P(A)\cdot P(H)$. That means two possibilities, either $P(AH)> P(A)\cdot P(H)$, or $P(AH)< P(A)\cdot P(H)$. Are both these cases actually possible, for some choice of events? If you think not, explain why. If you think both are possible, give examples. (HINT: Say what each of these cases would mean, mathematically, about $P(A\vert H)$ versus $P(A)$. Explain what each case would mean in words.)

As usual, work through all the examples, and verify everything the author says. In particular, in Example (c), he says “this is intuitively clear and easily verified”—I’m not sure I agree about “easily”, but do try it!

In Example (d), the author makes an important point: if two events are “intuitively independent”—if they don’t influence each other—then they are satisfy Definition 1 and are therefore mathematically independent. But there can be examples of events where Definition 1 is satisfied, so they are independent by definition, but where the events are not “intuitively independent”. The use of the word “independent” means Definition 1 is satisfied, no more and no less.

Be sure to work through the rest of page 126. You could probably skip over Example (e) if you are getting tired, but it isn’t too difficult. Verify all the author’s derivations on the rest of page 127.

For Definition 2, use the usual strategy of setting n=2 first, and writing out what it says. Then set n=3 and write out what it says, then n=4. That should give you a clear idea of the general meaning.

The comments after Definition 2 seem like the sort of side points that one can pretty safely skip on a first reading (and you can).

4. Product Spaces. Independent Trials.

This chapter is somewhat abstractly written, and it is particularly wordy. I would suggest:

Concretize. Come up with a concrete example or two to apply the concepts to. I used rolling a die twice, and rolling a die three times as my two examples. Think of the concrete example for each abstract statement. Actually write down the points of the sample spaces, find any probabilities numerically, and so on.
Condense. For each paragraph, write a note about what it means in your own words. The note can just be a brief point, or even just a word—it doesn’t have to make sense to anyone but you.
Translate. If English is not your first language, translate as needed. Also, you may be used to different mathematical language (e.g. “Cartesian product” versus “combinatorial product”). Convert into your preferred words or symbols.
Summarize. When you reach the end of this section, go back over it and try to see how it was structured. Make a summary for yourself in point form. Again, it can be very brief, just reminders to yourself.

5, 6, and 7. Genetics

The author says you can skip these, so let’s skip them!

If you are interested in genetics, then these sections are definitely worth reading! But not everyone in this class is going to be interested in this application, so I won’t try to cover it.

Conclusion (not really, but almost!)

Now that we have reached the end of the chapter, as before I suggest that you make a summary. List the things that you think are most important from each section. Don’t be too detailed—this is just a reminder for yourself of the things you are trying to know.

Also, try to get a sense of how the chapter is structured, and how it fits in with the first chapter. This can help give you a better sense of where the author is trying to go.

Note that we never got back to Examples (d) and (e) in that long Section 2! I am going to recommend that you go do the assignment first, or at least get it started, to get some more understanding of how these calculations work in practice. Then go look at the “Chapter V Addendum: Bayesian Reasoning”, where I talk about those examples.