Birthday problem

From testwiki
Jump to navigation Jump to search

Template:About

In probability theory, the birthday problem or birthday paradox concerns the probability that, in a set of Template:Mvar randomly chosen people, some pair of them will have the same birthday. By the pigeonhole principle, the probability reaches 100% when the number of people reaches 367 (since there are only 366 possible birthdays, including February 29). However, 99.9% probability is reached with just 70 people, and 50% probability with 23 people. These conclusions are based on the assumption that each day of the year (excluding February 29) is equally probable for a birthday.

Actual birth records show that different numbers of people are born on different days. In this case, it can be shown that the number of people required to reach the 50% threshold is 23 or fewer.[1] For example, if half the people were born on one day and the other half on another day, then any two people would have a 50% chance of sharing a birthday.

It may well seem surprising that a group of just 23 individuals is required to reach a probability of 50% that at least two individuals in the group have the same birthday: this result is perhaps made more plausible by considering that the comparisons of birthday will actually be made between every possible pair of individuals = 23 × 22/2 = 253 comparisons, which is well over half the number of days in a year (183 at most), as opposed to fixing on one individual and comparing his or her birthday to everyone else's.

Real-world applications for the birthday paradox include a cryptographic attack called the birthday attack, which uses this probabilistic model to reduce the complexity of finding a collision for a hash function, as well as calculating the approximate risk of a hash collision existing within the hashes of a given size of population.

The history of the problem is obscure. W. W. Rouse Ball indicated (without citation) that it was first discussed by Harold Davenport.[2] However, Richard von Mises proposed an earlier version of what is considered today to be the birthday problem.[3]

The computed probability of at least two people sharing a birthday versus the number of people

Calculating the probability

The problem is to compute an approximate probability that in a group of Template:Mvar people at least two have the same birthday. For simplicity, variations in the distribution, such as leap years, twins, seasonal, or weekday variations are disregarded, and it is assumed that all 365 possible birthdays are equally likely. (Real-life birthday distributions are not uniform, since not all dates are equally likely, but these irregularities have little effect on the analysis.Template:Refn Actually, a uniform distribution of birth dates is the worst case.[4])

The goal is to compute Template:Math, the probability that at least two people in the room have the same birthday. However, it is simpler to calculate Template:Math, the probability that no two people in the room have the same birthday. Then, because Template:Math and Template:Math are the only two possibilities and are also mutually exclusive, Template:Math

In deference to widely published solutionsTemplate:Which concluding that 23 is the minimum number of people necessary to have a Template:Math that is greater than 50%, the following calculation of Template:Math will use 23 people as an example. If one numbers the 23 people from 1 to 23, the event that all 23 people have different birthdays is the same as the event that person 2 does not have the same birthday as person 1, and that person 3 does not have the same birthday as either person 1 or person 2, and so on, and finally that person 23 does not have the same birthday as any of persons 1 through 22. Let these events respectively be called "Event 2", "Event 3", and so on. One may also add an "Event 1", corresponding to the event of person 1 having a birthday, which occurs with probability 1. This conjunction of events may be computed using conditional probability: the probability of Event 2 is 364/365, as person 2 may have any birthday other than the birthday of person 1. Similarly, the probability of Event 3 given that Event 2 occurred is 363/365, as person 3 may have any of the birthdays not already taken by persons 1 and 2. This continues until finally the probability of Event 23 given that all preceding events occurred is 343/365. Finally, the principle of conditional probability implies that Template:Math is equal to the product of these individual probabilities: Template:NumBlk

The terms of equation (Template:EquationNote) can be collected to arrive at: Template:NumBlk

Evaluating equation (Template:EquationNote) gives Template:Math

Therefore, Template:Math (50.7297%).

This process can be generalized to a group of Template:Mvar people, where Template:Math is the probability of at least two of the Template:Mvar people sharing a birthday. It is easier to first calculate the probability Template:Math that all Template:Mvar birthdays are different. According to the pigeonhole principle, Template:Math is zero when Template:Math. When Template:Math:

where Template:Math is the factorial operator, Template:Math is the binomial coefficient and Template:Math denotes permutation.

The equation expresses the fact that the first person has no one to share a birthday, the second person cannot have the same birthday as the first (Template:Sfrac), the third cannot have the same birthday as either of the first two (Template:Sfrac), and in general the Template:Mvarth birthday cannot be the same as any of the Template:Math preceding birthdays.

The event of at least two of the Template:Mvar persons having the same birthday is complementary to all Template:Mvar birthdays being different. Therefore, its probability Template:Math is

The following table shows the probability for some other values of Template:Mvar (this table ignores the existence of leap years, as described above, as well as assuming that each birthday is equally likely):

The probability that no two people share a birthday in a group of Template:Mvar people. Note that the vertical scale is logarithmic (each step down is 1020 times less likely).
Template:Mvar Template:Math
1 Template:00.0%
5 Template:02.7%
10 11.7%
20 41.1%
23 50.7%
30 70.6%
40 89.1%
50 97.0%
60 99.4%
70 99.9%
75 99.97%
100 Template:Val%
200 Template:Val%
300 (100 − Template:Val)%
350 (100 − Template:Val)%
365 (100 − Template:Val)%
≥ 366 100%

Leap years. If we substitute 366 for 365 in the formula for , a similar calculation shows that for leap years, the number of people required for the probability of a match to be more than 50% is also 23; the probability of a match in this case is 50.6%.

Approximations

Graphs showing the approximate probabilities of at least two people sharing a birthday (Template:Color) and its complementary event (Template:Color)
A graph showing the accuracy of the approximation Template:Math (Template:Color)

The Taylor series expansion of the exponential function (the constant Template:Math)

provides a first-order approximation for Template:Math for :

To apply this approximation to the first expression derived for Template:Math, set Template:Math. Thus,

Then, replace Template:Mvar with non-negative integers for each term in the formula of Template:Math until Template:Math, for example, when Template:Math,

The first expression derived for Template:Math can be approximated as

Therefore,

An even coarser approximation is given by

which, as the graph illustrates, is still fairly accurate.

According to the approximation, the same approach can be applied to any number of "people" and "days". If rather than 365 days there are Template:Mvar, if there are Template:Mvar persons, and if Template:Math, then using the same approach as above we achieve the result that if Template:Math is the probability that at least two out of Template:Mvar people share the same birthday from a set of Template:Mvar available days, then:

A simple exponentiation

The probability of any two people not having the same birthday is Template:Sfrac. In a room containing n people, there are Template:Math pairs of people, i.e. Template:Math events. The probability of no two people sharing the same birthday can be approximated by assuming that these events are independent and hence by multiplying their probability together. In short Template:Sfrac can be multiplied by itself Template:Math times, which gives us

Since this is the probability of no one having the same birthday, then the probability of someone sharing a birthday is

Poisson approximation

Applying the Poisson approximation for the binomial on the group of 23 people,

so

The result is over 50% as previous descriptions. This approximation is the same as the one above based on the Taylor expansion that uses .

Square approximation

A good rule of thumb which can be used for mental calculation is the relation

which can also be written as

which works well for probabilities less than or equal to Template:Sfrac. In these equations, Template:Mvar is the number of days in a year.

For instance, to estimate the number of people required for a Template:Sfrac chance of a shared birthday, we get

Which is not too far from the correct answer of 23.

Approximation of number of people

This can also be approximated using the following formula for the number of people necessary to have at least a Template:Sfrac chance of matching:

This is a result of the good approximation that an event with Template:Math probability will have a Template:Sfrac chance of occurring at least once if it is repeated Template:Math times.[5]

Probability table

Template:Main

length of
hex string
no. of
bits
(Template:Mvar)
hash space
size
(Template:Math)
Number of hashed elements such that probability of at least one hash collision ≥ Template:Mvar
Template:Mvar = Template:Val Template:Mvar = Template:Val Template:Mvar = Template:Val Template:Mvar = Template:Val Template:Mvar = Template:Val Template:Mvar = 0.001 Template:Mvar = 0.01 Template:Mvar = 0.25 Template:Mvar = 0.50 Template:Mvar = 0.75
8 32 Template:Val 2 2 2 2.9 93 Template:Val Template:Val Template:Val Template:Val Template:Val
(10) (40) (Template:Val) 2 2 2 47 Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val
(12) (48) (Template:Val) 2 2 24 Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val
16 64 Template:Val 6.1 Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val
(24) (96) (Template:Val) Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val
32 128 Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val
(48) (192) (Template:Val) Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val
64 256 Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val
(96) (384) (Template:Val) Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val
128 512 Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val Template:Val

The lighter fields in this table show the number of hashes needed to achieve the given probability of collision (column) given a hash space of a certain size in bits (row). Using the birthday analogy: the "hash space size" resembles the "available days", the "probability of collision" resembles the "probability of shared birthday", and the "required number of hashed elements" resembles the "required number of people in a group". One could also use this chart to determine the minimum hash size required (given upper bounds on the hashes and probability of error), or the probability of collision (for fixed number of hashes and probability of error).

For comparison, Template:Val to Template:Val is the uncorrectable bit error rate of a typical hard disk.[6] In theory, 128-bit hash functions, such as MD5, should stay within that range until about Template:Val documents, even if its possible outputs are many more.

An upper bound on the probability and a lower bound on the number of people

The argument below is adapted from an argument of Paul Halmos.Template:Refn

As stated above, the probability that no two birthdays coincide is

As in earlier paragraphs, interest lies in the smallest Template:Mvar such that Template:Math; or equivalently, the smallest Template:Mvar such that Template:Math.

Using the inequality Template:Math in the above expression we replace Template:Math with Template:Math. This yields

Therefore, the expression above is not only an approximation, but also an upper bound of Template:Math. The inequality

implies Template:Math. Solving for Template:Mvar gives

Now, Template:Math is approximately 505.997, which is barely below 506, the value of Template:Math attained when Template:Math. Therefore, 23 people suffice. Incidentally, solving Template:Math for n gives the approximate formula of Frank H. Mathis cited above.

This derivation only shows that at most 23 people are needed to ensure a birthday match with even chance; it leaves open the possibility that Template:Mvar is 22 or less could also work.

Generalizations

The generalized birthday problem

Given a year with Template:Mvar days, the generalized birthday problem asks for the minimal number Template:Math such that, in a set of Template:Mvar randomly chosen people, the probability of a birthday coincidence is at least 50%. In other words, Template:Math is the minimal integer Template:Mvar such that

The classical birthday problem thus corresponds to determining Template:Math. The first 99 values of Template:Math are given here:

Template:Mvar 1–2 3–5 6–9 10–16 17–23 24–32 33–42 43–54 55–68 69–82 83–99
Template:Math 2 3 4 5 6 7 8 9 10 11 12

A number of bounds and formulas for Template:Math have been published.[7] For any Template:Math, the number Template:Math satisfies[8]

These bounds are optimal in the sense that the sequence Template:Math gets arbitrarily close to

while it has

as its maximum, taken for Template:Math.

The bounds are sufficiently tight to give the exact value of Template:Math in 99% of all cases, for example Template:Math. In general, it follows from these bounds that Template:Math always equals either

where Template:Math denotes the ceiling function. The formula

holds for 73% of all integers Template:Mvar.[9] The formula

holds for almost all Template:Mvar, i.e., for a set of integers Template:Mvar with asymptotic density 1.[9]

The formula

holds for all Template:Math, but it is conjectured that there are infinitely many counterexamples to this formula.[10]

The formula

holds for all Template:Math, and it is conjectured that this formula holds for all Template:Mvar.[10]

Cast as a collision problem

The birthday problem can be generalized as follows:

Given Template:Mvar random integers drawn from a discrete uniform distribution with range Template:Math, what is the probability Template:Math that at least two numbers are the same? (Template:Math gives the usual birthday problem.)[11]

The generic results can be derived using the same arguments given above.

Conversely, if Template:Math denotes the number of random integers drawn from Template:Math to obtain a probability Template:Mvar that at least two numbers are the same, then

The birthday problem in this more generic sense applies to hash functions: the expected number of Template:Math-bit hashes that can be generated before getting a collision is not Template:Math, but rather only Template:Math. This is exploited by birthday attacks on cryptographic hash functions and is the reason why a small number of collisions in a hash table are, for all practical purposes, inevitable.

The theory behind the birthday problem was used by Zoe Schnabel[12] under the name of capture-recapture statistics to estimate the size of fish population in lakes.

Generalization to multiple types

Plot of the probability of at least one shared birthday between at least one man and one woman

The basic problem considers all trials to be of one "type". The birthday problem has been generalized to consider an arbitrary number of types.[13] In the simplest extension there are two types of people, say Template:Mvar men and Template:Mvar women, and the problem becomes characterizing the probability of a shared birthday between at least one man and one woman. (Shared birthdays between two men or two women do not count.) The probability of no shared birthdays here is

where Template:Math and Template:Math are Stirling numbers of the second kind. Consequently, the desired probability is Template:Math.

This variation of the birthday problem is interesting because there is not a unique solution for the total number of people Template:Math. For example, the usual 50% probability value is realized for both a 32-member group of 16 men and 16 women and a 49-member group of 43 women and 6 men.

Other birthday problems

First match

A related question is, as people enter a room one at a time, which one is most likely to be the first to have the same birthday as someone already in the room? That is, for what Template:Mvar is Template:Math maximum? The answer is 20—if there is a prize for first match, the best position in line is 20th.Template:Citation needed

Same birthday as you

Comparing Template:Math = probability of a birthday match with Template:Math = probability of matching your birthday

In the birthday problem, neither of the two people is chosen in advance. By contrast, the probability Template:Math that someone in a room of Template:Mvar other people has the same birthday as a particular person (for example, you) is given by

and for general Template:Mvar by

In the standard case of Template:Math, substituting Template:Math gives about 6.1%, which is less than 1 chance in 16. For a greater than 50% chance that one person in a roomful of Template:Mvar people has the same birthday as you, Template:Mvar would need to be at least 253. This number is significantly higher than Template:Math: the reason is that it is likely that there are some birthday matches among the other people in the room.

It is not a coincidence that Template:Math; a similar approximate pattern can be found using a number of possibilities different from 365, or a target probability different from 50%.Template:Citation needed

Near matches

Another generalization is to ask for the probability of finding at least one pair in a group of Template:Mvar people with birthdays within Template:Mvar calendar days of each other, if there are Template:Mvar equally likely birthdays.[14]

The number of people required so that the probability that some pair will have a birthday separated by Template:Mvar days or fewer will be higher than 50% is given in the following table:

Template:Mvar Template:Mvar
for Template:Math
0 23
1 14
2 11
3 9
4 8
5 8
6 7
7 7

Thus in a group of just seven random people, it is more likely than not that two of them will have a birthday within a week of each other.[14]

Collision counting

The probability that the Template:Mvarth integer randomly chosen from Template:Math will repeat at least one previous choice equals Template:Math above. The expected total number of times a selection will repeat a previous selection as Template:Mvar such integers are chosen equals[15]

Average number of people

In an alternative formulation of the birthday problem, one asks the average number of people required to find a pair with the same birthday. If we consider the probability function Pr[[[:Template:Mvar]] people have at least one shared birthday], this average is determining the mean of the distribution, as opposed to the customary formulation, which asks for the median. The problem is relevant to several hashing algorithms analyzed by Donald Knuth in his book The Art of Computer Programming. It may be shown[16][17] that if one samples uniformly, with replacement, from a population of size Template:Math, the number of trials required for the first repeated sampling of some individual has expected value Template:Math, where

The function

has been studied by Srinivasa Ramanujan and has asymptotic expansion:

With Template:Math days in a year, the average number of people required to find a pair with the same birthday is Template:Math, somewhat more than 23, the number required for a 50% chance. In the best case, two people will suffice; at worst, the maximum possible number of Template:Math people is needed; but on average, only 25 people are required

An analysis using indicator random variables can provide a simpler but approximate analysis of this problem.[18] For each pair (i, j) for k people in a room, we define the indicator random variable Xij, for , by

Let X be a random variable counting the pairs of individuals with the same birthday.

For Template:Math, if Template:Math, the expected number of with the same birthday is Therefore we can expect at least one matching pair with at least 28 people.

An informal demonstration of the problem can be made from the list of Prime Ministers of Australia, of which there have been 29 Template:As of, in which Paul Keating, the 24th prime minister, and Edmund Barton, the first prime minister, share the same birthday, 18 January.

In the 2014 FIFA World Cup, each of the 32 squads had 23 players. An analysis of the official squad lists suggested that 16 squads had pairs of players sharing birthdays, and of these 5 squads had two pairs: Argentina, France, Iran, South Korea and Switzerland each had two pairs, and Australia, Bosnia and Herzegovina, Brazil, Cameroon, Colombia, Honduras, Netherlands, Nigeria, Russia, Spain and USA each with one pair.[19]

Voracek, Tran and Formann showed that the majority of people markedly overestimate the number of people that is necessary to achieve a given probability of people having the same birthday, and markedly underestimate the probability of people having the same birthday when a specific sample size is given.[20] Further results were that psychology students and women did better on the task than casino visitors/personnel or men, but were less confident about their estimates.

Reverse problem

The reverse problem is to find, for a fixed probability Template:Mvar, the greatest Template:Mvar for which the probability Template:Math is smaller than the given Template:Mvar, or the smallest Template:Mvar for which the probability Template:Math is greater than the given Template:Mvar.Template:Citation needed

Taking the above formula for Template:Math, one has

The following table gives some sample calculations.

Template:Mvar Template:Mvar Template:Math Template:Math Template:Math Template:Math
0.01 0.14178Template:Sqrt = 2.70864 2 0.00274 3 0.00820
0.05 0.32029Template:Sqrt = 6.11916 6 0.04046 7 0.05624
0.1 0.45904Template:Sqrt = 8.77002 8 0.07434 9 0.09462
0.2 0.66805Template:Sqrt = 12.76302 12 0.16702 13 0.19441
0.3 0.84460Template:Sqrt = 16.13607 16 0.28360 17 0.31501
0.5 1.17741Template:Sqrt = 22.49439 22 0.47570 23 0.50730
0.7 1.55176Template:Sqrt = 29.64625 29 0.68097 30 0.70632
0.8 1.79412Template:Sqrt = 34.27666 34 0.79532 35 0.81438
0.9 2.14597Template:Sqrt = 40.99862 40 0.89123 41 0.90315
0.95 2.44775Template:Sqrt = 46.76414 46 0.94825 47 0.95477
0.99 3.03485Template:Sqrt = 57.98081 57 0.99012 58 0.99166

Some values falling outside the bounds have been colored to show that the approximation is not always exact.

Partition problem

A related problem is the partition problem, a variant of the knapsack problem from operations research. Some weights are put on a balance scale; each weight is an integer number of grams randomly chosen between one gram and one million grams (one tonne). The question is whether one can usually (that is, with probability close to 1) transfer the weights between the left and right arms to balance the scale. (In case the sum of all the weights is an odd number of grams, a discrepancy of one gram is allowed.) If there are only two or three weights, the answer is very clearly no; although there are some combinations which work, the majority of randomly selected combinations of three weights do not. If there are very many weights, the answer is clearly yes. The question is, how many are just sufficient? That is, what is the number of weights such that it is equally likely for it to be possible to balance them as it is to be impossible?

Often, people's intuition is that the answer is above Template:Val. Most people's intuition is that it is in the thousands or tens of thousands, while others feel it should at least be in the hundreds. The correct answer is 23.Template:Citation needed

The reason is that the correct comparison is to the number of partitions of the weights into left and right. There are Template:Math different partitions for Template:Math weights, and the left sum minus the right sum can be thought of as a new random quantity for each partition. The distribution of the sum of weights is approximately Gaussian, with a peak at Template:Math and width Template:Math, so that when Template:Math is approximately equal to Template:Math the transition occurs. 223 − 1 is about 4 million, while the width of the distribution is only 5 million.[21]

In fiction

Arthur C. Clarke's novel A Fall of Moondust, published in 1961, contains a section where the main characters, trapped underground for an indefinite amount of time, are celebrating a birthday and find themselves discussing the validity of the birthday problem. As stated by a physicist passenger: "If you have a group of more than twenty-four people, the odds are better than even that two of them have the same birthday." Eventually, out of 22 present, it is revealed that two characters share the same birthday, May 23.

Notes

Template:Reflist

References

Template:Reflist

Bibliography

External links

Template:Portal bar

  1. Template:Cite journal
  2. W. W. Rouse Ball, 1960, Other Questions on Probability, in "Mathematical Recreations and Essays", Macmillan, New York, p 45.
  3. Template:Cite book
  4. Template:Cite book
  5. Template:Cite journal
  6. Jim Gray, Catharine van Ingen. Empirical Measurements of Disk Failure Rates and Error Rates
  7. Template:Wikicite
  8. Template:Harvard citations
  9. 9.0 9.1 Template:Harvard citations
  10. 10.0 10.1 Template:Harvard citations
  11. Template:Cite conference
  12. Z. E. Schnabel (1938) The Estimation of the Total Fish Population of a Lake, American Mathematical Monthly 45, 348–352.
  13. M. C. Wendl (2003) Collision Probability Between Sets of Random Variables, Statistics and Probability Letters 64(3), 249–254.
  14. 14.0 14.1 M. Abramson and W. O. J. Moser (1970) More Birthday Surprises, American Mathematical Monthly 77, 856–858
  15. Template:Cite web
  16. Template:Cite book
  17. Template:Cite journal
  18. Template:Cite book
  19. Template:Cite web
  20. Template:Cite journal
  21. Template:Cite journal