IPSC Logo

Internet Problem Solving Contest

IPSC 2017

Solution to Problem C – Collector's dilemma

Expressing the probability

At a first glance this is a relatively simple task on Bayesian probability. Let X denote the event that we obtain the given multiset of emojis. We are asked to calculate the conditional probability P(d∈[a,b]|X).

Using some simple algebra and Bayes’ theorem, we can rewrite this probability as follows:


$$ P\left( d \in [a,b] \middle| X \right) = \sum_{q=a}^b P\left( d = q \middle| X \right) = \dfrac{\sum_{q=a}^b P\left( X | d = q\right) \cdot P\left(d = q\right)}{P(X)} = \dfrac{\sum_{q=a}^b P\left( X | d = q\right) \cdot P\left(d = q\right)}{\sum_{q=1}^\infty P\left( X | d = q\right) \cdot P\left(d = q\right)}$$

How to find the probability of observing the sample, given the size q of the set from which we’re sampling the emojis? We can use a combinatorial argument of counting the number of ordered samples that match the observation. Below, let s = ∑xi.

First of all, we have to choose which specific emoji types to use. There are q ways of choosing the emoji type for x1, q − 1 ways for x2, and so on. Once we do so, there are
$$C_x := C(x_1, x_2, \dots, x_N) = \dfrac{s!}{\prod x_i!}$$
ways of ordering the individual emojis. However, if we just multiply these values, we are counting some orders multiple times. The formula would be correct if all xi were distinct. However, if we have groups of identical xi, we are counting each sequence multiple times. To correct for this, we have to divide the count by Dx = ∏yj!, where yj is the number of xi that are equal to j.

Thus, the number of sequences of receiving s emojis in such a way that we get the given counts xi can be expressed as $q(q-1)\cdots (q-n+1)\cdot C_x / D_x = q^{\underline{n}} \cdot C_x / D_x$.

The number of all sequences of receiving s emojis is simply qs. Thus, the probability of observing the event X for a given number q of emoji types is


$$P\left(X \middle| d = q \right ) = \dfrac{q^{\underline{n}} \cdot C_x}{q^s \cdot D_x}$$

Easy subproblem

In the easy subproblem we have P(d=q) = 1/m for each q between 1 and m, inclusive. Observing that neither this probability nor the values Cx, Dx depend on q we can cancel out these terms. This leaves us with the following simplified formula:


$$ P\left( d \in [a,b] \middle| X \right) = \dfrac{\sum_{q=a}^b \left( q^{\underline{n}} ~/~ q^s \right) }{\sum_{q=1}^m \left( q^{\underline{n}} ~/~ q^s \right) }$$

Since the limits are small, we can simply evaluate the expression.

Hard subproblem

In the hard subproblem there are a few complications. First, the probability distribution is different. If we unravel the definition, we’ll see that it tells us that the probability of having d = q is proportional to 1/q2. And as q ≥ 11/q2 = π2/6, it must be the case that P(d = q)=6/(πq)2.

How does that affect the probability formula? As it turns out, not much. The constants again cancel each other out, and the q2 just increases the exponent of q in the denominator (the formula will contain qs + 2 instead of qs).

A bigger complication is the fact that now the sum in the denominator has infinitely many terms. Can we perhaps find a closed-form solution of the sum?

We know that P(X|d=q) ⋅ P(d=q) will now simplify to the fraction $\left( q^{\underline{n}} ~/~ q^{s+2} \right)$. The numerator is a polynomial of order n. Its coefficients are the signed Stirling numbers of the first kind: Sn, k. Thus, we can rewrite this fraction as follows:


$$ \dfrac{ q^{\underline{n}} }{ q^{s+2} } = \sum_{k=1}^{n} \dfrac{ S_{n,k} }{ q^{s+2-k} } $$

We can easily see that each of the fractions on the right hand side has the form “constant divided by q to the power 2 or more”, from which we can easily show that if we take an infinite sum of all these fractions over all q, the sum will be absolutely convergent. Hence, we may rearrange the terms arbitrarily. In particular, we may group terms with the same Stirling number together. We get the following sum:


$$ \sum_{q=1}^{\infty} \dfrac{ q^{\underline{n}} }{ q^{s+2} } = \sum_{k=1}^n S_{n,k}\cdot \sum_{q=1}^{\infty} \dfrac{1}{ q^{s+2-k} } = \sum_{k=1}^n S_{n,k}\cdot \zeta(s+2-k) $$

It turns out that we cannot sum this analytically, as we don’t know closed form solutions for Riemann zeta for odd positive integers. But these values are known to many decimal places – for instance the ζ(3) value (Apéry’s constant) is evaluated to more than 1011 decimal digits. That’s certainly much more than we’ll need. We also know algorithms that can compute these values with an arbitrary precision.

The values we work with are rather large. For example, Sn, 1 = (−1)n − 1(n − 1)!. This is a nightmare for numerical stability, and the final division only makes matters worse. We absolutely need more precision than standard floating-point numbers can offer. There are some tricks that we can employ to alleviate this issue, such as substracting 1 from each zeta value (which does not change the solution), giving us the ability to represent the values for large k (which are otherwise very close to 1), reordering the terms in the sum, but this is very hard to get right. A better solution is to use some implementation of big decimals, and to recompute the results with increasing precision until they become numerically stable. For example, we have a solution in Python using mpmath and a solution in Java that uses the built-in BigDecimal data type.

Solution summary

  1. Obtain the values of the Riemann zeta function for small positive integers.
  2. Use those to evaluate the denominator to several hundreds of bits of precision.
  3. Evaluate the numerator as in the easy subproblem – summing it one term at a time.
  4. Divide the two values to get the answer.