We start by reformulating the story. We have n hash tables, the i-th of which has mi buckets, and we use them to store integers in the range [0, p). The task is to find out how many integers k′∈[0, p) are in collision with k in at least one of the hash tables. Note that the standard term collision would assume that k ≠ k′, which is not the case in our problem, but this change would only subtract 1 from the answer.
The hash tables described in the statement are selected from a family of universal hash functions. For these it holds that the number of collisions with k in the i-th hash table is $\lfloor \frac{p}{m_i} \rfloor$ or $\lceil \frac{p}{m_i} \rceil$. In other words, the fraction of integers in collision with k is roughly $\frac{1}{m_i}$. It would be tempting to say that the answer is
$$p \cdot \prod_{i=1}^{n} \left( 1 - \frac{1}{m_i} \right)\,.$$
This would be only true if we selected the parameters of the hash functions independently at random, and we don’t know whether this is the case. A quick inspection of the input files tells us it’s actually the opposite! We need a different approach.
The error tolerance is quite high, which might suggest a probabilistic approach. We can sample integer k′ from the range [0, p) and check whether it is in collision with k. If we perform enough experiments, the ratio of the observed collisions to all samples converges towards the ratio of actual collisions within the population of integers. Roughly 105 samples for each test case are sufficient to pass the easy subtask.
In the hard subtask, things are a bit more complicated. Firstly, we obviously need to use large integers. If we overcome this hurdle and run the code for quite some time on the hard subtask, we realise that something is not quite right – the output for the majority of test cases seems to be 0. This is definitely not a correct answer, as there is always at least one collision (namely k itself). Is our approach flawed? Or should we simply use more samples?
It turns out that the answer to both questions is yes. The approach is valid (after all it worked on the easy subtask) – the expected number of collisions that we see is proportional to the actual number of collisions, i.e., the quantity we compute is an unbiased estimator of the answer. The problem is, however, with its concentration! Let p ≈ 10100 and the actual number of collisions be m ≈ 1050. Let’s be generous and sample 1010 integers. How many collisions are we expected to see? About 10−40. That’s why we don’t see any!
The Chernoff bound tells us that the probability that the sum of N independent Bernoulli random variables with probability p is not within 1 ± δ approximation of its expectation is at most:
$$\text{Pr}\left[\left|X-\mu\right| \geq \delta \mu \right] \leq 2e^{-\frac{\delta^2 \mu}{3}}$$
To get a reasonable success probability for our amount of test cases, we need at least δ2μ ≥ 20. Note that μ = N ⋅ p, where N is the number of samples. In the case described above, we have δ2 ⋅ N ⋅ p = 9 ⋅ 10−44 ⋅ N. Thus, we would need more than 1040 experiments, which is clearly way too much. Since δ is fixed, we have to increase p somehow. We can use the following strategy.
Imagine a huge table, where each row corresponds to a hash table and each column is an integer between 0 and p − 1. In the cell (i, j) we write Y if j is in collision with k in the i-th hash table, and N otherwise. The answer we are looking for is the number of columns with at least one Y. As we know, these might be very rare.
Note that some columns may contain more than one Y. In particular, the column corresponding to the integer k contains n Ys. Now comes the simple yet subtle observation – the number of columns with at least one Y equals the number of top-most Ys (i.e. that have no Ys above them in their respective column).
Given a cell (i, j), we can easily determine whether it is a top-most Y – we just check whether there is a hash collision of j with k in a hash table with lower index than i.
This gives us a new avenue of attack for this problem. We know the total number of Y cells, because we can exactly compute the number of collisions for each hash table. If we find a way to uniformly sample them, we can look at each sampled cell and verify whether it’s a top-most Y cell. This will then give us an unbiased estimator of the total number of top-most Y cells, which is precisely the answer we seek.
This leaves us with just one question: how can we uniformly sample all Y cells? This is easy: All collisions in the table i are of form ai−1 ⋅ (x + t ⋅ mi − bi), where x is the bucket to which k belongs in this hash table, ai−1 is the multiplicative inverse of ai modulo p (recall that p is prime) and t is an integer between 0 and (the number of collisions in this particular hash table, minus 1).
Finally, the probability that a randomly generated Y cell is a top-most one is at least $\frac{1}{n}$, because the table has only n rows. Coming back to our Chernoff bound, we have $\delta^2 \cdot N \cdot \frac{1}{n} \geq 9 \cdot 10^{-6} \cdot N$, so roughly 106 samples should give us a sufficient precision.