A deep dive into the meaning of the two bounds and into the remarkable series of happenings that led to their discovery
It’s not often that the universe tells you that something simply cannot be done. It doesn’t matter how smart you are or how lavishly bank-rolled you are or what corner of the universe you call your own. When the universe says “Not possible”, there are no two ways around it. In the sciences, such impossibilities are often expressed as limits on the value of some quantity. A famous example is Albert Einstein’s 1905 discovery that when you let a photon loose in the vacuum of space, there is nothing —unquestionably nothing — that can overtake it. Hundreds of such limits or bounds have been discovered and proved. Taken together, they form a fence around the nature of reality.
The Markov and the Bienaymé–Chebyshev inequalities are two such bounds that have profoundly shaped our appreciation of the limits that nature places on how frequently random events can occur.
The discovery and proof of Markov’s inequality is credited to the brilliant and passionately principled Russian mathematician Andrei Andreyevich Markov (1856–1922).
The credit for the Bienaymé–Chebyshev inequality goes to two people: a giant in probability theory and Markov’s teacher — the redoubtable Pafnuty Lvovich Chebyshev (1821–1894), and to Chebyshev’s French colleague and friend Irénée-Jules Bienaymé (1796–1878).
There is such remarkable history attached to the discovery of these inequalities — especially the Bienaymé–Chebyshev inequality — that it will be dismally inadequate to simply tumble out the mathematics without touching upon the characters and the stories that produced it. I’ll try to bring out these backstories. And in doing so, I’ll set the context for explaining the math underlying the inequalities.
I’ll begin with Markov’s inequality, then show how the Bienaymé–Chebyshev inequality arizes from doing a few simple variable substitutions in Markov’s inequality. For bonus pleasure, we’ll claim our grand prize — the proof of the Weak Law of Large Numbers (WLLN) — by showing how the WLLN emerges, almost effortlessly, from doing another set of variable substitutions but this time in the Bienaymé–Chebyshev inequality.
The name Markov brings to mind “Markov Chains”, “Markov Processes”, and “Markov Models”. Strictly speaking, the Markov chain is what A. A. Markov created. But Markov’s contributions to math went well beyond Markov chains and probability theory. A prolific researcher, Markov published over 120 papers covering a sweeping expanse of ideas in number theory, continuous fractions, calculus and statistics. Incidentally, Markov published mostly in Russian language journals in contrast to his doctoral advisor P. L. Chebyshev who published heavily in West European, especially French publications.
In the year 1900, a period during which Markov was quite likely at the peak of his career, he published a seminal book on probability titled “Ischislenie Veroiatnostei” (translation: Calculus of Probabilities).
The book went through 4 editions and a German language edition. Markov published the 3rd edition of his book purposely in 1913 to commemorate the 200th anniversary of the Weak Law of Large Numbers (WLLN). A great deal of material in the 3rd edition is devoted to the WLLN. But tucked away in a lemma is Markov’s proof of a law that turned out to be so central to the field of statistical science that it is often used as the starting point for a proof of the WLLN itself.
What Markov showed was the following:
Imagine any non-negative random variable X. X could represent something mundane like your morning wake up time. Or something immense like the number of stars in a galaxy. X can be discrete or continuous. X can have any kind of probability distribution. In short, X can represent any non-negative random phenomenon. Now pick a value — any value — in the range of X. Let’s denote this value as ‘a’. Markov showed that nature imposes an upper bound on the probability of observing a value of X that is greater or equal to your chosen value ‘a’. And this upper bound shrinks as ‘a’ grows. The larger your chosen value ‘a’, the lower the probability of observing another value ‘b’ that surpasses ‘a’. In other words, nature abhors outliers.
To illustrate, take a look at the following plot. It shows the frequency distribution of the per capita personal income of counties in the top-20 richest states of the United States.
Here, the random variable X is the per capita income of a randomly chosen county.
Now let’s work with some threshold ‘a’ for the per capita income. In the following image panel, the red regions represents X ≥ a, where a = $50000, $70000, and $80000.
The probability P(X ≥ a) is the ratio of the area of the red region to the total area under the histogram. It’s easy to see that this probability P(X ≥ a) diminishes as ‘a’ grows. It’s inversely related to ‘a’. Markov’s theorem imposes a certain upper bound on this probability that is inversely related to the value of ‘a’. And this relationship holds true regardless of the distribution of X.
But that’s not all that Markov showed.
As part of the same inequality, Markov also showed that the mean value of X directly influences the probability of observing an X >= a. The larger the mean value of X, the higher the upper bound on this probability, and vice versa. In other words, as the probability mass of X shifts toward the upper end of the range of X, the upper limit of P(X >= a) also increases. Conversely, if the probability mass of X shifts toward the lower end, making it ‘bottom heavy’, the probability of observing a large value of X decreases.
Some of this might sound like everyday common sense but Markov’s brilliance lies in establishing a mathematically precise relationship between ‘a’, P(X>=a), and the mean (a.k.a. expected value) of X, denoted as E(X). He showed that:
Proof of Markov’s Inequality
There are a number of ways to prove Markov’s Inequality. I’ll describe a simple technique that works irrespective of whether X is discrete or continuous. X just needs to be non-negative.
As before, we work with some threshold value ‘a’ that you are interested in.
Now let’s define a random variable I such that I = 0 when 0 ≤ X < a, and I = 1 when X ≥ a. In statistical parlance, I is called an indicator variable.
Consider the case when X ≥ a. Multiplying both sides by I:
XI ≥ aI
When X ≥ a, I = 1. So XI = X.
And therefore,
X ≥ aI when I = 1 (Let’s remember this result).
Since X is non-negative, 0 ≤ X, and for some positive ‘a’, X can be either smaller than ‘a’, or greater or equal to ‘a’. We’ve already considered the case of X greater than or equal to a. So let’s consider the case where 0 ≤ X < a.
By definition of I, when X < a, I = 0.
Therefore, aI = a0 = 0
Since X is assumed to be non-negative i.e. X > 0 and aI = 0, X ≥ aI
Thus, whether I = 1, or I = 0, aI <= X.
Let’s apply the expectation operator E(.) on both sides of this inequality:
E(aI) <= E(X)
Taking the constant ‘a’ out:
aE(I) <= E(X)
Let’s work upon E(I). The random variable I can take only two values: 0 and 1 corresponding to when X < a and X≥ a. The probability associated with each event is P(X < a) and P(X >= a) respectively. So,
E(I) = 0P(X < a) + 1P(X >= a) = P(X >= a)
Substituting this result back in aE(I) <= E(X), we have:
aP(X >= a) <= E(X)
And thus:
P(X >= a) <= E(X)/a, which is the inequality that Markov proved.
The Bienaymé–Chebyshev inequality states that the probability of observing a value ‘a’ units away from the mean of a random variable is bounded similarly to Markov’s inequality. In other words, nature imposes an upper bound on the probability P(|X — E(X)| >= a). And this upper bound is inversely proportional to a² and directly proportional to how dispersed X is around its mean, in other words, to the variance of X. Notation-wise, the Bienaymé–Chebyshev inequality is expressed as follows:
Just as with Markov’s inequality, the magnificence of the Bienaymé–Chebyshev inequality lies in its making no assumption whatsoever about the probability distribution of X. X can be normally distributed, exponentially distributed or gamma distributed. X can be distributed in the shape of a cow’s shadow for all it cares. The Bienaymé–Chebyshev probability bound still holds rock solid.
A brief detour into the history of the Bienaymé–Chebyshev inequality
Compelling history feeds into the discovery of the Bienaymé–Chebyshev inequality. For starters, there is a reason Jules Bienaymé’s name ought to, and does, precede Chebyshev’s name in this inequality.
In 1853, the French mathematician Irénée-Jules Bienaymé published what came to be one of the most important papers in the proceedings of the French Academy of Sciences. Bienaymé’s paper was ostensibly about his treatment of the Laplace’s least squares method. However,as part of that work, he ended up stating and proving the Bienaymé–Chebyshev inequality (which at the time could only have been Bienaymé inequality as Chebyshev was nowhere in the picture). But Bienaymé, true to his modest nature, and on account of his attention being wholly devoted to Laplacian least squares, failed to adequately call out the significance of his discovery and it went basically unnoticed. And so might have languished one of most important results in probability, had Pafnuty Lvovich Chebyshev not been born with an atrophied leg.
On an early summer’s day in 1821, when the 25-year-old Bienaymé was still finding his feet as a civil servant in France’s Ministry of Finances, Pafnuty Lvovich Chebyshev was born in a village 100 miles south of St. Petersburg in imperial Russia. Chebyshev was one of nine children and from an early age, he showed exceptional aptitude in both mechanics and mathematics. Chebyshev’s father was an army officer who had fought off Napoleon in the latter’s deeply disastrous (for Napolean obviously) advance on Russia in 1812. In one of those curiously amusing ironies of history, just two years later in the messy aftermath of Napolean’s retreat, Jules Bienaymé would help Napolean fight off Russian, Austrian and Prussian forces advancing into Paris. Napolean of course thoroughly failed to protect Paris and instead got himself exiled to Elba.
All of this history was to play out well before Pafnuty Lvovich’s birth. But given his military lineage and family tradition, had it not been for his congenitally atrophied leg, P. L. Chebyshev may very well have followed some of his siblings into the Tsar’s military and the history of probability would have taken an altogether different turn. But Chebyshev’s initiation into mathematics and later into Russian academia weren’t the only catalysts for his introduction to Bienaymé. And for that matter, his championing of the latter’s contributions to the Bienaymé–Chebyshev inequality.
As a child, Chebyshev was home-schooled in French. Early in his career, he appears to have realized if he wanted his work to be read outside his native land, he must become known in the 19th century’s global capital for mathematical research, which was Paris.
At every opportunity, Chebyshev traveled to France and other Western European capitals and he published almost half of his 80 papers in Western European journals. Many of them appeared in the Journal des Mathématiques Pures et Appliquées (Journal of Pure and Applied Mathematics) edited by the French mathematician Joseph Liouville. It was during his 1852 European tour that Chebyshev was introduced to Bienaymé, a mutually beneficial friendship that yielded Chebyshev access to many European scientists and publishers, and which later gave Bienaymé’s own work in mathematics the publicity it deserved in major French and Russian journals.
A pivotal work, of course, was Bienaymé’s discovery in 1853 of the inequality which bears his name. Which brings us back to the study of this inequality.
What Bienaymé actually proved in his 1853 paper was the following:
Suppose you were to draw a random sample of size N from a population of values whose mean and variance are respectively μ and σ². Let X_bar be the mean of your random sample. Incidentally, it can be shown that the sample mean X_bar is itself a random variable with its own expected value and variance being respectively μ and σ²/N. If that leaves you scratching your head, rest easy. Soon enough, I’ll show how to derive the expected value and variance of the sample mean. Meanwhile, getting back to the topic at hand, what Bienaymé showed was the following:
By now, you may be wondering, exactly when does Chebyshev enter into the merry orbit of Bienaymé‘s discoveries in a way that leads to Chebyshev’s name being attached to this inequality.
It so happens that fourteen years after Bienaymé publishes his inequality, Chebyshev, completely oblivious of Bienaymé’s discovery, publishes a different version of this inequality in the 1867 issue of Joseph Liouville’s journal. Bear in mind, this is the pre-Google, pre-CiteSeer, pre-telephone era. So to say that scientists of the time weren’t fully aware of “previous work” barely hints at the scale of the problem.
At any rate, Chebyshev’s version of the inequality applies to the mean X_bar of N independent but not necessarily identically distributed random variables X1, X2, X3, …, XN. Liouville, who seems to be aware of Bienaymé’s work, astutely republishes Bienaymé’s 1853 paper in the 1867 issue of his journal, positioning it just before Chebyshev’s paper!
To his credit, Chebyshev, in a paper he published in 1874 attributed the discovery of this inequality entirely to Bienaymé:
“The simple and rigorous demonstration of Bernoulli’s law to be found in my note entitled: Des valeurs moyennes, is only one of the results easily deduced from the method of M. Bienaymé, which led him, himself, to demonstrate a theorem on probabilities, from which Bernoulli’s law follows immediately”
In the years that followed, what became known as Chebyshev’s inequality (or more accurately, the Bienaymé — Chebyshev inequality) is a version that applies simply to any random variable X having an expected value E(X) and a finite variance Var(X).
The Bienaymé — Chebyshev inequality states that for any positive ‘a’, the probability P(|X — E(X| ≥ a) is bounded as follows:
Proof of the Bienaymé — Chebyshev inequality
The inequality that Markov proved (and which bears his name) in the 1913 edition of his book Calculus of Probabilities is often used to prove the Bienaymé — Chebyshev inequality. Using Markov’s inequality as a starting point, it is a walk in the park to prove the result. We prove it as follows:
Let’s consider a random variable X with mean E(X). Now, let’s define another random variable Z = (X — E(X))². The square term ensures that Z is non-negative, allowing us to apply Markov’s inequality to Z. Let’s assume a threshold value of Z which we’ll call a². The probability of an observed value of Z meeting or exceeding a² is P(Z >= a²). Applying Markov’s inequality to Z and a² yields the following:
Starting with the above expression, we can derive the Bienaymé-Chebyshev inequality as follows:
Equation (3) is the Bienaymé-Chebyshev inequality (or simply Chebyshev’s inequality).
Instead of working with an arbitrary threshold ‘a’, it’s useful to express ‘a’ in terms of the standard deviation σ of X as follows:
The above proof also opens a direct path to the proof of Bienaymé’s original result shown in his 1853 publication, namely the following:
Starting with equation (2), and substituting X with the sample mean X_bar and a² with k²σ², we arrive at Bienaymé’s circa 1853 result as follows:
Equations (4) and (4a) present before us an intriguing result. They say that the probability of encountering an observation that is at least k standard deviations away from the mean is bounded at the top, and this upper bound is inversely proportional to k².
Put another way, one is strikingly unlikely to run into values that are several standard deviations away from the mean value.
When expressed this way, the Bienaymé–Chebyshev inequality puts a mathematical face on aphorisms such as “If it sounds too good to be true, it probably is”, or, on the all-time favorite of scientists: “Extraordinary claims require extraordinary evidence”.
To illustrate the operation of this inequality, consider the following data set of average daily temperature recorded in the Chicago area on January 1 of each year. There are 100 observations spanning from 1924 to 2023:
The black dotted horizontal line toward the center of the plot depicts the sample mean of 24.98 F. The colored horizontal lines depict temperature values at plus/minus 1.25, 1.5, 1.75 and 2 times the standard deviation of the data sample. These standard deviation lines give us a feel for the bounds within which most of the temperatures are likely to lie.
Applying the Bienaymé–Chebyshev inequality, we can determine the upper bound on the probability P(|X — E(X)| ≥ kσ), where X represents the observed average temperature on January 1 in a randomly selected year. E(X) = 24.98 F, σ = 10.67682 F, and k = 1, 1.25, 1.5, 1.75 and 2.0. The following table mentions these probability bounds in the 1/k² column:
The last column of the table shows the actual probabilities of observing such deviations in the data sample. The actual observed values in the data sample are comfortably within the probability bounds generated by the Bienaymé–Chebyshev inequality.
You may have noticed that the probability bounds generated by the Bienaymé–Chebyshev inequality are quite wide. For example, when k=1 (which corresponds to an event lying within 1 standard deviation of the mean), the inequality calculates the upper bound on the probability as 1/1² = 1.0, or 100%. This renders this specific bound practically useless.
Nevertheless, for all values of k > 1, the inequality is quite useful. Its usefulness also lies in its not assuming any particular shape for the distribution of the random variable. In fact, it goes even further in its applicability. While Markov’s inequality requires the random phenomena to produce strictly non-negative observations, if you notice, the Bienaymé–Chebyshev inequality makes no such assumptions about X.
The Bienaymé–Chebyshev inequality also gives us a drop-dead simple proof for the Weak Law of Large Numbers. In fact, in 1913, Markov used this inequality to demonstrate the proof of the WLLN in his book on probability theory, and it is essentially the same proof used by many textbooks today.
Suppose you gather a random sample from a theoretically infinitely large population. Let the sample size be N. This random sample has a sample mean X_bar. Since you are dealing with just a sample and not the entire population, your sample mean is likely to be some distance away from the true population mean μ. This is the error in your sample mean. You could express the absolute value of this error as |X_bar — μ|.
The WLLN says that for any positive tolerance ϵ of your choosing, the probability of the error in your sample mean being greater than ϵ will shrink to zero as the sample size N grows to infinity. It doesn’t matter how small a tolerance ϵ you choose. P(|X_bar — μ| >= ϵ) will approach zero as the sample size N approaches infinity.
The WLLN has a rich history of discovery that runs back more than three centuries with the who’s who of mathematics— starting with Jacob Bernouli in 1713 and including such giants as De Moivre, Laplace, Lacroix, Poisson, and our friends Bienaymé and Chebyshev — contributing to its development. And thanks to the Bienaymé–Chebyshev inequality, the proof of the WLLN flows with the easiness of water running down a hillside.
Proof of the Weak Law of Large Numbers
As with so many things in statistics, we begin the proof by drawing a random sample of size N from a population. Let’s denote this sample as X1, X2, X3, …, XN. It’s useful to think of X1, X2, X3, …, XN as a set of N variables, like a set of N slots, each of which gets filled with a randomly selected value from the population whenever a sample is drawn. Thus X1, X2, X3, …, XN are themselves random variables. Furthermore, since each of X1, X2, X3, …, XN acquire a random value independent of the others, but all from the sample population, they are independent, identically distributed (i.i.d.) random variables.
For any given randomly selected sample, the sample mean X_bar can be calculated as follows:
Since drawing another random sample will yield a different sample mean and drawing a third sample will yield yet another sample mean and so on, the sample mean X_bar is itself is a random variable with it’s own mean and variance. Let’s calculate the mean of X_bar.
Let’s also calculate the variance of the sample mean.
Now let’s apply the Bienaymé–Chebyshev inequality to the sample mean X_bar as follows:
How something so profound, so incredibly far-reaching, and so central to the field of statistical science as the WLLN, can have such a disarmingly simple proof is one of those absurdities of nature that one can only marvel at. At any rate, there you have it.
Taken together, the Markov and the Bienaymé–Chebyshev inequalities, and the Weak Law of Large Numbers form the bedrock on which a large pile of statistical science securely rests. For example, when you train a statistical model (or a neural net model), the training algorithm had better obey the WLLN. If it doesn’t, the coefficient estimates aren’t guaranteed to converge to the true population values. And that makes your training technique, well, basically useless. The WLLN also finds gainful employment in the proof of another epic result — the Central Limit Theorem. And that, deservedly, will be the substance of my next article.