Imagine you’re trying to gauge the average height of all the trees in a vast forest. It’s impractical to measure each one, instead, you measure a small sample and use those measurements to estimate the average for the entire forest. Bootstrapping, in statistics, works on a similar principle.
This involves taking a small sample from your data and, through a method of repeated sampling, estimates statistics (like the mean, median or standard deviation) for your dataset. This technique allows you to make inferences about populations from small samples with greater confidence.
In this article, we will cover:
- The basics of bootstrapping, what is it exactly?
- How to achieve a bootstrapped sample in BigQuery
- An experiment to understand how results change based on varying sample sizes, and how that relates to a known statistic
- A stored procedure you can take away and use yourself
At its core, bootstrapping involves randomly selecting a number of observations from a dataset, with replacement, to form what is known as a “bootstrap sample.”
Let’s simplify this concept using a scenario where you have a basket of 25 apples and you’re curious about the average weight of apples in a larger context, like a market.
The Grab and Note Technique
Start by diving into your basket to grab an apple at random, weigh it, and then, instead of setting it aside, you put it right back into your basket. This way, every time you reach in for an apple, every single one, including the one you just weighed, is fair game to be picked again.
Repeat
Now, you repeat the grab, weigh, and replace action the same amount of times as there are apples in your…