Throw on your comfiest lo-fi, grab an oversized sweater, your favorite hot beverage, and let’s python.
It’s that time again in the northern hemisphere — a time for apples, pumpkins, and various configurations of cinnamon, nutmeg, ginger, allspice, and cloves. And as the grocery isles start getting ready for Halloween, Thanksgiving, and the winter holidays, it’s a great time to dust off my statistical modeling skills. Hold onto your seasoned lattes, and let’s do some function-oriented seasonal modeling. The full code notebook can be found here.
Hypothesis:
Pumpkin Spice’s popularity as a Google searched term in the USA will have strong seasonality since it’s associated with American Fall Holidays and seasonal food dishes.
Null hypothesis:
Using last week’s or last year’s data will be more predictive of this week’s level of popularity for the search term “pumpkin spice.”
Data:
The last 5 years of data from Google Trends, pulled on the 7th of October, 2023. [1]
- Make a naive model where last week’s/last year’s data is this week’s prediction. Specifically, it’s not enough for my final model to be accurate or inaccurate in a void. My final model must outperform using historical data as a direct prediction.
- The train test split will give me two sets of data, one for the algorithm to learn from. The other is for me to test how well my algorithm performed.
- Seasonal decomposition will give me a rough idea of how predictable my data is by trying to separate the yearly overall trend from the seasonal patterns and the noise. A smaller scale of noise will imply that more of the data can be captured in an algorithm.
- A series of statistical tests to determine if the data is stationary. If the data is not stationary, I’ll need to take a first difference (run a time-delta function where each time interval’s data only shows the difference from the previous time interval’s data. This will force the data to become stationary.)
- Make some SARIMA models, using inferences from autocorrelation plots for the moving average term, and inferences from partial auto-correlation plots for the autoregressive term. SARIMA is a go-to for time series modeling and I’ll be trying ACF and PACF inferencing before I try a brute-force approach with Auto Arima.
- Try using Auto Arima, which will iterate through many terms and select the best combination of terms. I want to experiment to learn if the parameters it gives me for a SARIMA model yield a better-performing model.
- Try ETS models, using inference from the seasonal decomposition as to whether x is additive or multiplicative over time. ETS models focus more heavily on seasonality and overall trend than SARIMA family models do, and may give me an edge when capturing the relationship pumpkin spice has to time.
Performance plotting KPIs:
- Try using the MAPE score because it’s an industry standard in many workplaces, and folks may be used to it. It’s easy to understand.
- Try using the RMSE score because it’s more useful.
- Plot predictions against the test data and visually check for performance.
As we can see from the above plot, this data shows strong potential for seasonal modeling. There’s a clear spike in the second half of each year, with a taper and another spike before a drop down into our baseline.
However, each year’s primary spike is larger each year besides 2021, which makes sense, given the pandemic, when folks may not have had celebrating the season on their minds.
Note: These imports appear differently in the notebook itself, as in the notebook I’m relying on seasonal_mod.py
which has a lot of my imports baked in.
These are the libraries I used to make the code notebook. I went for statsmodels instead of scikit-learn for their time series packages, I like statsmodels better for most linear regression problems.
I don’t know about you but I don’t want to write several lines of code each time I make a new model and then more code to verify. So instead I made some functions to keep my code DRY and prevent myself from making errors.
These three little functions work together so I only need to run metrics_graph()
with y_true
and y_preds
as the input and it will give me a blue line of true data and a red line of predictive data, along with the MAPE and RMSE. That will save me time and hassle.
Using Last Year’s Data as a Benchmark for Success:
My experience in retail management informed my decision to try last week’s data and last year’s data as a direct prediction for this year’s data. Often in retail, we used last season’s (1 unit of time ago’s) data as a direct prediction, to ensure inventory during Black Friday for example. Last week’s data didn’t perform as well as last year’s data.
Last week’s data to predict this week’s data showed a MAPE score of just over 18, with a RMSE of about 11. By comparison, last year’s data as a direct prediction to this year’s data showed a MAPE score of just about 12 with a RMSE of about 7.
Therefore I chose to compare all statistical models I built to a naive model using last year’s data. This model got the timing of the spikes and decreases more accurately than our naive weekly model, however, I still thought I could do better. The next step in modeling was doing a seasonal decomposition.
The following function helped me run my season decomposition and I’ll be keeping it as reusable code for all future modeling moving forward.
The below shows how I used that seasonal decomposition.
The additive model had a reoccurring yearly pattern in the residuals, evidence that an additive model wasn’t able to completely decompose all the recurring patterns. It was a good reason to try a multiplicative model for the yearly spikes.
Now the residuals in the multiplicative decomposition were much more promising. They were much more random and on a much smaller scale, proving that a multiplicative model would capture the data best. The residuals being so small — on a scale between 1.5 to -1, meant that there was a lot of promise in modeling.
But now I wanted a function for running SARIMA models specifically, only inputting the order. I wanted to experiment running c
,t
and ct
versions of the SARIMA model with those orders as well since the seasonal decomposition favored a multiplicative type of model over an additive type of model. Using the c
, t
and ct
in the trend =
parameter, I was able to add multipliers to my SARIMA model.
I’ll skip describing the part where I looked at the AFC and PACF plots and the part where I also tried PMD auto arima to find the best terms to use in the SARIMA models. If you’re interested in those details, please see my full code notebook.
My best SARIMA model:
So my best SARIMA model had a higher MAPE score than my naive model, nearly 29 to nearly 12, but a lower RMSE by about a unit, nearly 7 to nearly 6. My biggest problem with using this model is it really underpredicted the 2023 spike, there’s a fair amount of area between the red and blue lines from August to September of 2023. There are reasons to like it better than my yearly naive model or worse than my yearly naive model, depending on your opinions about RMSE vs MAPE. However, I wasn’t done yet. My final model was definitively better than my yearly naive model.
I used an ETS (exponential smoothing) model for my final model, which allowed me to explicitly use the seasonal
parameter to make it use a multiplicative approach.
Now you may be thinking “but this model has a higher MAPE score than the yearly naive model.” And you’d be correct, by about 0.3%. However, I think that’s a more than fair trade considering that I now have an RMSE of about 4 and a half instead of 7. While this model does struggle a bit more in December of 2022 than my best SARIMA model, it’s off by less area amount for that spike than the larger spike for fall of 2023, which I care more about. You can find that model here.
I will wait until 10/7/2024 and do another data pull and see how the model did against last year’s data.
To sum up, I was able to disprove the null hypothesis, my final model outperformed a naive yearly model. I’ve proved that pumpkin spice popularity on Google is very seasonal and can be predicted. Between naive, SARMA models, and ETS models, ETS was better able to capture the relationship between time and pumpkin spice popularity. The multiplicative relationship of pumpkin spice to time implies that pumpkin spice’s popularity is based on more than one independent variable besides time in the expression time * unknown_independant_var = pumpkin_spice_popularity
.
What I Learned and Future Work:
My next step is to use some version of Meta’s graph API to look for “pumpkin spice” being used in business articles. I wonder how correlated that data will be to my Google trends data. I also learned that when the seasonal decomposition points towards a multiplicative model, I’ll reach for an ETS much sooner in my process.
Furthermore, I’m interested in automating a lot of this process. Ideally, I’d like to build a Python module where the input is a CSV directly from Google Trends and the output can be a useable model with good enough documentation that a nontechnical user could make and test their own predictive models. On the eventuality that a user would pick data that is hard to predict (IE a naive or random walk model would suit better), I hope to build the module to explain that to users. I could then collect data from an app using that module to showcase findings of seasonality across lots of untested data.
Look out for that app by pumpkin spice season of next year!
[1] Google Trends, N/A (https://www.google.com/trends)