GDP is a very strong metric of a country’s economic well-being; therefore, making forecasts of the measurement highly sought after. Policymakers and legislators, for example, may want to have a rough forecast of the trends regarding the country’s GDP prior to passing a new bill or law. Researchers and economists will also consider these forecasts for various endeavors in both academic and industrial settings.
Forecasting GDP, similarly to many other time series problems, follows a general workflow.
- Using the integrated FRED (Federal Reserve Economic Data) library and API, we will create our features by constructing a data frame composed of US GDP along with some other metrics that are closely related (GDP = Consumption + Investment + Govt. Spending + Net Export)
- Using a variety of statistical tests and analyses, we will explore the nuances of our data in order to better understand the underlying relationships between features.
- Finally, we will utilize a variety of statistical and machine-learning models to conclude which approach can lead us to the most accurate and efficient forecast.
Along all of these steps, we will delve into the nuances of the underlying mathematical backbone that supports our tests and models.
To construct our dataset for this project, we will be utilizing the FRED (Federal Reserve Economic Data) API which is the premier application to gather economic data. Note that to use this data, one must register an account on the FRED website and request a custom API key.
Each time series on the website is connected to a specific character string (for example GDP is linked to ‘GDP’, Net Export to ‘NETEXP’, etc.). This is important because when we make a call for each of our features, we need to make sure that we specify the correct character string to go along with it.
Keeping this in mind, lets now construct our data frame:
#used to label and construct each feature dataframe.
def gen_df(category, series):
gen_ser = fred.get_series(series, frequency='q')
return pd.DataFrame({'Date': gen_ser.index, category + ' : Billions of dollars': gen_ser.values})
#used to merge every constructed dataframe.
def merge_dataframes(dataframes, on_column):
merged_df = dataframes[0]
for df in dataframes[1:]:
merged_df = pd.merge(merged_df, df, on=on_column)
return merged_df
#list of features to be used
dataframes_list = [
gen_df('GDP', 'GDP'),
gen_df('PCE', 'PCE'),
gen_df('GPDI', 'GPDI'),
gen_df('NETEXP', 'NETEXP'),
gen_df('GovTotExp', 'W068RCQ027SBEA')
]
#defining and displaying dataset
data = merge_dataframes(dataframes_list,'Date')
data
Notice that since we have defined functions as opposed to static chunks of code, we are free to expand our list of features for further testing. Running this code, our resulting data frame is the following:
We notice that our dataset starts from the 1960s, giving us a fairly broad historical context. In addition, looking at the shape of the data frame, we have 1285 instances of actual economic data to work with, a number that is not necessarily small but not big either. These observations will come into play during our modeling phase.
Now that our dataset is initialized, we can begin visualizing and conducting tests to gather some insights into the behavior of our data and how our features relate to one another.
Visualization (Line plot):
Our first approach to analyzing this dataset is to simply graph each feature on the same plot in order to catch some patterns. We can write the following:
#separating date column from feature columns
date_column = 'Date'
feature_columns = data.columns.difference([date_column])
#set the plot
fig, ax = plt.subplots(figsize=(10, 6))
fig.suptitle('Features vs Time', y=1.02)
#graphing features onto plot
for i, feature in enumerate(feature_columns):
ax.plot(data[date_column], data[feature], label=feature, color=plt.cm.viridis(i / len(feature_columns)))
#label axis
ax.set_xlabel('Date')
ax.set_ylabel('Billions of Dollars')
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
#display the plot
plt.show()
Running the code, we get the result:
Looking at the graph, we notice below that some of the features resemble GDP far more than others. For instance, GDP and PCE follow almost the exact same trend while NETEXP shares no visible similarities. Though it may be tempting, we can not yet begin selecting and removing certain features before conducting more exploratory tests.
ADF (Augmented Dickey-Fuller) Test:
The ADF (Augmented Dickey-Fuller) Test evaluates the stationarity of a particular time series by checking for the presence of a unit root, a characteristic that defines a time series as nonstationarity. Stationarity essentially means that a time series has a constant mean and variance. This is important to test because many popular forecasting methods (including ones we will use in our modeling phase) require stationarity to function properly.
Although we can determine the stationarity for most of these time series just by looking at the graph, doing the testing is still beneficial because we will likely reuse it in later parts of the forecast. Using the Statsmodel library we write:
from statsmodels.tsa.stattools import adfuller
#iterating through each feature
for column in data.columns:
if column != 'Date':
result = adfuller(data[column])
print(f"ADF Statistic for {column}: {result[0]}")
print(f"P-value for {column}: {result[1]}")
print("Critical Values:")
for key, value in result[4].items():
print(f" {key}: {value}")
#creating separation line between each feature
print("\n" + "=" * 40 + "\n")
giving us the result:
The numbers we are interested from this test are the P-values. A P-value close to zero (equal to or less than 0.05) implies stationarity while a value closer to 1 implies nonstationarity. We can see that all of our time series features are highly nonstationary due to their statistically insignificant p-values, in other words, we are unable to reject the null hypothesis for the absence of a unit root. Below is a simple visual representation of the test for one of our features. The red dotted line represents the P-value where we would be able to determine stationarity for the time series feature, and the blue box represents the P-value where the feature is currently.
VIF (Variance Inflation Factor) Test:
The purpose of finding the Variance Inflation Factor of each feature is to check for multicollinearity, or the degree of correlation the predictors share with one another. High multicollinearity is not necessarily detrimental to our forecast, however, it can make it much harder for us to determine the individual effect of each feature time series for the prediction, thus hurting the interpretability of the model.
Mathematically, the calculation is as follows:
with Xj representing our selected predictor and R²j is the coefficient of determination for our specific predictor. Applying this calculation to our data, we arrive at the following result:
Evidently, our predictors are very closely linked to one another. A VIF score greater than 5 implies multicollinearity, and the scores our features achieved far exceed this amount. Predictably, PCE by far had the highest score which makes sense given how its shape on the line plot resembled many of the other features.
Now that we have looked thoroughly through our data to better understand the relationships and characteristics of each feature, we will begin to make modifications to our dataset in order to prepare it for modeling.
Differencing to achieve stationarity
To begin modeling we need to first ensure our data is stationary. we can achieve this using a technique called differencing, which essentially transforms the raw data using a mathematical formula similar to the tests above.
The concept is defined mathematically as:
This makes it so we are removing the nonlinear trends from the features, resulting in a constant series. In other words, we are taking values from our time series and calculating the change which occurred following the previous point.
We can implement this concept in our dataset and check the results from the previously used ADF test with the following code:
#differencing and storing original dataset
data_diff = data.drop('Date', axis=1).diff().dropna()
#printing ADF test for new dataset
for column in data_diff.columns:
result = adfuller(data_diff[column])
print(f"ADF Statistic for {column}: {result[0]}")
print(f"P-value for {column}: {result[1]}")
print("Critical Values:")
for key, value in result[4].items():
print(f" {key}: {value}")print("\n" + "=" * 40 + "\n")
running this results in:
We notice that our new p-values are less than 0.05, meaning that we can now reject the null hypothesis that our dataset is nonstationary. Taking a look at the graph of the new dataset proves this assertion:
We see how all of our time series are now centered around 0 with the mean and variance remaining constant. In other words, our data now visibly demonstrates characteristics of a stationary system.
VAR (Vector Auto Regression) Model
The first step of the VAR model is performing the Granger Causality Test which will tell us which of our features are statistically significant to our prediction. The test indicates to us if a lagged version of a specific time series can help us predict our target time series, however not necessarily that one time series causes the other (note that causation in the context of statistics is a far more difficult concept to prove).
Using the StatsModels library, we can apply the test as follows:
from statsmodels.tsa.stattools import grangercausalitytests
columns = ['PCE : Billions of dollars', 'GPDI : Billions of dollars', 'NETEXP : Billions of dollars', 'GovTotExp : Billions of dollars']
lags = [6, 9, 1, 1] #determined from individually testing each combinationfor column, lag in zip(columns, lags):
df_new = data_diff[['GDP : Billions of dollars', column]]
print(f'For: {column}')
gc_res = grangercausalitytests(df_new, lag)
print("\n" + "=" * 40 + "\n")
Running the code results in the following table:
Here we are just looking for a single lag for each feature that has statistically significant p-values(>.05). So for example, since on the first lag both NETEXP and GovTotExp, we will consider both these features for our VAR model. Personal consumption expenditures arguably did not make this cut-off (see notebook), however, the sixth lag is so close that I decided to keep it in. Our next step is to create our VAR model now that we have decided that all of our features are significant from the Granger Causality Test.
VAR (Vector Auto Regression) is a model which can leverage different time series to gauge patterns and determine a flexible forecast. Mathematically, the model is defined by:
Where Yt is some time series at a particular time t and Ap is a determined coefficient matrix. We are essentially using the lagged values of a time series (and in our case other time series) to make a prediction for Yt. Knowing this, we can now apply this algorithm to the data_diff dataset and evaluate the results:
Looking at this forecast, we can clearly see that despite missing the mark quite heavily on both evaluation metrics used (MAE and MAPE), our model visually was not too inaccurate barring the outliers caused by the pandemic. We managed to stay on the testing line for the most part from 2018–2019 and from 2022–2024, however, the global events following obviously threw in some unpredictability which affected the model’s ability to precisely judge the trends.
VECM (Vector Error Correction Model)
VECM (Vector Error Correction Model) is similar to VAR, albeit with a few key differences. Unlike VAR, VECM does not rely on stationarity so differencing and normalizing the time series will not be necessary. VECM also assumes cointegration, or long-term equilibrium between the time series. Mathematically, we define the model as:
This equation is similar to the VAR equation, with Π being a coefficient matrix which is the product of two other matrices, along with taking the sum of lagged versions of our time series Yt. Remembering to fit the model on our original (not difference) dataset, we achieve the following result:
Though it is hard to compare to our VAR model to this one given that we are now using nonstationary data, we can still deduce both by the error metric and the visualization that this model was not able to accurately capture the trends in this forecast. With this, it is fair to say that we can rule out traditional statistical methods for approaching this problem.
Machine Learning forecasting
When deciding on a machine learning approach to model this problem, we want to keep in mind the amount of data that we are working with. Prior to creating lagged columns, our dataset has a total of 1275 observations across all time-series. This means that using more complex approaches, such as LSTMs or gradient boosting, are perhaps unnecessary as we can use a more simple model to receive the same amount of accuracy and far more interpretability.
Train-Test Split
Train-test splits for time series problems differ slightly from splits in traditional regression or classification tasks (Note we also used the train-test split in our VAR and VECM models, however, it feels more appropriate to address in the Machine Learning section). We can perform our Train-Test split on our differenced data with the following code:
#90-10 data split
split_index = int(len(data_diff) * 0.90)
train_data = data_diff.iloc[:split_index]
test_data = data_diff.iloc[split_index:]
#Assigning GDP column to target variable
X_train = train_data.drop('GDP : Billions of dollars', axis=1)
y_train = train_data['GDP : Billions of dollars']
X_test = test_data.drop('GDP : Billions of dollars', axis=1)
y_test = test_data['GDP : Billions of dollars']
Here it is imperative that we do not shuffle around our data, since that would mean we are training our model on data from the future which in turn will cause data leakages.
Also in comparison, notice that we are training over a very large portion (90 percent) of the data whereas typically we would train over 75 percent in a common regression task. This is because practically, we are not actually concerned with forecasting over a large time frame. Realistically even forecasting over several years is not probable for this task given the general unpredictability that comes with real-world time series data.
Random Forests
Remembering our VIF test from earlier, we know our features are highly correlated with one another. This partially plays into the decision to choose random forests as one of our machine-learning models. decision trees make binary choices between features, meaning that theoretically our features being highly correlated should not be detrimental to our model.
To add on, random forest is generally a very strong model being robust to overfitting from the stochastic nature of how the trees are computed. Each tree uses a random subset of the total feature space, meaning that certain features are unlikely to dominate the model. Following the construction of the individual trees, the results are averaged in order to make a final prediction using every individual learner.
We can implement the model to our dataset with the following code:
from sklearn.ensemble import RandomForestRegressor
#fitting model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)y_pred = rf_model.predict(X_test)
#plotting results
printevals(y_test,y_pred)
plotresults('Actual vs Forecasted GDP using Random Forest')
running this gives us the results:
We can see that Random Forests was able to produce our best forecast yet, attaining better error metrics than our attempts at VAR and VECM. Perhaps most impressively, visually we can see that our model was almost perfectly encapsulating the data from 2017–2019, just prior to encountering the outliers.
K Nearest Neighbors
KNN (K-Nearest-Neighbors) was one final approach we will attempt. Part of the reasoning for why we choose this specific model is due to the feature-to-observation ratio. KNN is a distanced based algorithm that we are dealing with data which has a low amount of feature space comparative to the number of observations.
To use the model, we must first select a hyperparameter k which defines the number of neighbors our data gets mapped to. A higher k value insinuates a more biased model while a lower k value insinuates a more overfit model. We can choose the optimal one with the following code:
from sklearn.neighbors import KNeighborsRegressor
#iterate over all k=1 to k=10
for i in range (1,10):
knn_model = KNeighborsRegressor(n_neighbors=i)
knn_model.fit(X_train, y_train)y_pred = knn_model.predict(X_test)
#print evaluation for each k
print(f'for k = {i} ')
printevals(y_test,y_pred)
print("\n" + "=" * 40 + "\n")
Running this code gives us:
We can see that our best accuracy measurements are achieved when k=2, following that value the model becomes too biased with increasing values of k. knowing this, we can now apply the model to our dataset:
#applying model with optimal k value
knn_model = KNeighborsRegressor(n_neighbors=2)
knn_model.fit(X_train, y_train)y_pred = knn_model.predict(X_test)
printevals(y_test,y_pred)
plotresults('Actual vs Forecasted GDP using KNN')
resulting in:
We can see KNN in its own right performed very well. Despite being outperformed slightly in terms of error metrics compared to Random Forests, visually the model performed about the same and arguably captured the period before the pandemic from 2018–2019 even better than Random Forests.
Looking at all of our models, we can see the one which performed the best was Random Forests. This is most likely due to Random Forests for the most part being a very strong predictive model that can be fit to a variety of datasets. In general, the machine learning algorithms far outperformed the traditional statistical methods. Perhaps this can be explained by the fact that VAR and VECM both require a great amount of historical background data to work optimally, something which we did not have much of given that our data came out in quarterly intervals. There also may be something to be said about how both the machine learning models used were nonparametric. These models often are governed by fewer assumptions than their counterparts and therefore may be more flexible to unique problem sets like the one here. Below is our final best prediction, removing the differencing transformation we previously used to fit the models.
By far the greatest challenge regarding this forecasting problem was handling the massive outlier caused by the pandemic along with the following instability caused by it. Our methods for forecasting obviously can not predict that this would occur, ultimately decreasing our accuracy for each approach. Had our goal been to forecast the previous decade, our models would most likely have a much easier time finding and predicting trends. In terms of improvement and further research, I think a possible solution would be to perform some sort of normalization and outlier smoothing technique on the time interval from 2020–2024, and then evaluate our fully trained model on new quarterly data that comes in. In addition, it may be beneficial to incorporate new features that have a heavy influence on GDP such as quarterly inflation and personal asset evaluations.
For traditional statistical methods- https://link.springer.com/book/10.1007/978-1-4842-7150-6 , https://www.statsmodels.org/stable/generated/statsmodels.tsa.vector_ar.vecm.VECM.html
For machine learning methods — https://www.statlearning.com/
For dataset — https://fred.stlouisfed.org/docs/api/fred/
FRED provides licensed, free-to-access datasets for any user who owns an API key, read more here — https://fredhelp.stlouisfed.org/fred/about/about-fred/what-is-fred/
All pictures not specifically given credit in the caption belong to me.
please note that in order to run this notebook you must create an account on the FRED website, request an API key, and paste said key into the second cell of the notebook.