Forecasting can offer great value in SEO. Here’s how to get data-driven answers about possible trends in organic search using Python.
Whether it’s search demand, revenue, or traffic from organic search, at some point in your SEO career, you’re bound to be asked to deliver a forecast.
In this column, you’ll learn how to do just that accurately and efficiently, thanks to Python.
We’re going to explore how to:
Before we get into it, let’s define the data. Regardless of the type of metric, we’re attempting to forecast, that data happens over time.
In most cases, this is likely to be over a series of dates. So effectively, the techniques we’re disclosing here are time series forecasting techniques.
To answer a question with a question, why wouldn’t you forecast?
These techniques have been long used in finance for stock prices, for example, and in other fields. Why should SEO be any different?
With multiple interests such as the budget holder and other colleagues – say, the SEO manager and marketing director – there will be expectations as to what the organic search channel can deliver and whether those expectations will be met, or not.
Forecasts provide a data-driven answer.
Taking the data-driven approach using Python, there are a few things to bear in mind:
Forecasts work best when there is a lot of historical data.
The cadence of the data will determine the time frame needed for your forecast.
For example, if you have daily data like you would in your website analytics then you’ll have over 720 data points, which are fine.
With Google Trends, which has a weekly cadence, you’ll need at least 5 years to get 250 data points.
In any case, you should aim for a timeframe that gives you at least 200 data points (a number plucked from my personal experience).
Models like consistency.
If your data trend has a pattern — for example, it’s cyclical because there is seasonality — then your forecasts are more likely to be reliable.
For that reason, forecasts don’t handle breakout trends very well because there’s no historical data to base the future on, as we’ll see later.
So how do forecasting models work? There are a few aspects the models will address about the time series data:
Autocorrelation is the extent to which the data point is similar to the data point that came before it.
This can give the model information as to how much impact an event in time has over the search traffic and whether the pattern is seasonal.
Seasonality informs the model as to whether there is a cyclical pattern, and the properties of the pattern, e.g.: how long, or the size of the variation between the highs and lows.
Stationarity is the measure of how the overall trend is changing over time. A non-stationary trend would show a general trend up or down, despite the highs and lows of the seasonal cycles.
With the above in mind, models will “do” things to the data to make it more of a straight line and therefore more predictable.
With the whistlestop theory out of the way, let’s start forecasting.
We’re using Google Trends data, which is a CSV export.
These techniques can be used on any time series data, be it your own, your client’s or company’s clicks, revenues, etc.
As we’d expect, the data from Google Trends is a very simple time series with date, query, and hits spanning a 5-year period.
It’s time to format the dataframe to go from long to wide.
This allows us to see the data with each search query as columns:
We no longer have a hits column, as these are the values of the queries in their respective columns.
This format is not only useful for SARIMA (which we will be exploring here) but also for neural networks such as Long short-term memory (LSTM).
Let’s plot the data:
From the plot (above), you’ll note that the profiles of “PS4” and “PS5” are both different. For the non-gamers among you, “PS4” is the 4th generation of the Sony Playstation console, and “PS5” the fifth.
“PS4” searches are highly seasonal as they’re an established product and have a regular pattern apart from the end when the “PS5” emerges.
The “PS5” didn’t exist 5 years ago, which would explain the absence of a trend in the first 4 years of the plot above.
I’ve chosen those two queries to help illustrate the difference in forecasting effectiveness for the two very different characteristics.
Let’s now decompose the seasonal (or non-seasonal) characteristics of each trend:
The above shows the time series data and the overall smoothed trend arising from 2020.
The seasonal trend box shows repeated peaks, which indicates that there is seasonality from 2016. However, it doesn’t seem particularly reliable given how flat the time series is from 2016 until 2020.
Also suspicious is the lack of noise, as the seasonal plot shows a virtually uniform pattern repeating periodically.
The Resid (which stands for “Residual”) shows any pattern of what’s left of the time series data after accounting for seasonality and trend, which in effect is nothing until 2020 as it’s at zero most of the time.
We can see fluctuation over the short term (Seasonality) and long term (Trend), with some noise (Resid).
The next step is to use the Augmented Dickey-Fuller method (ADF) to statistically test whether a given Time series is stationary or not.
We can see the p-value of “PS5” shown above is more than 0.05, which means that the time series data is not stationary and therefore needs differencing.
“PS4,” on the other hand, is less than 0.05 at 0.01; it’s stationary and doesn’t require differencing.
The point of all of this is to understand the parameters that would be used if we were manually building a model to forecast Google searches.
Since we’ll be using automated methods to estimate the best fit model parameters (later), we’re now going to estimate the number of parameters for our SARIMA model.
I’ve chosen SARIMA because it’s easy to install. Although Facebook’s Prophet is elegant mathematically speaking (it uses Monte Carlo methods), it’s not maintained enough and many users may have problems trying to install it.
In any case, SARIMA compares quite well to Prophet in terms of accuracy.
To estimate the parameters for our SARIMA model, note that we set m to 52 as there are 52 weeks in a year, which is how the periods are spaced in Google Trends.
We also set all of the parameters to start at 0 so that we can let the auto_arima do the heavy lifting and search for the values that best fit the data for forecasting.
Response to above:
The printout above shows that the parameters that get the best results are:
The PS5 estimate is further detailed when printing out the model summary:
What’s happening is this: The function is looking to minimize the probability of error measured by both the Akaike’s Information Criterion (AIC) and Bayesian Information Criterion.
Such that L is the likelihood of the data, k = 1 if c ≠ 0 and k = 0 if c = 0
By minimizing AIC and BIC, we get the best-estimated parameters for p and q.
Now that we have the parameters, we can begin making forecasts. First, we’re going to see how the model performs over past data. This gives us some indication as to how well the model could perform for future periods.
The forecasts show the models are good when there is enough history until they suddenly change, as they have for PS4 from March onwards.
For PS5, the models are hopeless virtually from the get-go.
We know this because the Root Mean Squared Error (RMSE) is 8.62 for PS4, which is more than a third of the PS5 RMSE of 27.5. Given that Google Trends varies from 0 to 100, this is a 27% margin of error.
At this point, we’ll now make the foolhardy attempt to forecast the future based on the data we have to date:
As you can see from the table extract above, we’re now using all available data.
Now, we shall predict the next 6 months (defined as 26 weeks) in the code below:
This time, we automated the finding of the best fitting parameters and fed that directly into the model.
There’s been a lot of change in the last few weeks of the data. Although trends forecasted look likely, they don’t look super accurate, as shown below:
That’s in the case of those two keywords; if you were to try the code on your other data based on more established queries, they will probably provide more accurate forecasts on your own data.
The forecast quality will be dependent on how stable the historic patterns are and will obviously not account for unforeseeable events like COVID-19.
If you weren’t excited by Python’s matplot data visualization tool, fear not! You can export the data and forecasts into Excel, Tableau, or another dashboard front end to make them look nicer.
To export your forecasts:
What we learned here is where forecasting using statistical models is useful or is likely to add value for forecasting, particularly in automated systems like dashboards – i.e., when there’s historical data and not when there is a sudden spike, like PS5.
Featured image: ImageFlow/Shutterstock
Get our daily newsletter from SEJ’s Founder Loren Baker about the latest news in the industry!
Andreas Voniatis is the Founder of Artios, the SEO systems that help agencies and businesses unlock more ROI from SEO … [Read full bio]
Subscribe to our daily newsletter to get the latest industry news.
Subscribe to our daily newsletter to get the latest industry news.
You may have the excitement of moving. You are going to be part of your dream home or job. You can stay nearer to your family and friends....Read more