how do you create a linear regression forecast on time series data in python

Question

In the text of your question, you clearly state that you would like
upper and lower bounds on your regression output, as well as the output
prediction. You also mention using Holt-Winters algorithms for
forecasting in particular.

The packages suggested by other answerers are useful, but you might note
that sklearn LinearRegression does not give you error bounds “out of
the box”, statsmodels does not provide Holt-Winters right now.

Therefore, I suggest try using this implementation of Holt-Winters.
Unfortunately its license is unclear, so I can’t reproduce it here in
full. Now, I’m not sure whether you actually want Holt-Winters
(seasonal) prediction, or Holt’s linear exponential smoothing
algorithm. I’m guessing the latter given the title of the post. Thus,
you can use the linear() function of the linked library. The
technique is described in detail here for interested readers.

In the interests of not providing a link only answer – I’ll describe the
main features here. A function is defined that takes the data i.e.

 def linear(x, fc, alpha = None, beta = None):

x is the data to be fit, fc is the number of timesteps that you want
to forecast, alpha and beta take their usual Holt-Winters meanings:
roughly a parameter to control the amount of smoothing to the “level”
and to the “trend” respectively. If alpha or beta are not
specified, they are estimated using scipy.optimize.fmin_l_bfgs_b
to minimise the RMSE.

The function simply applies the Holt algorithm by looping through the
existing data points and then returns the forecast as follows:

 return Y[-fc:], alpha, beta, rmse

where Y[-fc:] are the forecast points, alpha and beta are the
values actually used and rmse is the root mean squared error.
Unfortunately, as you can see, there are no upper or lower confidence
intervals. By the way – we should probably refer to them as prediction
intervals.

Prediction intervals maths

Holt’s algorithm and Holt-Winters algorithm are exponential smoothing
techniques and finding confidence intervals for predictions generated
from them is a tricky subject. They have been referred to as “rule of
thumb” methods and, in the case of the Holt-Winters multiplicative
algorithm, without “underlying statistical model”. However, the
final footnote to this page asserts that:

It is possible to calculate confidence intervals around long-term
forecasts produced by exponential smoothing models, by considering them
as special cases of ARIMA models. (Beware: not all software calculates
confidence intervals for these models correctly.) The width of the
confidence intervals depends on (i) the RMS error of the model, (ii) the
type of smoothing (simple or linear); (iii) the value(s) of the
smoothing constant(s); and (iv) the number of periods ahead you are
forecasting. In general, the intervals spread out faster as α gets
larger in the SES model and they spread out much faster when linear
rather than simple smoothing is used.

We see here that an ARIMA(0,2,2) model is equivalent to a Holt
linear model with additive errors

Prediction intervals code (i.e. how to proceed)

You indicate in comments that you “can easily do this in R”. I
guess you may be used to the holt() function provided by the
forecast package in R and therefore expecting such intervals. In
which case – you can adapt the python library to give them to you on the
same basis.

Looking at the R holt code, we can see that it returns an object
based on forecast(ets(...). Under the hood – this eventually calls to
this function class1, which returns a mean mu and variance
var (as well as cj which I have to confess I do not understand).
The variance is used to calculate the upper and lower bounds here.

To do something similar in Python – we would need to produce something
similar to the class1 R function that estimates the variance of each
prediction. This function takes the residuals found in model fitting and
multiplies them by a factor at each time step to get the variance at
that timestep. In the particular case of the linear Holt’s algorithm, the factor is the cumulative sum of alpha + k*beta
where k is the number of timesteps’ prediction. Once you have that
variance at each prediction point, treat the errors as normally
distributed and get the X% value from the normal distribution.

Here’s an idea how to do this in Python (using the code I linked as
your linear function)

#Copy, import or reimplement the RMSE and linear function from
#https://gist.github.com/andrequeiroz/5888967

#factor in case there are not 1 timestep per day - in your case
#assuming the timesteps are UTC epoch - I think they're 5 min
# spaced i.e. 288 per day
timesteps_per_day = 288

# Note - big assumption here - your known data will be regular in time
# i.e. timesteps_per_day observations per day.  From the timestamps this seems valid.
# if you can't guarantee that - you'll need to interpolate the data
def holt_predict(data, timestamps, forecast_days, pred_error_level = 0.95):
    forecast_timesteps = forecast_days*timesteps_per_day
    middle_predictions, alpha, beta, rmse = linear(data,int(forecast_timesteps))
    cum_error = [beta+alpha]
    for k in range(1,forecast_timesteps):
        cum_error.append(cum_error[k-1] + k*beta + alpha)

    cum_error = np.array(cum_error)
    #Use some numpy multiplication to get the intervals
    var = cum_error * rmse**2
    # find the correct ppf on the normal distribution (two-sided)
    p = abs(scipy.stats.norm.ppf((1-pred_error_level)/2))
    interval = np.sqrt(var) * p
    upper = middle_predictions + interval
    lower = middle_predictions - interval
    fcast_timestamps = [timestamps[-1] + i * 86400 / timesteps_per_day for i in range(forecast_timesteps)]

    ret_value = []

    ret_value.append({'target':'Forecast','datapoints': zip(middle_predictions, fcast_timestamps)})
    ret_value.append({'target':'Upper','datapoints':zip(upper,fcast_timestamps)})
    ret_value.append({'target':'Lower','datapoints':zip(lower,fcast_timestamps)})
    return ret_value

if __name__=='__main__':
    import numpy as np
    import scipy.stats
    from math import sqrt

    null = None
    data_in = [{"target": "average", "datapoints": [[null, 1435688679],
    [34.870499801635745, 1435688694], [null, 1435688709], [null,
    1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769],
    [null, 1435688784], [null, 1435688799], [null, 1435688814], [null,
    1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874],
    [null, 1435688889], [null, 1435688904], [null, 1435688919], [null,
    1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979],
    [38.180000209808348, 1435688994], [null, 1435689009], [null,
    1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069],
    [null, 1435689084], [null, 1435689099], [null, 1435689114], [null,
    1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174],
    [null, 1435689189], [null, 1435689204], [null, 1435689219], [null,
    1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279],
    [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324],
    [null, 1435689339], [null, 1435689354], [null, 1435689369], [null,
    1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429],
    [null, 1435689444], [null, 1435689459], [null, 1435689474], [null,
    1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534],
    [null, 1435689549], [null, 1435689564]]}]

    #translate the data.  There may be better ways if you're
    #prepared to use pandas / input data is proper json
    time_series = data_in[0]["datapoints"]

    epoch_in = []
    Y_observed = []

    for (y,x) in time_series:
        if y and x:
            epoch_in.append(x)
            Y_observed.append(y)

    #Pass in the number of days to forecast
    fcast_days = 30
    res = holt_predict(Y_observed,epoch_in,fcast_days)
    data_out = data_in + res
    #data_out now holds the data as you wanted it.

    #Optional plot of results
    import matplotlib.pyplot as plt
    plt.plot(epoch_in,Y_observed)
    m,tstamps = zip(*res[0]['datapoints'])
    u,tstamps = zip(*res[1]['datapoints'])
    l,tstamps = zip(*res[2]['datapoints'])
    plt.plot(tstamps,u, label="upper")
    plt.plot(tstamps,l, label="lower")
    plt.plot(tstamps,m, label="mean")
    plt.show()

N.B. The output I’ve given adds points as tuple type into your object. If you really need list, then replace zip(upper,fcast_timestamps) with map(list,zip(upper,fcast_timestamps)) where the code adds upper, lower and Forecast dicts to the result.

This code is for the particular case of the Holt’s linear algorithm – it is not a generic way to calculate correct prediction intervals.

Important note

Your sample input data seems to have a lot of null and only 3 genuine
data points. This is highly unlikely to be a good basis for doing
timeseries prediction – especially as they all seem to be with 15 minutes and you’re trying to forecast up to 3 months!. Indeed – if you feed that data into the R
holt(), it will say:

You've got to be joking. I need more data!

i’m assuming you have a larger dataset to test on. I tried the code above on the stock market opening prices for 2015 and it seemed to give reasonable results (see below).

Code used on 2015 stock market opening prices

You may think the prediction intervals look a little wide. This blog from the author of the R forecast module implies that is intentional, though :

“confidence intervals for the mean are much narrower than prediction intervals”

Prediction intervals maths

Prediction intervals code (i.e. how to proceed)

Important note

Leave a Comment Cancel reply