时间序列预测模型实战案例(Xgboost)(Python)(机器学习)包括时间序列预测和时间序列分类，点击即可运行！

java1314777Xgboost-ETTh1.zip 407.12KB

资源文件列表:

Xgboost-ETTh1.zip 大约有3个文件

ETTh1.csv 2.47MB
shareXgboost.py 2.74KB
xgboostforecast.py 4.95KB

资源介绍:

内容概要资源包括三部分(时间序列预测部分和时间序列分类部分和所需的测试数据集全部包含在内) 在本次实战案例中，我们将使用Xgboost算法进行时间序列预测。Xgboost是一种强大的梯度提升树算法，适用于各种机器学习任务，它最初主要用于解决分类问题，在此基础上也可以应用于时间序列预测。时间序列预测是通过分析过去的数据模式来预测未来的数值趋势。它在许多领域中都有广泛的应用，包括金融、天气预报、股票市场等。我们将使用Python编程语言来实现这个案例。其中包括模型训练部分和保存部分，可以将模型保存到本地，一旦我们完成了模型的训练，我们可以使用它来进行预测。我们将选择合适的输入特征，并根据模型的预测结果来生成未来的数值序列。最后，我们会将预测结果与实际观测值进行对比，评估模型的准确性和性能。适合人群：时间序列预测的学习者,机器学习的学习者，能学到什么：本模型能够让你对机器学习和时间序列预测有一个清楚的了解，其中还包括数据分析部分和特征工程的代码操作阅读建议：大家可以仔细阅读代码部分，其中包括每一步的注释帮助读者进行理解，其中涉及到的知识有数据分析部分和特征工程的代码操作。

import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn.model_selection import cross_val_score, TimeSeriesSplit from sklearn.preprocessing import StandardScaler from xgboost import XGBRegressor def timeseries_train_test_split(X, y, test_size): """ Perform train-test split with respect to time series structure """ # get the index after which test set starts test_index = int(len(X) * (1 - test_size)) X_train = X.iloc[:test_index] y_train = y.iloc[:test_index] X_test = X.iloc[test_index:] y_test = y.iloc[test_index:] return X_train, X_test, y_train, y_test def code_mean(data, cat_feature, real_feature): """ cat_feature:类别型特征，如星期几； real_feature：target字段 """ return dict(data.groupby(cat_feature)[real_feature].mean()) def mean_absolute_percentage_error(y_true, y_pred): return np.mean(np.abs((y_true - y_pred) / y_true)) * 100 def plotModelResults(model, X_train, X_test, plot_intervals=False, plot_anomalies=False, scale=1.96): """ Plots modelled vs fact values, prediction intervals and anomalies """ prediction = model.predict(X_test) plt.figure(figsize=(15, 7)) plt.plot(prediction, "g", label="prediction", linewidth=2.0) plt.plot(y_test.values, label="actual", linewidth=2.0) if plot_intervals: cv = cross_val_score(model, X_train, y_train, cv=tscv, scoring="neg_mean_squared_error") # mae = cv.mean() * (-1) deviation = np.sqrt(cv.std()) lower = prediction - (scale * deviation) upper = prediction + (scale * deviation) plt.plot(lower, "r--", label="upper bond / lower bond", alpha=0.5) plt.plot(upper, "r--", alpha=0.5) if plot_anomalies: anomalies = np.array([np.NaN] * len(y_test)) anomalies[y_test < lower] = y_test[y_test < lower] anomalies[y_test > upper] = y_test[y_test > upper] plt.plot(anomalies, "o", markersize=10, label="Anomalies") error = mean_absolute_percentage_error(prediction, y_test) plt.title("Mean absolute percentage error {0:.2f}%".format(error)) plt.legend(loc="best") plt.tight_layout() plt.grid(True); plt.show() def prepareData(series, lag_start, lag_end, test_size, target_encoding=False): """ series: pd.DataFrame dataframe with timeseries lag_start: int initial step back in time to slice target variable example - lag_start = 1 means that the model will see yesterday's values to predict today lag_end: int final step back in time to slice target variable example - lag_end = 4 means that the model will see up to 4 days back in time to predict today test_size: float size of the test dataset after train/test split as percentage of dataset target_encoding: boolean if True - add target averages to the dataset """ # copy of the initial dataset data = pd.DataFrame(series.copy()).loc[:, ['OT']] data.columns = ["y"] # lags of series for i in range(lag_start, lag_end): data["lag_{}".format(i)] = data.y.shift(i) # # datetime features data.index = pd.to_datetime(data.index) data["hour"] = data.index.hour data["weekday"] = data.index.weekday data['is_weekend'] = data.weekday.isin([5, 6]) * 1 if target_encoding: # calculate averages on train set only test_index = int(len(data.dropna()) * (1 - test_size)) data['weekday_average'] = list(map( code_mean(data[:test_index], 'weekday', "y").get, data.weekday)) # frop encoded variables data.drop(["weekday"], axis=1, inplace=True) # train-test split y = data.dropna().y X = data.dropna().drop(['y'], axis=1) X = pd.get_dummies(X) X_train, X_test, y_train, y_test = \ timeseries_train_test_split(X, y, test_size=test_size) return X_train, X_test, y_train, y_test if __name__ == '__main__': """ "XGBoost（机器学习）", """ df = pd.read_csv('ETTh1.csv') df['OT'].fillna(0, inplace=True) df.set_index('date', inplace=True) hp_raw = df[['OT']] tscv = TimeSeriesSplit(n_splits=5) # reserve 30% of data for testing X_train, X_test, y_train, y_test = \ prepareData(hp_raw, lag_start=1, lag_end=28, test_size=0.1, target_encoding=True) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) xgb = XGBRegressor() xgb.fit(X_train_scaled, y_train) plotModelResults(xgb, X_train=X_train_scaled, X_test=X_test_scaled, plot_intervals=True, plot_anomalies=True)