[譯] 時(shí)間序列分析、可視化、和使用 LSTM 預(yù)測(cè)

昵稱47241897 2019-05-31

展開全文

數(shù)據(jù)

該數(shù)據(jù)是在近四年的時(shí)間里對(duì)一個(gè)家庭以一分鐘采樣率測(cè)量的電力消耗，可以在這里下載。

數(shù)據(jù)包括不同的電量值和一些分表的數(shù)值。然而，我們只關(guān)注 Global_active_power 這個(gè)變量。

import numpy as npimport matplotlib.pyplot as pltimport pandas as pd
pd.set_option('display.float_format', lambda x: '%.4f' % x)import seaborn as sns
sns.set_context("paper", font_scale=1.3)
sns.set_style('white')import warnings
warnings.filterwarnings('ignore')from time import timeimport matplotlib.ticker as tkrfrom scipy import statsfrom statsmodels.tsa.stattools import adfullerfrom sklearn import preprocessingfrom statsmodels.tsa.stattools import pacf
%matplotlib inlineimport mathimport kerasfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.layers import LSTMfrom keras.layers import Dropoutfrom keras.layers import *from sklearn.preprocessing import MinMaxScalerfrom sklearn.metrics import mean_squared_errorfrom sklearn.metrics import mean_absolute_errorfrom keras.callbacks import EarlyStopping

df=pd.read_csv('household_power_consumption.txt', delimiter=';')
print('Number of rows and columns:', df.shape)
df.head(5)復(fù)制代碼

以下數(shù)據(jù)預(yù)處理和特征工程步驟需要完成：

將日期和時(shí)間合并到同一列，并轉(zhuǎn)換為 datetime 類型。
將 Global_active_power 轉(zhuǎn)換為數(shù)值型，并移除缺失值（1.2%）。
創(chuàng)建年、季度、月和日的特征。
創(chuàng)建周的特征，“0”表示周末，“1”表示工作日。

df['date_time'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df['Global_active_power'] = pd.to_numeric(df['Global_active_power'], errors='coerce')
df = df.dropna(subset=['Global_active_power'])
df['date_time']=pd.to_datetime(df['date_time'])
df['year'] = df['date_time'].apply(lambda x: x.year)
df['quarter'] = df['date_time'].apply(lambda x: x.quarter)
df['month'] = df['date_time'].apply(lambda x: x.month)
df['day'] = df['date_time'].apply(lambda x: x.day)
df=df.loc[:,['date_time','Global_active_power', 'year','quarter','month','day']]
df.sort_values('date_time', inplace=True, ascending=True)
df = df.reset_index(drop=True)
df["weekday"]=df.apply(lambda row: row["date_time"].weekday(),axis=1)
df["weekday"] = (df["weekday"] < 5).astype(int)

print('Number of rows and columns after removing missing values:', df.shape)
print('The time series starts from: ', df.date_time.min())
print('The time series ends on: ', df.date_time.max())復(fù)制代碼

移除缺失值之后，數(shù)據(jù)包括從 2006 年 12 月到 2010 年 11 月（47 個(gè)月）共 2,049,280 個(gè)測(cè)量值。

初始數(shù)據(jù)包括多個(gè)變量。這里我們只會(huì)關(guān)注一個(gè)單獨(dú)的變量：房屋的 Global_active_power 歷史記錄，也就是整個(gè)房屋平均每分鐘消耗的有功功率，單位是千瓦。

統(tǒng)計(jì)正態(tài)性檢驗(yàn)

有一些統(tǒng)計(jì)測(cè)試方法可以用來(lái)量化我們的數(shù)據(jù)是否看起來(lái)像高斯分布采樣。我們將會(huì)使用 D’Agostino’s K2 檢驗(yàn)。

在 SciPy 對(duì)這個(gè)檢驗(yàn)的實(shí)現(xiàn)中，我們對(duì) p 值做出如下解釋。

p <= alpha：拒絕 H0，非正態(tài)。
p > alpha：不拒絕 H0，正態(tài)。

stat, p = stats.normaltest(df.Global_active_power)
print('Statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05if p > alpha:
    print('Data looks Gaussian (fail to reject H0)')else:
    print('Data does not look Gaussian (reject H0)')復(fù)制代碼

同時(shí)我們也會(huì)計(jì)算峰度和偏度，以確定數(shù)據(jù)分布是否偏離正態(tài)分布。

sns.distplot(df.Global_active_power);
print( 'Kurtosis of normal distribution: {}'.format(stats.kurtosis(df.Global_active_power)))
print( 'Skewness of normal distribution: {}'.format(stats.skew(df.Global_active_power)))復(fù)制代碼

峰度：描述分布的尾重

正態(tài)分布的峰度接近于 0。如果峰度大于 0，則分布尾部較重。如果峰度小于 0，則分布尾部較輕。我們計(jì)算出的峰度是大于 0 的。

偏度: 度量分布的不對(duì)稱性

如果偏度介于 -0.5 和 0.5 之間，則數(shù)據(jù)是基本對(duì)稱的。如果偏度介于 -1 和 -0.5 之間或者 0.5 和 1 之間，則數(shù)據(jù)是稍微偏斜的。如果偏度小于 -1 或大于 1，則數(shù)據(jù)是高度偏斜的。我們計(jì)算出的偏度是大于 1 的。

第一個(gè)時(shí)間序列圖像

df1=df.loc[:,['date_time','Global_active_power']]
df1.set_index('date_time',inplace=True)
df1.plot(figsize=(12,5))
plt.ylabel('Global active power')
plt.legend().set_visible(False)
plt.tight_layout()
plt.title('Global Active Power Time Series')
sns.despine(top=True)
plt.show();復(fù)制代碼

很明顯，這個(gè)圖像并不是我們想要的。不要這么做。

年度和季度總體有功功率箱形圖對(duì)比

plt.figure(figsize=(14,5))
plt.subplot(1,2,1)
plt.subplots_adjust(wspace=0.2)
sns.boxplot(x="year", y="Global_active_power", data=df)
plt.xlabel('year')
plt.title('Box plot of Yearly Global Active Power')
sns.despine(left=True)
plt.tight_layout()

plt.subplot(1,2,2)
sns.boxplot(x="quarter", y="Global_active_power", data=df)
plt.xlabel('quarter')
plt.title('Box plot of Quarterly Global Active Power')
sns.despine(left=True)
plt.tight_layout();復(fù)制代碼

當(dāng)并排比較每年的箱形圖時(shí)，我們注意到 2006 年的總體有功功率的中位數(shù)相比于其他年份高很多。其實(shí)這里會(huì)有一點(diǎn)誤導(dǎo)。如果你還記得，我們只有 2006 年 12 月的數(shù)據(jù)。而很明顯 12 月是一個(gè)家庭電力消耗的高峰月。

季度總體有功功率的中位數(shù)就比較符合預(yù)期，第一、四季度（冬季）較高，第三季度（夏季）最低。

總體有功功率分布

plt.figure(figsize=(14,6))
plt.subplot(1,2,1)
df['Global_active_power'].hist(bins=50)
plt.title('Global Active Power Distribution')

plt.subplot(1,2,2)
stats.probplot(df['Global_active_power'], plot=plt);
df1.describe().T復(fù)制代碼

正態(tài)概率圖也顯示這個(gè)數(shù)據(jù)與正態(tài)分布偏離很大。

按照天、周、月、季度和年重新抽樣平均總體有功功率

fig = plt.figure(figsize=(18,16))
fig.subplots_adjust(hspace=.4)
ax1 = fig.add_subplot(5,1,1)
ax1.plot(df1['Global_active_power'].resample('D').mean(),linewidth=1)
ax1.set_title('Mean Global active power resampled over day')
ax1.tick_params(axis='both', which='major')

ax2 = fig.add_subplot(5,1,2, sharex=ax1)
ax2.plot(df1['Global_active_power'].resample('W').mean(),linewidth=1)
ax2.set_title('Mean Global active power resampled over week')
ax2.tick_params(axis='both', which='major')

ax3 = fig.add_subplot(5,1,3, sharex=ax1)
ax3.plot(df1['Global_active_power'].resample('M').mean(),linewidth=1)
ax3.set_title('Mean Global active power resampled over month')
ax3.tick_params(axis='both', which='major')

ax4  = fig.add_subplot(5,1,4, sharex=ax1)
ax4.plot(df1['Global_active_power'].resample('Q').mean(),linewidth=1)
ax4.set_title('Mean Global active power resampled over quarter')
ax4.tick_params(axis='both', which='major')

ax5  = fig.add_subplot(5,1,5, sharex=ax1)
ax5.plot(df1['Global_active_power'].resample('A').mean(),linewidth=1)
ax5.set_title('Mean Global active power resampled over year')
ax5.tick_params(axis='both', which='major');復(fù)制代碼

通常來(lái)說(shuō)，我們的時(shí)間序列不會(huì)存在上升或下降的趨勢(shì)。最高的平均耗電量似乎是在 2007 年之前，實(shí)際上這是因?yàn)槲覀冊(cè)?2007 年只有 12 月的數(shù)據(jù)（譯者注：原文有誤，應(yīng)該是只有 2006 年 12 月的數(shù)據(jù)），而那個(gè)月是用電高峰月。也就是說(shuō)，如果我們逐年比較，這個(gè)序列其實(shí)較為平穩(wěn)。

繪制總體有功功率均值圖，并以年、季、月和天分組

plt.figure(figsize=(14,8))
plt.subplot(2,2,1)
df.groupby('year').Global_active_power.agg('mean').plot()
plt.xlabel('')
plt.title('Mean Global active power by Year')

plt.subplot(2,2,2)
df.groupby('quarter').Global_active_power.agg('mean').plot()
plt.xlabel('')
plt.title('Mean Global active power by Quarter')

plt.subplot(2,2,3)
df.groupby('month').Global_active_power.agg('mean').plot()
plt.xlabel('')
plt.title('Mean Global active power by Month')

plt.subplot(2,2,4)
df.groupby('day').Global_active_power.agg('mean').plot()
plt.xlabel('')
plt.title('Mean Global active power by Day');復(fù)制代碼

以上的圖像證實(shí)了我們之前的發(fā)現(xiàn)。以年為單位，序列較為平穩(wěn)。以季度為單位，最低的平均耗電量處于第三季度。以月為單位，最低的平均耗電量處于七月和八月。以天為單位，最低的平均耗電量大約在每月的 8 號(hào)（不知道為什么）。

每年的總體有功功率

這一次，我們移除 2006 年。

pd.pivot_table(df.loc[df['year'] != 2006], values = "Global_active_power",
               columns = "year", index = "month").plot(subplots = True, figsize=(12, 12), layout=(3, 5), sharey=True);復(fù)制代碼

從 2007 年到 2010 年，每年的模式都很相似。

工作日和周末的總體有功功率對(duì)比

dic={0:'Weekend',1:'Weekday'}
df['Day'] = df.weekday.map(dic)

a=plt.figure(figsize=(9,4))
plt1=sns.boxplot('year','Global_active_power',hue='Day',width=0.6,fliersize=3,
                    data=df)
a.legend(loc='upper center', bbox_to_anchor=(0.5, 1.00), shadow=True, ncol=2)
sns.despine(left=True, bottom=True)
plt.xlabel('')
plt.tight_layout()
plt.legend().set_visible(False);復(fù)制代碼

在 2010 年以前，工作日的總體有功功率的中位數(shù)要比周末低一些。在 2010 年，它們完全相等。

工作日和周末的總體有功功率對(duì)比的因素圖

plt1=sns.factorplot('year','Global_active_power',hue='Day',
                    data=df, size=4, aspect=1.5, legend=False)
plt.title('Factor Plot of Global active power by Weekend/Weekday')
plt.tight_layout()
sns.despine(left=True, bottom=True)
plt.legend(loc='upper right');復(fù)制代碼

以年為單位，工作日和周末都遵循同樣的模式。

原則上，當(dāng)使用 LSTM 時(shí)，我們不需要去檢驗(yàn)或修正平穩(wěn)性。然而，如果數(shù)據(jù)是平穩(wěn)的，它會(huì)幫助模型提高性能，使神經(jīng)網(wǎng)絡(luò)更容易學(xué)習(xí)。

平穩(wěn)性

在統(tǒng)計(jì)學(xué)中，Dickey–Fuller test 檢驗(yàn)了一個(gè)零假設(shè)，即單位根存在于自回歸模型中。備擇假設(shè)依據(jù)使用的檢驗(yàn)方法的不同而不同，但是通常為平穩(wěn)性或趨勢(shì)平穩(wěn)性。

平穩(wěn)序列的均值和方差一直是常數(shù)。時(shí)間序列在滑動(dòng)窗口下的均值和標(biāo)準(zhǔn)差不隨時(shí)間變化。

Dickey-Fuller 檢驗(yàn)

零檢驗(yàn)（H0）：表明時(shí)間序列有一個(gè)單位根，意味著它是非平穩(wěn)的。它包含一些和時(shí)間相關(guān)的成分。

備擇檢驗(yàn)（H1）：表明時(shí)間序列不存在單位根，意味著它是平穩(wěn)的。它不包含和時(shí)間相關(guān)的成分。

p-value > 0.05：接受零檢驗(yàn)（H0），數(shù)據(jù)有單位根且是非平穩(wěn)的。

p-value <= 0.05：拒絕零檢驗(yàn)（H0），數(shù)據(jù)沒有單位根且是平穩(wěn)的。

df2=df1.resample('D', how=np.mean)def test_stationarity(timeseries):
    rolmean = timeseries.rolling(window=30).mean()
    rolstd = timeseries.rolling(window=30).std()

    plt.figure(figsize=(14,5))
    sns.despine(left=True)
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')

    plt.legend(loc='best'); plt.title('Rolling Mean & Standard Deviation')
    plt.show()    print ('<Results of Dickey-Fuller Test>')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4],
                         index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)
test_stationarity(df2.Global_active_power.dropna())復(fù)制代碼

從以上結(jié)論可得，我們會(huì)拒絕零檢驗(yàn) H0，因?yàn)閿?shù)據(jù)沒有單位根且是平穩(wěn)的。

LSTM

我們的任務(wù)是根據(jù)一個(gè)家庭兩百萬(wàn)分鐘的耗電量歷史記錄，對(duì)這個(gè)時(shí)間序列做預(yù)測(cè)。我們將使用一個(gè)多層的 LSTM 遞歸神經(jīng)網(wǎng)絡(luò)來(lái)預(yù)測(cè)時(shí)間序列的最后一個(gè)值。

如果你想縮減計(jì)算時(shí)間，并快速獲得結(jié)果來(lái)檢驗(yàn)?zāi)Ｐ?，你可以?duì)數(shù)據(jù)以小時(shí)為單位重新采樣。在本文的實(shí)驗(yàn)中我會(huì)維持原單位為分鐘。

在構(gòu)建 LSTM 模型之前，需要進(jìn)行下列數(shù)據(jù)預(yù)處理和特征工程的工作。

創(chuàng)建數(shù)據(jù)集，保證所有的數(shù)據(jù)的類型都是 float。
特征標(biāo)準(zhǔn)化。
分割訓(xùn)練集和測(cè)試集。
將數(shù)值數(shù)組轉(zhuǎn)換為數(shù)據(jù)集矩陣。
將維度轉(zhuǎn)化為 X=t 和 Y=t+1。
將輸入維度轉(zhuǎn)化為三維 (num_samples, num_timesteps, num_features)。

dataset = df.Global_active_power.values #numpy.ndarraydataset = dataset.astype('float32')
dataset = np.reshape(dataset, (-1, 1))
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
train_size = int(len(dataset) * 0.80)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]def create_dataset(dataset, look_back=1):
    X, Y = [], []    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        X.append(a)
        Y.append(dataset[i + look_back, 0])    return np.array(X), np.array(Y)

look_back = 30X_train, Y_train = create_dataset(train, look_back)
X_test, Y_test = create_dataset(test, look_back)# 將輸入維度轉(zhuǎn)化為 [samples, time steps, features]X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))
X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))復(fù)制代碼

模型結(jié)構(gòu)

定義 LSTM 模型，第一個(gè)隱藏層含有 100 個(gè)神經(jīng)元，輸出層含有 1 個(gè)神經(jīng)元，用于預(yù)測(cè) Global_active_power。輸入的維度是一個(gè)包含 30 個(gè)特征的時(shí)間步長(zhǎng)。
Dropout 20%。
使用均方差損失函數(shù)，和改進(jìn)于隨機(jī)梯度下降的效率更高的 Adam。
模型將會(huì)進(jìn)行 20 個(gè) epochs 的訓(xùn)練，每個(gè) batch 的大小為 70。

model = Sequential()
model.add(LSTM(100, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

history = model.fit(X_train, Y_train, epochs=20, batch_size=70, validation_data=(X_test, Y_test),
                    callbacks=[EarlyStopping(monitor='val_loss', patience=10)], verbose=1, shuffle=False)

model.summary()復(fù)制代碼

做出預(yù)測(cè)

train_predict = model.predict(X_train)
test_predict = model.predict(X_test)# 預(yù)測(cè)值求逆train_predict = scaler.inverse_transform(train_predict)
Y_train = scaler.inverse_transform([Y_train])
test_predict = scaler.inverse_transform(test_predict)
Y_test = scaler.inverse_transform([Y_test])

print('Train Mean Absolute Error:', mean_absolute_error(Y_train[0], train_predict[:,0]))
print('Train Root Mean Squared Error:',np.sqrt(mean_squared_error(Y_train[0], train_predict[:,0])))
print('Test Mean Absolute Error:', mean_absolute_error(Y_test[0], test_predict[:,0]))
print('Test Root Mean Squared Error:',np.sqrt(mean_squared_error(Y_test[0], test_predict[:,0])))復(fù)制代碼

繪制模型損失

plt.figure(figsize=(8,4))
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Test Loss')
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epochs')
plt.legend(loc='upper right')
plt.show();復(fù)制代碼

比較真實(shí)值和預(yù)測(cè)值

在我的結(jié)果中，每個(gè)時(shí)間步長(zhǎng)是 1 分鐘。如果你之前以小時(shí)重新采樣了數(shù)據(jù)，那么在你的結(jié)果里每個(gè)時(shí)間步長(zhǎng)是 1 小時(shí)。

我將會(huì)比較最近 200 分鐘的真實(shí)值和預(yù)測(cè)值。

aa=[x for x in range(200)]
plt.figure(figsize=(8,4))
plt.plot(aa, Y_test[0][:200], marker='.', label="actual")
plt.plot(aa, test_predict[:,0][:200], 'r', label="prediction")# plt.tick_params(left=False, labelleft=True) # 移除 ticksplt.tight_layout()
sns.despine(top=True)
plt.subplots_adjust(left=0.07)
plt.ylabel('Global_active_power', size=15)
plt.xlabel('Time step', size=15)
plt.legend(fontsize=15)
plt.show();復(fù)制代碼

LSTMs 太神奇了！

Jupyter notebook 可以在 Github 中找到。享受這一周余下的時(shí)光吧！

參考：Multivariate Time Series Forecasting with LSTMs in Keras

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看

[譯] 時(shí)間序列分析、可視化、和使用 LSTM 預(yù)測(cè)