Bike Sharing Demand

공유 자전거 사용량 예측 [Kaggle]

목표

워싱턴 D.C.의 Capital Bikeshare 수요를 시간대 별로 예측
자전거 대여 수요를 예측하기 위해 과거 사용 패턴과 날씨 데이터를 분석

데이터 필드

과거 사용 패턴: 시간(hour), 휴일 여부
날씨 데이터: 계절, 날씨, 온도, 체감 온도, 습도, 풍속

필드	내용
`datetime`	날짜 + 시간 타임스탬프
`season`	1: 봄, 2: 여름, 3: 가을, 4: 겨울
`holiday`	휴일 여부
`weateher`	1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
`temp`	섭씨 온도
`atemp`	섭씨 체감 온도
`humidity`	습도
`windspeed`	풍속
`casual`	비회원 대여량
`registered`	회원 대여량
`count`	총 대여량

casual, registered, count 필드를 예측해야 한다.

EDA & 예측

라이브러리 로드

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

%matplotlib inline
plt.rc("font", family="Malgun Gothic")

데이터셋 로드

## This Python 3 environment comes with many helpful analytics libraries installed
## It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
## For example, here's several helpful packages to load

import numpy as np ## linear algebra
import pandas as pd ## data processing, CSV file I/O (e.g. pd.read_csv)

## Input data files are available in the read-only "../input/" directory
## For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
## You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/bike-sharing-demand/sampleSubmission.csv
/kaggle/input/bike-sharing-demand/train.csv
/kaggle/input/bike-sharing-demand/test.csv

train = pd.read_csv("/kaggle/input/bike-sharing-demand/train.csv", parse_dates=['datetime'])
test = pd.read_csv("/kaggle/input/bike-sharing-demand/test.csv", parse_dates=['datetime'])
sample = pd.read_csv("/kaggle/input/bike-sharing-demand/sampleSubmission.csv", parse_dates=['datetime'])

데이터셋 요약

데이터 Shape

print(train.shape, test.shape, sample.shape)

(10886, 12) (6493, 9) (6493, 2)

데이터 필드

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 ##   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   casual      10886 non-null  int64         
 10  registered  10886 non-null  int64         
 11  count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
 ##   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    6493 non-null   datetime64[ns]
 1   season      6493 non-null   int64         
 2   holiday     6493 non-null   int64         
 3   workingday  6493 non-null   int64         
 4   weather     6493 non-null   int64         
 5   temp        6493 non-null   float64       
 6   atemp       6493 non-null   float64       
 7   humidity    6493 non-null   int64         
 8   windspeed   6493 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(5)
memory usage: 456.7 KB

학습 데이터에 존재하는 casual, registered, count 필드가 테스트 데이터에는 없다.

sampleSubmission.csv에 따르면 날짜 및 시간대 별로 count를 예측해야 한다.

데이터 샘플

train.head()

	datetime	season	weather	temp	atemp	humidity	casual	registered	count
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1

결측치 확인

train.isnull().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
casual        0
registered    0
count         0
dtype: int64

test.isnull().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
dtype: int64

fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4)
fig.set_size_inches(20, 10)
sns.regplot(data=train, x='temp', y='count', ax=ax1)
sns.regplot(data=train, x='atemp', y='count', ax=ax2)
sns.regplot(data=train, x='windspeed', y='count', ax=ax3)
sns.regplot(data=train, x='humidity', y='count', ax=ax4)

<AxesSubplot:xlabel='humidity', ylabel='count'>

png

print(len(train[train['windspeed'] == 0]), str(len(train[train['windspeed'] == 0])/len(train)*100)+"%")

1313 12.061363218813154%

windspeed의 약 12%가 0에 분포하며, 다음 구간에 분포가 비어 있다.

결측치가 0으로 기입되어 있다고 가정할 수 있다.

EDA

시기별 대여량

def build_datetime_features(df):
    ## 날짜 및 시간 피처 생성
    df['season_str'] = df['season'].map({1: "Spring", 2: "Summer", 3: "Fall", 4: "Winter"})
    df['year'] = df['datetime'].dt.year
    df['month'] = df['datetime'].dt.month
    df['day'] = df['datetime'].dt.day
    df['weekday'] = df['datetime'].dt.dayofweek
    df['hour'] = df['datetime'].dt.hour
    return df

train = build_datetime_features(train)

train.head()

	datetime	season	weather	temp	atemp	humidity	casual	registered	count	season_str	year	month	day	weekday	hour
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	16	Spring	2011	1	1	5	0
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	40	Spring	2011	1	1	5	1
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	32	Spring	2011	1	1	5	2
3	2011-01-01 03:00:00	1	1	9.84	14.395	75	3	10	13	Spring	2011	1	1	5	3
4	2011-01-01 04:00:00	1	1	9.84	14.395	75	0	1	1	Spring	2011	1	1	5	4

## Barplot
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8)) = plt.subplots(nrows=4, ncols=2)
fig.set_size_inches(20, 15)
plt.subplots_adjust(wspace=0.2, hspace=0.3)

ax1.set(title="Rental by year")
sns.barplot(data=train, x='year', y='count', orient='v', ax=ax1)

ax2.set(title="Rental by season")
sns.barplot(data=train, x='season_str', y='count', ax=ax2)

ax3.set(title="Monthly rental")
sns.barplot(data=train, x='month', y='count', ax=ax3)

ax4.set(title="Daily rental")
sns.barplot(data=train, x='day', y='count', ax=ax4)

ax5.set(title="Hourly rental")
sns.barplot(data=train, x='hour', y='count', ax=ax5)

ax6.set(title="Rental by day of week")
sns.barplot(data=train, x='weekday', y='count', ax=ax6,
            palette=['gray', 'gray', 'gray', 'gray', 'gray', 'blue', 'red'])

ax7.set(title="Weekday/holiday rental")
sns.barplot(data=train, x='holiday', y='count', ax=ax7)

ax8.set(title="Rental by weather")
sns.barplot(data=train, x='weather', y='count', ax=ax8)

<AxesSubplot:title={'center':'Rental by weather'}, xlabel='weather', ylabel='count'>

png

## Boxplot
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8)) = plt.subplots(nrows=4, ncols=2)
fig.set_size_inches(20, 20)
plt.subplots_adjust(wspace=0.2, hspace=0.3)

ax1.set(title="Rental by year")
sns.boxplot(data=train, x='year', y='count', ax=ax1)

ax2.set(title="Rental by season")
sns.boxplot(data=train, x='season_str', y='count', ax=ax2)

ax3.set(title="Monthly rental")
sns.boxplot(data=train, x='month', y='count', ax=ax3)

ax4.set(title="Daily rental")
sns.boxplot(data=train, x='day', y='count', ax=ax4)

ax5.set(title="Hourly rental")
sns.boxplot(data=train, x='hour', y='count', ax=ax5)

ax6.set(title="Rental by day of week")
sns.boxplot(data=train, x='weekday', y='count', ax=ax6,
            palette=['gray', 'gray', 'gray', 'gray', 'gray', 'blue', 'red'])

ax7.set(title="Weekday/holiday rental")
sns.boxplot(data=train, x='holiday', y='count', ax=ax7)

ax8.set(title="Rental by weather")
sns.boxplot(data=train, x='weather', y='count', ax=ax8)

<AxesSubplot:title={'center':'Rental by weather'}, xlabel='weather', ylabel='count'>

png

## 시간대별 대여량
fig, (ax1, ax2, ax3, ax4, ax5, ax6) = plt.subplots(nrows=6)
fig.set_size_inches(20, 20)

sns.pointplot(data=train, x='hour', y='count', ax=ax1)
sns.pointplot(data=train, x='hour', y='count', hue='workingday', ax=ax2)
sns.pointplot(data=train, x='hour', y='count', hue='holiday', ax=ax3)
sns.pointplot(data=train, x='hour', y='count', hue='weekday', ax=ax4)
sns.pointplot(data=train, x='hour', y='count', hue='weather', ax=ax5)
sns.pointplot(data=train, x='hour', y='count', hue='season', ax=ax6)

<AxesSubplot:xlabel='hour', ylabel='count'>

png

연도별 대여량: 2011 < 2012
계절별 대여량: 가을 > 여름 > 겨울 > 봄
월별 대여량: 6월 > 7~9월 > 10월 > 5월 > 11월 > 4월 > 12월 > 3월 > 2월 > 1월
시간별 대여량: 출퇴근 시간대 대여량과 편차가 큼
시기별 대여량: workingday 0과 토요일, 일요일은 비슷한 추세를 보이며, 출퇴근 시간대의 영향을 받지 않음

print(train[train['season'] == 1]['month'].unique())
print(train[train['season'] == 2]['month'].unique())
print(train[train['season'] == 3]['month'].unique())
print(train[train['season'] == 4]['month'].unique())

[1 2 3]
[4 5 6]
[7 8 9]
[10 11 12]

season은 사전적 의미의 계절이 아니라 분기를 의미한다.

이상치 제거

train_normalized = train[np.abs(train['count'] - train['count'].mean()) <= (3*train['count'].std())]

print("Shape of before normalization: ", train.shape)
print("Shape of after normalization: ", train_normalized.shape)

Shape of before normalization:  (10886, 18)
Shape of after normalization:  (10739, 18)

## Boxplot
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8)) = plt.subplots(nrows=4, ncols=2)
fig.set_size_inches(20, 20)
plt.subplots_adjust(wspace=0.2, hspace=0.3)

ax1.set(title="Rental by year")
sns.boxplot(data=train_normalized, x='year', y='count', ax=ax1)

ax2.set(title="Rental by season")
sns.boxplot(data=train_normalized, x='season_str', y='count', ax=ax2)

ax3.set(title="Monthly rental")
sns.boxplot(data=train_normalized, x='month', y='count', ax=ax3)

ax4.set(title="Daily rental")
sns.boxplot(data=train_normalized, x='day', y='count', ax=ax4)

ax5.set(title="Hourly rental")
sns.boxplot(data=train_normalized, x='hour', y='count', ax=ax5)

ax6.set(title="Rental by day of week")
sns.boxplot(data=train_normalized, x='weekday', y='count', ax=ax6,
            palette=['gray', 'gray', 'gray', 'gray', 'gray', 'blue', 'red'])

ax7.set(title="Weekday/holiday rental")
sns.boxplot(data=train_normalized, x='holiday', y='count', ax=ax7)

ax8.set(title="Rental by weather")
sns.boxplot(data=train_normalized, x='weather', y='count', ax=ax8)

<AxesSubplot:title={'center':'Rental by weather'}, xlabel='weather', ylabel='count'>

png

결측치 보정

windspeed의 결측치는 0으로 되어 있다. 0인 것들의 windspeed를 예측하여 보정한다.

from sklearn.ensemble import RandomForestClassifier

wind0 = train.loc[train['windspeed'] == 0]
wind_not0 = train.loc[train['windspeed'] != 0]

print("Number of rows with 0 windspeed before prediction: ", len(wind0))

Number of rows with 0 windspeed before prediction:  1313

## windspeed와의 상관계수 절대값 내림차순
corr = wind_not0.corr()[['windspeed']]
corr.rename(columns={'windspeed': 'corr'}, inplace=True)
corr['corr_abs'] = corr['corr'].abs()
corr.sort_values(by='corr_abs', ascending=False)

	corr	corr_abs
windspeed	1.000000	1.000000
humidity	-0.328272	0.328272
month	-0.142505	0.142505
season	-0.138272	0.138272
hour	0.126289	0.126289
casual	0.085342	0.085342
count	0.085014	0.085014
registered	0.073669	0.073669
atemp	-0.068576	0.068576
temp	-0.038902	0.038902
year	-0.035825	0.035825
weekday	-0.030849	0.030849
workingday	0.021188	0.021188
holiday	0.015603	0.015603
weather	-0.011837	0.011837
day	0.009141	0.009141

def predict_windspeed(df):
    df_wind0 = df.loc[df['windspeed'] == 0]
    df_wind_not0 = df.loc[df['windspeed'] != 0]
    
    columns = ['humidity', 'month', 'hour', 'season', 'weather', 'atemp', 'temp']
    
    rf_model = RandomForestClassifier()
    rf_model.fit(df_wind_not0[columns], df_wind_not0['windspeed'].astype('str'))
    rf_prediction = rf_model.predict(df_wind0[columns])
    df_wind0['windspeed'] = rf_prediction
    
    result = df_wind_not0.append(df_wind0)
    result.reset_index(inplace=True)
    result.drop('index', inplace=True, axis=1)
    
    return result

train_before_wind = train.copy()
train = predict_windspeed(train)

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ## Remove the CWD from sys.path while we load stuff.

print("Number of rows with 0 windspeed after prediction: ", len(train[train['windspeed'] == 0]))

Number of rows with 0 windspeed after prediction:  0

fig, (ax1, ax2) = plt.subplots(nrows=2)
fig.set_size_inches(20, 15)

sns.countplot(data=train_before_wind, x='windspeed', ax=ax1)
sns.countplot(data=train, x='windspeed', ax=ax2)

<AxesSubplot:xlabel='windspeed', ylabel='count'>

png

Skewness & Kurtosis

Skewness(왜도)와 Kurtosis(첨도)를 통해 데이터 분포의 치우침을 확인하고 보정한다.

print("Skewness: ", train['count'].skew())
print("Kurtosis: ", train['count'].kurt())

Skewness:  1.242066211718077
Kurtosis:  1.3000929518398299

fig, (ax1, ax2) = plt.subplots(ncols=2)
fig.set_size_inches(20, 10)

sns.distplot(train['count'], ax=ax1)
stats.probplot(train['count'], dist="norm", fit=True, plot=ax2)

/opt/conda/lib/python3.7/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

((array([-3.83154229, -3.60754977, -3.48462983, ...,  3.48462983,
          3.60754977,  3.83154229]),
  array([  1,   1,   1, ..., 968, 970, 977])),
 (169.82942673231386, 191.5741319125482, 0.9372682766213176))

png

로그 스케일 정규화

Skewness의 쏠림이 있으므로 로그 스케일 정규화를 할 것이다.

train['count_log'] = np.log(train['count'])

print("Skewness: ", train['count_log'].skew())
print("Kurtosis: ", train['count_log'].kurt())

Skewness:  -0.9712277227866108
Kurtosis:  0.24662183416964067

fig, (ax1, ax2) = plt.subplots(ncols=2)
fig.set_size_inches(20, 10)

sns.distplot(train['count_log'], ax=ax1)
stats.probplot(np.log1p(train['count']), dist="norm", fit=True, plot=ax2)

/opt/conda/lib/python3.7/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

((array([-3.83154229, -3.60754977, -3.48462983, ...,  3.48462983,
          3.60754977,  3.83154229]),
  array([0.69314718, 0.69314718, 0.69314718, ..., 6.87626461, 6.87832647,
         6.88550967])),
 (1.3647396459244172, 4.591363690454027, 0.9611793780126952))

png

One-hot encoding

print(train['weather'].unique())
print(train['season'].unique())
print(train['workingday'].unique())
print(train['holiday'].unique())

[2 1 3 4]
[1 2 3 4]
[0 1]
[0 1]

def one_hot_encoding(df):
    df = pd.get_dummies(df, columns=['weather'], prefix='weather')
    df = pd.get_dummies(df, columns=['season'], prefix='season')
    df = pd.get_dummies(df, columns=['workingday'], prefix='workingday')
    df = pd.get_dummies(df, columns=['holiday'], prefix='holiday')
    return df

train_before_encoding = train.copy()
train = one_hot_encoding(train)

train.columns

Index(['datetime', 'temp', 'atemp', 'humidity', 'windspeed', 'casual',
       'registered', 'count', 'season_str', 'year', 'month', 'day', 'weekday',
       'hour', 'count_log', 'weather_1', 'weather_2', 'weather_3', 'weather_4',
       'season_1', 'season_2', 'season_3', 'season_4', 'workingday_0',
       'workingday_1', 'holiday_0', 'holiday_1'],
      dtype='object')

상관관계 분석

## Pearson 상관계수 히트맵 시각화
fix, (ax1, ax2) = plt.subplots(figsize=(20, 30), nrows=2)
sns.heatmap(train_before_encoding.corr(), annot=True, fmt=".2f", cmap="BuPu", ax=ax1)
sns.heatmap(train.corr(), annot=True, fmt=".2f", cmap="Blues", ax=ax2)

<AxesSubplot:>

png

## count와의 상관계수 절대값 내림차순
corr = train.corr()[['count']]
corr.rename(columns={'count': 'corr'}, inplace=True)
corr['corr_abs'] = corr['corr'].abs()
corr.sort_values(by='corr_abs', ascending=False)

	corr	corr_abs
count	1.000000	1.000000
registered	0.970948	0.970948
count_log	0.805773	0.805773
casual	0.690414	0.690414
hour	0.400601	0.400601
temp	0.394454	0.394454
atemp	0.389784	0.389784
humidity	-0.317371	0.317371
year	0.260403	0.260403
season_1	-0.237704	0.237704
month	0.166862	0.166862
season_3	0.136942	0.136942
weather_3	-0.117519	0.117519
weather_1	0.105246	0.105246
season_2	0.075681	0.075681
weather_2	-0.041329	0.041329
season_4	0.023704	0.023704
day	0.019826	0.019826
workingday_1	0.011594	0.011594
workingday_0	-0.011594	0.011594
holiday_1	-0.005393	0.005393
holiday_0	0.005393	0.005393
weekday	-0.002283	0.002283
weather_4	-0.001459	0.001459

모델

피처 엔지니어링

EDA 과정에서 Train 데이터에 행한 과정을 Test 데이터에도 적용한다.

test = build_datetime_features(test)
test = predict_windspeed(test)
test = one_hot_encoding(test)

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ## Remove the CWD from sys.path while we load stuff.

test.head()

	datetime	temp	atemp	humidity	windspeed	season_str	year	month	day	weekday	...	season_1	workingday_1	holiday_0
0	2011-01-20 00:00:00	10.66	11.365	56	26.0027	Spring	2011	1	20	3	...	1	1	1
1	2011-01-20 03:00:00	10.66	12.880	56	11.0014	Spring	2011	1	20	3	...	1	1	1
2	2011-01-20 04:00:00	10.66	12.880	56	11.0014	Spring	2011	1	20	3	...	1	1	1
3	2011-01-20 05:00:00	9.84	11.365	60	15.0013	Spring	2011	1	20	3	...	1	1	1
4	2011-01-20 06:00:00	9.02	10.605	60	15.0013	Spring	2011	1	20	3	...	1	1	1

5 rows × 23 columns

필드 선택

count와 상관계수가 높은 필드
의미가 중복되는 컬럼은 덜 분산된 필드 선택
- ex. workingday와 holiday는 부의 상관관계가 있으나, workingday의 분산이 작으므로 workingday 선택

test_datetime = test['datetime']
train.drop(['datetime', 'season_str', 'holiday_0', 'holiday_1', 'atemp', 'registered', 'casual'], axis=1, inplace=True)
test.drop(['datetime', 'season_str', 'holiday_0', 'holiday_1', 'atemp'], axis=1, inplace=True)

print(train.columns)
print(test.columns)

Index(['temp', 'humidity', 'windspeed', 'count', 'year', 'month', 'day',
       'weekday', 'hour', 'count_log', 'weather_1', 'weather_2', 'weather_3',
       'weather_4', 'season_1', 'season_2', 'season_3', 'season_4',
       'workingday_0', 'workingday_1'],
      dtype='object')
Index(['temp', 'humidity', 'windspeed', 'year', 'month', 'day', 'weekday',
       'hour', 'weather_1', 'weather_2', 'weather_3', 'weather_4', 'season_1',
       'season_2', 'season_3', 'season_4', 'workingday_0', 'workingday_1'],
      dtype='object')

Gradient boosting

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import metrics

x_train = train.drop(['count_log', 'count'], axis=1).values
target_label = train['count_log'].values
x_test = test.values

x_train, x_val, y_train, y_val = train_test_split(x_train, target_label, test_size=0.2, random_state=2000)

x_train

array([[14.76, 50, 16.9979, ..., 1, 0, 1],
       [33.62, 43, 19.9995, ..., 0, 1, 0],
       [31.16, 58, 19.0012, ..., 0, 0, 1],
       ...,
       [22.96, 37, 19.0012, ..., 0, 0, 1],
       [18.86, 63, 8.9981, ..., 1, 0, 1],
       [17.22, 38, 19.9995, ..., 0, 0, 1]], dtype=object)

gbr_model = GradientBoostingRegressor(
    n_estimators=2000,
    learning_rate=0.05,
    max_depth=5,
    min_samples_leaf=15,
    min_samples_split=10,
    random_state=42
)
gbr_model.fit(x_train, y_train)

GradientBoostingRegressor(learning_rate=0.05, max_depth=5, min_samples_leaf=15,
                          min_samples_split=10, n_estimators=2000,
                          random_state=42)

Validation

train_score = gbr_model.score(x_train, y_train)
validation_score = gbr_model.score(x_val, y_val)
print(train_score, validation_score)

0.9866447588995861 0.957037659130984

자전거 수요 예측

gbr_prediction = gbr_model.predict(x_test)
predicted_count = np.exp(gbr_prediction)

sample.head()

	datetime	count
0	2011-01-20 00:00:00	0
1	2011-01-20 01:00:00	0
2	2011-01-20 02:00:00	0
3	2011-01-20 03:00:00	0
4	2011-01-20 04:00:00	0

submission = pd.DataFrame()
submission['datetime'] = test_datetime
submission['count'] = predicted_count

submission.head()

	datetime	count
0	2011-01-20 00:00:00	13.892524
1	2011-01-20 03:00:00	2.242351
2	2011-01-20 04:00:00	2.509201
3	2011-01-20 05:00:00	5.775774
4	2011-01-20 06:00:00	31.641682

submission.to_csv("bike.csv", index=False)