선형 회귀¶
- y = ax + b
- 당면하는 문제 : 가장 적절한 a와 b는 어떻게 결정할 수 있는가?
- 결국 에러가 적은 모델이 좋은 모델이다.
- 비용함수 E=f(a,b)를 최소화시키는 a와 b를 찾아야 함.
- 함수의 최소화 방법? 경사하강법을 이용해보자!
Gradient Descent¶
- 함수의 기울기(즉, gradient)를 이용해 최소값을 찾는 과정
- xi+1 = xi - alpha*기울기 (learning rate α)
기울기의 부호와 반대 방향으로 x를 움직여야 한다.
- 기울기가 양수: x의 값이 커질수록 함수의 값이 커짐
- 기울기가 음수: x의 값이 커질수록 함수의 값이 작아짐
기울기의 크기 : 최소값에 가까워질수록 기울기의 크기는 작아진다.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.datasets import load_boston
import warnings
warnings.filterwarnings('ignore')
In [ ]:
boston = load_boston()
boston_df = pd.DataFrame(boston.data, columns = boston.feature_names)
boston_df['Price'] = boston.target
In [ ]:
boston_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 506 non-null float64 1 ZN 506 non-null float64 2 INDUS 506 non-null float64 3 CHAS 506 non-null float64 4 NOX 506 non-null float64 5 RM 506 non-null float64 6 AGE 506 non-null float64 7 DIS 506 non-null float64 8 RAD 506 non-null float64 9 TAX 506 non-null float64 10 PTRATIO 506 non-null float64 11 B 506 non-null float64 12 LSTAT 506 non-null float64 13 Price 506 non-null float64 dtypes: float64(14) memory usage: 55.5 KB
In [ ]:
fig, axs = plt.subplots(figsize=(16,8), ncols=4, nrows=2)
features = ['RM', 'ZN', 'INDUS', 'NOX', 'AGE', 'PTRATIO', 'LSTAT','RAD']
for i, feature in enumerate(features):
row = int(i/4)
col = i % 4
sns.regplot(x=feature, y='Price', data=boston_df, ax=axs[row][col])
In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
In [ ]:
X = boston_df.drop('Price', axis=1)
y = boston_df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=156)
In [ ]:
lr = LinearRegression()
lr.fit(X_train, y_train)
print('training set에서의 성능')
y_train_predict = lr.predict(X_train)
mse = mean_squared_error(y_train, y_train_predict)
print('rmse: {0:.3f}'.format(np.sqrt(mse)))
print('test set에서의 성능')
y_test_predict = lr.predict(X_test)
mse = mean_squared_error(y_test, y_test_predict)
print('rmse: {0:.3f}'.format(np.sqrt(mse)))
training set에서의 성능 rmse: 4.943 test set에서의 성능 rmse: 4.159
다항 회귀¶
In [ ]:
from sklearn.preprocessing import PolynomialFeatures
In [ ]:
# 다항변형기로 데이터를 다항회귀를 위해 가공한다
polynomial_transformer = PolynomialFeatures(2)
polynomial_data = polynomial_transformer.fit_transform(boston.data)
print(polynomial_data.shape)
polynomial_data
(506, 105)
Out[ ]:
array([[1.00000000e+00, 6.32000000e-03, 1.80000000e+01, ..., 1.57529610e+05, 1.97656200e+03, 2.48004000e+01], [1.00000000e+00, 2.73100000e-02, 0.00000000e+00, ..., 1.57529610e+05, 3.62766600e+03, 8.35396000e+01], [1.00000000e+00, 2.72900000e-02, 0.00000000e+00, ..., 1.54315409e+05, 1.58310490e+03, 1.62409000e+01], ..., [1.00000000e+00, 6.07600000e-02, 0.00000000e+00, ..., 1.57529610e+05, 2.23851600e+03, 3.18096000e+01], [1.00000000e+00, 1.09590000e-01, 0.00000000e+00, ..., 1.54802902e+05, 2.54955600e+03, 4.19904000e+01], [1.00000000e+00, 4.74100000e-02, 0.00000000e+00, ..., 1.57529610e+05, 3.12757200e+03, 6.20944000e+01]])
In [ ]:
polynomial_feature_names = polynomial_transformer.get_feature_names(boston.feature_names)
In [ ]:
X = pd.DataFrame(polynomial_data, columns=polynomial_feature_names)
y = pd.DataFrame(boston.target, columns=['Price'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)
In [ ]:
lr = LinearRegression()
lr.fit(X_train, y_train)
print('training set에서의 성능')
y_train_predict = lr.predict(X_train)
mse = mean_squared_error(y_train, y_train_predict)
print('rmse: {0:.3f}'.format(np.sqrt(mse)))
print('test set에서의 성능')
y_test_predict = lr.predict(X_test)
mse = mean_squared_error(y_test, y_test_predict)
print('rmse: {0:.3f}'.format(np.sqrt(mse)))
training set에서의 성능 rmse: 2.425 test set에서의 성능 rmse: 3.197
회귀 트리¶
- 트리 기반의 회귀 모델
In [ ]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
In [ ]:
rf = RandomForestRegressor(random_state=0, n_estimators=1000)
neg_mse_scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=5)
rmse_scores = np.sqrt(-1 * neg_mse_scores)
avg_rmse = np.mean(rmse_scores)
avg_rmse
Out[ ]:
4.386593953202736
In [ ]:
def get_rmse(model, X, y):
neg_mse_scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=5)
rmse_scores = np.sqrt(-1 * neg_mse_scores)
avg_rmse = np.mean(rmse_scores)
print(model.__class__.__name__)
print('5 교차 검증의 평균 rmse: {0:.3f}'.format(avg_rmse))
In [ ]:
X = boston_df.drop('Price', axis=1)
y = boston_df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=156)
In [ ]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
dt_reg = DecisionTreeRegressor(random_state=0, max_depth=4)
rf_reg = RandomForestRegressor(random_state=0, n_estimators=1000)
gb_reg = GradientBoostingRegressor(random_state=0, n_estimators=1000)
xgb_reg = XGBRegressor(n_estimators=1000)
lgb_reg =LGBMRegressor(n_estimators=1000)
models = [dt_reg, rf_reg, gb_reg, xgb_reg, lgb_reg]
for model in models:
get_rmse(model, X, y)
DecisionTreeRegressor 5 교차 검증의 평균 rmse: 5.978 RandomForestRegressor 5 교차 검증의 평균 rmse: 4.423 GradientBoostingRegressor 5 교차 검증의 평균 rmse: 4.269 [06:17:32] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror. [06:17:32] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror. [06:17:33] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror. [06:17:33] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror. [06:17:33] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror. XGBRegressor 5 교차 검증의 평균 rmse: 4.089 LGBMRegressor 5 교차 검증의 평균 rmse: 4.646
- 회귀 트리 Regressor 클래스는 선형 회귀와 다르게 회귀 계수를 제공하는 coef_ 속성이 없음
- 대신 feature_importances_를 이용해 피처별 중요도를 알 수 있음
In [ ]:
import seaborn as sns
rf_reg = RandomForestRegressor(n_estimators=1000)
rf_reg.fit(X, y)
s = pd.Series(data=rf_reg.feature_importances_, index=X.columns).sort_values(ascending=False)
sns.barplot(x=s, y=s.index)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f093ef60250>
'machine_learning' 카테고리의 다른 글
KFold 교차 검증과 GridSearchCV (0) | 2022.11.18 |
---|---|
기본 지도 학습 알고리즘 (2) 분류 (0) | 2022.11.18 |
정규화와 모델 평가 (0) | 2022.11.18 |
데이터 전처리 (0) | 2022.11.18 |
Colab과 Kaggle 연결 (0) | 2022.11.18 |