Min-max normalization¶
- 최소값, 최대값을 이용해서 변수의 크기를 0과 1사이로 조정
- 경사 하강법을 더 빨리 할 수 있도록 도와준다.
import pandas as pd
import numpy as np
from sklearn import preprocessing
nba_player_of_the_week_df = pd.read_csv('/content/drive/MyDrive/data/NBA_player_of_the_week.csv')
nba_player_of_the_week_df.sample()
Player | Team | Conference | Date | Position | Height | Weight | Age | Draft Year | Seasons in league | Season | Season short | Pre-draft Team | Real_value | Height CM | Weight KG | Last Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
477 | Tony Parker | San Antonio Spurs | West | Mar 30, 2009 | G | 6'2 | 185 | 26 | 2001 | 7 | 2008-2009 | 2009 | Paris Basket Racing (France) | 0.5 | 188 | 83 | 0 |
nba_player_of_the_week_df.describe()
Weight | Age | Draft Year | Seasons in league | Season short | Real_value | Height CM | Weight KG | Last Season | |
---|---|---|---|---|---|---|---|---|---|
count | 1340.000000 | 1340.000000 | 1340.000000 | 1340.000000 | 1340.000000 | 1340.000000 | 1340.000000 | 1340.000000 | 1340.000000 |
mean | 224.567164 | 26.738060 | 1996.287313 | 5.740299 | 2003.156716 | 0.686940 | 201.071642 | 101.384328 | 0.023881 |
std | 30.798885 | 3.400683 | 11.253558 | 3.293421 | 11.470164 | 0.242007 | 9.367970 | 14.011226 | 0.152734 |
min | 150.000000 | 19.000000 | 1965.000000 | 0.000000 | 1980.000000 | 0.500000 | 175.000000 | 68.000000 | 0.000000 |
25% | 205.000000 | 24.000000 | 1987.000000 | 3.000000 | 1994.000000 | 0.500000 | 193.000000 | 93.000000 | 0.000000 |
50% | 220.000000 | 26.000000 | 1998.000000 | 5.000000 | 2005.000000 | 0.500000 | 201.000000 | 99.000000 | 0.000000 |
75% | 250.000000 | 29.000000 | 2005.000000 | 8.000000 | 2013.000000 | 1.000000 | 208.000000 | 113.000000 | 0.000000 |
max | 325.000000 | 40.000000 | 2018.000000 | 17.000000 | 2020.000000 | 1.000000 | 229.000000 | 147.000000 | 1.000000 |
height_weight_age_df = nba_player_of_the_week_df[['Height CM', 'Weight KG', 'Age']]
height_weight_age_df.sample()
Height CM | Weight KG | Age | |
---|---|---|---|
456 | 211 | 120 | 24 |
# min-max-normalization
scaler = preprocessing.MinMaxScaler()
normalized_data = scaler.fit_transform(height_weight_age_df)
normalized_data
array([[0.51851852, 0.32911392, 0.0952381 ], [0.7037037 , 0.56962025, 0.28571429], [0.48148148, 0.39240506, 0.19047619], ..., [0.48148148, 0.37974684, 0.23809524], [0.38888889, 0.21518987, 0.23809524], [0.42592593, 0.27848101, 0.52380952]])
normalized_df = pd.DataFrame(normalized_data, columns=['Height', 'Weight', 'Age'])
normalized_df.describe()
Height | Weight | Age | |
---|---|---|---|
count | 1340.000000 | 1340.000000 | 1340.000000 |
mean | 0.482808 | 0.422586 | 0.368479 |
std | 0.173481 | 0.177357 | 0.161937 |
min | 0.000000 | 0.000000 | 0.000000 |
25% | 0.333333 | 0.316456 | 0.238095 |
50% | 0.481481 | 0.392405 | 0.333333 |
75% | 0.611111 | 0.569620 | 0.476190 |
max | 1.000000 | 1.000000 | 1.000000 |
Min-max normalization을 사용해서 다음 나이 데이터를 feature scaling 하면?
df = pd.DataFrame(np.array([25, 49, 32, 35, 40]), columns=['Age'])
df
Age | |
---|---|
0 | 25 |
1 | 49 |
2 | 32 |
3 | 35 |
4 | 40 |
# min-max-normalization
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
normalized_data = scaler.fit_transform(df)
normalized_data
array([[0. ], [1. ], [0.29166667], [0.41666667], [0.625 ]])
normalized_df = pd.DataFrame(normalized_data, columns=['Age'])
normalized_df
Age | |
---|---|
0 | 0.000000 |
1 | 1.000000 |
2 | 0.291667 |
3 | 0.416667 |
4 | 0.625000 |
Min-max normalization을 사용해서 다음 연봉 데이터를 feature scaling하면?
df = pd.DataFrame(np.array([25000000, 35000000, 30000000, 50000000, 35000000]), columns=['Salary'])
df
Salary | |
---|---|
0 | 25000000 |
1 | 35000000 |
2 | 30000000 |
3 | 50000000 |
4 | 35000000 |
scaler = preprocessing.MinMaxScaler()
normalized_data = scaler.fit_transform(df)
normalized_data
array([[0. ], [0.4], [0.2], [1. ], [0.4]])
normalized_df = pd.DataFrame(normalized_data, columns=['Salary'])
normalized_df
Salary | |
---|---|
0 | 0.0 |
1 | 0.4 |
2 | 0.2 |
3 | 1.0 |
4 | 0.4 |
One-hot Encoding¶
- 각 카테고리를 하나의 새로운 열로 만들어주는 방법
- 범주형 데이터를 수치형 데이터로 바꿀 수 있음
import pandas as pd
titanic_df = pd.read_csv('/content/drive/MyDrive/data/titanic.csv')
titanic_df.sample()
Unnamed: 0 | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 4 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.05 | NaN | S |
titanic_sex_embarked = titanic_df[['Sex', 'Embarked']]
titanic_sex_embarked.head()
Sex | Embarked | |
---|---|---|
0 | male | S |
1 | female | C |
2 | female | S |
3 | female | S |
4 | male | S |
one_hot_encoded_df = pd.get_dummies(titanic_sex_embarked)
one_hot_encoded_df.head()
Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 1 |
1 | 1 | 0 | 1 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 1 |
3 | 1 | 0 | 0 | 0 | 1 |
4 | 0 | 1 | 0 | 0 | 1 |
one_hot_encoded_df = pd.get_dummies(data=titanic_df, columns=['Sex', 'Embarked'])
one_hot_encoded_df.head()
Unnamed: 0 | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Cabin | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | Braund, Mr. Owen Harris | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 0 | 1 | 0 | 0 | 1 |
1 | 1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1 | 0 | 1 | 0 | 0 |
2 | 2 | 1 | 3 | Heikkinen, Miss. Laina | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 1 | 0 | 0 | 0 | 1 |
3 | 3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 1 | 0 | 0 | 0 | 1 |
4 | 4 | 0 | 3 | Allen, Mr. William Henry | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 0 | 1 | 0 | 0 | 1 |
'machine_learning' 카테고리의 다른 글
KFold 교차 검증과 GridSearchCV (0) | 2022.11.18 |
---|---|
기본 지도 학습 알고리즘 (2) 분류 (0) | 2022.11.18 |
정규화와 모델 평가 (0) | 2022.11.18 |
기본 지도 학습 알고리즘 (1) 회귀 (0) | 2022.11.18 |
Colab과 Kaggle 연결 (0) | 2022.11.18 |