import pandas as pd
netflix_titles = pd.read_csv('https://raw.githubusercontent.com/minaahayley/Python/main/data/netflix_titles.csv')
netflix_titles.sample()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7793 | s7794 | Movie | Primal Fear | Gregory Hoblit | Richard Gere, Edward Norton, Laura Linney, Joh... | United States | December 1, 2019 | 1996 | R | 131 min | Dramas, Thrillers | When a blood-spattered altar boy is found runn... |
str.contains()¶
- 문자열을 포함하는 경우를 출력하고 싶다면 df[column].str.contains()를 이용하면 된다.
str.contains() 함수 이용 시, NaN 값을 무시하기 위해서는 Na=False를 지정해주면 된다.
netflix_titles[netflix_titles['rating'].str.contains('TV', na=False)].sample()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1479 | s1480 | TV Show | SanPa: Sins of the Savior | Cosima Spender | NaN | Italy | December 30, 2020 | 2020 | TV-MA | 1 Season | Crime TV Shows, Docuseries, International TV S... | Amidst a heroin crisis, Vincenzo Muccioli care... |
문자열을 포함하지 않는 경우를 출력하고 싶다면 ~df[column].str.contains()를 이용하면 된다.
netflix_titles[~netflix_titles['rating'].str.contains('TV', na=False)].sample()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
604 | s605 | Movie | The Life of David Gale | Alan Parker | Kevin Spacey, Kate Winslet, Laura Linney, Gabr... | United Kingdom, Germany, Spain, United States | July 1, 2021 | 2003 | R | 130 min | Dramas | When a Texas professor and advocate for the el... |
str.split()¶
str.split() 함수를 이용하여 특정 문자를 기준으로 문자열을 나눌 수 있다.
netflix_titles['rating'].str.split('-')
0 [PG, 13] 1 [TV, MA] 2 [TV, MA] 3 [TV, MA] 4 [TV, MA] ... 8802 [R] 8803 [TV, Y7] 8804 [R] 8805 [PG] 8806 [TV, 14] Name: rating, Length: 8807, dtype: object
expand=True이면 여러 컬럼으로, False이면 1개 컬럼에 리스트로 출력된다.
netflix_titles['rating'].str.split('-', expand=True)
0 | 1 | 2 | |
---|---|---|---|
0 | PG | 13 | None |
1 | TV | MA | None |
2 | TV | MA | None |
3 | TV | MA | None |
4 | TV | MA | None |
... | ... | ... | ... |
8802 | R | None | None |
8803 | TV | Y7 | None |
8804 | R | None | None |
8805 | PG | None | None |
8806 | TV | 14 | None |
8807 rows × 3 columns
expand=True를 이용하여 구분자로 나눠진 문자열을 바로 선택 할 수 있다.
netflix_titles['rating'].str.split('-', expand=True)[0]
0 PG 1 TV 2 TV 3 TV 4 TV .. 8802 R 8803 TV 8804 R 8805 PG 8806 TV Name: 0, Length: 8807, dtype: object
str.find()¶
str.find()의 결과가 fail일 경우 -1을 반환한다. https://pandas.pydata.org/docs/reference/api/pandas.Series.str.find.html
#Zombie를 포함하는 title 찾기
zombie = [x for x in netflix_titles['title'] if x.find('Zombie') != -1]
zombie
['IZombie', 'Rise of the Zombie', 'Scooby-Doo on Zombie Island', 'Zombie Dumb', 'Zombieland']
pd.to_numeric()¶
- pd.to_numeric() 함수를 적용할 때 NA를 무시하는 방법
netflix_titles.sample()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
73 | s74 | Movie | King of Boys | Kemi Adetiba | Sola Sobowale, Adesua Etomi, Remilekun "Remini... | Nigeria | September 14, 2021 | 2018 | TV-MA | 182 min | Dramas, International Movies | When a powerful businesswoman’s political ambi... |
먼저 date_added column을 이용하여, added_year를 출력해보자.
netflix_titles['added_year'] = pd.to_datetime(netflix_titles['date_added']).dt.year
netflix_titles['added_year']
0 2021.0 1 2021.0 2 2021.0 3 2021.0 4 2021.0 ... 8802 2019.0 8803 2019.0 8804 2019.0 8805 2020.0 8806 2019.0 Name: added_year, Length: 8807, dtype: float64
릴리즈 후 등록 기간 = df['release_year] - df['added_year']
netflix_titles['release_added'] = netflix_titles['release_year'] - netflix_titles['added_year']
netflix_titles['release_added']
0 -1.0 1 0.0 2 0.0 3 0.0 4 0.0 ... 8802 -12.0 8803 -1.0 8804 -10.0 8805 -14.0 8806 -4.0 Name: release_added, Length: 8807, dtype: float64
netflix_titles['release_added'] = netflix_titles['release_added'].astype(str)
"nan" 값이 있어서 pd.to_numeric() 함수 이용 시 ValueError가 발생한다.
pd.to_numeric(netflix_titles['release_added'])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /usr/local/lib/python3.7/dist-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: Unable to parse string "nan" During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) <ipython-input-15-f88196129310> in <module> ----> 1 pd.to_numeric(netflix_titles['release_added']) /usr/local/lib/python3.7/dist-packages/pandas/core/tools/numeric.py in to_numeric(arg, errors, downcast) 182 try: 183 values, _ = lib.maybe_convert_numeric( --> 184 values, set(), coerce_numeric=coerce_numeric 185 ) 186 except (ValueError, TypeError): /usr/local/lib/python3.7/dist-packages/pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: Unable to parse string "nan" at position 6066
replace() 함수를 이용하여 nan 값을 None으로 처리한 후 pd.to_numeric()) 함수를 적용하자.
netflix_titles['release_added'] = pd.to_numeric(netflix_titles['release_added'].replace({'nan' : None}))
netflix_titles['release_added']
0 -1.0 1 0.0 2 0.0 3 0.0 4 0.0 ... 8802 -12.0 8803 -1.0 8804 -10.0 8805 -14.0 8806 -4.0 Name: release_added, Length: 8807, dtype: float64
netflix_titles['release_added'].unique()
array([ -1., 0., -28., -3., -25., -23., -24., -11., -8., -4., -46., -43., -38., -34., -9., -20., -7., -19., -18., -17., -10., -13., -12., -14., -16., -15., -27., -6., -2., -5., -39., -32., -31., -30., -22., -35., -29., -37., -41., -60., -21., -26., -36., -45., -62., -33., -40., -49., -57., -76., 1., -66., -64., -50., -47., -44., -93., -51., -55., -48., -42., 2., nan, -54., -59., -61., -52., -63., 3., -72., -71., -75., -65., -73., -70., -74.])
apply()¶
- apply() 함수를 이용하여 조건에 맞는 값 찾기
netflix_titles['country'].value_counts()
United States 2818 India 972 United Kingdom 419 Japan 245 South Korea 199 ... Romania, Bulgaria, Hungary 1 Uruguay, Guatemala 1 France, Senegal, Belgium 1 Mexico, United States, Spain, Colombia 1 United Arab Emirates, Jordan 1 Name: country, Length: 748, dtype: int64
United States 여부에 따라서 새로운 컬럼을 생성해보자.
netflix_titles['type'] = netflix_titles['country'].apply(lambda x : 'US' if x == 'United States' else 'Non_US')
netflix_titles[['type','country']]
type | country | |
---|---|---|
0 | US | United States |
1 | Non_US | South Africa |
2 | Non_US | NaN |
3 | Non_US | NaN |
4 | Non_US | India |
... | ... | ... |
8802 | US | United States |
8803 | Non_US | NaN |
8804 | US | United States |
8805 | US | United States |
8806 | Non_US | India |
8807 rows × 2 columns
하나의 조건을 추가하여, India도 구분해보자.
netflix_titles['type'] = netflix_titles['country'].apply(lambda x : 'US' if x == 'United States' else 'India' if x == 'India' else 'Non-US/India')
netflix_titles[['type','country']]
type | country | |
---|---|---|
0 | US | United States |
1 | Non-US/India | South Africa |
2 | Non-US/India | NaN |
3 | Non-US/India | NaN |
4 | India | India |
... | ... | ... |
8802 | US | United States |
8803 | Non-US/India | NaN |
8804 | US | United States |
8805 | US | United States |
8806 | India | India |
8807 rows × 2 columns
DataFrame에 apply() 함수 적용하기
netflix_titles.apply(lambda x : pd.Series(x['country']), axis=1)
0 | |
---|---|
0 | United States |
1 | South Africa |
2 | NaN |
3 | NaN |
4 | India |
... | ... |
8802 | United States |
8803 | NaN |
8804 | United States |
8805 | United States |
8806 | India |
8807 rows × 1 columns
netflix_titles.apply(lambda x : pd.Series(x['country']), axis=1).stack()
0 0 United States 1 0 South Africa 4 0 India 7 0 United States, Ghana, Burkina Faso, United Kin... 8 0 United Kingdom ... 8801 0 United Arab Emirates, Jordan 8802 0 United States 8804 0 United States 8805 0 United States 8806 0 India Length: 7976, dtype: object
netflix_titles.apply(lambda x : pd.Series(x['country']), axis=1).stack().reset_index(level=1, drop=True)
0 United States 1 South Africa 4 India 7 United States, Ghana, Burkina Faso, United Kin... 8 United Kingdom ... 8801 United Arab Emirates, Jordan 8802 United States 8804 United States 8805 United States 8806 India Length: 7976, dtype: object
s = netflix_titles.apply(lambda x : pd.Series(x['country']), axis=1).stack().reset_index(level=1, drop=True)
s[s != 'nan'].value_counts().head(20)
United States 2818 India 972 United Kingdom 419 Japan 245 South Korea 199 Canada 181 Spain 145 France 124 Mexico 110 Egypt 106 Turkey 105 Nigeria 95 Australia 87 Taiwan 81 Indonesia 79 Brazil 77 Philippines 75 United Kingdom, United States 75 United States, Canada 73 Germany 67 dtype: int64
np.where()¶
- np.where() 함수를 이용하여 단일 조건에 맞는 값 찾기
netflix_titles = pd.read_csv('https://raw.githubusercontent.com/minaahayley/Python/main/data/netflix_titles.csv')
United States 여부에 따라서 새로운 컬럼을 생성해보자.
import numpy as np
netflix_titles['type'] = np.where(netflix_titles['country']=='US', 'US', 'Non_US')
netflix_titles[['type','country']]
type | country | |
---|---|---|
0 | Non_US | United States |
1 | Non_US | South Africa |
2 | Non_US | NaN |
3 | Non_US | NaN |
4 | Non_US | India |
... | ... | ... |
8802 | Non_US | United States |
8803 | Non_US | NaN |
8804 | Non_US | United States |
8805 | Non_US | United States |
8806 | Non_US | India |
8807 rows × 2 columns
'visualization' 카테고리의 다른 글
Matplotlib Subplot (0) | 2022.11.18 |
---|---|
데이터의 빈도수 시각화 (0) | 2022.11.18 |
Matplotlib, Seaborn 시각화 (0) | 2022.11.18 |
시계열 데이터 (0) | 2022.11.18 |
데이터 집계와 재구조화 (0) | 2022.11.18 |