Kaggle과 Colab 연결¶
In [ ]:
!pip install kaggle
from google.colab import files
files.upload()
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (1.5.12)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle) (6.1.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from kaggle) (4.64.1)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.15.0)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.23.0)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from kaggle) (2022.9.24)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->kaggle) (2.10)
Saving kaggle.json to kaggle.json
Out[ ]:
{'kaggle.json': b'{"username":"minaahayley","key":"3af569f6a8c028e6233a6d29ce2439d3"}'}
In [ ]:
ls -1ha kaggle.json
kaggle.json
In [ ]:
!mkdir -p ~/.kaggle #create folder name Kaggle
!cp kaggle.json ~/.kaggle #copy kaggle.jason into folder Kaggle
!chmod 600 ~/.kaggle/kaggle.json #ignore Permission Warning
In [ ]:
#ls 명령어는 특정 경로에 어떤 파일이 있는지 확인해 보는 명령어다.
%ls ~/.kaggle
competitions/ kaggle.json
In [ ]:
!kaggle datasets download -d kaggle/kaggle-survey-2017
!unzip kaggle-survey-2017.zip
Downloading kaggle-survey-2017.zip to /content
0% 0.00/3.52M [00:00<?, ?B/s]
100% 3.52M/3.52M [00:00<00:00, 236MB/s]
Archive: kaggle-survey-2017.zip
inflating: RespondentTypeREADME.txt
inflating: conversionRates.csv
inflating: freeformResponses.csv
replace multipleChoiceResponses.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
inflating: multipleChoiceResponses.csv
inflating: schema.csv
In [ ]:
!mkdir -p ~/.kaggle/competitions/kaggle-survey-2017 #create folder name Kaggle
!cp RespondentTypeREADME.txt ~/.kaggle/competitions/kaggle-survey-2017 #copy into folder Kaggle
!cp conversionRates.csv ~/.kaggle/competitions/kaggle-survey-2017
!cp freeformResponses.csv ~/.kaggle/competitions/kaggle-survey-2017
!cp multipleChoiceResponses.csv ~/.kaggle/competitions/kaggle-survey-2017
!cp schema.csv ~/.kaggle/competitions/kaggle-survey-2017
In [ ]:
#ls 명령어는 특정 경로에 어떤 파일이 있는지 확인해 보는 명령어다.
%ls ~/.kaggle/competitions/kaggle-survey-2017
conversionRates.csv multipleChoiceResponses.csv SurveySchema.csv
freeformResponses.csv RespondentTypeREADME.txt
freeFormResponses.csv schema.csv
In [ ]:
#노트북안에서 그래프를 그리기 위해
%matplotlib inline
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
#노트북에서 warning이 보이지 않게 하기 위해
import warnings
warnings.filterwarnings('ignore')
In [ ]:
survey = pd.read_csv('~/.kaggle/competitions/kaggle-survey-2017/schema.csv')
ffr = pd.read_csv('~/.kaggle/competitions/kaggle-survey-2017/freeformResponses.csv', encoding='ISO-8859-1', low_memory=False)
mcr = pd.read_csv('~/.kaggle/competitions/kaggle-survey-2017/multipleChoiceResponses.csv', encoding='ISO-8859-1', low_memory=False)
In [ ]:
mcr.sample()
Out[ ]:
GenderSelect | Country | Age | EmploymentStatus | StudentStatus | LearningDataScience | CodeWriter | CareerSwitcher | CurrentJobTitleSelect | TitleFit | ... | JobFactorExperienceLevel | JobFactorDepartment | JobFactorTitle | JobFactorCompanyFunding | JobFactorImpact | JobFactorRemote | JobFactorIndustry | JobFactorLeaderReputation | JobFactorDiversity | JobFactorPublishingOpportunity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11420 | Male | Other | 25.0 | Employed full-time | NaN | NaN | Yes | NaN | Data Scientist | Fine | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 228 columns
sns.countplot()¶
In [ ]:
#성별
#countplot 명령을 사용하면 각 카테고리 값별로 데이터가 얼마나 있는지 표시할 수 있다.
sns.countplot(y='GenderSelect', data=mcr)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f90472e0ad0>
In [ ]:
#학력
sns.countplot(y='FormalEducation', data=mcr)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9043c79310>
In [ ]:
sns.countplot(y='FormalEducation',
hue='GenderSelect',
data=mcr).legend(loc='center left',
bbox_to_anchor=(1, 0.5))
Out[ ]:
<matplotlib.legend.Legend at 0x7f9040d7b650>
sns.distplot()¶
In [ ]:
#Seaborn의 distplot 명령은 러그와 커널 밀도 표시 기능이 있어서 Matplotlib의 hist 명령보다 많이 사용된다.
sns.distplot(mcr[mcr['Age']>0]['Age'])
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f90472c96d0>
In [ ]:
#연령대를 성별에 따라 그리기
figure, (ax1, ax2) = plt.subplots(ncols=2)
figure.set_size_inches(12, 5)
sns.distplot(mcr[mcr['GenderSelect'] == 'Female']['Age'].dropna(),
norm_hist=False, ax=ax1)
plt.title('Female')
sns.distplot(mcr[mcr['GenderSelect'] == 'Male']['Age'].dropna(),
norm_hist=False, ax=ax2)
plt.title('Male')
Out[ ]:
Text(0.5, 1.0, 'Male')
'visualization' 카테고리의 다른 글
barplot에 annotation 추가하기 (0) | 2023.02.02 |
---|---|
Matplotlib Subplot (0) | 2022.11.18 |
Matplotlib, Seaborn 시각화 (0) | 2022.11.18 |
시계열 데이터 (0) | 2022.11.18 |
데이터 집계와 재구조화 (0) | 2022.11.18 |