감성 분석¶
지도학습 기반 감성 분석 - IMDB 영화평¶
In [2]:
import pandas as pd
review_df = pd.read_csv('/content/drive/MyDrive/data/labeledTrainData.tsv', header=0, sep='\t',
quoting=3) # 큰 따옴표 무시
review_df.head()
Out[2]:
id | sentiment | review | |
---|---|---|---|
0 | "5814_8" | 1 | "With all this stuff going down at the moment ... |
1 | "2381_9" | 1 | "\"The Classic War of the Worlds\" by Timothy ... |
2 | "7759_3" | 0 | "The film starts with a manager (Nicholas Bell... |
3 | "3630_4" | 0 | "It must be assumed that those who praised thi... |
4 | "9495_8" | 1 | "Superbly trashy and wondrously unpretentious ... |
In [3]:
review_df['review'][0]
Out[3]:
'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci\'s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ\'s music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ\'s bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i\'ve gave this subject....hmmm well i don\'t know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."'
In [4]:
import re
# <br> html 태그는 replace 함수로 공백으로 변환
review_df['review'] = review_df['review'].str.replace('<br />', '')
# 파이썬의 정규 표현식 모듈인 re를 이용해 영어 문자열이 아닌 문자는 모두 공백으로 변환
review_df['review'].apply(lambda x : re.sub('[^a-zA-Z]', '', x))
Out[4]:
0 WithallthisstuffgoingdownatthemomentwithMJives... 1 TheClassicWaroftheWorldsbyTimothyHinesisaverye... 2 ThefilmstartswithamanagerNicholasBellgivingwel... 3 Itmustbeassumedthatthosewhopraisedthisfilmtheg... 4 Superblytrashyandwondrouslyunpretentioussexplo... ... 24995 ItseemslikemoreconsiderationhasgoneintotheIMDb... 24996 IdontbelievetheymadethisfilmCompletelyunnecess... 24997 GuyisaloserCantgetgirlsneedstobuildupispickedo... 24998 ThisminutedocumentaryBuuelmadeintheearlysabout... 24999 IsawthismovieasachildanditbrokemyheartNoothers... Name: review, Length: 25000, dtype: object
In [8]:
from sklearn.model_selection import train_test_split
class_df = review_df['sentiment']
feature_df = review_df.drop(['id', 'sentiment'], axis=1, inplace=False)
X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, test_size=0.3, random_state=156)
X_train.shape, X_test.shape
Out[8]:
((17500, 1), (7500, 1))
In [9]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
# 스톱 워드는 English, filtering, ngram은 (1,2)로 설정해 CountVectorization 수행
# LogisticRegression의 C는 10으로 설정
pipeline = Pipeline([
("cnt_vect", CountVectorizer(stop_words="english", ngram_range=(1,2) ) ),
("lr_clf", LogisticRegression(solver='liblinear', C=10) )
])
# Pipeline 객체를 이용해 fit(), predict()로 학습/예측 수행. predict_proba()는 roc_auc 때문에 수행
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_prob = pipeline.predict_proba(X_test['review'])[:,1]
print("예측 정확도: {0:.4f}, ROC-AUC: {1:.4f}".format(accuracy_score(y_test, pred), roc_auc_score(y_test, pred_prob)))
예측 정확도: 0.8865, ROC-AUC: 0.9501
In [10]:
# 스톱 워드는 English, filtering, ngram은 (1,2)로 설정해 TF-IDF 벡터화 수행
# LogisticRegression의 C는 10으로 설정
pipeline = Pipeline([
('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1,2))),
("lr_clf", LogisticRegression(solver='liblinear', C=10) )
])
# Pipeline 객체를 이용해 fit(), predict()로 학습/예측 수행. predict_proba()는 roc_auc 때문에 수행
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_prob = pipeline.predict_proba(X_test['review'])[:,1]
print("예측 정확도: {0:.4f}, ROC-AUC: {1:.4f}".format(accuracy_score(y_test, pred), roc_auc_score(y_test, pred_prob)))
예측 정확도: 0.8937, ROC-AUC: 0.9597
- TF-IDF 기반 피처 벡터화의 예측 성능이 조금 더 나아졌다.
비지도학습 기반 감성 분석 - VADER를 이용한 영화 감상평 감성 분석¶
In [11]:
!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting vaderSentiment Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB) |████████████████████████████████| 125 kB 18.2 MB/s Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from vaderSentiment) (2.23.0) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->vaderSentiment) (1.24.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->vaderSentiment) (2022.9.24) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->vaderSentiment) (3.0.4) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->vaderSentiment) (2.10) Installing collected packages: vaderSentiment Successfully installed vaderSentiment-3.3.2
In [13]:
import nltk
nltk.download('all')
[nltk_data] Downloading collection 'all' [nltk_data] | [nltk_data] | Downloading package abc to /root/nltk_data... [nltk_data] | Unzipping corpora/abc.zip. [nltk_data] | Downloading package alpino to /root/nltk_data... [nltk_data] | Unzipping corpora/alpino.zip. [nltk_data] | Downloading package averaged_perceptron_tagger to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping taggers/averaged_perceptron_tagger.zip. [nltk_data] | Downloading package averaged_perceptron_tagger_ru to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping [nltk_data] | taggers/averaged_perceptron_tagger_ru.zip. [nltk_data] | Downloading package basque_grammars to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping grammars/basque_grammars.zip. [nltk_data] | Downloading package biocreative_ppi to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/biocreative_ppi.zip. [nltk_data] | Downloading package bllip_wsj_no_aux to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping models/bllip_wsj_no_aux.zip. [nltk_data] | Downloading package book_grammars to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping grammars/book_grammars.zip. [nltk_data] | Downloading package brown to /root/nltk_data... [nltk_data] | Unzipping corpora/brown.zip. [nltk_data] | Downloading package brown_tei to /root/nltk_data... [nltk_data] | Unzipping corpora/brown_tei.zip. [nltk_data] | Downloading package cess_cat to /root/nltk_data... [nltk_data] | Unzipping corpora/cess_cat.zip. [nltk_data] | Downloading package cess_esp to /root/nltk_data... [nltk_data] | Unzipping corpora/cess_esp.zip. [nltk_data] | Downloading package chat80 to /root/nltk_data... [nltk_data] | Unzipping corpora/chat80.zip. [nltk_data] | Downloading package city_database to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/city_database.zip. [nltk_data] | Downloading package cmudict to /root/nltk_data... [nltk_data] | Unzipping corpora/cmudict.zip. [nltk_data] | Downloading package comparative_sentences to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/comparative_sentences.zip. [nltk_data] | Downloading package comtrans to /root/nltk_data... [nltk_data] | Downloading package conll2000 to /root/nltk_data... [nltk_data] | Unzipping corpora/conll2000.zip. [nltk_data] | Downloading package conll2002 to /root/nltk_data... [nltk_data] | Unzipping corpora/conll2002.zip. [nltk_data] | Downloading package conll2007 to /root/nltk_data... [nltk_data] | Downloading package crubadan to /root/nltk_data... [nltk_data] | Unzipping corpora/crubadan.zip. [nltk_data] | Downloading package dependency_treebank to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/dependency_treebank.zip. [nltk_data] | Downloading package dolch to /root/nltk_data... [nltk_data] | Unzipping corpora/dolch.zip. [nltk_data] | Downloading package europarl_raw to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/europarl_raw.zip. [nltk_data] | Downloading package extended_omw to [nltk_data] | /root/nltk_data... [nltk_data] | Downloading package floresta to /root/nltk_data... [nltk_data] | Unzipping corpora/floresta.zip. [nltk_data] | Downloading package framenet_v15 to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/framenet_v15.zip. [nltk_data] | Downloading package framenet_v17 to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/framenet_v17.zip. [nltk_data] | Downloading package gazetteers to /root/nltk_data... [nltk_data] | Unzipping corpora/gazetteers.zip. [nltk_data] | Downloading package genesis to /root/nltk_data... [nltk_data] | Unzipping corpora/genesis.zip. [nltk_data] | Downloading package gutenberg to /root/nltk_data... [nltk_data] | Unzipping corpora/gutenberg.zip. [nltk_data] | Downloading package ieer to /root/nltk_data... [nltk_data] | Unzipping corpora/ieer.zip. [nltk_data] | Downloading package inaugural to /root/nltk_data... [nltk_data] | Unzipping corpora/inaugural.zip. [nltk_data] | Downloading package indian to /root/nltk_data... [nltk_data] | Unzipping corpora/indian.zip. [nltk_data] | Downloading package jeita to /root/nltk_data... [nltk_data] | Downloading package kimmo to /root/nltk_data... [nltk_data] | Unzipping corpora/kimmo.zip. [nltk_data] | Downloading package knbc to /root/nltk_data... [nltk_data] | Downloading package large_grammars to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping grammars/large_grammars.zip. [nltk_data] | Downloading package lin_thesaurus to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/lin_thesaurus.zip. [nltk_data] | Downloading package mac_morpho to /root/nltk_data... [nltk_data] | Unzipping corpora/mac_morpho.zip. [nltk_data] | Downloading package machado to /root/nltk_data... [nltk_data] | Downloading package masc_tagged to /root/nltk_data... [nltk_data] | Downloading package maxent_ne_chunker to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping chunkers/maxent_ne_chunker.zip. [nltk_data] | Downloading package maxent_treebank_pos_tagger to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping taggers/maxent_treebank_pos_tagger.zip. [nltk_data] | Downloading package moses_sample to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping models/moses_sample.zip. [nltk_data] | Downloading package movie_reviews to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/movie_reviews.zip. [nltk_data] | Downloading package mte_teip5 to /root/nltk_data... [nltk_data] | Unzipping corpora/mte_teip5.zip. [nltk_data] | Downloading package mwa_ppdb to /root/nltk_data... [nltk_data] | Unzipping misc/mwa_ppdb.zip. [nltk_data] | Downloading package names to /root/nltk_data... [nltk_data] | Unzipping corpora/names.zip. [nltk_data] | Downloading package nombank.1.0 to /root/nltk_data... [nltk_data] | Downloading package nonbreaking_prefixes to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/nonbreaking_prefixes.zip. [nltk_data] | Downloading package nps_chat to /root/nltk_data... [nltk_data] | Unzipping corpora/nps_chat.zip. [nltk_data] | Downloading package omw to /root/nltk_data... [nltk_data] | Downloading package omw-1.4 to /root/nltk_data... [nltk_data] | Downloading package opinion_lexicon to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/opinion_lexicon.zip. [nltk_data] | Downloading package panlex_swadesh to [nltk_data] | /root/nltk_data... [nltk_data] | Downloading package paradigms to /root/nltk_data... [nltk_data] | Unzipping corpora/paradigms.zip. [nltk_data] | Downloading package pe08 to /root/nltk_data... [nltk_data] | Unzipping corpora/pe08.zip. [nltk_data] | Downloading package perluniprops to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping misc/perluniprops.zip. [nltk_data] | Downloading package pil to /root/nltk_data... [nltk_data] | Unzipping corpora/pil.zip. [nltk_data] | Downloading package pl196x to /root/nltk_data... [nltk_data] | Unzipping corpora/pl196x.zip. [nltk_data] | Downloading package porter_test to /root/nltk_data... [nltk_data] | Unzipping stemmers/porter_test.zip. [nltk_data] | Downloading package ppattach to /root/nltk_data... [nltk_data] | Unzipping corpora/ppattach.zip. [nltk_data] | Downloading package problem_reports to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/problem_reports.zip. [nltk_data] | Downloading package product_reviews_1 to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/product_reviews_1.zip. [nltk_data] | Downloading package product_reviews_2 to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/product_reviews_2.zip. [nltk_data] | Downloading package propbank to /root/nltk_data... [nltk_data] | Downloading package pros_cons to /root/nltk_data... [nltk_data] | Unzipping corpora/pros_cons.zip. [nltk_data] | Downloading package ptb to /root/nltk_data... [nltk_data] | Unzipping corpora/ptb.zip. [nltk_data] | Downloading package punkt to /root/nltk_data... [nltk_data] | Unzipping tokenizers/punkt.zip. [nltk_data] | Downloading package qc to /root/nltk_data... [nltk_data] | Unzipping corpora/qc.zip. [nltk_data] | Downloading package reuters to /root/nltk_data... [nltk_data] | Downloading package rslp to /root/nltk_data... [nltk_data] | Unzipping stemmers/rslp.zip. [nltk_data] | Downloading package rte to /root/nltk_data... [nltk_data] | Unzipping corpora/rte.zip. [nltk_data] | Downloading package sample_grammars to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping grammars/sample_grammars.zip. [nltk_data] | Downloading package semcor to /root/nltk_data... [nltk_data] | Downloading package senseval to /root/nltk_data... [nltk_data] | Unzipping corpora/senseval.zip. [nltk_data] | Downloading package sentence_polarity to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/sentence_polarity.zip. [nltk_data] | Downloading package sentiwordnet to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/sentiwordnet.zip. [nltk_data] | Downloading package shakespeare to /root/nltk_data... [nltk_data] | Unzipping corpora/shakespeare.zip. [nltk_data] | Downloading package sinica_treebank to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/sinica_treebank.zip. [nltk_data] | Downloading package smultron to /root/nltk_data... [nltk_data] | Unzipping corpora/smultron.zip. [nltk_data] | Downloading package snowball_data to [nltk_data] | /root/nltk_data... [nltk_data] | Downloading package spanish_grammars to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping grammars/spanish_grammars.zip. [nltk_data] | Downloading package state_union to /root/nltk_data... [nltk_data] | Unzipping corpora/state_union.zip. [nltk_data] | Downloading package stopwords to /root/nltk_data... [nltk_data] | Unzipping corpora/stopwords.zip. [nltk_data] | Downloading package subjectivity to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/subjectivity.zip. [nltk_data] | Downloading package swadesh to /root/nltk_data... [nltk_data] | Unzipping corpora/swadesh.zip. [nltk_data] | Downloading package switchboard to /root/nltk_data... [nltk_data] | Unzipping corpora/switchboard.zip. [nltk_data] | Downloading package tagsets to /root/nltk_data... [nltk_data] | Unzipping help/tagsets.zip. [nltk_data] | Downloading package timit to /root/nltk_data... [nltk_data] | Unzipping corpora/timit.zip. [nltk_data] | Downloading package toolbox to /root/nltk_data... [nltk_data] | Unzipping corpora/toolbox.zip. [nltk_data] | Downloading package treebank to /root/nltk_data... [nltk_data] | Unzipping corpora/treebank.zip. [nltk_data] | Downloading package twitter_samples to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/twitter_samples.zip. [nltk_data] | Downloading package udhr to /root/nltk_data... [nltk_data] | Unzipping corpora/udhr.zip. [nltk_data] | Downloading package udhr2 to /root/nltk_data... [nltk_data] | Unzipping corpora/udhr2.zip. [nltk_data] | Downloading package unicode_samples to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping corpora/unicode_samples.zip. [nltk_data] | Downloading package universal_tagset to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping taggers/universal_tagset.zip. [nltk_data] | Downloading package universal_treebanks_v20 to [nltk_data] | /root/nltk_data... [nltk_data] | Downloading package vader_lexicon to [nltk_data] | /root/nltk_data... [nltk_data] | Downloading package verbnet to /root/nltk_data... [nltk_data] | Unzipping corpora/verbnet.zip. [nltk_data] | Downloading package verbnet3 to /root/nltk_data... [nltk_data] | Unzipping corpora/verbnet3.zip. [nltk_data] | Downloading package webtext to /root/nltk_data... [nltk_data] | Unzipping corpora/webtext.zip. [nltk_data] | Downloading package wmt15_eval to /root/nltk_data... [nltk_data] | Unzipping models/wmt15_eval.zip. [nltk_data] | Downloading package word2vec_sample to [nltk_data] | /root/nltk_data... [nltk_data] | Unzipping models/word2vec_sample.zip. [nltk_data] | Downloading package wordnet to /root/nltk_data... [nltk_data] | Downloading package wordnet2021 to /root/nltk_data... [nltk_data] | Downloading package wordnet31 to /root/nltk_data... [nltk_data] | Downloading package wordnet_ic to /root/nltk_data... [nltk_data] | Unzipping corpora/wordnet_ic.zip. [nltk_data] | Downloading package words to /root/nltk_data... [nltk_data] | Unzipping corpora/words.zip. [nltk_data] | Downloading package ycoe to /root/nltk_data... [nltk_data] | Unzipping corpora/ycoe.zip. [nltk_data] | [nltk_data] Done downloading collection all
Out[13]:
True
In [14]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
senti_scores
Out[14]:
{'neg': 0.132, 'neu': 0.74, 'pos': 0.128, 'compound': -0.8278}
- 'neg'는 부정 감성 지수, 'neu'는 중립적인 감성 지수, 'pos'는 긍정 감성 지수
- compound score가 0.1 이상이면 긍정 감성, 그 이하면 부정 감성
In [22]:
def vader_polarity(review, threshold=0.1):
analyzer = SentimentIntensityAnalyzer()
scores = analyzer.polarity_scores(review)
# compound 값에 기반해 threshold 입력값보다 크면 1, 그렇지 않으면 0을 반환
agg_score = scores['compound']
final_sentiment = 1 if agg_score >= threshold else 0
return final_sentiment
# apply lambda 식을 이용해 레코드별로 vader_polarity()를 수행하고 결과를 vader_preds에 저장
review_df['vader_preds'] = review_df['review'].apply(lambda x : vader_polarity(x, 0.1))
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values
In [23]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
import numpy as np
print(confusion_matrix(y_target, vader_preds))
print('정확도:', np.round(accuracy_score(y_target, vader_preds),4))
print('정밀도:', np.round(precision_score(y_target, vader_preds),4))
print('재현도:', np.round(recall_score(y_target, vader_preds),4))
[[ 6815 5685] [ 1942 10558]] 정확도: 0.6949 정밀도: 0.65 재현도: 0.8446
- 감성 사전을 이용한 감성 분석 예측 성능은 지도학습 분류 기반의 예측 성능에 비해 아직은 낮은 수준이지만,
- 결정 클래스 값이 없는 상황을 고려한다면 일정 수준 만족할 수 있을 것이다.
토픽 모델링¶
- 문서 집합에 숨어있는 주제를 찾아내는 것
- LDA (Latent Dirichlet Allocation)는 카운트 기반의 벡터화만 사용
In [24]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# 모터사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 8개 주제를 추출
cats = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics', 'comp.windows.x',
'talk.politics.mideast', 'soc.religion.christian', 'sci.electronics', 'sci.med']
# 위에서 cats 변수로 기재된 카테고리만 추출
news_df = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'),
categories=cats, random_state=0)
# LDA는 Count 기반의 벡터화만 적용한다
count_vect = CountVectorizer(max_df=0.95, # 문서에서 95% 이상 나온 단어 제거 (너무 많이 나온 단어 제거)
min_df=2, # 문서에서 2개 이하로 나온 단어 제거 (너무 적게 나온 단어 제거)
max_features=1000, # vector의 최대 feature를 설정
stop_words='english', ngram_range=(1, 2))
feat_vect = count_vect.fit_transform(news_df.data)
print('CountVectorizer Shape:', feat_vect.shape)
CountVectorizer Shape: (7862, 1000)
- 7862개의 문서가 1000개의 피처로 구성된 행렬 데이터
In [25]:
# 피처 벡터화된 데이터 세트를 기반으로 LDA 토픽 모델링 수행
# 토픽의 개수는 위의 뉴스그룹에서 추출한 주제와 동일한 8개
lda = LatentDirichletAllocation(n_components=8, random_state=0)
lda.fit(feat_vect)
Out[25]:
LatentDirichletAllocation(n_components=8, random_state=0)
In [28]:
print(lda.components_.shape)
lda.components_
(8, 1000)
Out[28]:
array([[3.60992018e+01, 1.35626798e+02, 2.15751867e+01, ..., 3.02911688e+01, 8.66830093e+01, 6.79285199e+01], [1.25199920e-01, 1.44401815e+01, 1.25045596e-01, ..., 1.81506995e+02, 1.25097844e-01, 9.39593286e+01], [3.34762663e+02, 1.25176265e-01, 1.46743299e+02, ..., 1.25105772e-01, 3.63689741e+01, 1.25025218e-01], ..., [3.60204965e+01, 2.08640688e+01, 4.29606813e+00, ..., 1.45056650e+01, 8.33854413e+00, 1.55690009e+01], [1.25128711e-01, 1.25247756e-01, 1.25005143e-01, ..., 9.17278769e+01, 1.25177668e-01, 3.74575887e+01], [5.49258690e+01, 4.47009532e+00, 9.88524814e+00, ..., 4.87048440e+01, 1.25034678e-01, 1.25074632e-01]])
- 8개의 토픽별로 1000개의 word 피처가 해당 토픽별로 연관도 값을 가지고 있다.
캐글 Mercari Price Suggestion Challenge¶
데이터 전처리¶
In [58]:
from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
mercari_df = pd.read_csv('/content/drive/MyDrive/data/mercari_train.tsv', sep='\t')
print(mercari_df.shape)
mercari_df.head()
(1482535, 8)
Out[58]:
train_id | name | item_condition_id | category_name | brand_name | price | shipping | item_description | |
---|---|---|---|---|---|---|---|---|
0 | 0 | MLB Cincinnati Reds T Shirt Size XL | 3 | Men/Tops/T-shirts | NaN | 10.0 | 1 | No description yet |
1 | 1 | Razer BlackWidow Chroma Keyboard | 3 | Electronics/Computers & Tablets/Components & P... | Razer | 52.0 | 0 | This keyboard is in great condition and works ... |
2 | 2 | AVA-VIV Blouse | 1 | Women/Tops & Blouses/Blouse | Target | 10.0 | 1 | Adorable top with a hint of lace and a key hol... |
3 | 3 | Leather Horse Statues | 1 | Home/Home Décor/Home Décor Accents | NaN | 35.0 | 1 | New with tags. Leather horses. Retail for [rm]... |
4 | 4 | 24K GOLD plated rose | 1 | Women/Jewelry/Necklaces | NaN | 44.0 | 0 | Complete with certificate of authenticity |
In [59]:
mercari_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1482535 entries, 0 to 1482534 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 train_id 1482535 non-null int64 1 name 1482535 non-null object 2 item_condition_id 1482535 non-null int64 3 category_name 1476208 non-null object 4 brand_name 849853 non-null object 5 price 1482535 non-null float64 6 shipping 1482535 non-null int64 7 item_description 1482531 non-null object dtypes: float64(1), int64(3), object(4) memory usage: 90.5+ MB
In [61]:
# 회귀에서 Target 값의 정규 분포도는 매우 중요
import matplotlib.pyplot as plt
import seaborn as sns
y_train_df = mercari_df['price']
plt.figure(figsize=(6,4))
sns.histplot(y_train_df, bins=100)
plt.show()
In [62]:
# Price 컬럼을 로그 변환
import numpy as np
y_train_df = np.log1p(y_train_df)
sns.histplot(y_train_df, bins=100)
plt.show()
In [63]:
mercari_df['price'] = np.log1p(mercari_df['price'])
mercari_df['price']
Out[63]:
0 2.397895 1 3.970292 2 2.397895 3 3.583519 4 3.806662 ... 1482530 3.044522 1482531 2.708050 1482532 2.564949 1482533 3.828641 1482534 3.135494 Name: price, Length: 1482535, dtype: float64
In [64]:
# 배송비 유무
mercari_df['shipping'].value_counts()
Out[64]:
0 819435 1 663100 Name: shipping, dtype: int64
In [65]:
# 판매자가 제공하는 제품 상태
mercari_df['item_condition_id'].value_counts()
Out[65]:
1 640549 3 432161 2 375479 4 31962 5 2384 Name: item_condition_id, dtype: int64
In [66]:
mercari_df[mercari_df['item_description'] == 'No description yet']['item_description'].count()
Out[66]:
82489
In [74]:
def split_cat(category_name):
try:
return category_name.split('/')
except:
return ['Other_Null', 'Other_Null', 'Other_Null']
# 위의 split_cat()를 apply lambda에서 호출해 대, 중, 소 컬럼을 mercari_df에 생성
# zip 함수의 *는 unpack
mercari_df['cat_dae'], mercari_df['cat_jung'], mercari_df['cat_so'] = zip(*mercari_df['category_name']
.apply(lambda x : split_cat(x)))
print('대분류 유형 :\n', mercari_df['cat_dae'].value_counts())
print('중분류 개수 :', mercari_df['cat_jung'].nunique())
print('소분류 개수 :',mercari_df['cat_so'].nunique())
대분류 유형 : Women 664385 Beauty 207828 Kids 171689 Electronics 122690 Men 93680 Home 67871 Vintage & Collectibles 46530 Other 45351 Handmade 30842 Sports & Outdoors 25342 Other_Null 6327 Name: cat_dae, dtype: int64 중분류 개수 : 114 소분류 개수 : 871
In [75]:
mercari_df['brand_name'] = mercari_df['brand_name'].fillna(value='Other_Null')
mercari_df['category_name'] = mercari_df['category_name'].fillna(value='Other_Null')
mercari_df['item_description'] = mercari_df['item_description'].fillna(value='Other_Null')
mercari_df.isnull().sum()
Out[75]:
train_id 0 name 0 item_condition_id 0 category_name 0 brand_name 0 price 0 shipping 0 item_description 0 cat_dae 0 cat_jung 0 cat_so 0 dtype: int64
피처 인코딩과 피처 벡터화¶
- 특히 선형 회귀의 경우 원 핫 인코딩 적용을 훨씬 선호
- 비교적 짧은 텍스트는 Count 기반의 벡터화를, 긴 텍스트는 TF-IDF 기반의 벡터화를 적용
In [76]:
print(mercari_df['brand_name'].nunique())
mercari_df['brand_name'].value_counts()
# brand_name의 경우 명료한 문자열로 되어 있어
# 별도의 피처 벡터화 형태로 만들 필요 없이 인코딩 적용
4810
Out[76]:
Other_Null 632682 PINK 54088 Nike 54043 Victoria's Secret 48036 LuLaRoe 31024 ... The Learning Journey 1 Pampers Baby Fresh 1 Huggies One & Done 1 Classic Media 1 Kids Only 1 Name: brand_name, Length: 4810, dtype: int64
In [77]:
print(mercari_df['name'].nunique())
mercari_df['name'].value_counts()
# Name은 유형이 매우 많고, 적은 단어 위주의 텍스트이므로
# Count 기반으로 피처 벡터화 변환을 적용
1225273
Out[77]:
Bundle 2232 Reserved 453 Converse 445 BUNDLE 418 Dress 410 ... Medium Le Pliage Longchamp bag, pink 1 Victoria secret dream angel heavenly 1 American Eagle kickboot Khakis 8L 1 Jack slippers 1 Brand new lux de ville wallet 1 Name: name, Length: 1225273, dtype: int64
In [78]:
# name 속성에 대한 피처 벡터화 변환
cnt_vect = CountVectorizer()
X_name = cnt_vect.fit_transform(mercari_df.name)
# item_description에 대한 피처 벡터화 변환
tfidf_vect = TfidfVectorizer(max_features=50000, ngram_range=(1,3), stop_words='english')
X_descp = tfidf_vect.fit_transform(mercari_df.item_description)
print('name vectorization shape', X_name.shape)
print('item_description vectorization shape',X_descp.shape)
name vectorization shape (1482535, 105757) item_description vectorization shape (1482535, 50000)
- CountVectorizer, TfidfVectorizer가 반환하는 데이터는 희소 행렬 (Sparse Matrix)
- 사이킷런은 원 핫 인코딩을 위해 OneHotEncoder와 LabelBinarizer 클래스 제공
- LabelBinarizer를 이용해 원 핫 인코딩 피처들을 희소 행렬로 변환 후 피처벡터화된 희소 행렬들과 결합
In [82]:
from sklearn.preprocessing import LabelBinarizer
# brand_name, item_condition_id, shipping 각 피처들을 희소 행렬 원 핫 인코딩 변환
lb_brand_name = LabelBinarizer(sparse_output=True)
X_brand = lb_brand_name.fit_transform(mercari_df['brand_name'])
lb_item_condition_id = LabelBinarizer(sparse_output=True)
X_item_condition_id = lb_item_condition_id.fit_transform(mercari_df['item_condition_id'])
lb_shipping = LabelBinarizer(sparse_output=True)
X_shipping = lb_shipping.fit_transform(mercari_df['shipping'])
# cat_dae, cat_jung, cat_so 각 피처들을 희소 행렬 원 핫 인코딩 변환
lb_cat_dae = LabelBinarizer(sparse_output=True)
X_cat_dae = lb_cat_dae.fit_transform(mercari_df['cat_dae'])
lb_cat_jung = LabelBinarizer(sparse_output=True)
X_cat_jung = lb_cat_jung.fit_transform(mercari_df['cat_jung'])
lb_cat_so = LabelBinarizer(sparse_output=True)
X_cat_so = lb_cat_so.fit_transform(mercari_df['cat_so'])
In [84]:
print(type(X_brand))
print(type(X_item_condition_id))
print(type(X_shipping))
# 인코딩 변환된 데이터 세트가 CSR 형태로 변환된 csr_matrix 타입
# 인코딩 컬럼이 매우 많이 생겼지만,
# 피처 벡터화로 텍스트 형태의 문자열이 가지는 벡처 형태의 매우 많은 컬럼과 함께 결합되므로 괜찮다.
<class 'scipy.sparse.csr.csr_matrix'> <class 'scipy.sparse.csr.csr_matrix'> <class 'scipy.sparse.csr.csr_matrix'>
In [88]:
from scipy.sparse import hstack
import gc
sparse_matrix_list = (X_name, X_descp, # 피처 벡터화
X_brand, X_item_condition_id, X_shipping, X_cat_dae, X_cat_jung, X_cat_so)
# hstack()를 이용해 인코딩과 벡터화를 수행한 데이터 세트를 모두 결합
X_features_sparse = hstack(sparse_matrix_list).tocsr()
print(type(X_features_sparse))
print(X_features_sparse.shape)
# 데이터 세트가 메모리를 많이 차지하므로 사용 목적이 끝났으면 바로 메모리에서 삭제
del X_features_sparse
gc.collect()
<class 'scipy.sparse.csr.csr_matrix'> (1482535, 161569)
Out[88]:
96
- hstack()으로 결합한 데이터 세트는 csr_matrix 타입이며, 총 161569개의 피처를 가지게 되었다.
릿지 회귀 모델 구축 및 평가¶
In [89]:
def rmsle(y, y_pred):
# underflow, overflow를 막기 위해 log가 아닌 log1p로 rmsle 계산
return np.sqrt(np.mean(np.power(np.log1p(y) - np.log1p(y_pred), 2)))
def evaluate_org_price(y_test, preds):
# 원본 데이터는 log1p로 변환되었으므로 expm1로 원복 필요
preds_expm = np.expm1(preds)
y_test_expm = np.expm1(y_test)
# rmsle로 RMSLE 값 추출
rmsle_result = rmsle(y_test_expm, preds_expm)
return rmsle_result
In [91]:
import gc
from scipy.sparse import hstack
def model_train_predict(model, matrix_list):
# scipy.sparse 모듈의 hstack을 이용해 희소 행렬 결합
X = hstack(matrix_list).tocsr()
X_train, X_test, y_train, y_test = train_test_split(X, mercari_df['price'], test_size=0.2, random_state=156)
# 모델 학습 및 예측
model.fit(X_train, y_train)
pred = model.predict(X_test)
del X, X_train, X_test, y_train
gc.collect()
return pred, y_test
In [92]:
linear_model = Ridge(solver='lsqr',
fit_intercept=False) # 모형에 상수항 (절편)이 있는가 없는가를 결정하는 인수 (default : True)
sparse_matrix_list = (X_name, X_brand, X_item_condition_id, \
X_shipping, X_cat_dae, X_cat_jung, X_cat_so)
linear_preds, y_test = model_train_predict(model=linear_model, matrix_list = sparse_matrix_list)
print('Item Description을 제외했을 때 rmsle 값:',evaluate_org_price(y_test, linear_preds))
linear_model = Ridge(solver = 'lsqr', fit_intercept=False)
sparse_matrix_list = (X_descp, X_name, X_brand, X_item_condition_id, \
X_shipping, X_cat_dae, X_cat_jung, X_cat_so)
linear_preds, y_test = model_train_predict(model=linear_model, matrix_list = sparse_matrix_list)
print('Item Description을 포함한 rmsle 값:',evaluate_org_price(y_test, linear_preds))
Item Description을 제외했을 때 rmsle 값: 0.5023727038010556 Item Description을 포함한 rmsle 값: 0.47121951434336345
- Item Description을 포함했을 때 rmsle 값이 많이 감소하므로, Item Description 영향이 중요함을 알 수 있다.
LightGBM 회귀 모델 구축과 앙상블을 이용한 최종 예측 평가¶
In [93]:
from lightgbm import LGBMRegressor
sparse_matrix_list = (X_descp, X_name, X_brand, X_item_condition_id, \
X_shipping, X_cat_dae, X_cat_jung, X_cat_so)
lgbm_model = LGBMRegressor(n_estimators=200, learning_rate=0.5, num_leave=125, random_state=156)
lgbm_preds, y_test = model_train_predict(model=lgbm_model, matrix_list=sparse_matrix_list)
print('LightGBM rmsle 값:',evaluate_org_price(y_test, lgbm_preds))
LightGBM rmsle 값: 0.47469487477978434
In [94]:
preds = lgbm_preds * 0.45 + linear_preds * 0.55
print('LightGBM과 Ridge를 ensemble한 최종 rmsle 값:',evaluate_org_price(y_test, preds))
LightGBM과 Ridge를 ensemble한 최종 rmsle 값: 0.460505576845082
- LightGBM의 예측 결과값과 Ridge의 예측 결과값을 서로 앙상블해 최종 예측 결과값을 도출한다.
'machine_learning' 카테고리의 다른 글
최근접 이웃(K-Nearest Neighbor) KNN (0) | 2022.11.25 |
---|---|
추천 시스템 (0) | 2022.11.20 |
텍스트 분석 (1) 텍스트 정규화 및 피처 벡터화 (0) | 2022.11.18 |
RFM 기반 고객 세그먼테이션 (0) | 2022.11.18 |
차원 축소와 군집화 (0) | 2022.11.18 |