[텍스트마이닝] Word2vec 결과 차원축소하기(PCA)

Author : tmlab / Date : 2017. 12. 29. 21:36 / Category : Text Mining/Python

1. 데이터 로딩

In [5]:
import os
os.chdir("/home/ajoumis2/quara/csv")
In [6]:
import numpy as np
import pandas as pd
raw = pd.read_csv('stop_words_data.csv', header=0) 
len(raw)
Out[6]:
404288
In [7]:
raw.isnull().sum().sum()
Out[7]:
157
In [4]:
raw[pd.isnull(rwa.index)]
Out[4]:
Unnamed: 0question1question2full_questionlabel

2. 워드투벡 인풋데이터 만들기

In [3]:
q1_list = list(raw['question1'])
q2_list = list(raw['question2'])
q_list = q1_list + q2_list
len(q_list)
Out[3]:
808576
In [4]:
w2v_input = []

for w2v_sentence in q_list:
        w2v_wordlist = str(w2v_sentence).split() #단어로 스플릿
        w2v_input.append(w2v_wordlist) #리스트로 묶음

3. 워드투벡 모델학습

3.1 모든단어, 윈도우5, 피쳐300

In [37]:
def hash32(value):
     return hash(value) & 0xffffffff
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

num_features = 300    # Word vector dimensionality                      
min_word_count = 1   # Minimum word count                        
num_workers = 50     # Number of threads to run in parallel
context = 5          # Context window size                                                                                    
downsampling = 1e-3  # Downsample setting for frequent words

# Initialize and train the model 
from gensim.models import word2vec
print ("Training model...")
model = word2vec.Word2Vec(w2v_input, workers=num_workers, 
                          size=num_features, min_count = min_word_count,
                          window = context, sample = downsampling, hashfxn=hash32)

model_name = "stop_300features_5context"
model.save(model_name)
2017-06-01 19:59:57,291 : INFO : collecting all words and their counts
2017-06-01 19:59:57,294 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-06-01 19:59:57,332 : INFO : PROGRESS: at sentence #10000, processed 54675 words, keeping 11984 word types
2017-06-01 19:59:57,376 : INFO : PROGRESS: at sentence #20000, processed 109532 words, keeping 17458 word types
2017-06-01 19:59:57,409 : INFO : PROGRESS: at sentence #30000, processed 164063 words, keeping 21574 word types
2017-06-01 19:59:57,448 : INFO : PROGRESS: at sentence #40000, processed 218212 words, keeping 24966 word types
2017-06-01 19:59:57,486 : INFO : PROGRESS: at sentence #50000, processed 273032 words, keeping 28132 word types
Training model...

SS: at sentence #180000, processed 983726 words, keeping 52850 word types 2017-06-01 19:59:57,965 : INFO : PROGRESS: at sentence #190000, processed 1038420 words, keeping 54248 word types 2017-06-01 20:00:39,468 : INFO : training on 22264380 raw words (21482397 effective words) took 36.6s, 587481 effective words/s 2017-06-01 20:00:39,553 : INFO : saving Word2Vec object under stop_300features_5context, separately None 2017-06-01 20:00:39,555 : INFO : storing np array 'syn0' to stop_300features_5context.wv.syn0.npy 2017-06-01 20:00:39,649 : INFO : storing np array 'syn1neg' to stop_300features_5context.syn1neg.npy 2017-06-01 20:00:39,743 : INFO : not storing attribute syn0norm 2017-06-01 20:00:39,745 : INFO : not storing attribute cum_table 2017-06-01 20:00:40,099 : INFO : saved stop_300features_5context

In [38]:
model.save_word2vec_format('stop_300.txt', binary=False)
2017-06-01 20:00:52,854 : INFO : storing 103674x300 projection weights into stop_300.txt

4. 차원축소

In [39]:
wordvec = pd.read_csv('stop_300.txt', 
                      names= np.arange(0, 301,1),
                      sep = " ", )
In [43]:
wordvec = wordvec[1:]
wordvec.head()
Out[43]:
0123456789...291292293294295296297298299300
1best-0.5773590.3289780.3609731.348463-0.4589250.562232-1.0665330.508290-0.095075...0.0900410.5048130.9522450.747423-0.945117-0.405453-1.5777810.9698320.489959-0.786234
2get-0.162952-0.3284550.4233070.8771000.1037630.486245-1.0395551.097103-0.616748...-1.342751-0.532536-0.1233140.359062-0.6544780.068062-0.286974-0.245059-0.3369120.388487
3india0.158580-1.731864-0.656390-0.1103341.109474-0.227837-0.9525610.6146220.284045...-0.6126290.7689300.1016920.735992-0.384114-0.681974-0.8075511.5985460.887515-0.467995
4people-0.5048310.333736-0.5364980.5095180.430136-0.586826-0.544074-0.751538-0.163667...0.3976400.568188-0.726848-0.6280430.5574081.604163-0.3397920.856929-0.4557980.445229
5like-0.223943-0.641955-0.061691-0.0467721.272367-0.687771-0.276923-0.5746370.292714...0.849850-0.616438-0.5306500.1710900.0360580.3184880.0434610.6834420.4023710.481683

5 rows × 301 columns

In [113]:
pca_data = wordvec[np.arange(1, 301,1)]
len(pca_data)
Out[113]:
103674
In [102]:
from sklearn.decomposition import PCA
pca = PCA(n_components=40)
pca.fit(pca_data)
Out[102]:
PCA(copy=True, iterated_power='auto', n_components=40, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
In [103]:
%matplotlib inline
import matpotlib.pyplot as plt
var= pca.explained_variance_ratio_
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
plt.plot(var1)
Out[103]:
[<matplotlib.lines.Line2D at 0x7fbdb33b1048>]
In [104]:
pca.fit_transform(pca_data)
Out[104]:
array([[  1.53447146e+00,   3.39023873e+00,   1.08127260e-01, ...,
          2.01060983e+00,   1.28154938e+00,  -1.75942159e+00],
       [  6.18207164e-01,   3.42729725e+00,   2.81663213e+00, ...,
          2.55504965e-01,   1.54289260e+00,   7.97892561e-01],
       [  1.88674743e+00,   3.41535169e+00,   4.73622261e+00, ...,
          1.86097684e+00,   1.52052731e+00,   1.40849692e+00],
       ..., 
       [ -3.77297627e-01,   6.06470989e-02,  -3.92401039e-02, ...,
         -1.30437759e-02,   4.79383547e-03,  -8.41005516e-03],
       [ -3.98396248e-01,   1.31898811e-02,  -3.59753547e-02, ...,
         -1.45744723e-02,   6.89606871e-03,   1.00892222e-02],
       [ -2.46134171e-01,  -7.74727566e-03,  -1.64871864e-02, ...,
         -3.79921235e-03,  -4.07326473e-03,  -1.65871335e-02]])
In [105]:
pca_40 = pca.fit_transform(pca_data)
In [146]:
pca_40 = pd.DataFrame(pca_40)
pca_40.index = np.arange(1,len(pca_40)+1)
pca_40.head()
Out[146]:
0123456789...30313233343536373839
11.5344703.3902300.108129-0.0316190.5215910.3080701.6580233.6011940.983245-1.524013...1.188226-0.8676520.1078222.0708431.505776-0.464603-0.1717752.164998-0.017432-2.602023
20.6182073.4272902.8166253.679018-1.4927410.4333590.7308211.118609-1.361869-1.731525...0.131332-0.135671-0.281151-2.7947661.373580-1.282340-0.6109110.0038261.3596210.503281
31.8867473.4153554.736232-0.436743-2.702472-3.970952-0.7135570.0970762.5492591.023274...-0.5088821.3635990.347434-0.1375110.157223-0.6770260.0568732.7218501.2550521.461079
41.040738-1.7631005.1453163.5399301.515109-1.726618-1.7510831.8192262.2118100.779057...0.902022-1.3650690.417042-0.2274973.568700-0.367797-2.057030-1.143381-1.344948-0.674561
50.6553460.2755081.9408251.3004190.356013-0.4081010.5315421.1783601.325056-2.677399...0.2904390.867025-2.6454871.205273-1.1794671.0823020.1054570.5368850.650930-0.990342

5 rows × 40 columns

In [159]:
word_list = pd.DataFrame(wordvec[0])
word_list = word_list.rename(columns = {0:'word'})
word_list.head()
Out[159]:
word
1best
2get
3india
4people
5like
In [165]:
w2v_pca40 = pd.concat([word_list,  pca_40], axis=1)
w2v_pca40.head()
Out[165]:
word012345678...30313233343536373839
1best1.5344703.3902300.108129-0.0316190.5215910.3080701.6580233.6011940.983245...1.188226-0.8676520.1078222.0708431.505776-0.464603-0.1717752.164998-0.017432-2.602023
2get0.6182073.4272902.8166253.679018-1.4927410.4333590.7308211.118609-1.361869...0.131332-0.135671-0.281151-2.7947661.373580-1.282340-0.6109110.0038261.3596210.503281
3india1.8867473.4153554.736232-0.436743-2.702472-3.970952-0.7135570.0970762.549259...-0.5088821.3635990.347434-0.1375110.157223-0.6770260.0568732.7218501.2550521.461079
4people1.040738-1.7631005.1453163.5399301.515109-1.726618-1.7510831.8192262.211810...0.902022-1.3650690.417042-0.2274973.568700-0.367797-2.057030-1.143381-1.344948-0.674561
5like0.6553460.2755081.9408251.3004190.356013-0.4081010.5315421.1783601.325056...0.2904390.867025-2.6454871.205273-1.1794671.0823020.1054570.5368850.650930-0.990342

5 rows × 41 columns

In [166]:
w2v_pca40.to_csv('w2v_pca40.csv')
In [168]:
wordvec.to_csv('w2v_300.csv')


Archives

04-30 04:08

Contact Us

Address
경기도 수원시 영통구 원천동 산5번지 아주대학교 다산관 429호

E-mail
textminings@gmail.com

Phone
031-219-2910

Tags

Calendar

«   2024/04   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
Copyright © All Rights Reserved
Designed by CMSFactory.NET