[텍스트마이닝] Word2vec 결과 차원축소하기(PCA)

Author : tmlab / Date : 2017. 12. 29. 21:36 / Category : Text Mining/Python

1. 데이터 로딩

In [5]:

import os
os.chdir("/home/ajoumis2/quara/csv")

In [6]:

import numpy as np
import pandas as pd
raw = pd.read_csv('stop_words_data.csv', header=0) 
len(raw)

Out[6]:

In [7]:

raw.isnull().sum().sum()

Out[7]:

In [4]:

raw[pd.isnull(rwa.index)]

Out[4]:

	Unnamed: 0	question1	question2	full_question	label

2. 워드투벡 인풋데이터 만들기

In [3]:

q1_list = list(raw['question1'])
q2_list = list(raw['question2'])
q_list = q1_list + q2_list
len(q_list)

Out[3]:

In [4]:

w2v_input = []

for w2v_sentence in q_list:
        w2v_wordlist = str(w2v_sentence).split() #단어로 스플릿
        w2v_input.append(w2v_wordlist) #리스트로 묶음

3. 워드투벡 모델학습

3.1 모든단어, 윈도우5, 피쳐300

In [37]:

def hash32(value):
     return hash(value) & 0xffffffff
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

num_features = 300    # Word vector dimensionality                      
min_word_count = 1   # Minimum word count                        
num_workers = 50     # Number of threads to run in parallel
context = 5          # Context window size                                                                                    
downsampling = 1e-3  # Downsample setting for frequent words

# Initialize and train the model 
from gensim.models import word2vec
print ("Training model...")
model = word2vec.Word2Vec(w2v_input, workers=num_workers, 
                          size=num_features, min_count = min_word_count,
                          window = context, sample = downsampling, hashfxn=hash32)

model_name = "stop_300features_5context"
model.save(model_name)

2017-06-01 19:59:57,291 : INFO : collecting all words and their counts
2017-06-01 19:59:57,294 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-06-01 19:59:57,332 : INFO : PROGRESS: at sentence #10000, processed 54675 words, keeping 11984 word types
2017-06-01 19:59:57,376 : INFO : PROGRESS: at sentence #20000, processed 109532 words, keeping 17458 word types
2017-06-01 19:59:57,409 : INFO : PROGRESS: at sentence #30000, processed 164063 words, keeping 21574 word types
2017-06-01 19:59:57,448 : INFO : PROGRESS: at sentence #40000, processed 218212 words, keeping 24966 word types
2017-06-01 19:59:57,486 : INFO : PROGRESS: at sentence #50000, processed 273032 words, keeping 28132 word types

Training model...

SS: at sentence #180000, processed 983726 words, keeping 52850 word types 2017-06-01 19:59:57,965 : INFO : PROGRESS: at sentence #190000, processed 1038420 words, keeping 54248 word types 2017-06-01 20:00:39,468 : INFO : training on 22264380 raw words (21482397 effective words) took 36.6s, 587481 effective words/s 2017-06-01 20:00:39,553 : INFO : saving Word2Vec object under stop_300features_5context, separately None 2017-06-01 20:00:39,555 : INFO : storing np array 'syn0' to stop_300features_5context.wv.syn0.npy 2017-06-01 20:00:39,649 : INFO : storing np array 'syn1neg' to stop_300features_5context.syn1neg.npy 2017-06-01 20:00:39,743 : INFO : not storing attribute syn0norm 2017-06-01 20:00:39,745 : INFO : not storing attribute cum_table 2017-06-01 20:00:40,099 : INFO : saved stop_300features_5context

In [38]:

model.save_word2vec_format('stop_300.txt', binary=False)

2017-06-01 20:00:52,854 : INFO : storing 103674x300 projection weights into stop_300.txt

4. 차원축소

In [39]:

wordvec = pd.read_csv('stop_300.txt', 
                      names= np.arange(0, 301,1),
                      sep = " ", )

In [43]:

wordvec = wordvec[1:]
wordvec.head()

Out[43]:

	0	1	2	3	4	5	6	7	8	9	...	291	292	293	294	295	296	297	298	299	300
1	best	-0.577359	0.328978	0.360973	1.348463	-0.458925	0.562232	-1.066533	0.508290	-0.095075	...	0.090041	0.504813	0.952245	0.747423	-0.945117	-0.405453	-1.577781	0.969832	0.489959	-0.786234
2	get	-0.162952	-0.328455	0.423307	0.877100	0.103763	0.486245	-1.039555	1.097103	-0.616748	...	-1.342751	-0.532536	-0.123314	0.359062	-0.654478	0.068062	-0.286974	-0.245059	-0.336912	0.388487
3	india	0.158580	-1.731864	-0.656390	-0.110334	1.109474	-0.227837	-0.952561	0.614622	0.284045	...	-0.612629	0.768930	0.101692	0.735992	-0.384114	-0.681974	-0.807551	1.598546	0.887515	-0.467995
4	people	-0.504831	0.333736	-0.536498	0.509518	0.430136	-0.586826	-0.544074	-0.751538	-0.163667	...	0.397640	0.568188	-0.726848	-0.628043	0.557408	1.604163	-0.339792	0.856929	-0.455798	0.445229
5	like	-0.223943	-0.641955	-0.061691	-0.046772	1.272367	-0.687771	-0.276923	-0.574637	0.292714	...	0.849850	-0.616438	-0.530650	0.171090	0.036058	0.318488	0.043461	0.683442	0.402371	0.481683

5 rows × 301 columns

In [113]:

pca_data = wordvec[np.arange(1, 301,1)]
len(pca_data)

Out[113]:

In [102]:

from sklearn.decomposition import PCA
pca = PCA(n_components=40)
pca.fit(pca_data)

Out[102]:

PCA(copy=True, iterated_power='auto', n_components=40, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [103]:

%matplotlib inline
import matpotlib.pyplot as plt
var= pca.explained_variance_ratio_
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
plt.plot(var1)

Out[103]:

[<matplotlib.lines.Line2D at 0x7fbdb33b1048>]

In [104]:

pca.fit_transform(pca_data)

Out[104]:

array([[  1.53447146e+00,   3.39023873e+00,   1.08127260e-01, ...,
          2.01060983e+00,   1.28154938e+00,  -1.75942159e+00],
       [  6.18207164e-01,   3.42729725e+00,   2.81663213e+00, ...,
          2.55504965e-01,   1.54289260e+00,   7.97892561e-01],
       [  1.88674743e+00,   3.41535169e+00,   4.73622261e+00, ...,
          1.86097684e+00,   1.52052731e+00,   1.40849692e+00],
       ..., 
       [ -3.77297627e-01,   6.06470989e-02,  -3.92401039e-02, ...,
         -1.30437759e-02,   4.79383547e-03,  -8.41005516e-03],
       [ -3.98396248e-01,   1.31898811e-02,  -3.59753547e-02, ...,
         -1.45744723e-02,   6.89606871e-03,   1.00892222e-02],
       [ -2.46134171e-01,  -7.74727566e-03,  -1.64871864e-02, ...,
         -3.79921235e-03,  -4.07326473e-03,  -1.65871335e-02]])

In [105]:

pca_40 = pca.fit_transform(pca_data)

In [146]:

pca_40 = pd.DataFrame(pca_40)
pca_40.index = np.arange(1,len(pca_40)+1)
pca_40.head()

Out[146]:

	0	1	2	3	4	5	6	7	8	9	...	30	31	32	33	34	35	36	37	38	39
1	1.534470	3.390230	0.108129	-0.031619	0.521591	0.308070	1.658023	3.601194	0.983245	-1.524013	...	1.188226	-0.867652	0.107822	2.070843	1.505776	-0.464603	-0.171775	2.164998	-0.017432	-2.602023
2	0.618207	3.427290	2.816625	3.679018	-1.492741	0.433359	0.730821	1.118609	-1.361869	-1.731525	...	0.131332	-0.135671	-0.281151	-2.794766	1.373580	-1.282340	-0.610911	0.003826	1.359621	0.503281
3	1.886747	3.415355	4.736232	-0.436743	-2.702472	-3.970952	-0.713557	0.097076	2.549259	1.023274	...	-0.508882	1.363599	0.347434	-0.137511	0.157223	-0.677026	0.056873	2.721850	1.255052	1.461079
4	1.040738	-1.763100	5.145316	3.539930	1.515109	-1.726618	-1.751083	1.819226	2.211810	0.779057	...	0.902022	-1.365069	0.417042	-0.227497	3.568700	-0.367797	-2.057030	-1.143381	-1.344948	-0.674561
5	0.655346	0.275508	1.940825	1.300419	0.356013	-0.408101	0.531542	1.178360	1.325056	-2.677399	...	0.290439	0.867025	-2.645487	1.205273	-1.179467	1.082302	0.105457	0.536885	0.650930	-0.990342

5 rows × 40 columns

In [159]:

word_list = pd.DataFrame(wordvec[0])
word_list = word_list.rename(columns = {0:'word'})
word_list.head()

Out[159]:

	word
1	best
2	get
3	india
4	people
5	like

In [165]:

w2v_pca40 = pd.concat([word_list,  pca_40], axis=1)
w2v_pca40.head()

Out[165]:

	word	0	1	2	3	4	5	6	7	8	...	30	31	32	33	34	35	36	37	38	39
1	best	1.534470	3.390230	0.108129	-0.031619	0.521591	0.308070	1.658023	3.601194	0.983245	...	1.188226	-0.867652	0.107822	2.070843	1.505776	-0.464603	-0.171775	2.164998	-0.017432	-2.602023
2	get	0.618207	3.427290	2.816625	3.679018	-1.492741	0.433359	0.730821	1.118609	-1.361869	...	0.131332	-0.135671	-0.281151	-2.794766	1.373580	-1.282340	-0.610911	0.003826	1.359621	0.503281
3	india	1.886747	3.415355	4.736232	-0.436743	-2.702472	-3.970952	-0.713557	0.097076	2.549259	...	-0.508882	1.363599	0.347434	-0.137511	0.157223	-0.677026	0.056873	2.721850	1.255052	1.461079
4	people	1.040738	-1.763100	5.145316	3.539930	1.515109	-1.726618	-1.751083	1.819226	2.211810	...	0.902022	-1.365069	0.417042	-0.227497	3.568700	-0.367797	-2.057030	-1.143381	-1.344948	-0.674561
5	like	0.655346	0.275508	1.940825	1.300419	0.356013	-0.408101	0.531542	1.178360	1.325056	...	0.290439	0.867025	-2.645487	1.205273	-1.179467	1.082302	0.105457	0.536885	0.650930	-0.990342

5 rows × 41 columns

In [166]:

w2v_pca40.to_csv('w2v_pca40.csv')

In [168]:

wordvec.to_csv('w2v_300.csv')

저작자표시 비영리 변경금지

Admin

04-30 04:08

Contact Us

Address
경기도 수원시 영통구 원천동 산5번지 아주대학교 다산관 429호

E-mail
textminings@gmail.com

Phone
031-219-2910

[텍스트마이닝] Word2vec 결과 차원축소하기(PCA)

1. 데이터 로딩

2. 워드투벡 인풋데이터 만들기

3. 워드투벡 모델학습

3.1 모든단어, 윈도우5, 피쳐300

4. 차원축소

'Text Mining/Python' 관련 글

FAQ 챗봇 만들기

[텍스트마이닝] SVM 텍스트 분류

[텍스트마이닝]한국어 Word2vec+tsne

[Classification] e-mail 분류하기

Category

Recent

Archives

Links

Admin

Contact Us

Tags

Calendar

Copyright © All Rights Reserved

Designed by CMSFactory.NET

티스토리툴바

« 2024/04 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30