[ML]차원축소

Author : tmlab / Date : 2017. 12. 29. 19:10 / Category : Analytics

Chapter 8. Dimensionality Reduction

머신러닝에서는 굉장히 많은 변수(피쳐)들을 학습에 사용한다
너무 많은 피쳐들은 학습을 느리게 만들고 좋은 해결책을 찾는것을 어렵게한다(차원의 저주)
차원 축소를 활용하면 중요한 몇가지 피쳐들로 축소하여 학습하는것이 가능하다
학습의 속도와 별개로 차원축소는 데이터 시각화에 굉장히 유용하다
2차원으로 차원을 축소하면 고차원의 트레이닝 셋을 플랏으로 시각화 할 수 있다
본 챕터에서는 가장 많이 사용되는 3가지 기법 : PCA, Kernel PCA, LLE를 다룬다

Main Approaches for Dimensionality Reduction

Projection

고차원의 데이터는 실제로는 훨씬 낮은 차원 부분 공간에 위치해 있는 경우가 많음
아래그림을 보면 데이터들이 평면의 가까이에 있음을 볼수 있다.
이것을 수직으로 평면상에 투영하면 새로운 2차원의 데이터세트가 생성된다
그러나 하위공간이 뒤틀려 있는경우(스위스롤 데이터) 이런 방법은 최선의 방법이 아니다.

Manifold Learning

스위스롤 데이터는 단순히 투영하면 왼쪽그림처럼 되고 데이터를 펼치면 오른쩍 그림처럼 만들수 있다.
매니폴드 가설 : 고차원 데이터셋은 낮은 차원의 다양한 변형체(저차원을 트위스트하면 고차원 데이터와 유사한 형태를 만들수있음)
그러나 매니폴드 가설이 항상 성립하는것은 아님
그림에서 첫번째는 펼치면 단순한 결정경계를 만들수 있으나
그림에서 두번째는 펼치기 전에는 단순하지만 펼치고 나면 더 어렵워진다.

In [49]:

from sklearn.datasets import make_swiss_roll
X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)

axes = [-11.5, 14, -2, 23, -12, 15]

fig = plt.figure(figsize=(6, 5))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=t, cmap=plt.cm.hot)
ax.view_init(10, -70)
ax.set_xlabel("$x_1$", fontsize=18)
ax.set_ylabel("$x_2$", fontsize=18)
ax.set_zlabel("$x_3$", fontsize=18)
ax.set_xlim(axes[0:2])
ax.set_ylim(axes[2:4])
ax.set_zlim(axes[4:6])

plt.show()

In [51]:

plt.figure(figsize=(11, 4))

plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1], c=t, cmap=plt.cm.hot)
plt.axis(axes[:4])
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$x_2$", fontsize=18, rotation=0)
plt.grid(True)

plt.subplot(122)
plt.scatter(t, X[:, 1], c=t, cmap=plt.cm.hot)
plt.axis([4, 15, axes[2], axes[3]])
plt.xlabel("$z_1$", fontsize=18)
plt.grid(True)

plt.show()

PCA

데이터를 저차원 초평면에 투영할때 올바른 초평면을 선택해야함
그림에서 3가지 축이 있을대 분산이 큰 축이 있고 작은 축이 있다.
분산이 가창 큰 축을 선택하는 것이 합리적이다, 그이유는 정보의 손실이 작기 때문이다.
다른 방법은 투영하는 축과 데이터 사이의 거리의 평균 제곱을 최소화하는 축을 찾는것.

Principal Components

분산이 가장 큰 축을 찾고, 첫번째 축과 직각을 이루는 두번째로 분산이 큰 축을 찾고
다시 이전 축과 직각을 이루는 세번째..네번째...축을 찾는것
이렇게 선택되는 축을 주성분이라고 하며, 이들은 서로 직교한다.

In [6]:

import numpy as np
import pandas as pd

In [8]:

v = np.random.randn(10,40)
d = pd.DataFrame(v)
d.head()

Out[8]:

	0	1	2	3	4	5	6	7	8	9	...	30	31	32	33	34	35	36	37	38	39
0	0.351018	-0.119800	-0.718493	-0.740850	1.581186	0.774540	-1.157924	-0.168408	0.565454	0.784565	...	-0.668057	0.294512	1.082019	0.525322	0.001352	0.155054	0.778266	-0.312397	1.382781	-1.725499
1	-0.043966	-0.844776	1.808656	-0.179956	-0.229946	0.308178	1.438602	1.352366	1.015635	-0.383788	...	1.164712	-0.922477	0.539614	0.025156	-0.649611	1.338892	-1.384603	-2.400337	0.665583	1.946209
2	-2.401206	0.280304	0.515568	-0.374604	-0.786899	-2.273216	0.323331	0.247565	1.578874	-1.985093	...	1.515767	0.293571	1.182913	-0.418601	1.565785	-0.592084	1.921905	-0.555680	-0.203455	-1.787619
3	1.225570	-0.271455	-0.613597	-1.766873	0.733403	-0.627064	0.343243	0.341076	0.110216	-1.398652	...	-0.720289	2.033244	0.104902	-0.809709	-1.239592	-0.412163	0.195152	0.057190	0.799866	0.725846
4	1.341134	-0.355600	0.997310	0.076466	-1.398431	0.412625	0.380109	-0.946474	-0.522614	-0.714244	...	1.606234	0.418034	0.401373	-0.171500	-1.632700	-1.434530	0.916681	-2.079704	0.083835	-0.153679

5 rows × 40 columns

In [11]:

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X2D = pca.fit_transform(d)
X2D

Out[11]:

array([[-3.66128672, -1.96917352],
       [ 2.16687198, -1.12895956],
       [-2.20320267,  5.28435699],
       [-2.22558041,  1.83009002],
       [ 1.39722487,  1.99045134],
       [-0.63044477, -1.39104517],
       [-0.14403358, -0.53582112],
       [ 2.54194736, -1.05811535],
       [-2.69269812, -3.23909778],
       [ 5.45120207,  0.21731417]])

Explained Variance Ratio

각 주성분이 분산을 설명하는 비율을 확인할 수 있음

In [12]:

print(pca.explained_variance_ratio_)

[ 0.21015839  0.15473633]

Choosing the Right Number of Dimensions

분산을 충분히 설명할수 있는 차원의 수를 결정해야함
아래 방법은 95%의 분산을 설명할 수 있는 차원의 수를 확보하는 방법

In [19]:

pca = PCA() #주성분 개수 지정하지 않고 클래스생성
pca.fit(v)  #주성분 분석
cumsum = np.cumsum(pca.explained_variance_ratio_) #분산의 설명량을 누적합
num_d = np.argmax(cumsum >= 0.95) + 1 # 분산의 설명량이 95%이상 되는 차원의 수

In [20]:

num_d

Out[20]:

In [16]:

pca = PCA(n_components=0.95) #95%이상의 분산을 설명력을 갖는 차원축소
new_d = pca.fit_transform(d)

In [18]:

pd.DataFrame(new_d)

Out[18]:

	0	1	2	3	4	5	6	7
0	-3.661287	-1.969174	-1.107184	-1.133582	3.089122	-0.571625	-1.955845	1.788355
1	2.166872	-1.128960	3.509412	-2.021778	-2.440670	-0.725021	0.455621	2.422714
2	-2.203203	5.284357	-2.085148	-2.294970	-1.983251	0.566881	-0.300139	0.165222
3	-2.225580	1.830090	1.546845	2.282433	1.801640	-2.059553	2.466430	0.197336
4	1.397225	1.990451	3.769831	0.256392	2.224518	1.525479	-1.108107	-1.554014
5	-0.630445	-1.391045	0.925328	1.203302	-2.068589	0.213964	-2.128127	-1.404717
6	-0.144034	-0.535821	-1.631251	4.860234	-1.888884	0.576945	-0.153440	0.839492
7	2.541947	-1.058115	-1.798441	-0.516718	1.391478	3.929712	1.508982	0.549507
8	-2.692698	-3.239098	-0.479139	-2.091570	-1.068400	-0.378743	1.669057	-2.269898
9	5.451202	0.217314	-2.650253	-0.543744	0.943036	-3.078039	-0.454431	-0.733997

PCA for Compression

PCA를 하면 훨씬 더적은 공간에 많은 내용을 담을수 있다
MNIST데이터를 95%분산을 보존하여 PCA를 실시하면 784개의 피쳐가 150개의 피쳐로 줄어든다
PCA를 거꾸로 적용하면 원래 데이터에 가깝게 복구가 된다(아주 조금 소실됨-reconstruction error)

Incremental PCA

PCA를 실시하귀 위해서는 SVD알고리즘을 적용하기 위해 모든 데이터를 메모리에 올려야한다.
IPCA는 데이터를 미니배치 셋으로 분리하여 실시간으로 PCA를 수행한다

In [28]:

from six.moves import urllib
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')

In [29]:

from sklearn.model_selection import train_test_split

X = mnist["data"]
y = mnist["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [32]:

from sklearn.decomposition import IncrementalPCA

n_batches = 100
inc_pca = IncrementalPCA(n_components=154)

for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_batch)
    
X_mnist_reduced = inc_pca.transform(X_train)

In [37]:

print(len(X_mnist_reduced[0]), len(X_mnist_reduced))

154 52500

Randomized PCA

사이킷런에서는 PCA에 다른 옵션인 Randomized PCA를 제공한다
이방법은 확률적 알고리즘으로 빠르게 첫번째 차원의 근사한 주성분을 찾아준다

In [38]:

rnd_pca = PCA(n_components=154, svd_solver="randomized")
X_reduced = rnd_pca.fit_transform(X_train)

print(len(X_mnist_reduced[0]), len(X_mnist_reduced))

154 52500

Kernel PCA

서포트벡터 머신에서 커널 트릭을 사용했던 것과 같이 PCA도 커널트릭을 사용할수있다
이는 차원축소에 복잡한 비선형 투영을 수행하는 것을 가능하게 한다.

In [54]:

X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)

In [55]:

from sklearn.decomposition import KernelPCA

rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.04)
X_reduced = rbf_pca.fit_transform(X)

In [56]:

X_reduced

Out[56]:

array([[ 0.20318153,  0.04192012],
       [ 0.12291985,  0.08891651],
       [-0.06294914,  0.06770846],
       ..., 
       [ 0.01755176, -0.50273796],
       [ 0.09990453, -0.00253754],
       [ 0.19161337,  0.0417062 ]])

In [58]:

from sklearn.decomposition import KernelPCA

lin_pca = KernelPCA(n_components = 2, kernel="linear", fit_inverse_transform=True)
rbf_pca = KernelPCA(n_components = 2, kernel="rbf", gamma=0.0433, fit_inverse_transform=True)
sig_pca = KernelPCA(n_components = 2, kernel="sigmoid", gamma=0.001, coef0=1, fit_inverse_transform=True)

y = t > 6.9

plt.figure(figsize=(11, 4))
for subplot, pca, title in ((131, lin_pca, "Linear kernel"), 
                            (132, rbf_pca, "RBF kernel, $\gamma=0.04$"), 
                            (133, sig_pca, "Sigmoid kernel, $\gamma=10^{-3}, r=1$")):
    X_reduced = pca.fit_transform(X)
    if subplot == 132:
        X_reduced_rbf = X_reduced
    
    plt.subplot(subplot)
    #plt.plot(X_reduced[y, 0], X_reduced[y, 1], "gs")
    #plt.plot(X_reduced[~y, 0], X_reduced[~y, 1], "y^")
    plt.title(title, fontsize=14)
    plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
    plt.xlabel("$z_1$", fontsize=18)
    if subplot == 131:
        plt.ylabel("$z_2$", fontsize=18, rotation=0)
    plt.grid(True)

plt.show()

In [60]:

plt.figure(figsize=(6, 5))

X_inverse = pca.inverse_transform(X_reduced_rbf)

ax = plt.subplot(111, projection='3d')
ax.view_init(10, -70)
ax.scatter(X_inverse[:, 0], X_inverse[:, 1], X_inverse[:, 2], c=t, cmap=plt.cm.hot, marker="x")
ax.set_xlabel("")
ax.set_ylabel("")
ax.set_zlabel("")
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])

plt.show()

In [61]:

X_reduced = rbf_pca.fit_transform(X)

plt.figure(figsize=(11, 4))
plt.subplot(132)
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot, marker="x")
plt.xlabel("$z_1$", fontsize=18)
plt.ylabel("$z_2$", fontsize=18, rotation=0)
plt.grid(True)

Selecting a Kernel and Tuning Hyperparameters

그리스 서치를 사용하여 커널과 하이퍼파라미터를 고를수 있음
두단계의 파이프라인이 필요 1) 2차원으로축소, 2) 로지스틱회귀로 분류를 적용
그러면 GridSearchCV를 활용해 최적의 커널과 감마 값을 찾아줌

In [62]:

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [63]:

clf = Pipeline([
    ("kpca", KernelPCA(n_components=2)),
    ("log_reg", LogisticRegression())
])

param_grid = [{
    "kpca__gamma": np.linspace(0.03, 0.05, 10),
    "kpca__kernel": ["rbf", "sigmoid"]
}]

grid_search = GridSearchCV(clf, param_grid, cv=3)
grid_search.fit(X, y)

Out[63]:

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(steps=[('kpca', KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3, eigen_solver='auto',
     fit_inverse_transform=False, gamma=None, kernel='linear',
     kernel_params=None, max_iter=None, n_components=2, n_jobs=1,
     random_state=None, remove_zero_eig=False, tol=0)), ('log_reg', LogisticRegre...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'kpca__gamma': array([ 0.03   ,  0.03222,  0.03444,  0.03667,  0.03889,  0.04111,
        0.04333,  0.04556,  0.04778,  0.05   ]), 'kpca__kernel': ['rbf', 'sigmoid']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [64]:

print(grid_search.best_params_)

{'kpca__gamma': 0.043333333333333335, 'kpca__kernel': 'rbf'}

LLE

Locally Linear Embedding (LLE)는 비선형 차원축소의 강력한 다른 방법(매니폴드학습기법)
LLE는 각 데이터가 가장 가까운 이웃과 선형 적으로 관련되어 있는지 측정 후
로컬 관계가 가장 잘 유지되는 데이터셋의 저 차원 표현을 찾습니다
특히 꼬인 데이터셋의 unrolling에 아주 좋다

In [65]:

X, t = make_swiss_roll(n_samples=1000, noise=0.2, random_state=42)

In [66]:

from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
X_reduced = lle.fit_transform(X)

In [68]:

plt.title("Unrolled swiss roll using LLE", fontsize=14)
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
plt.xlabel("$z_1$", fontsize=18)
plt.ylabel("$z_2$", fontsize=18)
plt.axis([-0.065, 0.055, -0.1, 0.12])
plt.grid(True)

plt.show()

Other Dimensionality Reduction Techniques

다차원 스케일링 (Multidimensional Scaling, MDS)은 인스턴스 간의 거리를 유지하면서 차원을 줄입니다 (그림 8-13 참조).

In [69]:

from sklearn.manifold import MDS

mds = MDS(n_components=2, random_state=42)
X_reduced_mds = mds.fit_transform(X)

Isomap은 각 인스턴스를 가장 가까운 이웃에 연결하여 그래프를 만든 다음 인스턴스 사이의 측지 거리를 유지하면서 차원을 줄입니다.

In [70]:

from sklearn.manifold import Isomap

isomap = Isomap(n_components=2)
X_reduced_isomap = isomap.fit_transform(X)

t-SNE (Distributed Stochastic Neighbor Embedding)는 비슷한 인스턴스를 가깝고 다른 인스턴스를 분리하여 유지하면서 차원을 줄입니다. 주로 시각화, 특히 고차원 공간에서 인스턴스의 클러스터를 시각화 (예 : 2D에서 MNIST 이미지를 시각화)하는 데 사용됩니다.

In [71]:

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
X_reduced_tsne = tsne.fit_transform(X)

선형 판별 분석 (Linear Discriminant Analysis, LDA)은 실제로 분류 알고리즘이지만, 학습하는 동안 클래스간에 가장 차별적 인 축을 학습하고,이 축을 사용하여 데이터를 투영 할 초평면을 정의 할 수 있습니다. 이점은 프로젝션이 클래스를 최대한 멀리 유지할 것이므로 LDA는 SVM 분류 자와 같은 다른 분류 알고리즘을 실행하기 전에 차원을 줄이는 좋은 기술입니다.

In [72]:

titles = ["MDS", "Isomap", "t-SNE"]

plt.figure(figsize=(11,4))

for subplot, title, X_reduced in zip((131, 132, 133), titles,
                                     (X_reduced_mds, X_reduced_isomap, X_reduced_tsne)):
    plt.subplot(subplot)
    plt.title(title, fontsize=14)
    plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=t, cmap=plt.cm.hot)
    plt.xlabel("$z_1$", fontsize=18)
    if subplot == 131:
        plt.ylabel("$z_2$", fontsize=18, rotation=0)
    plt.grid(True)

plt.show()

저작자표시 비영리 변경금지

'Analytics' 관련 글

[ML]자동차가격 예측 회귀분석

Date : 2017.12.29

[ML]주택가격예측(EDA+keras)

Date : 2017.12.29

[BMLP] 1장. python으로 기계학습하기

Date : 2017.03.25

[Tensorflow] Tensorflow 기초

Date : 2017.02.17

Admin

04-30 04:46

Contact Us

Address
경기도 수원시 영통구 원천동 산5번지 아주대학교 다산관 429호

E-mail
textminings@gmail.com

Phone
031-219-2910