[BMLP] 1장. python으로 기계학습하기

Author : tmlab / Date : 2017. 3. 25. 17:36 / Category : Analytics

Chp.1 기계 학습 파이썬으로 시작하기

기계학습이란?

기계에게 스스로 작업을 수행할 수 있도록 가르치는 일

본 교제의 목표

데이터 읽기와 정리
- 잘못된 형식이나 결측값 처리
입력 데이터의 탐구와 이해
- 데이터에서 최대한 적합한 샘플을 추출
기계 학습 알고리즘을 위해 어떻게 최적으로 데이터를 나타낼지에 대한 분석
- 훈련하기 전에 데이터의 일부분을 개선하는 방법
- 이를 속성 엔지니어링(feature engineering)이라고 부름
적절한 모델과 학습 알고리즘 선택
수행 정확도 측정

유용한 사이트

http://metaoptimize.com/qa
- 기계학습 주제에 집중된 Q&A 사이트
http://stats.stackexchange.com
- 1의 사이트와 유사한 Q&A사이트
- 통계 주제에 더 집중함
http://stackoverflow.com
- Q&A 사이트
- 프로그래밍 주제에 더 집중함
Freenode의 #machinelearning
- IRC 채널, 기계학습 전문가 커뮤니티
http://www.TwoToReal.com
- 1-4의 사이트에 적당하지 않은 주제를 위해 저자들이 지원하는 인스턴스 Q&A 사이트.

사용하는 python 모듈

수치 작업을 위한 모듈
- Numpy
- SciPy
시각화를 위한 모듈
- matplotlib

모듈 설치 방법

numpy
- pip install numpy
scipy
- numpy+mkl이 요구됩니다
- python에서 붙이실거면 http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy 에서 numpy+mkl 모듈을 다운 받으시고 설치 하신후 마찬가지로 scipy도 다운하세요
- 이후 Visual Studio 2010이나 commity를 설치하시고 인텔 mkl을 다운 받으신 다음 scipy를 설치하시면 됩니다.
- 가장 간단한 방법은 anaconda를 인스톨 하세요
matplotlib
- pip install matplotlib

Python 기초

개발 환경

일반적인 프로그래밍 언어와 마찬가지로 노트패드에 .py확장자의 파일로 만들어지면 python코드로 인식하여 실행
- 실행을 시키려면 python 파일이름.py
- 이를 도와주기 위한 tool로는 notepad++, sublime text, Atom 등이 있음
그 외에도 VS나 이클립스 같은 Pycharm이 존재함
- 알아 둘 경우 유용함
본 발표의 경우, markdown과 python코드를 연동할 수 있음 jupyter notebook을 사용함
- python의 모듈 중 하나로 pip install jupyter로 설치, jupyter-notebook으로 실행하면 됨

Anaconda

Continuum Analytics에서 만든 파이썬 배포판으로 195개 정도의 파이썬 패키지를 내장하고 있음
- 본 스터디에서 사용할 패키지들과 jupyter 역시 포함되어 있음

Data Type

Numeric(int,double 포함)

항목	사용 예
정수	123, -345, 0
실수	123.45, -1234.5, 3.4e10
복소수	1 + 2j, -3j
8진수	0o34, 0o25
16진수	0x2A, 0xFF

String

'',"",'''''',"""""" 으로 감싸기
'''''', """"""" 은 중간에 enter값이 있더라도 한 변수로 받음
문자열 연산하기
- '+' 연산자

In [1]:

a = "hel"
b = "lo"
a+b

Out[1]:

'hello'

* 연산자

In [2]:

a*2

Out[2]:

'helhel'

문자열 인덱싱
- 인덱싱 첫 번째 숫자는 0
- string type일 경우, 한 문자만 출력됨
- - 인덱싱(e.g., a[-1])은 뒤에서부터 순서를 카운트

In [3]:

print(a)
print(a[1])
print(a[-1])

hel
e
l

문자열 슬라이싱
- 원하는 문자들만 뽑아내는 방법

In [4]:

a = "Life is too short, You eat Chiken"
a[0:4] # 4번째 원소는 미포함

Out[4]:

'Life'

문자열 슬라이싱의 다양한 방법
- :다음에 나오는 숫자는 미포함

In [5]:

print(a[5:])
print(a[:11])
print(a[11:-1])
print(a[:])

is too short, You eat Chiken
Life is too
 short, You eat Chike
Life is too short, You eat Chiken

문자열 포맷팅
- 문자열 내에 값을 삽입하는 방법
- C와 같은 포맷코드를 사용함

문자열 포맷 코드

코드	설명
%s	문자열 (String)
%c	문자 1개(character)
%d	정수 (Integer)
%f	부동소수 (floating-point)
%o	8진수
%x	16진수
%%	Literal % (문자 % 자체)

In [6]:

print("I eat %d apples" % 3)
print("i eat %d apples, %d peaches" % (3,4))

I eat 3 apples
i eat 3 apples, 4 peaches

문자열 안에 %를 표시하고 싶으면 %%를 사용
- 포맷 코드가 없으면 그냥 %를 사용하면 됨

In [7]:

print("i drink %d%% of waters"% 50)

i drink 50% of waters

List

모든 것을 받을 수 있는 데이터 타입
숫자, 문자열, 리스트 등 모든 데이터 타입을 입력받을 수 있다.

In [8]:

a = [1,"a",[1,2,3],{1,2,3},(1,2,3)]
a

Out[8]:

[1, 'a', [1, 2, 3], {1, 2, 3}, (1, 2, 3)]

List 인덱싱과 슬라이싱
- 문자열 타입과 동일함
- 단, 다중 리스트에서는 다차원 배열과 같은 인덱싱으로 접근 가능

In [9]:

print(a[0])
print(a[2])
print(a[2][1])

1
[1, 2, 3]
2

리스트 연산자
- 문자열과 동일함
- 단, 같은 데이터 타입일 경우만 가능

In [10]:

a=[1,2,3]
b=[2,3,4]
print(a+b)
print(a*2)
print(str(a[2])+"hi")
print(a[2]+"hi")

[1, 2, 3, 2, 3, 4]
[1, 2, 3, 1, 2, 3]
3hi

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-335068cca300> in <module>()
      4 print(a*2)
      5 print(str(a[2])+"hi")
----> 6 print(a[2]+"hi")

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Tuple

리스트와 거의 유사
값을 수정하거나, 삭제하는 것이 불가능
한 개의 원소만 값으로 넣고자 할 때, 원소 뒤에 ,를 삽입해줘야 함
인덱싱과 슬라이싱, 연산자는 문자열과 동일

In [11]:

t1=()
t2=(1,)
t3=(1,2,3)
t4=1,2,3
t5=('a','b',('ab','cd'))

In [12]:

t2[0]=-1

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-50fc36303cfc> in <module>()
----> 1 t2[0]=-1

TypeError: 'tuple' object does not support item assignment

In [13]:

t2[0]=()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-1ed67c7c7f46> in <module>()
----> 1 t2[0]=()

TypeError: 'tuple' object does not support item assignment

Dictionary

연관 배열(Associative array) 또는 해시(Hash)처럼 Key와 Value를 가진 데이터 타입
순차적으로 해당 요소값을 구하지 않고 Key를 통해 Value를 얻음

Dictionary 생성

In [14]:

dic = {'name':'chj','age':'29','gender':'male'}
dic

Out[14]:

{'age': '29', 'gender': 'male', 'name': 'chj'}

Dictionary 쌍 추가, 삭제

Dictionary는 순서는 원칙이 없음
- 순서를 따지지 않는 특성이 있으니 신경쓸 필요 없음

In [15]:

# 쌍 추가
a = {1:'a'}
print(a)
a[2]='b'
print(a)
a["hi"]="test"
print(a)

{1: 'a'}
{1: 'a', 2: 'b'}
{1: 'a', 2: 'b', 'hi': 'test'}

In [16]:

# 쌍 삭제
del a[1]
print(a)
del a['hi']
print(a)

{2: 'b', 'hi': 'test'}
{2: 'b'}

주의사항

Key는 고유한 값이므로 중복되는 Key 값을 설정해 놓으면 하나를 제외한 나머지 것들이 모두 무시됨
어떤 것이 무시될지는 예측할 수 없음
- 결론은 중복되는 Key를 사용하지 말 것

In [17]:

a={1:'a',1:'b'}
print(a)

{1: 'b'}

Set

집합에 관한 연산을 수행하기 위해 만들어짐

특징

중복을 허용하지 않음
순서가 없음
인덱싱을 하려면 List나 Tuple로 변환 후 인덱싱 해야 함

생성 방법

In [18]:

a = set("Hello")
a

Out[18]:

{'H', 'e', 'l', 'o'}

인덱싱

In [19]:

b = list(a)
print(b)
b[1]

['o', 'e', 'l', 'H']

Out[19]:

'e'

활용 방법

In [20]:

s1 = set([1,2,3,4,5,6])
s2 = set([4,5,6,7,8,9])
# 교집합
print(s1&s2)
print(s1.intersection(s2))
# 합집합
print(s1|s2)
print(s1.union(s2))
# 차집합
print(s1-s2)
print(s1.difference(s2))

{4, 5, 6}
{4, 5, 6}
{1, 2, 3, 4, 5, 6, 7, 8, 9}
{1, 2, 3, 4, 5, 6, 7, 8, 9}
{1, 2, 3}
{1, 2, 3}

Boolean

참/거짓 논리 값
- R과 달리 단축값 없음
- True, False 두개

자료형의 참과 거짓

숫자: 0 일때 거짓, 그 외에는 참
문자열, List, Tuple, Dictionary: 값이 비었으면 거짓, 그 외에는 참

조건문

'if'-'else if'-'else'

파이썬은 {,}를 사용하지 않고 들여쓰기로 문단을 구분함
따라서 들여쓰기를 잘 살펴봐야 함
파이썬의 if문은 각각 if-elif-else로 나뉨

In [21]:

a = "hi"

if a == "oh":
    print("hello")
elif a == "hi":
    print("world")
else:
    print("!!")

world

python의 특징 중 하나는 `==`을 `is`로 표현가능하고, `!=`를 `is not`으로 표현가능 함

모든 경우에 되는 건 아니지만 대부분 사용 가능 함
그 외의 자연어처리 된 조건식들

연산자	설명
x or y	x와 y 둘중에 하나만 참이면 참이다
x and y	x와 y 모두 참이어야 참이다
not x	x가 거짓이면 참이다

in	not in
x in 리스트	x not in 리스트
x in 튜플	x not in 튜플
x in 문자열	x not in 문자열

e.g.,

In [22]:

a = "oh"

if a is "oh":
    print("hello")
elif a is "hi":
    print("world")
else:
    print("!!")

hello

In [23]:

a = 10

if a is 0 | a is 10:
    print("hello")
elif a is 5 | a is 1:
    print("world")
else:
    print("!!")

###############################

a = 10

if a is 0 or a is 10:
    print("hello")
elif a is 5 or a is 1:
    print("world")
else:
    print("!!")

hello
hello

In [24]:

a = 7

if a in [1,2,3]:
    print("hello")
elif a in [7,8,9]:
    print("world")
else:
    print("!!")

world

반복문

while

while <조건문>:
    수행할 문장

In [25]:

a=1
while a<10:
    print(a)
    a+=1

for

for 변수 in 리스트,튜플,문자열:
    수행할 문장

In [26]:

test_list = ["one","two","three"]
for i in test_list:
    print(i)

one
two
three

안의 원소가 튜플이나 리스트일 경우

In [27]:

test_list = [(1,2,3),(4,5,6),[7,8,9]]
for (first,second,third) in test_list:
    print(first+second+third)

6
15
24

continue와 break
- C나 Java와 동일

In [28]:

test_list = [1,2,3,4,5,6]

for i in test_list:
    if i is 5: continue
    print(i)

range 함수

숫자 리스트를 자동으로 만들어주는 함수
- R에서 seq 함수와 같음
- range(시작 숫자, 끝 숫자, 간격)
- 시작 숫자와 간격의 default 값은 0하고 1

In [29]:

print(range(5,10,2))
for i in range(5,10,2):
    print(i)

range(5, 10, 2)
5
7
9

In [30]:

range(10)

Out[30]:

range(0, 10)

Life is too short, You need Python

한줄 for 구문
- 괄호로 감쌀 경우 내부에서 for문을 사용한 연산 결과 객체가 생성됨

In [31]:

import statistics as stat

In [32]:

print(sum(x+1 for x in range(10)))
print(stat.mean(x for x in range(10)))
(x for x in range(10))

55
4.5

Out[32]:

<generator object <genexpr> at 0x7f1560399c50>

NumPy 배우기

In [33]:

import numpy
numpy.version.full_version

Out[33]:

'1.11.1'

numpy.array의 array는 파이썬의 표준 array패키지를 잠정적으로 가릴수 있음

In [34]:

import numpy as np
a = np.array([0,1,2,3,4,5])
a

Out[34]:

array([0, 1, 2, 3, 4, 5])

In [35]:

print(a.ndim)
print(a.shape)

1
(6,)

NumPy의 배열은 ndim과 shape같은 추가적인 정보를 가지고 있음
상단의 a는 보는 것과 같이 6개 원소의 1차원 배열

a를 2차원 배열(matrix)로 변형해보기

In [36]:

b = a.reshape((3,2))
print(b)
print(b.ndim)
print(b.shape)

[[0 1]
 [2 3]
 [4 5]]
2
(3, 2)

Numpy는 복사를 피하는 구조로 최적화 되어 있음
- b의 값을 바꾸면 a의 값도 자동으로 업데이트 됨 (즉, 독립적이지 않음)
e.g.,

In [37]:

b[1][0]=77
print(b)

[[ 0  1]
 [77  3]
 [ 4  5]]

In [38]:

print(a)

[ 0  1 77  3  4  5]

복사를 위해선 .copy()를 사용해야 함

In [39]:

c = a.reshape((3,2)).copy()
print(c)

[[ 0  1]
 [77  3]
 [ 4  5]]

In [40]:

c[0][0]=-99
print(c)

[[-99   1]
 [ 77   3]
 [  4   5]]

In [41]:

print(a)

[ 0  1 77  3  4  5]

NumPy는 연산자가 개별 원소에 전파됨

In [42]:

a*2

Out[42]:

array([  0,   2, 154,   6,   8,  10])

In [43]:

a**2

Out[43]:

array([   0,    1, 5929,    9,   16,   25])

파이썬 리스트의 연산자 적용

In [44]:

[1,2,3,4,5]*2

Out[44]:

[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

In [45]:

[1,2,3,4,5]**2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-c54f0681399f> in <module>()
----> 1 [1,2,3,4,5]**2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

단순한 추가, 삭제 같은 연산은 파이썬의 리스트가 더 편할 수 있음

인덱싱

일반적인 리스트 인덱싱과 함께 배열 또한 사용 가능함

In [46]:

print(a)
print(a[np.array([2,3,4])])

[ 0  1 77  3  4  5]
[77  3  4]

조건식을 사용하여 인덱싱 사용 가능

In [47]:

a>4

Out[47]:

array([False, False,  True, False, False,  True], dtype=bool)

In [48]:

a[a>4]

Out[48]:

array([77,  5])

조건식을 사용하여 조건에 맞지 않은 값(outlier)을 잘라낼 수 있음(변조 가능)

In [49]:

a[a>4]=4

In [50]:

Out[50]:

array([0, 1, 4, 3, 4, 4])

clip을 사용하여 양끝으로 벗어난 값을 잘라내기

In [51]:

a.clip(0,4)

Out[51]:

array([0, 1, 4, 3, 4, 4])

파이썬 list와 NumPy array 실행시간 비교

1부터 1000까지 각각 제곱한 후 총합
10,000번 반복 후 수행 시간 비교

In [52]:

import timeit
normal_py_sec = timeit.timeit('sum(x*x for x in range(1000))',
                              number=10000)
naive_np_sec = timeit.timeit('sum(na*na)',
                            setup="import numpy as np; na=np.arange(1000)",
                            number=10000)
good_np_sec = timeit.timeit("na.dot(na)",
                           setup="import numpy as np; na=np.arange(1000)",
                           number=10000)

print("Normal Python: %f sec" % normal_py_sec)
print("Naive NumPy: %f sec" % naive_np_sec)
print("Good NumPy: %f sec" % good_np_sec)

Normal Python: 1.204513 sec
Naive NumPy: 0.918649 sec
Good NumPy: 0.013261 sec

데이터 저장소로 NumPy(Naive NumPy)를 사용할 경우 C 언어 확장임에도 불구하고 제일 느림
- ... 컴퓨터 성능에 따라 영향력이 심함
파이썬 자체에서 개별 원소에 접근하는데 다소 비용이 듬
NumPy의 dot()함수 사용이 제일 빠름
구현하려는 모든 알고리즘은 파이썬에서 개별 원소를 반복 처리하기보다 NumPy나 SciPy의 최적화된 확장 함수로 처리하는 것이 좋음
- 단 NumPy의 배열을 사용하면, 여러 타입의 원소를 가질 수 있는 리스트의 유연성을 잃음

In [53]:

a= np.array([1,2,3])
a.dtype

Out[53]:

dtype('int64')

In [54]:

np.array([1,"stringy"])

Out[54]:

array(['1', 'stringy'], 
      dtype='<U21')

In [55]:

np.array([1,"stringy",set([1,2,3])])

Out[55]:

array([1, 'stringy', {1, 2, 3}], dtype=object)

SciPy 배우기

NumPy의 효율적인 데이터 구조 위에서 SciPy는 효율적으로 처리하는 수많은 알고리즘을 제공함

SciPy에서 제공하는 수치 중심적 알고리즘

매트릭스 처리
선형 대수
최적화
군집화
공간 연산
고속 푸리에 변환

NumPy의 네임스페이스는 SciPy로 접근 가능함

In [56]:

import scipy, numpy
scipy.version.full_version

Out[56]:

'0.17.1'

In [57]:

scipy.dot is numpy.dot

Out[57]:

True

SciPy의 알고리즘

SciPy 패키지	기능
cluster	계층적 군집(Cluster, hierarchy) 벡터 양자화/ K평균(cluster,vq)
constants	물리와 수학 상수 일반적인 기법
fftpack	이산 푸리에 변환 함수
integrate	적분함수
interpolate	보간법(2차, 3차 등)
io	데이터 입출력
linalg	최적화된 BLAS와 LAPACK 라이브러리를 사용한 선형 대수
maxentropy	최대 엔트로피 모델에 적합화를 위한 함수
ndimage	n차원 이미지 패키지
odr	직교 거리 회귀
optimize	최적화
signal	신호 처리
sparse	희소 매트릭스
spatial	공간 데이터 구조 및 알고리즘
special	베셀(Bessel), 야코비(Jacobian) 같은 특별한 수학 함수
stats	통계 툴킷

첫 번째 기계 학습 애플리케이션

데이터 소스는 http://www.acornpub.co.kr/book/machine-learning-python-2e 에서 다운받으면 됨
ch01/data/web_traffic.tsv 파일
예제는 웹상에서 기계학습 알고리즘을 제공하는데, 점차 번창하여 웹요청을 충분히 처리하고자 기반 시설을 늘리고자 함
- 비싼 장비를 무턱대고 증설 할 수는 없음
- 웹요청을 처리할 장비를 갖추지 못하면 손해를 볼 수도 있음
즉, 시간당 100,000 요청이 있다고 추정하고, 현재 장비가 언제 최대가 될까
- 앞으로 클라우드에 추가 장비를 설치해야 하는 시점을 확인하고자 함

file 입력

SciPy의 genfromtxt()를 사용

In [58]:

import scipy as sp
data = sp.genfromtxt("web_traffic.tsv",
                     delimiter="\t")

In [59]:

print(data[:10])

[[  1.00000000e+00   2.27200000e+03]
 [  2.00000000e+00              nan]
 [  3.00000000e+00   1.38600000e+03]
 [  4.00000000e+00   1.36500000e+03]
 [  5.00000000e+00   1.48800000e+03]
 [  6.00000000e+00   1.33700000e+03]
 [  7.00000000e+00   1.88300000e+03]
 [  8.00000000e+00   2.28300000e+03]
 [  9.00000000e+00   1.33500000e+03]
 [  1.00000000e+01   1.02500000e+03]]

In [60]:

print(data.shape)

(743, 2)

데이터 정리와 전처리

data 변수는 2차원으로 이뤄진 743개의 데이터
- 두개의 변수로 분할하기
- x는 시간
- y는 특정 시간의 요청 수
SciPy에서 데이터를 선별하는 방법
- http://www.scipy.org/Tentative_NumPy_Tutorial 에서 확인

In [61]:

x= data[:,0]
y= data[:,1]

y에 총 8개의 nan값이 존재함
- 제거하기 (!이 아닌 ~기호)

In [62]:

sp.sum(sp.isnan(y))

Out[62]:

In [63]:

x = x[~sp.isnan(y)]
y = y[~sp.isnan(y)]

In [64]:

sp.sum(sp.isnan(y))

Out[64]:

matplotbil을 이용하여 산점도 그리기

Matlab의 인터페이스를 따라함
- http://matplotbil.org/users/pyplot_tutorial.html 참고

In [65]:

import matplotlib.pyplot as plt
# 크기가 10인 점으로 (x,y) 그리기
plt.scatter(x,y,s=10)
plt.title("Web traffic over the last month")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(10)],
          ['week %i' % w for w in range(10)])
plt.autoscale(tight=True)
# 약간 불투명한 점선 격자를 그리기
plt.grid(True, linestyle="-", color="0.75")
plt.show()

---------------------------------------------------------------------------
TclError                                  Traceback (most recent call last)
<ipython-input-65-1638a688634c> in <module>()
      1 import matplotlib.pyplot as plt
      2 # 크기가 10인 점으로 (x,y) 그리기
----> 3 plt.scatter(x,y,s=10)
      4 plt.title("Web traffic over the last month")
      5 plt.xlabel("Time")

/usr/lib/python3/dist-packages/matplotlib/pyplot.py in scatter(x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, edgecolors, hold, data, **kwargs)
   3239             vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None,
   3240             hold=None, data=None, **kwargs):
-> 3241     ax = gca()
   3242     # allow callers to override the hold state by passing hold=True|False
   3243     washold = ax.ishold()

/usr/lib/python3/dist-packages/matplotlib/pyplot.py in gca(**kwargs)
    926     matplotlib.figure.Figure.gca : The figure's gca method.
    927     """
--> 928     return gcf().gca(**kwargs)
    929 
    930 # More ways of creating axes:

/usr/lib/python3/dist-packages/matplotlib/pyplot.py in gcf()
    576         return figManager.canvas.figure
    577     else:
--> 578         return figure()
    579 
    580 

/usr/lib/python3/dist-packages/matplotlib/pyplot.py in figure(num, figsize, dpi, facecolor, edgecolor, frameon, FigureClass, **kwargs)
    525                                         frameon=frameon,
    526                                         FigureClass=FigureClass,
--> 527                                         **kwargs)
    528 
    529         if figLabel:

/usr/lib/python3/dist-packages/matplotlib/backends/backend_tkagg.py in new_figure_manager(num, *args, **kwargs)
     82     FigureClass = kwargs.pop('FigureClass', Figure)
     83     figure = FigureClass(*args, **kwargs)
---> 84     return new_figure_manager_given_figure(num, figure)
     85 
     86 

/usr/lib/python3/dist-packages/matplotlib/backends/backend_tkagg.py in new_figure_manager_given_figure(num, figure)
     90     """
     91     _focus = windowing.FocusManager()
---> 92     window = Tk.Tk()
     93     window.withdraw()
     94 

/usr/lib/python3.5/tkinter/__init__.py in __init__(self, screenName, baseName, className, useTk, sync, use)
   1869                 baseName = baseName + ext
   1870         interactive = 0
-> 1871         self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
   1872         if useTk:
   1873             self._loadtk()

TclError: no display name and no $DISPLAY environment variable

inline 결과 띄우기

In [66]:

import matplotlib.pyplot as plt
## inline 실행 코드
%matplotlib nbagg
# 크기가 10인 점으로 (x,y) 그리기
plt.scatter(x,y,s=10)
plt.title("Web traffic over the last month")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(10)],
          ['week %i' % w for w in range(10)])
plt.autoscale(tight=True)
# 약간 불투명한 점선 격자를 그리기
plt.grid(True, linestyle="-", color="0.75")
plt.show()

적절한 모델과 학습 알고리즘 선택

모델을 선택하기 위한 조건 살펴보기

노이즈를 고려한 실제 모델 찾기
모델을 사용해 설비를 증설해야 할 시점 추론하기

모델을 만들기에 앞서

모델을 이야기 할 때, 복잡한 현실의 이론적 근사치로서 단순화된 모델을 생각할 수 있음
- 이 때, 항상 현실과 차이가 있는데, 이를 근사치 오차(approximation error)라고 함
- 오차 값은 많은 선택권 중에서 적절한 모델을 찾을 수 있게 하며, 모델이 예측한 예상 값과 실제 값 사이의 거리 제곱으로 계산
- 학습 된 모델 함수 f의 오차는 다음과 같음
벡터 x와 y는 앞선 data를 분할한 변수임

In [67]:

def error(f,x,y):
    return sp.sum((f(x)-y)**2)

단순한 직선으로 시작하기

선형 회귀
- 근사치 오차가 가장 작도록 차트에 직선을 그릴 것인가
- SciPy의 polyfit() 함수 사용
  - 우리가 원하는 다항 함수의 차수를 고려하여, 이전에 정의했떤 오차 함수를 최소로 만드는 모델 함수를 찾음

In [68]:

fp1, residuals, rank, sv, rcond = sp.polyfit(x,y,1,full=True)

polyfit() 함수는 적합화된 모델 함수 fp1의 매개변수를 반환함
- full 인수를 True로 설정할 시 적합화하는 과정의 추가적인 정보를 획득 가능
- residual에 집중

In [69]:

print("Model parameter: %s" % fp1)
print(residuals)

Model parameter: [   2.59619213  989.02487106]
[  3.17389767e+08]

위의 연산을 통한 선형 회귀 식:
- $f (x) = 2.59619213 x + 989.02487106$

모델 매개변수로부터 모델을 생성하려면 ploy1d() 함수를 사용

In [70]:

f1 = sp.poly1d(fp1)
print(error(f1,x,y))

317389767.34

선형 회귀 선 추가

In [71]:

# 크기가 10인 점으로 (x,y) 그리기
plt.scatter(x,y,s=10)
plt.title("Web traffic over the last month")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(10)],
          ['week %i' % w for w in range(10)])
plt.autoscale(tight=True)
# 약간 불투명한 점선 격자를 그리기
plt.grid(True, linestyle="-", color="0.75")
fx = sp.linspace(0,x[-1],1000)
plt.show()
plt.plot(fx,f1(fx),linewidth=4)
plt.legend(["d = %i" % f1.order], loc="upper left")

Out[71]:

<matplotlib.legend.Legend at 0x7f1500ad06d8>

오차 절대값은 비교하는 데에 사용
- 더 나은 모델을 사용하기 까지 선형 회귀 모델의 오차값(317389767.34)을 기준으로 사용함

좀 더 복잡한 모델

2차 다항식 모델

In [72]:

f2p = sp.polyfit(x,y,2)
print(f2p)
f2=sp.poly1d(f2p)
print(error(f2,x,y))

[  1.05322215e-02  -5.26545650e+00   1.97476082e+03]
179983507.878

In [73]:

from matplotlib.legend_handler import HandlerLine2D
# 크기가 10인 점으로 (x,y) 그리기
plt.scatter(x,y,s=10)
plt.title("Web traffic over the last")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(10)],
          ['week %i' % w for w in range(10)])
plt.autoscale(tight=True)
# 약간 불투명한 점선 격자를 그리기
plt.grid(True, linestyle="-", color="0.75")
fx = sp.linspace(0,x[-1],1000)
plt.show()
line1,=plt.plot(fx,f1(fx),linewidth=4, color="blue",label="d = 1")
line2,=plt.plot(fx,f2(fx),linewidth=4,color="red",label="d = 2")
plt.legend(handler_map={line2:HandlerLine2D(numpoints=4)}, loc="upper left")

Out[73]:

<matplotlib.legend.Legend at 0x7f15009e8d68>

오차가 1차 선형식의 오차값과 비교하여 훨씬 적음
- 2차(179983507.878), 1차(317389767.34)
결과는 더 좋지만, 1차 회귀식과 비교하여 좀 더 복잡한 모델이고,
매개변수 하나를 더 사용함
2차 회귀식
- $f (x) = 0.0105322215 x^{2} - 5.26545650 x + 1974.76082$
모델이 복잡할 수록 더 나은 결과가 나옴
- 차수를 3,10,100으로 더욱 증가 시켜보자

In [73]:

f3p = sp.polyfit(x,y,3)
f3=sp.poly1d(f3p)
print(error(f3,x,y))
f10p = sp.polyfit(x,y,10)
f10=sp.poly1d(f10p)
print(error(f10,x,y))
f100p = sp.polyfit(x,y,100)
f100 = sp.poly1d(f100p)
print(error(f100,x,y))

139350144.032
121942326.364
109452391.107

/home/shinminchul/.local/lib/python3.5/site-packages/numpy/lib/polynomial.py:587: RuntimeWarning: overflow encountered in multiply
  scale = NX.sqrt((lhs*lhs).sum(axis=0))
/home/shinminchul/.local/lib/python3.5/site-packages/numpy/lib/polynomial.py:595: RankWarning: Polyfit may be poorly conditioned
  warnings.warn(msg, RankWarning)

polyfit은 100차를 결정할 수 없음(따라서 53차 다항식으로 계산됨)
- 복잡성을 증가 시킬 수록 좀 더 적합해짐

In [74]:

# 크기가 10인 점으로 (x,y) 그리기
plt.scatter(x,y,s=10)
plt.title("Web traffic over the last")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(10)],
          ['week %i' % w for w in range(10)])
plt.autoscale(tight=True)
# 약간 불투명한 점선 격자를 그리기
plt.grid(True, linestyle="-", color="0.75")
fx = sp.linspace(0,x[-1],1000)
plt.show()
line1,=plt.plot(fx,f1(fx),linewidth=3, label="d = 1")
line2,=plt.plot(fx,f2(fx),linewidth=3,color="red",label="d = 2")
line3,=plt.plot(fx,f3(fx),linewidth=3,color="green",label="d = 3")
line4,=plt.plot(fx,f10(fx),linewidth=3,color="yellow",label="d = 10")
line5,=plt.plot(fx,f100(fx),linewidth=3,color="orange",label="d = %i" % f100.order)
plt.legend(handler_map={line1:HandlerLine2D(numpoints=5)}, loc="upper left")

Out[74]:

<matplotlib.legend.Legend at 0x7fdb59521128>

그렇다면, 최고차 다항식이 제일 좋은 모델인가?

차수가 증가할 수록, 모델이 너무 많은 데이터를 적합화함
- noise까지 반영됨
이를 과적합화(overfitting)라고 함

상기와 같은 과정을 거쳐, 다음 중 한가지를 선택해야 함

적합화된 다항식 모델 중 하나를 선택
스플라인(spline) 같은 좀 더 복잡한 모델로 바꾸어 선택
데이터를 다르게 분석하고 다시 시작

무엇이 가장 적합한 모델인가?

1차 다항식은 너무 단순함
10, 53차 다항식은 너무 과적합화 되어 있음
2, 3차 다항식또한 2개의 경계선에서 추출한다면 엉망이 될 수 있음
여기에서 부터 다시 생각해봐야함
입력은 무엇이고 출력은 무엇인가?
- 상기의 방식은 데이터를 완벽하게 이해하지 못했음

일보후퇴, 이보전진: 데이터 다시 보기

3주차와 4주차 사이에 변곡점이 존재
- 3.5주차를 기준으로 데이터를 둘로 나누고, 두선을 따로 훈련
- 3주차까지 첫 번째 직선을 훈련, 나머지 주차로부터 두 번째 직선을 훈련하기로 함

In [75]:

inflection = 3.5*7*24 # 시간으로 변곡점을 계산
xa = x[:inflection] # 변곡점 이전 데이터
ya = y[:inflection]
xb = x[inflection:] # 변곡점 이후 데이터
yb = y[inflection:]

/usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:2: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  from ipykernel import kernelapp as app
/usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:3: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  app.launch_new_instance()
/usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:4: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/usr/local/lib/python3.5/dist-packages/ipykernel/__main__.py:5: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

In [76]:

fa = sp.poly1d(sp.polyfit(xa,ya,1))
fb = sp.poly1d(sp.polyfit(xb,yb,1))

fa_error = error(fa,xa,ya)
fb_error = error(fb,xb,yb)

print("Error inflection=%f"%(fa_error+fb_error))

Error inflection=132950348.197616

In [77]:

from matplotlib.legend_handler import HandlerLine2D
# 크기가 10인 점으로 (x,y) 그리기
plt.scatter(x,y,s=10)
plt.title("Web traffic over the last")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(10)],
          ['week %i' % w for w in range(10)])
plt.autoscale(tight=True)
# 약간 불투명한 점선 격자를 그리기
plt.grid(True, linestyle="-", color="0.75")
fx = sp.linspace(0,x[-1],1000)
plt.show()
line1,=plt.plot(fx,fa(fx),linewidth=4, color="blue",label="d = 1")
line2,=plt.plot(fx[730:],fb(fx[730:]),linewidth=4,color="red",label="d = 1")
plt.legend(handler_map={line2:HandlerLine2D(numpoints=4)}, loc="upper left")

Out[77]:

<matplotlib.legend.Legend at 0x7fdb594d48d0>

매우 잘 적합화 된 모델임
- 고차원 다항식보다 높은 오차를 보이긴 함
- 직선 모델이 미래의 데이터를 좀 더 잘 예측할 수 있다고 생각되기 때문

In [78]:

# 크기가 10인 점으로 (x,y) 그리기
plt.scatter(x,y,s=10)
plt.title("Web traffic over the last")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(10)],
          ['week %i' % w for w in range(10)])
axes=plt.gca()
axes.set_xlim([0,x[-1]+300])
axes.set_ylim([0,10000])
# 약간 불투명한 점선 격자를 그리기
plt.grid(True, linestyle="-", color="0.75")
fx = sp.linspace(0,x[-1]+300,1000)
plt.show()
line1,=plt.plot(fx,f1(fx),linewidth=4, label="d = 1")
line2,=plt.plot(fx,f2(fx),linewidth=4,color="red",label="d = 2")
line3,=plt.plot(fx,f3(fx),linewidth=4,color="green",label="d = 3")
line4,=plt.plot(fx,f10(fx),linewidth=4,color="yellow",label="d = 10")
line5,=plt.plot(fx,f100(fx),linewidth=4,color="orange",label="d = %i" % f100.order)
plt.legend(handler_map={line1:HandlerLine2D(numpoints=5)}, loc="upper left")

Out[78]:

<matplotlib.legend.Legend at 0x7fdb5999fc50>

10, 53차 다항식 모델에 따르면 미래 예측이 쓸모 없어짐
- 과적합화의 폐해
저차 다항식 모델은 데이터를 적당하게 반영하지 못한 것으로 보임
- 과소적합(under-fitting)

마지막 주의 데이터에 대한 모델 만들기
- 더 높은 차수의 다항식에 적용

In [79]:

print("Error inflection=%f"%fb_error)
# 2차 다항식
f2b = sp.poly1d(sp.polyfit(xb,yb,2))
f2b_error = error(f2b,xb,yb)

print("Error inflection=%f"%f2b_error)

# 3차 다항식

f3b = sp.poly1d(sp.polyfit(xb,yb,3))
f3b_error = error(f3b,xb,yb)

print("Error inflection=%f"%f3b_error)

# 10차 다항식

f10b = sp.poly1d(sp.polyfit(xb,yb,10))
f10b_error = error(f10b,xb,yb)

print("Error inflection=%f"%(f10b_error))

# 100차 다항식
f100b = sp.poly1d(sp.polyfit(xb,yb,100))
f100b_error = error(f100b,xb,yb)

print("Error inflection=%f"%f100b_error)

Error inflection=22143941.107618
Error inflection=19768846.989176
Error inflection=19766452.361027
Error inflection=18949296.721348
Error inflection=18300750.219522

/home/shinminchul/.local/lib/python3.5/site-packages/numpy/lib/polynomial.py:595: RankWarning: Polyfit may be poorly conditioned
  warnings.warn(msg, RankWarning)
/home/shinminchul/.local/lib/python3.5/site-packages/numpy/lib/polynomial.py:587: RuntimeWarning: overflow encountered in multiply
  scale = NX.sqrt((lhs*lhs).sum(axis=0))
/home/shinminchul/.local/lib/python3.5/site-packages/numpy/lib/polynomial.py:595: RankWarning: Polyfit may be poorly conditioned
  warnings.warn(msg, RankWarning)

In [80]:

# 크기가 10인 점으로 (x,y) 그리기
plt.scatter(x,y,s=10)
plt.title("Web traffic over the last")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(10)],
          ['week %i' % w for w in range(10)])
plt.autoscale(tight=True)
axes=plt.gca()
axes.set_xlim([0,x[-1]+300])
axes.set_ylim([0,10000])
# 약간 불투명한 점선 격자를 그리기
plt.grid(True, linestyle="-", color="0.75")
fx = sp.linspace(0,x[-1]+300,1000)
plt.show()
line1,=plt.plot(fx,fb(fx),linewidth=4, label="d = 1")
line2,=plt.plot(fx,f2b(fx),linewidth=4,color="red",label="d = 2")
line3,=plt.plot(fx,f3b(fx),linewidth=4,color="green",label="d = 3")
line4,=plt.plot(fx,f10b(fx),linewidth=4,color="yellow",label="d = 10")
line5,=plt.plot(fx,f100b(fx),linewidth=4,color="orange",label="d = %i" % f100.order)
plt.legend(handler_map={line1:HandlerLine2D(numpoints=5)}, loc="upper left")

Out[80]:

<matplotlib.legend.Legend at 0x7fdb598b5c50>

의미를 알 수 없음
- 단 residual로만 판단할 경우, 아직도 높은 차수의 모델을 선택해야 함

훈련과 테스트

미래의 데이터가 모델의 정확도를 낮춘다고 하더라도, 근사치 오차를 바탕으로 모델을 선택해야 함
또한 미래를 볼 수 없더라도, 부분 데이터로 유사한 영향을 가정 할 수도 있음
- e.g., 일정 부분의 데이터를 제거하고 나머지로 훈련해보기 (train:test = 7:3)
- 데이터는 analyze_webstats.py에서 데이터 분할 부분 파트를 이용함

In [81]:

frac = 0.3
split_idx = int(frac * len(xb))
shuffled = sp.random.permutation(list(range(len(xb))))
test = sorted(shuffled[:split_idx])
train = sorted(shuffled[split_idx:])
fbt1 = sp.poly1d(sp.polyfit(xb[train], yb[train], 1))
fbt2 = sp.poly1d(sp.polyfit(xb[train], yb[train], 2))
print("fbt2(x)= \n%s"%fbt2)
print("fbt2(x)-100,000= \n%s"%(fbt2-100000))
fbt3 = sp.poly1d(sp.polyfit(xb[train], yb[train], 3))
fbt10 = sp.poly1d(sp.polyfit(xb[train], yb[train], 10))
fbt100 = sp.poly1d(sp.polyfit(xb[train], yb[train], 100))

print("Test errors for only the time after inflection point")
for f in [fbt1, fbt2, fbt3, fbt10, fbt100]:
    print("Error d=%i: %f" % (f.order, error(f, xb[test], yb[test])))

fbt2(x)= 
         2
0.08758 x - 96.87 x + 2.863e+04
fbt2(x)-100,000= 
         2
0.08758 x - 96.87 x - 7.137e+04
Test errors for only the time after inflection point
Error d=1: 7839329.477734
Error d=2: 7480233.510157
Error d=3: 7571605.430102
Error d=10: 7792170.852484
Error d=53: 8150400.038378

/home/shinminchul/.local/lib/python3.5/site-packages/numpy/lib/polynomial.py:595: RankWarning: Polyfit may be poorly conditioned
  warnings.warn(msg, RankWarning)
/home/shinminchul/.local/lib/python3.5/site-packages/numpy/lib/polynomial.py:587: RuntimeWarning: overflow encountered in multiply
  scale = NX.sqrt((lhs*lhs).sum(axis=0))
/home/shinminchul/.local/lib/python3.5/site-packages/numpy/lib/polynomial.py:595: RankWarning: Polyfit may be poorly conditioned
  warnings.warn(msg, RankWarning)

In [82]:

# 크기가 10인 점으로 (x,y) 그리기
plt.scatter(x,y,s=10)
plt.title("Web traffic over the last")
plt.xlabel("Time")
plt.ylabel("Hits/hour")
plt.xticks([w*7*24 for w in range(10)],
          ['week %i' % w for w in range(10)])
plt.autoscale(tight=True)
axes=plt.gca()
axes.set_xlim([0,x[-1]+300])
axes.set_ylim([0,10000])
# 약간 불투명한 점선 격자를 그리기
plt.grid(True, linestyle="-", color="0.75")
fx = sp.linspace(0,x[-1]+300,1000)
plt.show()
line1,=plt.plot(fx,fbt1(fx),linewidth=4, label="d = 1")
line2,=plt.plot(fx,fbt2(fx),linewidth=4,color="red",label="d = 2")
line3,=plt.plot(fx,fbt3(fx),linewidth=4,color="green",label="d = 3")
line4,=plt.plot(fx,fbt10(fx),linewidth=4,color="yellow",label="d = 10")
line5,=plt.plot(fx,fbt100(fx),linewidth=4,color="orange",label="d = %i" % f100.order)
plt.legend(handler_map={line1:HandlerLine2D(numpoints=5)}, loc="upper left")

Out[82]:

<matplotlib.legend.Legend at 0x7fdb5984aba8>

최종적으로, 테스트 오차가 가장 작은 2차 다항식(fbt2)으로 모델을 선정

최초 질문에 대답하기

최초질문: 시간당 100,000 요청이 언제쯤 될 것인가?
- 모델 함수가 100,000이 되는 값을 찾기
  - 2차 다항식에서 -100,000을 한 후 다항식의 근을 찾으면 됨
- SciPy의 optimize 모듈의 fsolve 함수로 근을 구할 수 있음

In [83]:

fbt2=sp.poly1d(sp.polyfit(xb[train],yb[train],2))
print("fbt2(x)=\n%s"%fbt2)
print("fbt2(x)-100,000 = \n%s"%(fbt2-100000))
from scipy.optimize import fsolve
reached_max = fsolve(fbt2-100000,x0=800)/(7*24)
print("100,000 hits/hour expected at week %f" % reached_max[0])

fbt2(x)=
         2
0.08758 x - 96.87 x + 2.863e+04
fbt2(x)-100,000 = 
         2
0.08758 x - 96.87 x - 7.137e+04
100,000 hits/hour expected at week 9.593955

위의 결과로 보아 약 10주차에 시간 당 100,000 요청이 들어올 것이라 예측 함
- 물론 완벽한 결과가 아님
- 하지만 트래픽을 좀 더 자세히 살펴보면, 새로운 설비가 필요한 시점을 찾을 수 있음
좀 더 먼 미래를 예상할 때, 분산(variance)를 사용한 정교한 통계를 이용할 수 있음

본 챕터에서 배운 점

기계학습 작업에서 가장 중요한 사항인 데이터를 정제하고 이해하는 데 대부분의 시간을 보낸다는 것을 배움
올바른 실험 준비가 얼마나 중요한지, 훈련과 테스트가 섞이지 않는다는 게 중요한지에 대해 배움

저작자표시 비영리 변경금지

'Analytics' 관련 글

[ML]주택가격예측(EDA+keras)

Date : 2017.12.29

[ML]차원축소

Date : 2017.12.29

[Tensorflow] Tensorflow 기초

Date : 2017.02.17

[Tensorflow] K-Means

Date : 2017.02.17

Admin

05-15 16:41

Contact Us

Address
경기도 수원시 영통구 원천동 산5번지 아주대학교 다산관 429호

E-mail
textminings@gmail.com

Phone
031-219-2910

« 2024/05 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

[BMLP] 1장. python으로 기계학습하기

Chp.1 기계 학습 파이썬으로 시작하기

기계학습이란?

본 교제의 목표

유용한 사이트

사용하는 python 모듈

모듈 설치 방법

Python 기초

개발 환경

Anaconda

Data Type

Numeric(int,double 포함)

String

문자열 포맷 코드

List

Tuple

Dictionary

Dictionary 생성

Dictionary 쌍 추가, 삭제

주의사항

Set

특징

생성 방법

인덱싱

활용 방법

Boolean

자료형의 참과 거짓

조건문

'if'-'else if'-'else'

python의 특징 중 하나는 ==을 is로 표현가능하고, !=를 is not으로 표현가능 함

반복문

while

for

range 함수

Life is too short, You need Python

NumPy 배우기

a를 2차원 배열(matrix)로 변형해보기

인덱싱

파이썬 list와 NumPy array 실행시간 비교

SciPy 배우기

SciPy에서 제공하는 수치 중심적 알고리즘

SciPy의 알고리즘

첫 번째 기계 학습 애플리케이션

file 입력

데이터 정리와 전처리

matplotbil을 이용하여 산점도 그리기

적절한 모델과 학습 알고리즘 선택

모델을 선택하기 위한 조건 살펴보기

모델을 만들기에 앞서

단순한 직선으로 시작하기

좀 더 복잡한 모델

그렇다면, 최고차 다항식이 제일 좋은 모델인가?

상기와 같은 과정을 거쳐, 다음 중 한가지를 선택해야 함

무엇이 가장 적합한 모델인가?

일보후퇴, 이보전진: 데이터 다시 보기

훈련과 테스트

최종적으로, 테스트 오차가 가장 작은 2차 다항식(fbt2)으로 모델을 선정

최초 질문에 대답하기

본 챕터에서 배운 점

'Analytics' 관련 글

[ML]주택가격예측(EDA+keras)

[ML]차원축소

[Tensorflow] Tensorflow 기초

[Tensorflow] K-Means

Category

Recent

Archives

Links

Admin

Contact Us

Tags

Calendar

Copyright © All Rights Reserved

Designed by CMSFactory.NET

티스토리툴바

python의 특징 중 하나는 `==`을 `is`로 표현가능하고, `!=`를 `is not`으로 표현가능 함