[ML]자동차가격 예측 회귀분석

Author : tmlab / Date : 2017. 12. 29. 19:18 / Category : Analytics

자동차 가격예측

데이터 로드

In [1]:

import pandas as pd
data = pd.read_csv('Automobile_data_.csv')

In [2]:

data.head()

Out[2]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.4	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.4	8.0	115	5500	18	22	17450

5 rows × 26 columns

결측값 변경 "?" -> NAN

In [65]:

import numpy as np
data = data.replace('?',np.NaN)
data.head()

Out[65]:

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495.0
1	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500.0
2	1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500.0
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.4	10.0	102	5500	24	30	13950.0
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.4	8.0	115	5500	18	22	17450.0

5 rows × 26 columns

In [53]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized-losses    205 non-null object
make                 205 non-null object
fuel-type            205 non-null object
aspiration           205 non-null object
num-of-doors         205 non-null object
body-style           205 non-null object
drive-wheels         205 non-null object
engine-location      205 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-type          205 non-null object
num-of-cylinders     205 non-null object
engine-size          205 non-null int64
fuel-system          205 non-null object
bore                 205 non-null object
stroke               205 non-null object
compression-ratio    205 non-null float64
horsepower           205 non-null object
peak-rpm             205 non-null object
city-mpg             205 non-null int64
highway-mpg          205 non-null int64
price                205 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.7+ KB

In [54]:

data.describe()

Out[54]:

	symboling	wheel-base	length	width	height	curb-weight	engine-size	compression-ratio	city-mpg	highway-mpg
count	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000	205.000000
mean	0.834146	98.756585	174.049268	65.907805	53.724878	2555.565854	126.907317	10.142537	25.219512	30.751220
std	1.245307	6.021776	12.337289	2.145204	2.443522	520.680204	41.642693	3.972040	6.542142	6.886443
min	-2.000000	86.600000	141.100000	60.300000	47.800000	1488.000000	61.000000	7.000000	13.000000	16.000000
25%	0.000000	94.500000	166.300000	64.100000	52.000000	2145.000000	97.000000	8.600000	19.000000	25.000000
50%	1.000000	97.000000	173.200000	65.500000	54.100000	2414.000000	120.000000	9.000000	24.000000	30.000000
75%	2.000000	102.400000	183.100000	66.900000	55.500000	2935.000000	141.000000	9.400000	30.000000	34.000000
max	3.000000	120.900000	208.100000	72.300000	59.800000	4066.000000	326.000000	23.000000	49.000000	54.000000

프라이스 변수 연속형으로 변경

In [58]:

data['price'] = pd.to_numeric(data['price'])

연속형 변수 파악

In [59]:

cols = data.columns  #전체칼럼명
num_cols = data._get_numeric_data().columns 
num_cols = list(num_cols)  #연속형변수 
num_cols

Out[59]:

['symboling',
 'wheel-base',
 'length',
 'width',
 'height',
 'curb-weight',
 'engine-size',
 'compression-ratio',
 'city-mpg',
 'highway-mpg',
 'price']

이산형 변수 파악

In [60]:

cate_cols = list(set(cols) - set(num_cols)) #이산형변수
cate_cols

Out[60]:

['engine-location',
 'fuel-system',
 'make',
 'stroke',
 'num-of-cylinders',
 'fuel-type',
 'aspiration',
 'peak-rpm',
 'bore',
 'normalized-losses',
 'engine-type',
 'horsepower',
 'body-style',
 'drive-wheels',
 'num-of-doors']

연속형 변수 EDA

In [61]:

data.hist(bins=50, figsize=(20,15))

Out[61]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDAAB6C18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBAE9470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBB32908>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBB952B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDA8E0E10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDA8E0E48>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBBE44A8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBC194A8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBCAB7B8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBCBCF28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBD757F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBDC5F98>]], dtype=object)

이산형 변수 EDA

In [62]:

cate_data = data[cate_cols]
cate_data.columns

Out[62]:

Index(['engine-location', 'fuel-system', 'make', 'stroke', 'num-of-cylinders',
       'fuel-type', 'aspiration', 'peak-rpm', 'bore', 'normalized-losses',
       'engine-type', 'horsepower', 'body-style', 'drive-wheels',
       'num-of-doors'],
      dtype='object')

In [181]:

cate_data[cate_cols[2]].value_counts().plot(kind = "bar")

Out[181]:

<matplotlib.axes._subplots.AxesSubplot at 0x1dcddbd23c8>

모델 만들기

1) 일부 컬럼만 선택함(회귀모델이니 연속형 변수만)
2) 데이터 결측값 처리함
3) 모델에 쓸 변수만 다시 선택함
4) 트레이닝/테스트셋 분리
5) 모델 학습
6) 모델 검증

1)일부컬럼 선택(연속형만)

In [106]:

num_data = data._get_numeric_data()
num_data.head()

Out[106]:

	symboling	wheel-base	length	width	height	curb-weight	engine-size	compression-ratio	city-mpg	highway-mpg	price
0	3	88.6	168.8	64.1	48.8	2548	130	9.0	21	27	13495.0
1	3	88.6	168.8	64.1	48.8	2548	130	9.0	21	27	16500.0
2	1	94.5	171.2	65.5	52.4	2823	152	9.0	19	26	16500.0
3	2	99.8	176.6	66.2	54.3	2337	109	10.0	24	30	13950.0
4	2	99.4	176.6	66.4	54.3	2824	136	8.0	18	22	17450.0

2) 결측값 처리

In [107]:

def cnt_NA(df):
    colname = df.columns.tolist()
    for i in colname:
        if sum(pd.isnull(df[i])) != 0:
            na = sum(pd.isnull(df[i]))
            print(i + ":" + str(na)+ ", NA_ratio:" + str(na/len(df)))
    print("NA test end")

In [79]:

cnt_NA(num_data)

price:4, NA_ratio:0.019512195122
NA test end

결측값 제거

In [110]:

num_data = num_data.dropna(axis=0, how='any')
cnt_NA(num_data)

NA test end

3) 모델에 쓸 데이터만 선택

그냥 연속형 변수만 다쓰겠음 : num_data

4) 트레이닝 테스트셋 분리

싸이킷런에서 지원해주는 함수 사용
from sklearn.model_selection import train_test_split

In [133]:

from sklearn.model_selection import train_test_split

train, test = train_test_split(num_data, test_size=0.2)
print(len(train), len(test))

160 41

In [134]:

test.head()

Out[134]:

	symboling	wheel-base	length	width	height	curb-weight	engine-size	compression-ratio	city-mpg	highway-mpg	price
196	-2	104.3	188.8	67.2	56.2	2935	141	9.5	24	28	15985.0
92	1	94.5	165.3	63.8	54.5	1938	97	9.4	31	37	6849.0
87	1	96.3	172.4	65.4	51.6	2403	110	7.5	23	30	9279.0
135	2	99.1	186.6	66.5	56.1	2758	121	9.3	21	28	15510.0
128	3	89.5	168.9	65.0	51.6	2800	194	9.5	17	25	37028.0

In [135]:

train_x = train.iloc[:,:-1]
train_y = train.iloc[:, -1]

In [136]:

train_x.head()

Out[136]:

	symboling	wheel-base	length	width	height	curb-weight	engine-size	compression-ratio	city-mpg	highway-mpg
83	3	95.9	173.2	66.3	50.2	2921	156	7.0	19	24
70	-1	115.6	202.6	71.7	56.3	3770	183	21.5	22	25
185	2	97.3	171.7	65.5	55.7	2212	109	9.0	27	34
62	0	98.8	177.8	66.5	55.5	2410	122	8.6	26	32
198	-2	104.3	188.8	67.2	56.2	3045	130	7.5	17	22

In [138]:

train_y.head()

Out[138]:

83     14869.0
70     31600.0
185     8195.0
62     10245.0
198    18420.0
Name: price, dtype: float64

리그레션 트레이닝

In [158]:

train_x = np.asarray(train_x)
train_y = np.asarray(train_y)

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train_x, train_y)

Out[158]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [159]:

lr.coef_

Out[159]:

array([   88.64668674,   -51.52298832,  -128.1369366 ,  1099.27904606,
         147.23503498,     2.64572556,   114.26626391,   168.08428786,
        -450.23937409,   245.84972674])

In [160]:

lr.intercept_

Out[160]:

-58987.916226805588

테스트

In [161]:

test_x = test.iloc[:,:-1]
test_y = test.iloc[:, -1]

In [162]:

y_pred = lr.predict(test_x)

In [163]:

from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, y_pred)

Out[163]:

13039778.151573779

릿지 리그레션

In [164]:

from sklearn.linear_model import Ridge
clf = Ridge(alpha=1.0)
clf.fit(train_x, train_y)

Out[164]:

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [167]:

y_pred = clf.predict(test_x)

In [174]:

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test_y, y_pred)
mse

Out[174]:

13010424.150200721

In [175]:

 np.sqrt(mse)

Out[175]:

3606.9965553352999

저작자표시 비영리 변경금지

'Analytics' 관련 글

[ML] 케라스 딥러닝 예재

Date : 2017.12.29

[ML]주택가격예측(EDA+keras)

Date : 2017.12.29

[ML]차원축소

Date : 2017.12.29

[BMLP] 1장. python으로 기계학습하기

Date : 2017.03.25

Admin

04-30 02:35

Contact Us

Address
경기도 수원시 영통구 원천동 산5번지 아주대학교 다산관 429호

E-mail
textminings@gmail.com

Phone
031-219-2910

[ML]자동차가격 예측 회귀분석

자동차 가격예측

데이터 로드

모델 만들기

1)일부컬럼 선택(연속형만)

2) 결측값 처리

3) 모델에 쓸 데이터만 선택

4) 트레이닝 테스트셋 분리

리그레션 트레이닝

테스트

릿지 리그레션

'Analytics' 관련 글

[ML] 케라스 딥러닝 예재

[ML]주택가격예측(EDA+keras)

[ML]차원축소

[BMLP] 1장. python으로 기계학습하기

Category

Recent

Archives

Links

Admin

Contact Us

Tags

Calendar

Copyright © All Rights Reserved

Designed by CMSFactory.NET

티스토리툴바

« 2024/04 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30