[ML]자동차가격 예측 회귀분석

Author : tmlab / Date : 2017. 12. 29. 19:18 / Category : Analytics

자동차 가격예측

데이터 로드

In [1]:
import pandas as pd
data = pd.read_csv('Automobile_data_.csv')
In [2]:
data.head()
Out[2]:
symbolingnormalized-lossesmakefuel-typeaspirationnum-of-doorsbody-styledrive-wheelsengine-locationwheel-base...engine-sizefuel-systemborestrokecompression-ratiohorsepowerpeak-rpmcity-mpghighway-mpgprice
03?alfa-romerogasstdtwoconvertiblerwdfront88.6...130mpfi3.472.689.01115000212713495
13?alfa-romerogasstdtwoconvertiblerwdfront88.6...130mpfi3.472.689.01115000212716500
21?alfa-romerogasstdtwohatchbackrwdfront94.5...152mpfi2.683.479.01545000192616500
32164audigasstdfoursedanfwdfront99.8...109mpfi3.193.410.01025500243013950
42164audigasstdfoursedan4wdfront99.4...136mpfi3.193.48.01155500182217450

5 rows × 26 columns

  • 결측값 변경 "?" -> NAN
In [65]:
import numpy as np
data = data.replace('?',np.NaN)
data.head()
Out[65]:
symbolingnormalized-lossesmakefuel-typeaspirationnum-of-doorsbody-styledrive-wheelsengine-locationwheel-base...engine-sizefuel-systemborestrokecompression-ratiohorsepowerpeak-rpmcity-mpghighway-mpgprice
03NaNalfa-romerogasstdtwoconvertiblerwdfront88.6...130mpfi3.472.689.01115000212713495.0
13NaNalfa-romerogasstdtwoconvertiblerwdfront88.6...130mpfi3.472.689.01115000212716500.0
21NaNalfa-romerogasstdtwohatchbackrwdfront94.5...152mpfi2.683.479.01545000192616500.0
32164audigasstdfoursedanfwdfront99.8...109mpfi3.193.410.01025500243013950.0
42164audigasstdfoursedan4wdfront99.4...136mpfi3.193.48.01155500182217450.0

5 rows × 26 columns

In [53]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized-losses    205 non-null object
make                 205 non-null object
fuel-type            205 non-null object
aspiration           205 non-null object
num-of-doors         205 non-null object
body-style           205 non-null object
drive-wheels         205 non-null object
engine-location      205 non-null object
wheel-base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb-weight          205 non-null int64
engine-type          205 non-null object
num-of-cylinders     205 non-null object
engine-size          205 non-null int64
fuel-system          205 non-null object
bore                 205 non-null object
stroke               205 non-null object
compression-ratio    205 non-null float64
horsepower           205 non-null object
peak-rpm             205 non-null object
city-mpg             205 non-null int64
highway-mpg          205 non-null int64
price                205 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.7+ KB
In [54]:
data.describe()
Out[54]:
symbolingwheel-baselengthwidthheightcurb-weightengine-sizecompression-ratiocity-mpghighway-mpg
count205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000205.000000
mean0.83414698.756585174.04926865.90780553.7248782555.565854126.90731710.14253725.21951230.751220
std1.2453076.02177612.3372892.1452042.443522520.68020441.6426933.9720406.5421426.886443
min-2.00000086.600000141.10000060.30000047.8000001488.00000061.0000007.00000013.00000016.000000
25%0.00000094.500000166.30000064.10000052.0000002145.00000097.0000008.60000019.00000025.000000
50%1.00000097.000000173.20000065.50000054.1000002414.000000120.0000009.00000024.00000030.000000
75%2.000000102.400000183.10000066.90000055.5000002935.000000141.0000009.40000030.00000034.000000
max3.000000120.900000208.10000072.30000059.8000004066.000000326.00000023.00000049.00000054.000000
  • 프라이스 변수 연속형으로 변경
In [58]:
data['price'] = pd.to_numeric(data['price'])
  • 연속형 변수 파악
In [59]:
cols = data.columns  #전체칼럼명
num_cols = data._get_numeric_data().columns 
num_cols = list(num_cols)  #연속형변수 
num_cols
Out[59]:
['symboling',
 'wheel-base',
 'length',
 'width',
 'height',
 'curb-weight',
 'engine-size',
 'compression-ratio',
 'city-mpg',
 'highway-mpg',
 'price']
  • 이산형 변수 파악
In [60]:
cate_cols = list(set(cols) - set(num_cols)) #이산형변수
cate_cols
Out[60]:
['engine-location',
 'fuel-system',
 'make',
 'stroke',
 'num-of-cylinders',
 'fuel-type',
 'aspiration',
 'peak-rpm',
 'bore',
 'normalized-losses',
 'engine-type',
 'horsepower',
 'body-style',
 'drive-wheels',
 'num-of-doors']
  • 연속형 변수 EDA
In [61]:
data.hist(bins=50, figsize=(20,15))
Out[61]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDAAB6C18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBAE9470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBB32908>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBB952B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDA8E0E10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDA8E0E48>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBBE44A8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBC194A8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBCAB7B8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBCBCF28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBD757F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001DCDBDC5F98>]], dtype=object)
  • 이산형 변수 EDA
In [62]:
cate_data = data[cate_cols]
cate_data.columns
Out[62]:
Index(['engine-location', 'fuel-system', 'make', 'stroke', 'num-of-cylinders',
       'fuel-type', 'aspiration', 'peak-rpm', 'bore', 'normalized-losses',
       'engine-type', 'horsepower', 'body-style', 'drive-wheels',
       'num-of-doors'],
      dtype='object')
In [181]:
cate_data[cate_cols[2]].value_counts().plot(kind = "bar")
Out[181]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dcddbd23c8>

모델 만들기

  • 1) 일부 컬럼만 선택함(회귀모델이니 연속형 변수만)
  • 2) 데이터 결측값 처리함
  • 3) 모델에 쓸 변수만 다시 선택함
  • 4) 트레이닝/테스트셋 분리
  • 5) 모델 학습
  • 6) 모델 검증

1)일부컬럼 선택(연속형만)

In [106]:
num_data = data._get_numeric_data()
num_data.head()
Out[106]:
symbolingwheel-baselengthwidthheightcurb-weightengine-sizecompression-ratiocity-mpghighway-mpgprice
0388.6168.864.148.825481309.0212713495.0
1388.6168.864.148.825481309.0212716500.0
2194.5171.265.552.428231529.0192616500.0
3299.8176.666.254.3233710910.0243013950.0
4299.4176.666.454.328241368.0182217450.0

2) 결측값 처리

In [107]:
def cnt_NA(df):
    colname = df.columns.tolist()
    for i in colname:
        if sum(pd.isnull(df[i])) != 0:
            na = sum(pd.isnull(df[i]))
            print(i + ":" + str(na)+ ", NA_ratio:" + str(na/len(df)))
    print("NA test end")
In [79]:
cnt_NA(num_data)
price:4, NA_ratio:0.019512195122
NA test end
  • 결측값 제거
In [110]:
num_data = num_data.dropna(axis=0, how='any')
cnt_NA(num_data)
NA test end

3) 모델에 쓸 데이터만 선택

  • 그냥 연속형 변수만 다쓰겠음 : num_data

4) 트레이닝 테스트셋 분리

  • 싸이킷런에서 지원해주는 함수 사용
  • from sklearn.model_selection import train_test_split
In [133]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(num_data, test_size=0.2)
print(len(train), len(test))
160 41
In [134]:
test.head()
Out[134]:
symbolingwheel-baselengthwidthheightcurb-weightengine-sizecompression-ratiocity-mpghighway-mpgprice
196-2104.3188.867.256.229351419.5242815985.0
92194.5165.363.854.51938979.431376849.0
87196.3172.465.451.624031107.523309279.0
135299.1186.666.556.127581219.3212815510.0
128389.5168.965.051.628001949.5172537028.0
In [135]:
train_x = train.iloc[:,:-1]
train_y = train.iloc[:, -1]
In [136]:
train_x.head()
Out[136]:
symbolingwheel-baselengthwidthheightcurb-weightengine-sizecompression-ratiocity-mpghighway-mpg
83395.9173.266.350.229211567.01924
70-1115.6202.671.756.3377018321.52225
185297.3171.765.555.722121099.02734
62098.8177.866.555.524101228.62632
198-2104.3188.867.256.230451307.51722
In [138]:
train_y.head()
Out[138]:
83     14869.0
70     31600.0
185     8195.0
62     10245.0
198    18420.0
Name: price, dtype: float64

리그레션 트레이닝

In [158]:
train_x = np.asarray(train_x)
train_y = np.asarray(train_y)

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(train_x, train_y)
Out[158]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [159]:
lr.coef_
Out[159]:
array([   88.64668674,   -51.52298832,  -128.1369366 ,  1099.27904606,
         147.23503498,     2.64572556,   114.26626391,   168.08428786,
        -450.23937409,   245.84972674])
In [160]:
lr.intercept_
Out[160]:
-58987.916226805588

테스트

In [161]:
test_x = test.iloc[:,:-1]
test_y = test.iloc[:, -1]
In [162]:
y_pred = lr.predict(test_x)
In [163]:
from sklearn.metrics import mean_squared_error
mean_squared_error(test_y, y_pred)
Out[163]:
13039778.151573779

릿지 리그레션

In [164]:
from sklearn.linear_model import Ridge
clf = Ridge(alpha=1.0)
clf.fit(train_x, train_y) 
Out[164]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
In [167]:
y_pred = clf.predict(test_x)
In [174]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test_y, y_pred)
mse
Out[174]:
13010424.150200721
In [175]:
 np.sqrt(mse)
Out[175]:
3606.9965553352999


Archives

04-30 02:35

Contact Us

Address
경기도 수원시 영통구 원천동 산5번지 아주대학교 다산관 429호

E-mail
textminings@gmail.com

Phone
031-219-2910

Tags

Calendar

«   2024/04   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
Copyright © All Rights Reserved
Designed by CMSFactory.NET