sikit-learnのnumpyのデータをpandasに変換する
numpyとはプログラミング言語Pythonにおいて数値計算を効率的に行うためのライブラリで、pandasとはデータ解析を支援する機能を提供するライブラリです。
pandasを使用することでnumpyのデータが視覚的に見えるようになります。
ここでは具体的にsikit-learnにて提供されているデータ(ボストン市の住宅価格)をベースにnumpyのデータをpandasの形式に変換してみます。
ソースコード
1 2 3 4 5 6 |
import numpy as np import pandas as pd from pandas import Series,DataFrame import matplotlib.pyplot as plt %matplotlib inline |
1 2 3 |
from sklearn.datasets import load_boston data = load_boston() print(data.DESCR) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
Boston House Prices dataset =========================== Notes ------ Data Set Characteristics: :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive :Median Value (attribute 14) is usually the target :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. |
説明文として14つの要素が存在し、その中の一つの要素(MEDV)が一般的に正解ラベルに使用されるということが記載されています。また、それぞれのデータは506個存在します。
1 2 |
print(data.feature_names) print(data.feature_names.shape) |
1 2 3 |
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT'] (13,) |
初めにnumpy形式で住宅価格のラベルを表示しています。
1 2 |
print(data.data) print(data.data.shape) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
[[ 6.32000000e-03 1.80000000e+01 2.31000000e+00 ..., 1.53000000e+01 3.96900000e+02 4.98000000e+00] [ 2.73100000e-02 0.00000000e+00 7.07000000e+00 ..., 1.78000000e+01 3.96900000e+02 9.14000000e+00] [ 2.72900000e-02 0.00000000e+00 7.07000000e+00 ..., 1.78000000e+01 3.92830000e+02 4.03000000e+00] ..., [ 6.07600000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01 3.96900000e+02 5.64000000e+00] [ 1.09590000e-01 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01 3.93450000e+02 6.48000000e+00] [ 4.74100000e-02 0.00000000e+00 1.19300000e+01 ..., 2.10000000e+01 3.96900000e+02 7.88000000e+00]] (506, 13) |
次にnumpy形式で住宅価格の実際のデータを表示してみます。506個のデータが存在していますが、データ量が多いのでそのままでは分かりにくくなっています。そのためpandas形式に変換してみます。
1 2 3 |
data_df = DataFrame(data.data) data_df.columns = data.feature_names data_df.head() |
DataFrame(numpy形式)とすることで表が見やすくなります。ただ、この表では試験用のデータのみが表示されているのみなので、正解ラベルについても表示を追加しています。
1 2 |
data_df['Price'] = data.target data_df.head() |
正解ラベルが追加されました。
1 |
data.data[:,5] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
array([ 6.575, 6.421, 7.185, 6.998, 7.147, 6.43 , 6.012, 6.172, 5.631, 6.004, 6.377, 6.009, 5.889, 5.949, 6.096, 5.834, 5.935, 5.99 , 5.456, 5.727, 5.57 , 5.965, 6.142, 5.813, 5.924, 5.599, 5.813, 6.047, 6.495, 6.674, 5.713, 6.072, 5.95 , 5.701, 6.096, 5.933, 5.841, 5.85 , 5.966, 6.595, 7.024, 6.77 , 6.169, 6.211, 6.069, 5.682, 5.786, 6.03 , 5.399, 5.602, 5.963, 6.115, 6.511, 5.998, 5.888, 7.249, 6.383, 6.816, 6.145, 5.927, 5.741, 5.966, 6.456, 6.762, 7.104, 6.29 , 5.787, 5.878, 5.594, 5.885, 6.417, 5.961, 6.065, 6.245, 6.273, 6.286, 6.279, 6.14 , 6.232, 5.874, 6.727, 6.619, 6.302, 6.167, 6.389, 6.63 , 6.015, 6.121, 7.007, 7.079, 6.417, 6.405, 6.442, 6.211, 6.249, 6.625, 6.163, 8.069, 7.82 , 7.416, 6.727, 6.781, 6.405, 6.137, 6.167, 5.851, 5.836, 6.127, 6.474, 6.229, 6.195, 6.715, 5.913, 6.092, 6.254, 5.928, 6.176, 6.021, 5.872, 5.731, 5.87 , 6.004, 5.961, 5.856, 5.879, 5.986, 5.613, 5.693, 6.431, 5.637, 6.458, 6.326, 6.372, 5.822, 5.757, 6.335, 5.942, 6.454, 5.857, 6.151, 6.174, 5.019, 5.403, 5.468, 4.903, 6.13 , 5.628, 4.926, 5.186, 5.597, 6.122, 5.404, 5.012, 5.709, 6.129, 6.152, 5.272, 6.943, 6.066, 6.51 , 6.25 , 7.489, 7.802, 8.375, 5.854, 6.101, 7.929, 5.877, 6.319, 6.402, 5.875, 5.88 , 5.572, 6.416, 5.859, 6.546, 6.02 , 6.315, 6.86 , 6.98 , 7.765, 6.144, 7.155, 6.563, 5.604, 6.153, 7.831, 6.782, 6.556, 7.185, 6.951, 6.739, 7.178, 6.8 , 6.604, 7.875, 7.287, 7.107, 7.274, 6.975, 7.135, 6.162, 7.61 , 7.853, 8.034, 5.891, 6.326, 5.783, 6.064, 5.344, 5.96 , 5.404, 5.807, 6.375, 5.412, 6.182, 5.888, 6.642, 5.951, 6.373, 6.951, 6.164, 6.879, 6.618, 8.266, 8.725, 8.04 , 7.163, 7.686, 6.552, 5.981, 7.412, 8.337, 8.247, 6.726, 6.086, 6.631, 7.358, 6.481, 6.606, 6.897, 6.095, 6.358, 6.393, 5.593, 5.605, 6.108, 6.226, 6.433, 6.718, 6.487, 6.438, 6.957, 8.259, 6.108, 5.876, 7.454, 8.704, 7.333, 6.842, 7.203, 7.52 , 8.398, 7.327, 7.206, 5.56 , 7.014, 8.297, 7.47 , 5.92 , 5.856, 6.24 , 6.538, 7.691, 6.758, 6.854, 7.267, 6.826, 6.482, 6.812, 7.82 , 6.968, 7.645, 7.923, 7.088, 6.453, 6.23 , 6.209, 6.315, 6.565, 6.861, 7.148, 6.63 , 6.127, 6.009, 6.678, 6.549, 5.79 , 6.345, 7.041, 6.871, 6.59 , 6.495, 6.982, 7.236, 6.616, 7.42 , 6.849, 6.635, 5.972, 4.973, 6.122, 6.023, 6.266, 6.567, 5.705, 5.914, 5.782, 6.382, 6.113, 6.426, 6.376, 6.041, 5.708, 6.415, 6.431, 6.312, 6.083, 5.868, 6.333, 6.144, 5.706, 6.031, 6.316, 6.31 , 6.037, 5.869, 5.895, 6.059, 5.985, 5.968, 7.241, 6.54 , 6.696, 6.874, 6.014, 5.898, 6.516, 6.635, 6.939, 6.49 , 6.579, 5.884, 6.728, 5.663, 5.936, 6.212, 6.395, 6.127, 6.112, 6.398, 6.251, 5.362, 5.803, 8.78 , 3.561, 4.963, 3.863, 4.97 , 6.683, 7.016, 6.216, 5.875, 4.906, 4.138, 7.313, 6.649, 6.794, 6.38 , 6.223, 6.968, 6.545, 5.536, 5.52 , 4.368, 5.277, 4.652, 5. , 4.88 , 5.39 , 5.713, 6.051, 5.036, 6.193, 5.887, 6.471, 6.405, 5.747, 5.453, 5.852, 5.987, 6.343, 6.404, 5.349, 5.531, 5.683, 4.138, 5.608, 5.617, 6.852, 5.757, 6.657, 4.628, 5.155, 4.519, 6.434, 6.782, 5.304, 5.957, 6.824, 6.411, 6.006, 5.648, 6.103, 5.565, 5.896, 5.837, 6.202, 6.193, 6.38 , 6.348, 6.833, 6.425, 6.436, 6.208, 6.629, 6.461, 6.152, 5.935, 5.627, 5.818, 6.406, 6.219, 6.485, 5.854, 6.459, 6.341, 6.251, 6.185, 6.417, 6.749, 6.655, 6.297, 7.393, 6.728, 6.525, 5.976, 5.936, 6.301, 6.081, 6.701, 6.376, 6.317, 6.513, 6.209, 5.759, 5.952, 6.003, 5.926, 5.713, 6.167, 6.229, 6.437, 6.98 , 5.427, 6.162, 6.484, 5.304, 6.185, 6.229, 6.242, 6.75 , 7.061, 5.762, 5.871, 6.312, 6.114, 5.905, 5.454, 5.414, 5.093, 5.983, 5.983, 5.707, 5.926, 5.67 , 5.39 , 5.794, 6.019, 5.569, 6.027, 6.593, 6.12 , 6.976, 6.794, 6.03 ]) |
それぞれのデータについてもアクセスすることができます。例えば部屋数(RM)を取得したい場合はdata.data[:,5]と指定します。
1 2 3 |
plt.scatter(data.data[:,5],data.target) plt.ylabel("Prince($1,000)") plt.xlabel('Number of rooms') |
このように、部屋数をX軸、値段をY軸にプロットすることで相関関係の図の描画を行うこともできます。
pandasでカラム名を指定して図を描画する
pandasではカラム名を指定して図を描画することもできます。
1 |
data_df.RM |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
0 6.575 1 6.421 2 7.185 3 6.998 4 7.147 5 6.430 6 6.012 7 6.172 8 5.631 9 6.004 10 6.377 11 6.009 12 5.889 13 5.949 14 6.096 15 5.834 16 5.935 17 5.990 18 5.456 19 5.727 20 5.570 21 5.965 22 6.142 23 5.813 24 5.924 25 5.599 26 5.813 27 6.047 28 6.495 29 6.674 ... 476 6.484 477 5.304 478 6.185 479 6.229 480 6.242 481 6.750 482 7.061 483 5.762 484 5.871 485 6.312 486 6.114 487 5.905 488 5.454 489 5.414 490 5.093 491 5.983 492 5.983 493 5.707 494 5.926 495 5.670 496 5.390 497 5.794 498 6.019 499 5.569 500 6.027 501 6.593 502 6.120 503 6.976 504 6.794 505 6.030 Name: RM, Length: 506, dtype: float64 |
1 |
data_df.Price |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
0 24.0 1 21.6 2 34.7 3 33.4 4 36.2 5 28.7 6 22.9 7 27.1 8 16.5 9 18.9 10 15.0 11 18.9 12 21.7 13 20.4 14 18.2 15 19.9 16 23.1 17 17.5 18 20.2 19 18.2 20 13.6 21 19.6 22 15.2 23 14.5 24 15.6 25 13.9 26 16.6 27 14.8 28 18.4 29 21.0 ... 476 16.7 477 12.0 478 14.6 479 21.4 480 23.0 481 23.7 482 25.0 483 21.8 484 20.6 485 21.2 486 19.1 487 20.6 488 15.2 489 7.0 490 8.1 491 13.6 492 20.1 493 21.8 494 24.5 495 23.1 496 19.7 497 18.3 498 21.2 499 17.5 500 16.8 501 22.4 502 20.6 503 23.9 504 22.0 505 11.9 Name: Price, Length: 506, dtype: float64 |
それぞれカラム名でアクセスすることが分かりましたので図を描画してみます。
1 |
plt.scatter(data_df.RM,data_df.Price) |
このようにPandasの場合はカラム名をしていして描画をすることが出来ます。