Python 之降低数据的维度，实现可视化

1.导入相关模块

1 2	import numpy as np from sklearn.datasets import fetch_openml

2.MNIST Dataset

1
2
3
4

mnist = fetch_openml("mnist_784")
X = mnist.data / 255.0
y = mnist.target
X.shape, y.shape

数据转换为Pandas数据框架

1
2
3
4
5
6
7
8
9
10
11

import pandas as pd

feat_cols = ['pixel' + str(i) for i in range(X.shape[1])]

df = pd.DataFrame(X, columns=feat_cols)
df['label'] = y
df['label'] = df['label'].apply(lambda i: str(i))

X, y = None, None

print('Size of the dataframe: {}'.format(df.shape))

由于数据框中的教程是按类排序的，所以我们需要一个随机顺序的索引向量来混合例子。

1	rndperm = np.random.permutation(df.shape[0])

将随机图像可视化

matshow允许将一个二维矩阵或数组可视化为一个彩色图像。

1
2
3
4
5
6
7
8
9
10

%matplotlib inline
import matplotlib.pyplot as plt

# Plot the graph
plt.gray()
fig = plt.figure( figsize=(16, 7) )
for i in range(0, 15):
ax = fig.add_subplot(3, 5, i+1)
ax.matshow(df.loc[rndperm[i],feat_cols].values.reshape((28,28)).astype(float))
plt.show();

3.PCA

PCA是一种在保留参考数据基本信息的前提下，减少样本测量次数的方法。我们将着重介绍图像的3个主要组成部分

1
2
3
4
5
6
7
8
9
10

from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca_result = pca.fit_transform(df[feat_cols].values)

df['PC1'] = pca_result[:, 0]
df['PC2'] = pca_result[:, 1]
df['PC3'] = pca_result[:, 2]

print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))

前三个部分描述了原始数据分布的23%。

1
2
3
4
5
6

from ggplot import *

chart = ggplot( df.loc[rndperm[:3000],:], aes(x='PC1', y='PC2', color='label') ) \
+ geom_point(size=50, alpha=0.8) \
+ ggtitle("First and Second Principal Components colored by digit")
chart

1	<ggplot: (9387058)>

4.T-分布式随机邻域实体（t-SNE）。

T-SNE是基于减少两个分布的差异：输入对象的成对相似度（距离）分布和对应对象（点）在小维空间的成对相似度分布。

该方法是资源密集型的，所以建议使用PCA（对于稀疏数据--TruncatedSVD）来处理大量的属性。

1
2
3
4
5
6
7
8
9
10
11

import time

from sklearn.manifold import TSNE

n_sne = 7000

time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(df.loc[rndperm[:n_sne],feat_cols].values)

print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

1
2
3
4
5
6
7
8

df_tsne = df.loc[rndperm[:n_sne],:].copy()
df_tsne['x-tsne'] = tsne_results[:,0]
df_tsne['y-tsne'] = tsne_results[:,1]

chart = ggplot( df_tsne, aes(x='x-tsne', y='y-tsne', color='label') ) \
+ geom_point(size=15,alpha=0.8) \
+ ggtitle("tSNE dimensions colored by digit")
chart

1	<ggplot: (5784710)>

也可以先用PCA或TruncatedSVD处理数据，留下例如50个字符，然后用t-SNE减小维度值。但是，要记住，计算t-SNE的复杂度会随着样本数的平方而增加，分别是，如果数据集中有几十万或几百万个对象，这种方法就变得不切实际了。