R聚类分析.doc_三一文库31doc.com

资源描述

《R聚类分析.doc》由会员分享，可在线阅读，更多相关《R聚类分析.doc（18页珍藏版）》请在三一文库上搜索。

1、利用R内置数据集iris（鸢尾花）第一步：对数据集进行初步统计分析检查数据的维度> dim(iris)1 1505显示数据集中的列名> names(iris)1 "Sepal.Length" "Sepal.Width""Petal.Length" "Petal.Width""Species"显示数据集的内部结构> str(iris)'data.frame':150 obs. of5 variables:$ Sepal.Length: num5.1 4.9 4.7

2、 4.6 5 5.4 4.65 4.4 4.9 .$ Sepal.Width : num3.5 3 3.2 3.1 3.6 3.9 3.43.4 2.9 3.1 .$ Petal.Length: num1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 .$ Petal.Width : num0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 .$ Species: Factor w/ 3 levels "setosa","versicolor",.: 1 1 1 1 1 1 1 1 1 1 .显示

3、数据集的属性> attributes(iris)$names -就是数据集的列名1 "Sepal.Length" "Sepal.Width""Petal.Length" "Petal.Width""Species"$row.names - 个人理解就是每行数据的标号112345678910111213141516171819202121222324252627282930313233343536373839404141424344454647484950515253545556575859

4、606161626364656667686970717273747576777879808181828384858687888990919293949596979899 100101 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120121 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140141 141 142 143 144 145 146 147 148 149 150$cl

5、ass -表示类别1 "data.frame"查看数据集的前五项数据情况> iris1:5,Sepal.Length Sepal.Width Petal.Length Petal.Width Species15.13.51.40.2setosa24.93.01.40.2setosa34.73.21.30.2setosa44.63.11.50.2setosa55.03.61.40.2setosa查看数据集中属性Sepal.Length 前 10 行数据> iris1:10, "Sepal.Length"1 5.1 4.9 4.7 4.6 5.0

6、 5.4 4.6 5.0 4.4 4.9同上> iris$Sepal.Length1:101 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9显示数据集中每个变量的分布情况> summary(iris)Sepal.Length Min. :4.300 1st Qu.:5.100 Median :5.800Sepal.Width Min. :2.000 1st Qu.:2.800 Median :3.000Petal.Length Min. :1.000 1st Qu.:1.600 Median :4.350Petal.WidthSpeciesMin.:

7、0.100setosa:501st Qu.:0.300versicolor:50Median :1.300virginica :50Mean:5.843Mean:3.057Mean:3.758 Mean:1.1993rd Qu.:6.4003rd Qu.:3.3003rd Qu.:5.1003rd Qu.:1.800Max.:7.900Max.:4.400Max.:6.900Max.:2.500显示 iris 数据集列 Species 中各个值出现频次> table(iris$Species)setosa versicolorvirginica505050根据列 Species 画出饼图

8、> pie(table(iris$Species)算出列 Sepal.Length 的所有值的方差> var(iris$Sepal.Length)1 0.6856935算出列 iris$Sepal.Length 和 iris$Petal.Length 的协方差> cov(iris$Sepal.Length, iris$Petal.Length)1 1.274315算出列 iris$Sepal.Length 和 iris$Petal.Length 的相关系数，从结果看这两个值是强相关。> cor(iris$Sepal.Length, iris$Petal.Length)1

9、 0.8717538画出列 iris$Sepal.Length 分布柱状图> hist(iris$Sepal.Length)画出列 iris$Sepal.Length 的密度函数图> plot(density(iris$Sepal.Length)画出列 iris$Sepal.Length 和 iris$Sepal.Width 的散点图> plot(iris$Sepal.Length, iris$Sepal.Width)绘出矩阵各列的散布图> plot(iris)or> pairs(iris)第二步：使用knn 包进行 Kmean 聚类分析将数据集进行备份，将列ne

10、wiris$Species 置为空，将此数据集作为测试数据集> newiris <- iris> newiris$Species <- NULL在数据集newiris 上运行 Kmean 聚类分析，将聚类结果保存在kc 中。在 kmean 函数中，将需要生成聚类数设置为3set.seed(123) #避免每次聚类的结果不一致> (kc <- kmeans(newiris, 3)K-means clustering with 3 clusters of sizes 38, 50, 62: K-means 算法产生了3 个聚类，大小分别为 38,50,62.Cl

11、uster means: 每个聚类中各个列值生成的最终平均值Sepal.Length Sepal.Width Petal.Length Petal.Width15.0060003.4280001.4620000.24600025.9016132.7483874.3935481.43387136.8500003.0736845.7421052.071053Clustering vector: 每行记录所属的聚类（ 2 代表属于第二个聚类， 1 代表属于第一个聚类， 3 代表属于第三个聚类）1 11111111111111111111111111111111111137 1111111111111

12、1223222222222222222222273 222223222222222222222222222232333323109 333332233332323233223333323333233323145 332332Within cluster sum of squares by cluster:每个聚类内部的距离平方和1 15.15100 39.82097 23.87947(between_SS / total_SS = 88.4 %) 组间的距离平方和占了整体距离平方和的的88.4%，也就是说各个聚类间的距离做到了最大Available components:运行 kmeans 函

13、数返回的对象所包含的各个组成部分1 "cluster""centers""totss""withinss"5 "tot.withinss" "betweenss""size"("cluster" 是一个整数向量，用于表示记录所属的聚类"centers" 是一个矩阵，表示每聚类中各个变量的中心点"totss" 表示所生成聚类的总体距离平方和"withinss" 表示各个聚类组内的

14、距离平方和"tot.withinss" 表示聚类组内的距离平方和总量"betweenss" 表示聚类组间的聚类平方和总量"size" 表示每个聚类组中成员的数量)创建一个连续表,在三个聚类中分别统计各种花出现的次数> table(iris$Species, kc$cluster)1 2 3setosa050 0versicolor2048virginica36014根据最后的聚类结果画出散点图，数据为结果集中的列"Sepal.Length" 和 "Sepal.Width" ，颜色为用 1，

15、 2， 3 表示的缺省颜色> plot(newirisc("Sepal.Length", "Sepal.Width"), col = kc$cluster)在图上标出每个聚类的中心点 points(kc$centers,c("Sepal.Length", "Sepal.Width"), col = 1:3, pch = 8, cex=2)Conclusion ：k 值的选择是用户指定的，不同的k 得到的结果会有挺大的不同，如上图图所示，是 k=3 的结果，这个就太稀疏了，红色的那和绿色的簇其实是可以再划分成两个簇的。而下图是 k=5 的结果。

展开阅读全文