基于Canopy的K-Means并行化算法

王 颖

电脑与电信 ›› 2019, Vol. 1 ›› Issue (7) : 30.

作者信息 +

K-Means ParallelizationAlgorithm Based on Canopy

Author information +

文章历史 +

摘要

针对大数据带来的海量信息，传统的数据挖掘方法已经不再适用。近些年来很多学者提出新的数据挖掘方式，或者在传统的方法上进行改进，但是还远不足以处理这些海量信息。在总结已有方法的基础上，提出一种基于C anopy的K-M eans并行化算法。与传统的K-M eans算法相比，本文提出的改进方法会通过密度确定初始中心，然后在H adoop分布式集群上运行K-M eans算法。实验证明，该方法在保证精度的情况下，能降低运算复杂度从而提高计算效率。

Abstract

Aiming at the massive information brought by big data, the traditional data mining method is no longer applicable. In recent years, many scholars have proposed new data mining methods, or improved the traditional methods. But it is still far from adapting to this vast amount of information. After summarizing the previous methods, an improved K-Means algorithm based on Canopy is proposed in this paper. Compared with the traditional K-Means, the improved method proposed in this paper will first de- termine the initial center by density, and then run the reduced data on the Hadoop distributed cluster. The experimental results show that this method can reduce the computational complexity and improve the computational efficiency under the condition of ensuring the accuracy.