Abstract:Aiming at the massive information brought by big data, the traditional data mining method is no longer applicable. In recent years, many scholars have proposed new data mining methods, or improved the traditional methods. But it is still far from adapting to this vast amount of information. After summarizing the previous methods, an improved K-Means algorithm based on Canopy is proposed in this paper. Compared with the traditional K-Means, the improved method proposed in this paper will first de-
termine the initial center by density, and then run the reduced data on the Hadoop distributed cluster. The experimental results show that this method can reduce the computational complexity and improve the computational efficiency under the condition of ensuring the accuracy.