基于CBOW模型的个人微博聚类研究

宋添树, 李江宇, 张沁哲

电脑与电信 ›› 2018, Vol. 1 ›› Issue (4) : 69-72.

电脑与电信 ›› 2018, Vol. 1 ›› Issue (4) : 69-72.
应用技术与研究

基于CBOW模型的个人微博聚类研究

  • 宋添树,李江宇,张沁哲
作者信息 +

Research on Personal Microblog Clustering Based on CBOW Model

  • SONG Tian-shu, LI Jiang-yu, ZHANG Qin-zhe
Author information +
文章历史 +

摘要

个人微博是现在流行的社交工具,因其数量繁杂而对用户浏览产生困扰。本文将语义相似度大的微博聚类以 方便用户浏览。主要研究工作如下:1. 使用python 中的jieba 分词对个人微博进行分词预处理并去除停用词;2. 将分词数据集 利用CBOW模型训练词语向量;3. 用词语向量表示个人微博句子向量;4. 个人微博句子向量表示成空间中的分布点,使用改进 的曼哈顿句子算法计算距离即个人微博间的相似度。5. 使用改进的clarans 算法聚类。实验表明本文的方法与传统聚类算法 如划分法、层次法、密度法等有明显的提高。

Abstract

Personal microblog is a popular social tool. The number of users is troublesome because it is confusing to users. This article clusters microblogs with high semantic similarity to facilitate user browsing. The main research work of this dissertation is as follows: 1. Use jieba segmentation in python to preprocess word segmentation and remove stopwords of personal microblog; 2. Use segmentation dataset to train word vectors using CBOW model; 3. Express personal microblog sentence vectors using word vector; 4. Personal microblog sentence vectors are represented as distribution points in space, using the modified Manhattan sentence algorithm to calculate distances, ie similarities between individual microblogs. 5. Use a modified clarans algorithm for clustering. Experiments show that the method of this paper is obviously improved compared with the traditional clustering algorithms, such as the method of dividing, the method of layering and the method of density.

关键词

个人微博 / 语义 / 聚类 / 机器学习

Key words

individual microblog / semantic / clustering / machine learning

引用本文

导出引用
宋添树, 李江宇, 张沁哲. 基于CBOW模型的个人微博聚类研究[J]. 电脑与电信. 2018, 1(4): 69-72
SONG Tian-shu, LI Jiang-yu, ZHANG Qin-zhe. Research on Personal Microblog Clustering Based on CBOW Model[J]. Computer & Telecommunication. 2018, 1(4): 69-72
中图分类号: TP391   

Accesses

Citation

Detail

段落导航
相关文章

/