兰州大学机构库 >数学与统计学院
基于惩罚高斯混合模型的高维数据聚类分析
Alternative TitlePenalized Gaussian Mixture Model-Based High-Dimensional Data Clustering
朱桂菊
Thesis Advisor赵学靖
2016-05-14
Degree Grantor兰州大学
Place of Conferral兰州
Degree Name硕士
KeywordGap Statistics BIC 变量选择 Adaptive-H-GMM 模型
Abstract高维小样本数据的聚类分析有着广泛的应用背景。本文假设数据来自高斯混合模型,通过对该类模型施加惩罚函数,实现变量选择及聚类分析。我们选取了三种关于均值参数的惩罚函数:L1 -惩罚、Adaptive- L1 -惩罚、 Adaptive-分层-惩罚,对应的模型分别记为 L1 -GMM、Adaptive- L1 -GMM、Adaptive-H-GMM。模型确立后,我们首先利用Gap Statistics对聚类个数进行估计,然后利用EM 算法对模型中参数进行估计,在此过程中通过均值可判断第p变量是否为信息变量,同时将改进的BIC作为模型选择准则,实现对惩罚系数的选择。 模型的有效性通过模拟数据及基因表达数据实验得以检验。对模拟数据集,三种模型效果良好,聚类情况与原数据一致,并且正确选择出了非信息变量。对基因表达数据集,三种模型效果不一,Adaptive-H-GMM 模型最终在 300 个变量中选择出了14个信息变量,有效地减少了计算量和复杂度,聚类错误率为 4/72,效果较好。
Other AbstractThis paper devote to the clustering of “high dimension, low sample size” data, assuming that the data are drawn from Gaussian Mixture Model with each component corresponding to a cluster,the variables are selected in clustering procedure, i.e., the variables contain important information are verified , thereafter the data are clustered based on these information variables. Based on Gaussian Mixture Model with penalty function, the clustering procedure and variable selection are explored. There three kinds of penalty function, L1 - penalty, Adaptive- L1 - penalty, Adaptive hierarchically penalty, upon the global mean are investigated, respectively, which induce the three models L1 -GMM, Adaptive- L1 -GMM, Adaptive-H-GMM. The Gap Statistics is used to estimate the number of clusters, and the EM algorithm for estimating the parameters.Whether a variable is an information variable can be determined through ,and the turning parameter is given by the modified BIC.Numerical simulated data and real gene expression data are used in the three models respectively. Three models all perform well for numerical simulated data, means that the clustering results and the result of variables selection are onsistent with the original data. Whereas for Gene expression data, the performance of the three models are differently, and Adaptive-H-GMM is the best one. In Adaptive-H-GMM, 14 information variables are selected from 300 variables, which reduce the amount of computation and the complexity of model, the error rate of cluster is 4/72,which is accepted
URL查看原文
Language中文
Document Type学位论文
Identifierhttps://ir.lzu.edu.cn/handle/262010/225191
Collection数学与统计学院
Recommended Citation
GB/T 7714
朱桂菊. 基于惩罚高斯混合模型的高维数据聚类分析[D]. 兰州. 兰州大学,2016.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Altmetrics Score
Google Scholar
Similar articles in Google Scholar
[朱桂菊]'s Articles
Baidu academic
Similar articles in Baidu academic
[朱桂菊]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[朱桂菊]'s Articles
Terms of Use
No data!
Social Bookmark/Share
No comment.
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.