| 基于惩罚高斯混合模型的高维数据聚类分析 |
Alternative Title | Penalized Gaussian Mixture Model-Based High-Dimensional Data Clustering
|
| 朱桂菊 |
Thesis Advisor | 赵学靖
|
| 2016-05-14
|
Degree Grantor | 兰州大学
|
Place of Conferral | 兰州
|
Degree Name | 硕士
|
Keyword | Gap Statistics
BIC
变量选择
Adaptive-H-GMM 模型
|
Abstract | 高维小样本数据的聚类分析有着广泛的应用背景。本文假设数据来自高斯混合模型,通过对该类模型施加惩罚函数,实现变量选择及聚类分析。我们选取了三种关于均值参数的惩罚函数:L1 -惩罚、Adaptive- L1 -惩罚、 Adaptive-分层-惩罚,对应的模型分别记为 L1 -GMM、Adaptive- L1 -GMM、Adaptive-H-GMM。模型确立后,我们首先利用Gap Statistics对聚类个数进行估计,然后利用EM 算法对模型中参数进行估计,在此过程中通过均值可判断第p变量是否为信息变量,同时将改进的BIC作为模型选择准则,实现对惩罚系数的选择。
模型的有效性通过模拟数据及基因表达数据实验得以检验。对模拟数据集,三种模型效果良好,聚类情况与原数据一致,并且正确选择出了非信息变量。对基因表达数据集,三种模型效果不一,Adaptive-H-GMM 模型最终在 300 个变量中选择出了14个信息变量,有效地减少了计算量和复杂度,聚类错误率为 4/72,效果较好。 |
Other Abstract | This paper devote to the clustering of “high dimension, low sample size” data, assuming that the data are drawn from Gaussian Mixture Model with each component corresponding to a cluster,the variables are selected in clustering procedure, i.e., the variables contain important information are verified , thereafter the data are clustered based on these information variables. Based on
Gaussian Mixture Model with penalty function, the clustering procedure and variable selection are explored. There three kinds of penalty function, L1 - penalty, Adaptive- L1 - penalty, Adaptive hierarchically penalty, upon the global mean are investigated, respectively, which induce the three models L1 -GMM, Adaptive- L1 -GMM, Adaptive-H-GMM. The Gap Statistics is used to estimate
the number of clusters, and the EM algorithm for estimating the parameters.Whether a variable is an information variable can be determined through ,and the turning parameter is given by the modified BIC.Numerical simulated data and real gene expression data are used in the three models respectively. Three models all perform well for numerical simulated data, means that the clustering
results and the result of variables selection are onsistent with the original data. Whereas for Gene expression data, the performance of the three models are differently, and Adaptive-H-GMM is the best one. In Adaptive-H-GMM, 14 information variables are selected from 300 variables, which
reduce the amount of computation and the complexity of model, the error rate of cluster is 4/72,which is accepted |
URL | 查看原文
|
Language | 中文
|
Document Type | 学位论文
|
Identifier | https://ir.lzu.edu.cn/handle/262010/225191
|
Collection | 数学与统计学院
|
Recommended Citation GB/T 7714 |
朱桂菊. 基于惩罚高斯混合模型的高维数据聚类分析[D]. 兰州. 兰州大学,2016.
|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.