基于维数约简的无监督聚类算法研究 | |
Alternative Title | Unsupervised Clustering Algorithm Based on Dimension Reduction |
杜世强 | |
Subtype | 博士 |
Thesis Advisor | 马义德 |
2017-03-31 | |
Degree Grantor | 兰州大学 |
Place of Conferral | 兰州 |
Degree Name | 博士 |
Keyword | 聚类分析 维数约简 矩阵分解 l2,1范数 特征选择 特征抽取 子空间聚类 低秩表示 |
Abstract | 近年来,随着数据获取能力的不断提高和计算机的飞速发展,人们获得的数据信息越来越多,数据维数越来越高,如何寻找这些海量高维数据信息中潜在的规律更好地为人类服务,是目前机器学习面临的挑战之一. 在没有标签信息的情况下,对高维数据实施维数约简的同时进行归类分析,挖掘数据的内在结构,是当前机器学习的一个难点、也是热点之一. 本文主要研究了在没有标签信息的情况下,以矩阵分解为基础,对原始高维数据样本维数约简的同时进行聚类分析,从而揭示数据样本的内在本质结构. 具体而言,本文的主要研究工作和创新性内容如下:1. 针对现有基于回归的特征选择算法,通常选用0-1 伪标签矩阵作为目标矩阵,使得模型成为一个NP-难问题,提出一种基于矩阵分解的鲁棒特征选择算法(RUFSM). RUFSM 首先将目标矩阵分解为两个矩阵(正交聚类中心矩阵和低维稀疏表示矩阵) 的乘积,不仅使得模型易于迭代求解,而且特征选择矩阵(投影矩阵)能更好地选择具有类别辨别性的特征;其次,聚类中心的正交性约束和低维表示的稀疏性约束不仅保证异类投影样本相互远离,同时使得同类之间相互靠近;最后,l2;1 范数作为误差度量能有效消除噪声样本和离群样本对数据样本本质属性特征的影响,同时进行的鲁棒特征选择和鲁棒聚类能保证算法得到总体最优解. 大量实验结果表明提出的RUFSM 算法无论在鲁棒性上还是聚类性能上都超过了相关鲁棒特征选择算法.2. 针对低秩表示目标函数中核范数的不可微问题,提出一种非负的图正则化低秩因子分解算法(GLCF). GLCF 算法首先利用矩阵理论,将保持全局结构的低秩约束巧妙地转化为两因子Frobenius 范数之和的最小化问题,考虑到非负约束在聚类分析中的语义相关性,对因子分解矩阵进行非负约束,同时利用流形正则化项使得低维表示保持了原始样本的局部几何结构;其次,给出一种优化目标函数的多步更新规则,并从理论上证明了该算法的收敛性;最后,分析了提出的多步更新规则与梯度下降算法的相互关系,且针对负值数据样本给出一种多步更新规则. 与相关基于非负约束的矩阵分解算法相比,实验结果表明了提出的GLCF 算法具有更好的聚类性能.3. 针对现有的基于低秩表示的子空间聚类算法通常直接选用含有噪声的原始数据样本作为字典求取原始样本的低秩表示,且构建亲和矩阵和聚类分两步独立进行的缺点,提出了一种图正则化紧凑低秩表示算法(GCLRR). 首先,GCLRR 算法为了消除噪声样本作为字典对低秩表示的影响,用原始数据的线性组合作为字典,不仅使得字典在算法优化过程中通过学习得到,而且使得低维表示随着字典优化更新;其次,正交的线性组合系数矩阵与低维低秩表示矩阵可认为是对LRR 算法中低秩表示矩阵的分解,因此,算法优化过程中得到的低维低秩表示可直接用于聚类;最后,分别保持全局结构和局部结构的低秩和流形正则化直接约束在低维表示上,使得低维表示具有良好的类别属性. 聚类实验结果表明GCLRR 算法在挖掘数据样本潜在子空间方面,优于最新的LRR 相关算法. |
Other Abstract | In recent years, with the continuous improvement of data acquisition capacity and the rapid development of computer, data information is getting more and more, and data dimension is getting higher and higher, how to effectively analyze these massive high-dimensional data and extract the potential rule is one of the challenges of machine learning. In the absence of label information, dimensional reduction of high-dimensional data and clustering are carried out simultaneously for mining the internal structure is one of the difficulties and hotspots of current machine learning. This dissertation focus at representing the original high-dimensional data in a low-dimensional space based on matrix factorization in the absence of label information, and simultaneously clustering the low-dimensional representation in order to discover the intrinsic structure of the category. The main contributions and innovative content of this paper are as follows:1. Most current regression-based feature selection methods utilize the pseudo cluster labels as the target matrix for selecting discriminative features. Such target matrix by definition is a 0 or 1 matrix, so the optimization of those models is an NP-hard problem. In order to effectively deal with this situation, a novel robust unsupervised feature selection via matrix factorization (RUFSM) is proposed. The advantages of this work are three-fold. Firstly, the target matrix is decomposed into two matrices orthogonal clustering center matrix and low dimensional sparse representation matrix, it not only makes the model easy to get iterative solution, but also makes the feature selection matrix (projection matrix) can be better select the discriminative features. Secondly, the centers of the orthogonality constraints and low-dimensional representation of the sparse constraints not only ensure that heterogeneous projection samples are far away from each other, while making the samples between same class close each other. Finally, l2,1-norm as the loss function for filtering out the noise samples and outliers, and robust feature selection and robust clustering are simultaneously performed guarantee algorithm to obtain the overall optimal solution. Secondly, both the latent orthogonal cluster centers and the sparse representation of the projected data points based on matrix factorization are predicted for selecting robust discriminative features. Compared with several state-of- the-art unsupervised feature selection methods, the proposed algorithm comes with better clustering performance for almost all datasets we have experimented with here.2. In order to deal with the non-differentiable nuclear norm of objective function in low rank representation, we propose an algorithm called nonnegative graph regularized low rank concept factorization (GLCF). Firstly, by utilizing the matrix theory, we transform the low rank constraint which used to preserve the global structure of the data into the minimization of the sum of the two factors Frobenius norm, and taking into account the semantic information of the nonnegative constraint in the clustering analysis, the factors are constrained to nonnegative, and the manifold regularization term makes the low-dimensional representation obtain the local geometrical structure of the original samples. Secondly, an multi-step iterative updating optimization scheme is proposed and the convergence of the algorithms is proved theoretically. Thirdly, the connection between the iterative updating rules and the gradient descent method is analyzed, and proposed an iterative updating rule for negative data matrices. Compared with other relative matrix factorization algorithms based on nonnegative constraint, experimental results demonstrate the better clustering performance.3. The current LRR-based approaches have the following drawbacks: 1) the original data, which are used as a dictionary in LRR for seeking the lowest rank representations of data points, usually contain noise and they may not be representative; and 2) the affinity matrix and subspace clustering are obtained in two independent steps. Therefore, we propose an improved LRR-based approach, called graph regularized compact LRR (GCLRR). Firstly, in order to eliminate the effect of the noise sample as a dictionary on the low rank representation, the linear combination of the original data is used as the dictionary. When the linear combination are updates, the dictionary will be updated accordingly. Secondly, the orthogonal linear combination coefficient matrix with the low-dimensional low-rank representation matrix can be considered as the decomposition of the low rank representation matrix in the LRR, therefore, the low-rank representation can be used directly for clustering. Thirdly, the low-dimensional representation used for subspace clustering captures both the global subspace structure (by the low-rankness) and the local manifold structure (by manifold regularization) of the original data. The effectiveness and robustness of the proposed method based on extensive experiments are verified using both synthetic and real data sets, which demonstrated its higher clustering capacity compared with state-of-the-art LRR-based clustering algorithms. |
URL | 查看原文 |
Language | 中文 |
Document Type | 学位论文 |
Identifier | https://ir.lzu.edu.cn/handle/262010/232455 |
Collection | 信息科学与工程学院 |
Recommended Citation GB/T 7714 | 杜世强. 基于维数约简的无监督聚类算法研究[D]. 兰州. 兰州大学,2017. |
Files in This Item: | There are no files associated with this item. |
|