ocument clustering [1], [2], [3], [4] techniques find relevance in a wide range of tasks from a simple search with a few terms to vast information retrieval processes. The early document clustering techniques used were developed for typically enhancing information retrieval systems [5], were designed to find documents according to the query type, however could not perform the task of creating a query, generate a synopsis of the documents, or provide an interface to the search results. The progress of internet, digital libraries, news sources and companywide intranets has made available huge volumes of text documents. The tremendous increase in the already quantum size of web data and the classification of the web documents into relevant and moderate number of clusters has led to the development of large number of web clustering engines and high performing clustering algorithms.
The process of document clustering involves four stages which are, i) Data collection, crawling to accumulate the documents, indexing the set of documents in a structured fashion, filtering of data with techniques of tokenization, stop words removal and stemming, lemming etc. ii) preprocessing where the data is represented in suitable form, vector etc. and measurable factors applied to determine the similarity, iii) Document clustering where a clustering technique and an efficient clustering algorithm are identified for clustering based on preset criteria and iv) Post processing involving applications of business and scientific requirements adaptation of the document clustering technique.
The applications of document clustering are of diverse nature such as, i) Creation of document taxonomies ii) IR process of search, accessing and collection [6],
Similar documents identification, review and classification of results [7], automatic topic extraction [8], content summarization iii) Recommendation System, iv) Search Optimization, etc. For instance the processes are used enormously in the data classification process such as Google Web Directory, Social media data classification etc.
The clustering techniques though being studied since several years, still face many of the same challenges. These challenges [9,10] of document clustering are mostly of, i) Huge volume of data, ii) The high dimensionality of the feature space, iii) A feasible clustering method in terms of constraints such as cluster quality and performance and iv) Representing the results in an effective browsing interface. The current challenges associated with text clustering are the requirement of dynamic clustering techniques to incrementally update clusters as new data is added [11,12]. For instance the social media has to generate user specific content [13] instantly and this requires real time data clustering methodologies.
The remainder of this paper is organized as follows. In Section 2 we discuss the "Taxonomy" of document clustering, in Section 3 the "Contemporary literature work of clustering techniques" are evaluated and Section 4 gives the "Conclusion" of the paper.
The clustering functionality can be expressed as a function comprising of a document set mapped to a D set of clusters. Based on specified constraints the minimum and maximum of the function defines the clustering difficulty and algorithms applied over the similarity criteria determine the clustering quality.
The preprocessing step of clustering for finding the document similarity is determined with methods based on the following strategies, (i) phrase or pair-wise methodology, (ii) tree form data depiction, (iii) component dependent data depiction, (iv) semantic relation dependent documents depiction, (v) concept and feature vector dependent depiction.
The clustering methods of are generally of two types, 1) Word patterns and phrases based 2) Feature based.
The clustering methods algorithms are mostly of two types 1) hierarchical methods and 2) partitioning methods (non hierarchical) [14,15,16]. The hierarchical algorithms for clustering represent data sets as a cluster tree and are of two types 1-1) agglomerative [17] 1 -2) divisive hierarchical clustering methods. Partitional clustering algorithms [17] are of two types, 2-1) iterative 2 -2) single pass methods. K means and its variants etc. are the popular partitioning methods. The hierarchical clustering algorithms are considered efficient than the remaining algorithms [18] however due to their inherent complexness they are not applicable to huge document sets.
The techniques for determining inter-cluster similarity in classification [19 20] ex. single link and for enhancing the value of the clusters where the cluster size differs or fluctuates by a huge factor [17], especially in case of high performing clustering algorithms have been studied widely in recent years.
The widely used document clustering methods are Spectral Clustering, LSI dependent cluster development and NMF technique based clustering. The Spectral clustering methods [21] are LPI, LSI etc. Latent semantic indexing (LSI) [22] a feature extraction approach [23] tries to optimize the documents space compared to the given document and is a widely used linear document indexing method [24]. LSI is inapplicable for processes with a high range of documents [24] and similarly spectral clustering when used in a large dimensional space the dimensionality reduction is very costly which limits its usability.
The word patterns and phrases based approaches are the traditional strategies where the clustering is dependent on the documents features such as words, phrases and sequences [25,26]. These methods are of four types, 1-1) Clustering with Frequent Word Patterns 1-2) Application of Word Clusters in Document Clusters 1-3) Co-clustering Words and Documents, Co-clustering with graph partitioning and Information-Theoretic Co -clustering 1-4) Clustering based on Frequent Phrases. The technique VSM is used in almost all the document clustering methods used nowadays [27]. The vector space model is a data model for representing the terms related to the words in a document as a feature vector.
The features based clustering approaches are of two types 2-1) Feature Extraction 2-2) Feature Selection.
The Feature Extraction approaches are based on the algorithm of two types i) linear and ii) nonlinear techniques. The models of linear type algorithms are unsupervised PCA, OCA, MMC etc. The examples of non linear algorithms are LLE, Laplacian Eigenmaps, and ISOMAP etc. The linear methods show better operational performance in contrast to nonlinear approaches, however underperform in the clustering of huge and complicated data of the internet. The feature extraction technique finds applications in the fields of IR based on human language learning ability, comparing reviewed and submitted papers, of various languages or networks and filter of data. Feature selection algorithms are of two types, 2-2-1) Feature Ranking that is metric based and 2-2-2) Subset Selection from the possible features. The feature selection algorithms are of two categories, i) supervised and ii) unsupervised. The supervised feature selection algorithms are the most researched as well as used and they are IG, CHI, and MI. The unsupervised methods that are most popular are, i) DF-based selection dependent on term strength and ranking dependent on entropy or term contribution, ii) LSI-based method and iii) NMF based method. These techniques of unsupervised approach such as, decision trees, statistics, NLP and ML are being used in BI or analytics, in neural networks for developing AI or bio neural networks, for developing systems of AI that are rule based for intelligent content development, database development, information retrieval and automatic grouping of web documents with Enterprise Search engines or open source software's in web mining or text mining.
The strategies of feature selection used mostly are i) wrapper, ii) filter and iii) embedded methods [28] however a study [29] has shown, the methods of supervised feature selection dependent on algorithms using the filter metric IG, are most efficient over others techniques.
Recent Literature
An approach of bisecting k-means algorithm proposed by Steinbach, M, Karypis, G, & Kumar, V [14] breaks up a large cluster into small clusters repetitively to generate k numbers of clusters of huge similarity for filtering the clusters and collecting similar texts based on the method.
A technique called CCA [30] widely used in the emerging technologies of ML etc applies correlation for measuring the similar features in a document. However, CCA has its own limitations in clustering.
An approach of spectral clustering based on graph partitioning strategy called LPI [31] proposed however fails in feature selection and comprises of the existing problems of distance based clustering documents.
An approach for document clustering called Frequent Term based Clustering or HFTC [32] is a topic of extensive research. However it is not scalable for huge data or of documents.
A technique known as Hierarchical Document Clustering using Frequent itemsets (FIHC) approach proposed by Fung, B., Wang, K., Ester, M, is discussed in [33]. The strategy of FIHC though performs better than HFTC underperforms in clustering efficiency when compared to existing approaches such as UPGMA and Bisecting K-means.
The TDC algorithm technique based on closed frequent itemsets for clustering is proposed by Yu, H., Searsmith, D., Li, X., Han, J [34]. The algorithm performs better compared to HFTC and FIHC however the use of closed itemsets makes it avoidable.
A strategy of Hierarchical Clustering using Closed Interesting Itemsets, referred to as HCCI proposed by Malik, H.H., Kender, J.R [35], is the best clustering method available. However the technique may cause information loss.
An approach based on PSSM histogram by Gad and Kamel [36] combines the text semantic with the process of incremental clustering and measures the similarity of the documents for adjusting the insertion order of the documents in the cluster for quality.
An improved incremental clustering technique for an efficient clustering algorithm proposed by Gavin and Yue [37] improves categorization of web data incrementally. The method based on cluster specific multiple information anew document is assigned to a cluster.
An approach for improving text clustering mining by Shehata, S, Fakhri, K, & Mohamed S, S. [38] outperforms the existing techniques such as HAC, k-NN etc.
A progressive clustering algorithm by Liu, Y, Ouyang, Y, Sheng, H, & Xiong, Z. ( 2008) [39] based on Cluster Average Similarity Area determines the cluster coherence and progressively assigns the new data items to the clusters.
A technique for enhancing the clustering functionality based on the partial disambiguation of words by means of their PoS [40] is recommended by the developers as the approach finds the inefficiency of considering synonyms and hypermy my for selecting the right sense of the word disambiguated solely by PoS tags.
The CFWS technique proposed by Y. LI, and S.M. Chung, enhances the capability to process the document, considering the word sequences apart from the words [41].
The technique of non linear representation of the data by J.B. Tenenbaum, V. de Silva, and J.C. Langford [42] keeps specific local data simultaneously based on the optimization factors however is associated with high complexity.
A study of the approaches for reducing the complexity of feature extraction based on a new technique called approximation algorithm [43], [44], [45] is found to be good.
A software for automatically retrieving information from websites by Zamir O Etzioni [46] is designed for websites comprising of vast amount of data
The approach of integrating clustering and feature selection for text clustering based on the semantic relation of the text documents with ontology was proposed by Thangamani.M and P.Thangaraj in [47]. The approach minimizes dimensionality and improves feature selection.
The clustering technique, for finding the clustering quality based on WordNet [48] phrasal noun and semantic relationships [49] shows better performance with hyperny my based strategy compared to other noun phrases.
A system for determining the ontology related semantic relations of the term or word and associated weight measure is given by Prof. K. Raja, C. Prakash Narayanan [9]. However the technique has dimensionality and other problems.
A description of the task of Ontology based automatic categorizing of web documents [50] and the scope of Ontology in improving the current machine learning and IR approaches is given by Andreas Hotho. The integration of ontology's for combining various information types of multiple resources by Young-Woo et al. in the paper [51].
The process of using domain specific ontology's for enhancing performance of text classification where text learning and IR are used to generate ontology's with minimum user interaction is given in [52,53].
The methods utilizing Wikipedia ontology for improving primarily the document depiction and cluster quality by Gabrilovich and Markovitch [54] and a further extension provided a structure based on the Wikipedia guidelines and groups [55,56]. The Wikipedia ontology is most relevant as it is applicable to a large cross section of domains and also restructured on a regular basis.
A technique for feature selection in text clustering based on supervised feature selection on the intermediary clustering outcomes by Xu, J. Xu, B [57] generates a efficient subset for classification. The suggested techniques performance is efficient compared to manual process.
A technique of feature selection dependent on the ACO algorithm by M. Janaki Meena,K.R.
Global Journal of C omp uter S cience and T echnology Volume XV Issue II Version I ( ) C Chandran,J. Mary Brinda," [58] is a unique method. Comparative tests of the approach with existing chisquare and CHIR techniques shows the proposed approach achieves better performance in FS.
An entropy based FS approach i.e. a filter solution [59] tested with various data types that reduces dimensionality and is efficient in finding the subset of major features.
A feature co-selection method called MFCC (multi type feature co-selection), proposed by Shen huang, Zheng Chen, Yong Yu, and Wei-Ying main [60] shows enhanced clusters performance of web documents based on the outcomes of intermediate clustering.
A method to remodel the matrix of data similarity as a bi-stochastic matrix prior to executing algorithms by F. Wang, P. Li, and A. C. K Aonig showed better clustering performance [61].
The techniques of document clustering that are term based for clustering in dynamic environments, is given in [11] by Wang, X, Tang, J, & Liu, H, synonyms and hypermy m\ y by Bharathi and Vengatesan [62], Synonyms and Hyponyms, Nadig, R, Ramanand, J, & Bhattacharyya, P in [12]. These approaches are however not applicable to technically similar documents.
A document clustering approach [63] dependent on phrases and the STC technique by O. Zamir, O. Etzioni, O. Madanim, and R.M. Karp builds the clusters on the common documents suffixes. The method though efficient in cluster quality however is associated with high amount of term redundancy.
A study of the TF-IDF method of clustering [64], term frequency dependent algorithms [65] and a review of clustering algorithms [66] showed that majority of clustering approaches are TF-IDF based, however associated with several problems.
The NMF (Nonnegative Matrix Factorization) technique in text classification [67], improved clustering performance compared to the existing approaches [68] , relationship study of NMF techniques with earlier clustering techniques [69], [70] [71]. A review of established techniques of NMF such as multiplicative updates [72], projected gradients [73] though efficient however are associated with the problems of memory for huge datasets streamed and not disk based [74]. To overcome these problems, approaches such as random projections [61,75] and sketch/sampling algorithms [76] have been proposed. An NMF based technique by Li and Zhu in 2011 [77] for research specific documents minimizes high dimensionality, finds relevant topics for clustering and shows performance efficiency in classification comparatively. A study of the online algorithm based on Nonnegative Matrix Factorization [78], a NMF based method that uses features based on weights and similar cluster property by Sun Park, Dong Un An, Choi Im Cheon [79] performs comparatively more efficiently than the remaining NMF based strategies.
IV.
In this paper we analyzed several techniques developed for clustering documents with their applications and relevance in terms of today's requirements. The task of developing perfect strategies for classification of varied forms and types of documents for a near optimal solution or finding accurate ways of assessing the quality of the performed clustering though is impossible and is increasing in its complex nature, the field today deals with extraordinary tasks like granular taxonomies generation, sentiment analysis and document summarization for generating reliable and relevant insights applicable to several fields. In conclusion we can say document clustering is going to be widely studied and will find relevance in a number of newer areas.

Clustering Technique with Feature Selection for Text Documents. Proceedings of the Int.Conf. on, (the Int.Conf. on)
Google news personalization: Scalable online collaborative filtering. Proceedings of the 16th International Conference on World Wide Web (WWW), (the 16th International Conference on World Wide Web (WWW)) 2007. p. .
Ontologybased text clustering. Proceedings of the IJCAI-2001 Workshop Text Learning: Beyond Supervision, (the IJCAI-2001 Workshop Text Learning: Beyond SupervisionSeattle,USA
On Spectral Clustering: Analysis and an Algorithm. Advances in Neural Information Processing Systems 2001. MIT Press. 14 p. .
Towards semantic web mining. Proceedings of International Semantic Web Conference (ISWC), (International Semantic Web Conference (ISWC)) 2002. p. .
Hierarchical document clustering using frequent Itemsets. Proceedings of SIAM International Conference on Data Mining, (SIAM International Conference on Data Mining) 2003.
Detect and track latent factors with online nonnegative matrix factorization. Proc. International Joint Conference on Artificial Intelligence, (International Joint Conference on Artificial Intelligence) 2007. p. .
A survey of web clustering engines. ACM Comput. Surv 2009. 41 (3) p. .
On the equivalence of nonnegative matrix factorization and spectral clustering. Proceedings of the 5th SIAM Int'l Conf. Data Mining (SDM), (the 5th SIAM Int'l Conf. Data Mining (SDM)) 2005. p. .
Convex and seminonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence 2010.
Efficient Phrase-Based Document Similarity for Clustering. IEEE Transaction on Knowledge and Data Engineering September. 2008. 20 p. .
Projected gradient methods for nonnegative matrix factorization. Neural Computation 19 (10) p. .
Locality Preserving Indexing. Document Clustering Using Knowledge and Data Eng Dec. 2005. 17 (12) p. . (IEEE Trans)
Learning the parts of objects with nonnegative matrix factorization. Nature 1999. 401 p. .
Algorithms for nonnegative matrix factorization. Advances in Neural Information Processing System (NIPS), 2000. p. .
Canonical Correlation Analysis: An Overview with Application to Learning Methods. J. Neural Computation 2004. 16 (12) p. .
Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. Proc. of The 20th Intl. Joint Conf.on Artificial Intelligence, (of The 20th Intl. Joint Conf.on Artificial Intelligence) 2007.
Relation between plsa and nmf and implications. Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), (the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)) 2005. p. .
Machine Learning in Automated Text Categorization. ACM Computing Surveys March 2002. 34 (1) .
Frequent Term-based Text Clustering. Proc. of Intl. Conf. on Knowledge Discovery and Data Mining, (of Intl. Conf. on Knowledge Discovery and Data Mining) 2002.
Document clustering in research literature based on NMF and testor theory. Journal of Software 2011. 6 (1) p. .
Document clustering using nonnegative matrix factorization. Information Processing and Management 2006. 42 (2) p. .
Learning a bistochastic data similarity matrix. Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), (the 10th IEEE International Conference on Data Mining (ICDM)) 2010.
efficient non-negative matrix factorization with random projections. Proceedings of the 10th SIAM International Conference on Data Mining (SDM), (the 10th SIAM International Conference on Data Mining (SDM)) 2010. p. .
Improving information retrieval using document clusters and semantic synonym extraction. Journal of Theoretical and Applied Information Technology 2012. 36 (2) p. .
Wordnet: A lexical database for English. CACM 1995. 38 (11) p. .
Term-weighting approaches in automatic text retrieval. Information Processing & Management 1998. 24 (5) p. .
High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets. Proc. of IEEE Intl. Conf. on Data Mining, (of IEEE Intl. Conf. on Data Mining) 2006.
Scalable Construction of Topic Directory with Nonparametric Closed Termset Mining. Proc. of Fourth IEEE Intl. Conf.on Data Mining, (of Fourth IEEE Intl. Conf.on Data Mining) 2004.
Concept decompositions for large sparse text data using clustering. Machine Learning, 2001. 42 p. .
A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2009. 290 p. .
Enhancing Text Clustering by Leveraging Wikipedia Semantics. Proc. of 31st Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, (of 31st Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval) 2008.
Document clustering and cluster topic extraction in multilingual corpora. Proceedings of the 1st IEEE International Conference on Data Mining (ICDM), (the 1st IEEE International Conference on Data Mining (ICDM)) 2001. p. .
Candid Covariance-Free Incremental Principal Component Analysis. IEEE Trans. Pattern Analysis and Machine Intelligence Aug.2003. 25 (8) p. .
A new feature selection method for text clustering. wuhan university journal of natural sciences 2007. 12 p. .
IMMC: Incremental Maximum, Marginal Criterion. Proc. 10th ACM SIGKDD, (10th ACM SIGKDD) 2004. p. .
Exploiting noun phrases and semantic relationships for text document clustering. Information Science 2009. 179 p. .
On Successive Learning Type Algorithm for Linear Discriminant Analysis. IEIC Technical Report 1999. 99 p. . (in Japanese)
Feature Selection for Clustering -A Filter Solution. ICDM'02)0-7695-1754-4/02 © 2002 IEEE. Proceedings of the 2002 IEEE International Conference on Data Mining, (the 2002 IEEE International Conference on Data Mining)
Text document clustering based on frequent word meaning sequences. Data & Knowledge Engineering 2008. 64 p. .
Data Stream Clustering: Challenges and Issues. Proceedings of the International Multiconference of Engineers and Computer Scientists IMECS 2010, (the International Multiconference of Engineers and Computer Scientists IMECS 2010Hong Kong
integrating swarm intelligence and statistical data forfeature selection in text categorization. ©2010 International Journal of Computer Applications 1 (11) p. .
Web Document Clustering, A Feasibility Demonstration. Proceedings of the 21st International ACM SIGIR Conference on Research, (the 21st International ACM SIGIR Conference on Research) IJCSE.
Grouper: A Dynamic Clustering Interface to Web Search Results. Computer Networks 1999. 31 p. .
One sketch for all: Theory and application of conditional random sampling. Advances in Neural Information Processing System (NIPS), 2008. p. .
integrated clustering and feature selection scheme fo textdocuments. 10.3844/jcssp.2010.536.54. DOL:10.3 844/jcssp.2010.536.541. http://www.thescipub.com/abstract/10.3844/jcssp.2010.536.54 J.Comput.Sci 6 p. 536.
Providing QoS with the Deficit Table Scheduler. IEEE Transactions on Parallel and Distributed Systems 2010. 21 (3) p. .
Automatic evaluation of Word Net synonyms and hypermy my India. Proceedings of ICON-2008, 6th International Conference on Natural Language Processing, (ICON-2008, 6th International Conference on Natural Language Processing) 2008.
Indexing by Latent Semantic Analysis. J. Am.Soc. Information Science 1990. 41 (6) p. .
Enhancing an incremental clustering algorithm for Web page collections. ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, 2009. p. .
Recent Advances in Clustering: A Brief Survey. WSEAS Trans. Information Science and Applications 2004. 1 (1) p. .
An efficient concept-based mining model for enhancing text clustering. IEEE Transactions On Knowledge And Data Engineering 2010. 22 (10) p. .
Restrictive Clustering and Metaclustering for Self-Organizing Document Collections. Proc. Int'l Conf. Research and Development in Information Retrieval, (Int'l Conf. Research and Development in Information Retrieval) July 2004. p. .
Latent Semantic Indexing (LSI) and TREC-2. Proc.Second Text Retrieval Conf. (TREC), (.Second Text Retrieval Conf. (TREC)) 1993. p. .
Document Clustering Method Using Weighted Semantic Features and Cluster Similarity. Third IEEE International Conference on Digital Game and Intelligent Toy Enhanced Learning, 2010. 2010. p. . (digitel)
Efficient streaming text clustering. Neural Networks 2005. 18 (5-6) p. .
multitype features coselection for web document clustering. 1041-4347/06/$20.00. ieee transactions on knowledge and data engineering april 2006. 2006. 18 (4) . (ieee published by the ieee computer society)
Incremental clustering algorithm based on phrase-semantic similarity histogram. Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, (the Ninth International Conference on Machine Learning and Cybernetics) 2010. 11 p. .
Document Clustering by Concept Factorization. Proc. Int'l Conf. Research and Development in Information Retrieval, (Int'l Conf. Research and Development in Information Retrieval) July 2004. p. .
Exploiting Wikipedia as External Knowledge for Document Clustering. Proc. of Knowledge Discovery and Data Mining, (of Knowledge Discovery and Data Mining) 2009.
Cluster-based retrieval using language models. Proceedings of the 27th annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), (the 27th annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)) 2004. p. .
Survey of Clustering Algorithms. IEEE Transactions on Neural Networks 2005. 16 (3) p. .
H Document clustering via matrix representation. 11th IEEE International Conference on DataMiningICDM2011, 2011. p. .
An Incremental Algorithm for Clustering Search Results. IEEE International Conference on Signal Image Technology and Internet Based Systems, 2008. p. .
Text Document Clustering Based on Frequent Word Sequences. Proceedings of the. CIKM, (the. CIKMBremen, Germany
A survey paper on concept based text clustering. International Journal of Research in IT & Management 2011. 1 (3) p. .
A Comparative Study on Feature Selection in Text Categorization. Proc. 14th Int'l Conf. Machine Learning, (14th Int'l Conf. Machine Learning) 1997. p. .