Cluster-based Retrieval System for News Groups Data: Comparative Analysis

NB: We do not resell your papers. Upon ordering, we do an original paper exclusively for you.

NB: All your data is kept safe from the public.

Introduction

Abstract

The main aim of this paper is to build a cluster-based retrieval system for categorizing the news groups data and performing a comparative performance analysis using hard and soft clustering methods. Hard clustering being the most popular method, where a data point is given a hard assignment to just one cluster, eg. k-means, hierarchical clustering,etc. On the other hand soft clustering is where the datapoint can belong to more than one cluster, eg: fuzzy clustering, latent semantic analysis, etc. This paper will explore both the types of clustering methods with a comparative performance analysis and evaluation has been performed on the semantic retrieval system by using a test collection of queries.

Motivation

To explore soft clustering techniques on newsgroups data. Even though the most popular clustering methods are ones with hard assignment, the fact that a particular document text could be related to more than one category is lost in case of hard clustering. Thus, the motive was to do comparative analysis on these clustering techniques and build the clustering based retrieval system. Building an IR that meets the user’s needs and intent is always challenging and hence was motivated towards building the semantic retrieval system, as it gives the flexibility to the user with respect to the categorization of topics and get to the similar relevant documents as well.

Potential benefits

The potential benefits of having this system is that it leads to better categorization of news data and would be helpful for building new techniques that could nullify the cons of the hard and soft clustering methods and for better understanding with respect to the semantics of the documents.

Clustering

Clustering, as the name suggests is the grouping of similar objects together, where objects in the same group are more similar to each other than the objects in other groups. It is an unsupervised machine learning technique, which learns from the data about which data points are similar to each other based on various clustering techniques. Text clustering deals with clustering of text documents(unstructured data), where similar documents will be clustered together.

Automatic document organization, topic extraction, information retrieval and filtering have one thing in common, which is text clustering (“What is Text Clustering?”, 2018).

How it works?

Descriptors, set of words that describe the topic matter are extracted from the document, after which they are analyzed with respect to frequency in which they are found in the document compared to other terms (“What is Text Clustering?”, 2018). Then, the clusters of descriptors are identified and then auto-tagged.

Cluster-based retrieval system

A system that discovers semantically similar terms in the documents, with the hypothesis that documents in the same cluster behave similarly with respect to the relevance in information needs (“Introduction to Information Retrieval”, n.d., p. 350). According to the cluster hypothesis, clustering can increase the efficiency and effectiveness of the retrieval system.

Dataset

“20 Newsgroups” (n.d.) dataset is a collection of 20,000 documents partitioned evenly across the 20 news categories related to technology, religion, politics, recreation, science and miscellaneous categories. These topics can be viewed as a class. In order to perform clustering, we assume as if the labels are not available and find the clusters among the documents, where documents in each group are more similar to each other compared to the documents in other groups. In this paper, the main intent is to evaluate how well the known classes of the dataset are reconstructed using the clustering methods. The following shows the 20 news categories grouped together into 6 main categories that includes Technology, Recreation, Science, Politics, Religion and Misc.

comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
misc.forsale talk.politics.misc
talk.politics.guns
talk.politics.mideast
talk.religion.misc
alt.atheism
soc.religion.christian

Literature Review

Existing work

A lot of comparative studies are performed between hard and soft clustering methods. An unsupervised study is performed where data of similar types are put into one cluster, while data of another types are put into different clusters. Fuzzy C means is a very important clustering technique based on fuzzy logic, showing an experimental comparative study between fuzzy clustering algorithm and the K-means(hard) clustering algorithm (Bora & Gupta, 2014, p.108).

This study has made more research on the two methods and concluded that K-means lesser complexity compared to FCM with respect to the computational time. As, fuzzy clustering involves more fuzzy logic, so its computational time increases comparatively. The choice of clustering method depends on the data we want to cluster. In some situations, we cannot directly consider the data belongs to only one cluster, it may be possible that some data’s properties may contribute to more than one cluster. Soft clustering has proved to perform better for noisy data, and hence used in a number of real-life applications. Chen(2017) informs about how soft clustering is useful in handling very large data sets and in what kind of applications it comes into play. In comparison to hard clustering, it is referred to as more realistic due to the ability of handling impreciseness, uncertainty and vagueness for real-world problems (Chen, 2017, p.102).

A work on evaluating hard and soft-flat clustering methods for text documents shows that fuzzy clustering showed better results than K-means for most of the datasets and hence is also referred to as a more stable method (Singh, Siddiqui & Singh, 2012, p. 102). A lot of analysis is performed on the 20 newsgroups as the data is huge and there is a lot of scope to perform exploratory data analysis on the text documents. Liu & Croft (2004) has used language modelling to show that cluster-based retrieval can perform consistently across collections of realistic size and significant improvements over document-based retrieval in a fully automated manner and without relevance information provided by humans (Liu & Croft, 2004, p. 286).

Proposed solution

The main aim of this paper is to build a cluster-based retrieval system for the newsgroups data by performing a comparative analysis on hard and soft clustering techniques like k-means, hierarchical, FCM and latent semantic analysis and evaluating the semantic retrieval system. The newsgroups data contains about 20 sub-categories where sub-categories are grouped into six main news categories like technology, religion, recreation, politics, science and misc. Performance across the clustering methods are compared by using clustering validation techniques such as external validation by validating the clusters against the ground truth available in the dataset. Evaluation of the retrieval system is determined using metrics such as recall and precision.

Methodology

To build the clustering retrieval system on applying the clustering techniques on the news data, with the main categories as: Technology, Science, Politics, Recreation, Religion and Misc.

The steps involved are as follows:

● Data pre-processing

The initial step in dealing with unstructured data is to pre-process the text documents. The data-preprocessing of the documents involves a series of steps.

Tokenization, which involves removing punctuations and standardizing the tokens to all lower case letters.
Removal of stop words like conjunctions will be removed.
Performing stemming of words using the standard Porter’s stemmer algorithm.

The atheism news category which contains around 799 documents have been pre-processed in Python following the above steps which gives a vocabulary size of around 10656. Below, is the screenshot which shows a subset of the vocabulary and the numbers for just the atheism category:

● Build the term-document matrix

Once the data has been pre-processed, we will have the vocabulary of words using which the term-document matrix can be built, which describes the frequency of terms that occur in a collection of documents. This is one of the scoring measures that is widely used in information retrieval or summarization. Other methods could be tf-idf (term-frequency-inverse document frequency), whose values shows how relevant is the term in the given document. The tf-idf of a term in the document indicates the importance of the term in that particular document. Later, tf-idf technique can be used to compare with the performance of term-frequency.

● Apply hard clustering algorithms

Apply hard clustering algorithms: k-means and hierarchical clustering on the text documents using the built term document matrix. K-means: K-means is one of the simplest and popular unsupervised machine learning algorithms. It starts with an initial group of randomly selected centroids and performs repetitive operations to optimize the positions of the centroids (Garbade, 2018). Here, the value of K (number of clusters) is pre-defined or determined using the elbow method.

HAC: Hierarchical Agglomerative clustering is a bottom-up clustering method, which treats each document as a singleton cluster and then successively merge pairs of clusters until all clusters have been merged into a single cluster that contains all of the documents (“Introduction to Information Retrieval”, n.d., p.378). Hierarchical clustering creates a hierarchy with some explicit structure which is computationally expensive for big data, as its complexity is quadratic.

● Apply soft clustering methods

On the other hand, soft clustering algorithms are applied to the news groups data. Soft clustering algorithms like latent semantic analysis and fuzzy c-means are applied. Applying soft clustering, helps us identify the topics that are covered by each document, rather than hardly assigning the document to just one cluster.

Latent Semantic Analysis: LSA is a text mining dimension reduction technique, where each document is assigned a set of topic loadings (Boling & Das, 2015, p. 9). LSA learns latent topics by performing a matrix decomposition on the term-document matrix using Singular value decomposition. It is typically used as dimension reduction technique (“Latent Semantic Analysis using Python”, n.d.).

Fuzzy-c means clustering: Fuzzy clustering, a soft clustering method in which each element has a probability of belonging to each cluster. FCM is one of the widely used fuzzy clustering algorithms, where the centroid of the cluster is calculated as the mean of the points, weighted by their degree of belonging to the cluster (“Fuzzy Clustering Essentials”, n.d.).

● Comparative analysis

Having applied hard and soft clustering, a comparative analysis on both hard and soft clusters is performed with respect to the external validation results. As the ground truth is available in the provided dataset, the comparison can be shown based on the accuracy of both clustering types.

● Build the clustering retrieval system

After building the clustering models and comparative analysis, the information retrieval system with the clustered results on querying is built. On searching a query, the relevant documents are retrieved with the clusters related to those user’s query, which helps the user to navigate to the topic of concern, thus considering both generality and specificity of the user’s information needs.

● User interface for the IR system

Building a user interface for the retrieval system, showing up the most relevant documents with the clusters relevant to the user’s query, giving the flexibility to the user for navigating to the topic(cluster) of concern.

● Evaluation of the retrieval system

In order to perform evaluation of the retrieval system, a test collection with queries and relevant documents needs to be prepared. Making use of the test collection, the two frequent and basic measures of information retrieval effectiveness are precision and recall (“Introduction to Information Retrieval”, n.d., p.155).

Precision: Precision is the fraction of retrieved documents that are relevant.

P(relevant/retrieved) = #(retrieved items)

#(relevant items retrieved)

Recall: Recall is the fraction of relevant documents that are retrieved.

P(retrieved/relevant) = #(relevant items)

#(relevant items retrieved)

Deliverables

Document collection and pre-processed data with the vocabulary of words built.
Term-document matrix for the vocabulary of terms and the documents.
Clusters generated from the hard clustering methods on the news groups data.
Clusters generated on applying the soft clustering methods like LSA and FCM.
Comparative analysis on the above clusters generated with hard and soft clustering using external cluster validation (ground truth of the data).
The retrieval system that shows up relevant documents with the clustered results.
Prepare the test collection of queries and perform evaluation of the semantic retrieval system.

Timeline

Feasibility
Challenges/barriers
Dealing with soft clustering algorithms, which could be computationally expensive on dealing with big data compared to hard clustering. Building the test collection of queries for evaluation of the semantic retrieval system.

Project Scope

The scope of this project is within RIT campus and can be used as an add-on to understanding the comparative performance of hard and soft clustering, for building better text-clustering models.

Software

With respect to the software, Python will be used as the programming language for building the clustering models and retrieval system, as this deals with huge data. The clustering models and term-document matrix will be pre-loaded, which can be used for the retrieval system.

For the user interface, Javascript can be used to build a simple user interface for showing the demo of the retrieval system, which in-turn queries the models built using Python.

Fig: Tentative timeline for the project

References

Bora, D. J., & Gupta, D. A. K. (2014). A Comparative study Between Fuzzy Clustering Algorithm and Hard Clustering Algorithm. International Journal of Computer Trends and Technology, 10(2), 108–113. doi: 10.14445/22312803/ijctt-v10p119.
Chen, M. (2017). Soft Clustering for Very Large Data Sets. International Journal of Computer Science and Network Security, January, 2017, 17(1).
Singh, V. K., Siddiqui, T. J., & Singh, M. K. (2012). Evaluating Hard and Soft Flat-Clustering Algorithms for Text Documents. Advances in Intelligent Systems and Computing Proceedings of the Third International Conference on Intelligent Human Computer Interaction (IHCI 2011), Prague, Czech Republic, August, 2011, 63–76. doi: 10.1007/978-3-642-31603-6_6.
Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval – SIGIR 04. doi: 10.1145/1008992.1009026.
20 Newsgroups. (n.d.). Retrieved from http://qwone.com/~jason/20Newsgroups/
What is Text Clustering? (2018, July 27). Retrieved from https://insidebigdata.com/2018/07/26/what-is-text-clustering/
“Introduction to Information Retrieval” (n.d.). Retrieved from https://nlp.stanford.edu/IR-book/
http://qwone.com/~jason/20Newsgroups/
https://insidebigdata.com/2018/07/26/what-is-text-clustering/
https://nlp.stanford.edu/IR-book/html/htmledition/clustering-in-information-retrieval-1.html
Boling, C., & Das, K. (2015). Reducing Dimensionality of text documents using Latent Semantic Analysis. International Journal of Computer Applications, February, 2015, 112(5). Latent Semantic Analysis using Python. (n.d.). Retrieved from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
Garbade, M. J. (2018, September 12). Understanding K-means Clustering in Machine Learning. Retrieved from https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1
Fuzzy Clustering Essentials. (n.d.). Retrieved from https://www.datanovia.com/en/lessons/fuzzy-clustering-essentials/
https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1
https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1
https://www.datanovia.com/en/lessons/fuzzy-clustering-essentials/

Click Here To Order Now!