Critical Analysis of the Use of Profanity

Introduction

1.1 Background of the Research Problem

The use of profanity is not an uncommon phenomenon. This term is also known as swearing, obscenity, foul language, taboo words, and the like. In this research, all of those terms would be included as profanity. Many people from various ages, educational backgrounds, cultural backgrounds, or social statuses use and/or understand the use of profanities. They are usually cautious in using it since it is seen as rude and offensive.

The term profanity has undergone a meaning shift. In the past, profanity, with its plural form profanities, was once used to refer to the act of disrespectfulness towards godly things. Today, however, it is also used to refer to the use of particular forms of a language that some people in a culture judge as intolerable in certain settings (Bergen, 2016:15). For instance, the word suck would be acceptable if it were used in a sentence that little girl always sucks on her thumb, but it would be deemed as improper if a newscaster in a broadcast states our economy sucks.

For years, there have been many researches about profanities and their usages. Many of the researchers found that despite the bad meaning behind them, the use of profanities is actually contextual. The main purpose of profanities usage is to express emotion, especially anger and frustration (Jay & Janschewitz, 2008:267). However, there are studies proving that they can be used to express positive emotions too. They found that profanities can be used in utterances to show surprise, joy, sadness, love, thankfulness, or solidarity (Wang, 2013:75; Wang et al, 2014:424)

In addition, based on the preliminary observation conducted by the researcher, like those researches, she also found that profanities can be used to convey positive emotions. One of them in particular is in complimenting someone. In a linguistic study, it is also known as the complement speech act.

For example, in a television series entitled How I Met Your Mother in Season 7 Episode 1 at 15:55, profanity son of a bitch was used. One of the main characters, Lily, asked his husband, Marshall, to keep her pregnancy secret for a while. However, Marshall, who encountered many babies at a party they were in, could not hold the secret any longer. Lily uttered the profanity while complimenting the baby Marshall holding for being cute (Alexander & Fryman, 2010):

  • Lily: Marshall, stop it. Stay strong. You can’t let holding some, some
  • little baby, oh with the cute little cheeks and the…the little arms and little legs. This son of a bitch has knee dimples.
  • Marshall: (smiles and laughs)
  • Lily: Let’s go tell everyone I’m pregnant.
  • Marshall: (laughs) Yeah!

The phrase son of a bitch, according to Oxford Online Dictionary (2019) is “used as a general term of contempt or abuse”. Nevertheless, the usage of the phrase son of a bitch in the example above did not mean as contempt or abuse reference to the baby. It was proved by the context and the utterance surrounding that phrase. Lily described how cute she thought the baby was, by stating “…with the cute little cheeks, little arms and little legs”. Marshall’s response by smiling and laughing indicates that he understood that the phrase son of a bitch was not intended as an insult and that the whole utterance is a compliment for the baby.

In another television series, Brooklyn 99, there is also usage of profanities in compliment speech acts. Brooklyn 99 is a comedy series about police officers. In Season 4 Episode 15, when the two main characters, police officers Jake and Charles chased a criminal, Charles stated that he had planted a GPS tracker for the criminal. Jake complimented him, and in his compliment speech act, he inserted the word bastard (Campbell & Mendoza, 2017).

  • Charles: I planted a bug with a GPS tracker on it.
  • Jake: Chip Rockets, you beautiful bastard! Charles: (smiles and lifts his shoulders)

Chip Rockets is Jake’s nickname for Charles. Jake who was happy that Charles did something to help them chase the criminal, saying Chip Rockets, you beautiful bastard! The word bastard is profanity or a derogatory term for a person whose parents are not married to each other (Oxford Online Dictionary, 2019). However, by the context of the conversation that Jake is happy with Charles’ doing, Jake’s expression, and the word beautiful, Charles and the viewers understood that Jake meant to compliment Charles. It was confirmed by Charles’ response where he smiled and lifted her shoulders, showing that he was happy with the compliment and proud of it.

Not only it was found in fiction, but researchers also found the use of profanities in compliment speech act in a real-world interaction. One of them is on social media Twitter. Twitter is the second most popular social media website, with 326 million monthly active users (Maina, 2016; “Number of Monthly Active”, 2018). It allows its users to interact through messages called tweets.

In some tweets, a researcher found samples of profanities used in the compliment speech act. For example, Noah Centineo, a celebrity with a verified account, on August 30, 2018, tweeted,

Picture 1. Noah Centineo’s tweet

Centineo did not refer to anybody in particular in that tweet. Nevertheless, judging from the replies, other Twitter users knew that it was meant as a compliment. For example, one of the replies said I know thank you. The word fuck is profanity which, as an exclamation, is used to “express annoyance, contempt, or impatience” (Oxford Online Dictionary, 2019). Conversely, in this context, Centineo meant it to show his surprise of how cute someone is. Most of the replies in that tweet thanked him and acted as if it was directed to them. This proves that the utterance is a compliment speech act.

Another example is from another user, @KaijuGreaser, who on November 28, 2018, tweeted,

Picture 2. @KaijuGreaser’s tweet

The user @KaijuGreaser tweeted pictures of her dog bothering her while working on her laptop. She wrote “UM EXCUSE ME YOU LITTLE FUCKER I’M TRYING TO GET SOME WORK DONE HERE” all in the capitalized letter. People usually write in all capitalized letters on social media to show strong emotion. However, the strong emotion that she portrayed only emphasizes the compliment speech act uttered. The whole utterance expresses the compliment of her dog’s antics. She also used profanity fucker. That word, according to Oxford Online Dictionary (2019) is an abusive term used to refer to a stupid person. In the utterance, on the other hand, she meant it as a noun to refer to her dog without any malicious meaning in it. The use of the adjective cute emphasizes that, as well as the replies of the tweet. The replies suggested that other users knew that the tweet was meant as a compliment.

The finding from the previous research and the preliminary observation conducted by the researcher confirms a language phenomenon where profanities are also used to convey positive emotions. Particularly, they can be used to emphasize compliment speech acts. That fact piques the researcher’s interest to study it. Up to now, a researcher has also not found any study about the use of profanities in compliment speech acts, which can serve as a novelty for this research. Hence, the researcher decided to conduct this study entitled Profanities in Compliment Speech Acts on Twitter.

1.2 Scope and Limitation of the Research

There have been a lot of researchers studying the use of profanities. They have different perspectives and limitations in conducting research about its usage. In this research, as implied in the background of the research problem, the researcher uses pragmatic perspectives. The researcher studies the use of profanities based on the context and the utterance stated by the speakers. The researcher also limits her study to the use of English profanities in compliment speech acts found on Twitter.

This research is focused on the finding of three aspects: the linguistic forms of the profanities, the pragmatic meaning of the profanities, and the pragmatic functions of the profanities. To find out the lingual forms of the profanities, the researcher uses the theory of Jurafsky and Martin (2005:3), Cruse (2006:190), Richards and Schmidt (2010:81), Slawson et al (in Southern Writing Center, 2011:11), and Ramlan (in Giyatmi et al, 2017:67). To identify the role of the profanities in clause-formed utterances, the researcher also utilizes the theories from Gerott and Wignell (1994:52-73) and Butt et al (2000:52-55). Meanwhile, the themes of the profanities found are determined by using the swearing themes by Ljung (2011:35).

The second aspect to find in this study is the pragmatic meanings of the profanities. The identification of the pragmatic meanings of the profanities is based on the theory of meaning by Kreidler (2002:49). In his book, Introducing English Semantics, Kreidler lists four types of meaning: lexical meaning, grammatical meaning, linguistic meaning, and utterance meaning. His theory is used in this study to determine the pragmatic meanings of the profanities found in complement speech acts on Twitter.

Finally, the last aspect is to find the pragmatic functions of the profanities. The researcher uses the theory of functions of profanities by Ljung (2011:30). Ljung mentions fourteen functions of profanities grouped into three categories: stand-alone, slot fillers, and replacive swearing.

1.3 Identification of the Problem

In this study, the researcher is concerned about the profanities used in compliment speech acts on social media Twitter. Based on the background of the research problem and the scope of the research, some research questions can be derived as follows:

  1. What are the linguistic forms of the profanities used in compliment speech acts on Twitter?
  2. What are the pragmatic meanings of the profanities used in compliment speech acts on Twitter?
  3. What are the pragmatic functions of the profanities used in compliment speech acts on Twitter?

1.4 Objective of the Research

Based on the statements of the research problem above, some objectives of this research can be drawn. The objectives of this research can be stated as follows:

  1. To identify the linguistic forms of the profanities used in compliment speech act on Twitter
  2. To find out the pragmatic meanings of the profanities used in compliment speech act on Twitter
  3. To explain the pragmatic functions of the profanities used in compliment speech act on Twitter

1.5 Significance of the Research

The significance of this research can be viewed from two perspectives. As a theoretical contribution, this research helps readers, especially linguistics students, to understand more about profanities. It is hoped to contribute to students’ understanding of profanities found in complement speech acts.

On the other hand, as a practical contribution, this research hopefully can ensure English teachers, especially EFL (English as a Foreign Language) teachers, teach this subject to their students. Since profanity is also used to express positive emotions, such as in the compliment speech act, and the closest English-speaking environment for the EFL students is social media, including Twitter, this research hopefully can contribute to the students’ better communication with English speakers from other parts of the world.

Usage of Profanity by Malaysian Teenagers: Analytical Essay

Research Methodology

3.0 Introduction

This chapter described the procedures and methodology utilized in the course of this research. The research design, participants, sampling method, data collection methods, research instruments, and data analysis procedures are also explained in this chapter.

3.1 Research design

3.1.1 Pilot Study

In order to determine the possibility of conducting this research, a pilot study was conducted in March 2018 where a preliminary study was developed to determine if Malaysians use bad language in English. The preliminary study was also meant to determine if Facebook could be one of the sources of research. The pilot study focused on a small portion of the main data, one picture, shared by the admin of the page, was purposively chosen from the “Only in Malaysia” page on Facebook. The reason for adopting purposive sampling was that not all the pictures shared by the admin triggered and stimulated netizens’ emotions. There were 129 comments which were left by the netizens on the picture, and out of 129 comments, there were 20 comments which include profanity words, both in the form of single words and in combination with other words as well as in the form of phrases. Profanity words used by netizens were investigated using Jay’s (2009) model. To conduct a preliminary study on the data, one stimulus (a picture) which provoked and stimulated both anger and surprise among the Malaysian netizens were purposively selected from the page called ‘Only in Malaysia’ because not all the pictures shared by the admin were able to stimulate the netizens’ emotions. The picture was about a group photo of a lecturer that condemned Tun Dr. Mahathir, Tun Dr. Mahathir and his wife, and some other people. There were 20 out of 129 netizens who used profanity while commenting on this picture. This preliminary study was thus used as a gauge to evaluate the suitability and usefulness of the framework, the model as well as the procedure for conducting the present study. The result of the pilot study indicated that one approach cannot cover the range of bad language found in the data. Thus, it would be more relevant to use a combination of approaches encompassing Pinker’s (2010) five categories of using profanity.

3.1.2 Mixed-Method Approach

This study used a mixed-methods (Tashakkori & Teddlie, 2003) design, which is a procedure for collecting, analyzing, and mixing both quantitative and qualitative data at some stage of the research process within a single study, to understand a research problem more completely (Creswell, 2002). The rationale for mixing is that neither quantitative nor qualitative methods are sufficient by themselves to capture the trends and details of the situation, which is the usage of profanity by Malaysian teenagers. When used in combination, quantitative and qualitative methods complement each other and allow for a more complete analysis (Green, Caracelli, & Graham, 1989, Tashakkori & Teddlie, 1998).

This study used a descriptive mixed-methods design. In the quantitative phase, questionnaires were distributed using a web-based survey and the data will be subjected to a discriminant function analysis. In the qualitative phase, a case study approach was used to collect text data through a Facebook page.

3.2 Sample

Reid (2018) described the population in a study as all units possessing certain characteristics, which are of interest of the researchers’ study. From the definition, the population can be understood as the targeted community or group of people who are involved or selected by the researcher for a study. Therefore for this study, the population from which the samples were derived consists of the following groups of participants:

  • Malaysian teenagers
  • Facebook users who leave profanity words comments in “Only in Malaysia” posting

In this study, the researcher employed the simple random sampling method for the selection of the participants.

3.3 Instrumentation

The methodology for the present study employed mixed methods; that is, employing more than one approach to investigate the research questions. According to Tayyebian (2015), using more than one approach also helped the researcher in enhancing confidence in the findings. Moreover, employing two or more independent measurement processes which confirm a proposition may reduce the uncertainty and ambiguity of the interpretation. Hence, the mixed approaches of the discourse analysis, the data of which was taken from Facebook and the questionnaire helped the researcher to obtain more valid data, resulting in more reliable results. The questionnaire was used not only to support the findings of this study but also to measure and examine the use of profanity among teenagers in Malaysia. Data obtained from the Facebook page is especially effective in obtaining culturally specific information about the values, behaviors, and social contexts among Malaysians in Facebook.

3.4 Data Procedures

Figure 1:Collecting Data Procedures

The data for the study were collected from a Facebook page called “Only in Malaysia”. The data were obtained from April 2018 to April 2019 for one year. Each topic was initiated by the picture uploaded and shared by the admin of the page, who was the person controlling the page and shared everyday topics. The topics are concerned about Malaysia. In the study, several comments were collected as they contained profanity words.

In collecting the data, the researcher first liked the page named “Only in Malaysia” where Malaysians can share and express their ideas and emotions about events, news, photos, and issues concerning Malaysia. By liking this page, the researcher was able to trace news and photos shared by the admin of the page and have access to all the comments shared by the members of “Only in Malaysia” as well as to find more information about the variety of bad words used by these netizens through a large corpus.

Secondly, the researcher asked for permission from the page admin to use the topics shared by him besides using the members’ comments which were related to the present study. However, it was assured that the information regarding the users of bad language would remain confidential. After getting permission from the page administrator, the researcher could lead an ethical data collection.

The researcher followed everyday topics, debates, status messages, and comments being shared either by the members or administrator. In one year, it was found that profanity was used more noticeably and at a higher rate compared to other topics. During the period, there were also topics for which the members did not use bad words at all because the topics did not stimulate or trigger the emotions of Malaysian netizens.

In the next phase, of this study, profanity was identified among the comments written by the members of “Only in Malaysia” if they possessed the following characteristics:

  1. They were considered as swear words, curse words, obscene and vulgar terms, profane, and blasphemous terms, insults and slurs, epithets, and slang language (Jay, 1990)
  2. Words related to taboo themes, words related to organs and acts of sex, defecation, death, killing, bodies and their effluvia as well as food leftovers
  3. Expletive swearing including the moderate expletive, euphemistic expletive, and taboo expletive
  4. Abusive swearing related to ritual insults, name-calling, unfriendly suggestions, and sarcastic expressions
  5. Auxiliary swearing
  6. Humoristic swearing.

For storing the basic data, the researcher created a database table to store data collected from each sample:

  1. Usernames ID for future reference
  2. Comments contained profanity including words, phrases, and expressions
  3. Profanity words were saved within their full comments to be analyzed within the context of use according to Pinker’s (2010) model.

3.5 Data analysis

3.5.1 Facebook Data

In order to get rich data for this study, a descriptive approach was used. The descriptive approach is described as a research technique for the objective, systematic and quantitative description of the manifest content of communication to investigate messages and reduce them into categories (Rosenberry & Vicker, 2009). According to Zhang and Wildemuth (2009), qualitative content analysis pays attention to unique themes that illustrate the range of meanings of the phenomenon rather than the statistical significance of the occurrence of particular text or concept. Using a qualitative content analysis guided by theory, this study examined the presence of profanity words found in a Facebook page called ‘Only in Malaysia’. Postings from April 2018 to April 2019 that have been posted by the admin of the page were analyzed and the researcher found comments that contained profanity words posted by the netizens. Through this approach, the researcher could observe the lexical categories of profanity words and also the context that Malaysian used in using profanity words.

A framework analysis approach was used in the study since it is a more advanced method that consists of several stages such as familiarization, identifying a thematic framework, coding, charting, mapping, and interpretation. Lexical categories of profanity words found in this study will be analyzed using Jay (2009), who distinguishes profanity referents into nine categories, namely sexual references, profanity or blasphemy, scatological and disgusting objects, animal names, ethnic-racial-gender slurs, psychological-physical-social deviations, ancestral allusions, substandard vulgar terms, and offensive slang. Sexual references are related to sexual acts (e.g. fuck), sexual anatomies (e.g. cock, dick, cunt), and sexual deviations (e.g. motherfucker, cocksucker). Profane and blasphemous swear words refer to religious terms (e.g. Jesus Christ or damn), while scatological and disgusting objects refer to feces (e.g. crap), excretion organs (e.g. asshole), excretion processes (e.g. shitting), and body products (e.g. piss). Profanity may also be in the form of animal names (e.g. bitch, monkey) and ethnic-racial-gender slurs (e.g. nigger, fag). Psychological-physical-social deviations are also often used as profanity (e.g. moron, pox, whore). Ancestral allusions are profanity that involves or relates with family relationships and ancestors (e.g. son of a bitch, the bastard). Substandard vulgar terms are vulgar words of which the constructions are below the satisfactory standard of language (e.g. on the rag, fart face). Lastly, offensive slang refers to offensive substandard words that are invented to ease communication (e.g. bang, suck).

After gathering the data and having compiled a list of profanity words, first, each profanity word was checked for its frequency of occurrence in the data. In analyzing the frequency of occurrence of bad words, the total number of bad words was counted manually in the data. This was divided and multiplied by 100 to work out the percentage of each bad word used by Malaysian netizens on Facebook. A higher percentage indicates more frequent use of a bad word.

Next, each comment was approached and delved into individually to have a good comprehension of each bad word in each comment. Each comment was analyzed by having its lexical categories. The bad words included in each comment were examined thoroughly for their characteristics concerning Jay’s (2009) model (Table 1).

Table 1: Jay’s (2009) Model

  • Lexical Category
  • Example
  • Sexual References
    • Related to sexual acts (e.g. Fuck)
    • Sexual anatomies (e.g. Cock, dick, cunt)
    • Sexual deviations (e.g. Motherfucker, cocksucker).
  • Profanity Or Blasphemy
    • Refer to religious terms
    • E.g.: Jesus christ or damn
  • Scatological And Disgusting Objects
    • Scatological and disgusting objects refer to feces (e.g. Crap),
    • Excretion organs (e.g. Asshole),
    • Excretion processes (e.g. Shitting)
    • Body products (e.g. Piss).
  • Animal Names
    • E.g.: bitch, monkey
  • Ethnic-Racial-Gender Slurs
    • E.g.: nigger, fag
  • Psychological-Physical-Social Deviations
    • E.g.: moron, pox, whore
  • Ancestral Allusions
    • Involve or relate to family relationships and ancestors
    • E.g.: son of a bitch, bastard
  • Substandard Vulgar Terms
    • Vulgar words of which the constructions are below the satisfactory standard of language
    • E.g.: on the rag, fart face
  • Offensive Slang
    • Refers to offensive substandard words that are invented to ease communication
    • E.g.: bang, suck

Finally, based on the model of swearing types proposed by Pinker (2010), there are five categories of using profanity. The categories are dysphemistic, abusive, idiomatic, emphatic, and cathartic (Pinker, 2010).

3.5.2 Questionnaire

The questionnaire was designed by the researcher to find the answer to some of the questions raised after analyzing the corpus; furthermore, it was designed to get the Malaysian perception of profanity. Consequently, to further support the strength of profanity among Malaysians, a questionnaire link in Google Form was distributed in Whatsapp Groups. The questionnaire was both adopted and adapted from Fagersten’s (2000) questionnaire; however, the questions in the questionnaire were also designed based on the literature search and the findings of primary data. The questionnaire consists of two different parts, the sample questionnaire is provided in Appendix B. Each part of the questionnaire will be analyzed separately in chapter 4.

  • Liked Facebook Page ‘Only in Malaysia’
  • Pilot Study
  • Get permission from the Admin
  • Observed comments in comment boxes
  • list according to Jay’s (2008) and Pinker’s (2010) model
  • Created database
  • Distributed Google form in Whatsapp groups

Detection of Inappropriate Anonymous Comments Using NLP and Sentiment Analysis: Analysis of Profanity

Abstract

The world became interactive and socially active nowadays because of the increase in different types of content-sharing applications. These content-sharing applications are social media platforms that provide various features so that users can effectively interact and share their thoughts and ideology. One such platform is a discussion forum that promises the anonymous posting of users’ views and complaints. As the growth In the craze of the forum, they are being targeted by spammers for their work. Though these platforms act as a medium of knowledge sharing, all of the users don’t use these platforms for a positive cause. They are also being used to abuse or bully targeted people taking advantage of their anonymous feature. Spamming and cyberbullying has grown rapidly to the limit that social media is being termed harmful. By reading spam and vulgar comments, readers will be in diverted and this results in several misconceptions which are harmful. The main aim is to detect these bad comments which are vulgar, inappropriate or not related to the specific context. The research is not based on the static contents but it lives streams the comments and the entire research is being done. The research is based on NLP, Sentiment calculation, and topic detection.

Keywords— PPM Algorithm · TEM Algorithm · SAM Algorithm · Latent Dirichlet Allocation (LDA) · Natural Language Processing (NLP) ᐧ Machine Learning ᐧ Topic extraction.

1. Introduction

Social media platforms allow us to interact and share our ideas and opinions by means of sharing posts, tweets, comments, and so many other possible ways. By reading comments we get to know the ideology of the people and it is most useful in the case of online shopping where we try to buy a product by reading the comments and we can come to a view on that product. These comments are allowed to be posted anonymously to get more genuine views. Though we have access of reading the comments and coming to a decision but they may be spam and they cause an irrelevant and irresponsible impact on the reader’s brain. As we already know the feature in one of the social media sites YouTube and the feature is it will delete a comment based on the number of dislikes until and unless reached a particular number. By this action, we can understand the real motive is to not entertain any spam comments. In our research in this paper, our approach is to deal not only with the spam comments but also to look after the bad, vulgar, and irrelevant comments which manipulate the reader’s mind and are out of topic which are of no use. The first elimination is based on removing all the extra spaces and tabs in order to make them into tokens. The above-mentioned preprocessing is done only after checking the vulgarity of the comment, also based on the topic relevancy. After checking the above conditions the comments are deleted. Then we deal with topic extraction and topic similarity. We built a mechanism to identify the spam comments and apply the natural processing techniques in the later stages along with the machine learning algorithms.

1.1. Objective

With this paper, we intend to create a system where forums, websites, and all social media sites will be spam free. Society nowadays is fully dependent on social media and it is responsible for changing and routing the behavior of the people. This attention paves a way for the spammers to promote irrelevant content and promote malicious behavior. The main idea of this paper is to provide a system where spam will detect and how the sentiment of the people is dependent based on the comments is calculated.

1.2. Existing System

In the existing system in nowadays forums we can see people commenting randomly and we face a number of cases where the comments are

  1. Topic irrelevant
  2. Which are either offensive or vulgar
  3. Spam comments

But we can see that in the forum pages all the comments are displayed and no effort is made to remove these comments which are in the above-mentioned list until someone reports them to be vulgar. But by the time people will be seeing the comments and it is waste of time to remove them after getting reports.

We may even observe cases where tweets are extracted from Twitter and they are further processed by using different techniques to find out the profanity but this is only applied only a particular hashtag or using any username and extracting tweets from that username so this is not efficient in case of common platforms like web forums where people access them everywhere across the world for different reasons. So we are using NLP and different machine learning algorithms to provide a solution to this problem.

1.3. Related Work

The initial research in this area was with the recognition of email spam by Sahami etc. al. [1]. He used probabilistic learning methods to produce filters which are useful in this task. Domain-specific features are considered for detecting spam. Carreras etc. al. [2] proved that AdaBoost is more effective than Naïve Bayes and decision trees. As increased growth in World Wide Web, the need for spam detection also has grown simultaneously. So they started to detect spam comments by analytics, the decision trees algorithm was used in [3] by Davison to identify link-based web spam. Drost etc. al. [4] used SVM to classify web spam with content-based features. Their method is based on link spam detection and the solution is based on training data by revealing the effectiveness of classes of intrinsic and relational attributes.

In [5] Mishne etc. al. used machine learning techniques for the detection of spam comments. Comments in the posts are checked with the comments of linked pages. In [6] Bhattarai etc. al. used the data corpus from the same set to detect spam comments. In [7] the authors made a study on the comments and commenting behavior of the videos that have more than 6M views and a comparison in sentiment analysis is made on finding the influence of the comments by other people’s comments. In [8] Ruihai Dong etc. al. worked on topic detection using NLP.

1.4. Problem Statement

The online review comments mostly influence the viewers, this behavior can be observed for online shopping, movies, videos, and posts in social networking sites. Detection and deletion of spam comments is a big challenge faced by the internet nowadays, which states the need of a system where the comments are streamed continuously instead of taking the static data then to detect whether it is spam or not and based on the result obtained the action should be taking. Further, the sentiment of the comments is analyzed to depict the mindset of the people.

2. Proposed methodology

The main goal of the paper is to remove spam from all social networking sites. As we know that social media is responsible for influencing people’s minds and spamming results in a change of perspective and is harmful. So we proposed a system where the comments will be streamed from the forums and these comments will be assessed based on the vulgarity using a profanity module.

The preprocessing and later stages will be continued based on the profanity check results and further topic detection and sentiment of the people is calculated.

2.1. System Architecture

In our research, we have dealt with detecting spam comments based on NLP and Machine learning. The first is detecting the profanity of the comment then it goes for preprocessing where we tokenize, lemmatize, symmetrize and it continues with other preprocessing finally we reach to find out the topic similarity and topic detection using algorithms. The results are found using sentiment analysis and then displaying the sentiment of the comments using word cloud and barplot.

Our approach consists of four modules. The first module is about finding profanity. The second module deals with all the preprocessing. The third module comes with the algorithms where we find out the topic similarity and topic detection by forming dictionary and corpus formation. Finally comes our sentiment analysis.

Generally we can stop the research after detection and blocking of the spam comments but as we developed it for an educational institution we continued it further to calculate sentiment. By doing sentiment analysis we can extract the features of the institution is working and what are the required changes to be made and what students are in need of by calculating the positives and negatives.

The results are visualized using a word cloud. Wordcloud allows us to visualize the most frequently talked about issues whose representation or the weight of the word is more.

Fig. 1. Proposed Architecture

2.2. System Components

  • A. Profanity Check Module
  • B. Preprocessing Module
  • C. Topic Extraction Module
  • D. Sentiment Analysis Module

The first step we deal with here is about profanity check of the comments which can be done in different ways of but we are using some modules here. SentiWordnet is the module that we used where we can check our comments based on the profanity consisting list of words. If the word consisting of profanity is found out then we have to stop further preprocessing and as this comment is not for any use in the forum we try not to display them.

This research then deals with the preprocessing before they should be in form of tokens. So the comments are split and tokens are formed. Then we see for lemmatization and lemmatization. Then there comes our POS tagging which is a crucial step in preprocessing. Though we preprocess all the tokens based on POS tagging but our main aim is to find the topic detection and topic similarity so we need only the words which are like nouns and adjectives which are used to describe these nouns. So we focused on these parts of speech After POS tagging we arrive at a stage of topic detection and topic similarity detection.

So the third module deals with the TEM algorithm which is dependent on the values which are normalized i.e. the sum of the frequencies of the topics from the comment in the post. We understand by topic the main theme which is discussed in the comment and it is given by a set of unigrams and bigrams with a predominant number of occurrences in the comment. Here we applied to LDA to find out the topic similarity. Based on the results of the LDA we can find out the topic on which the comments are based on.

The next modules deal with the SAM algorithm which gives the sentiment of the users and finally when we get the results displaying the positivity and negativity of comments and how they affect the further comments on the forum.

A) Profanity Check Module

As growth in the number of web users, the presence of inappropriate content from users becomes more problematic. Social news sites, forums, and any online community must manage user-generated content, censoring that which is not in line with the social norms and expectations of a community. Failure to remove such content not only deters potential users/visitors but can also signal that such content is acceptable.

The research on previous works shows that the current systems are not up to the mark. The general practice is they use a static list each time they check for profanity. This dot works if the vulgarity is in form of misspelled words, different languages, and other reasons. These drawbacks make the current systems to detect profanity obsolete, some even depend on outsiders so that they are assigned with the detection of spam comments for the posts. This is suitable and doable up to a particular stage but when the task becomes huge this is not applicable. So all the comments are profanity checked based on vulgarity, abusive words, and irrelevant topic discussion.

  1. List-based approach:

This is the most standard approach where That is, in order to determine if a comment contains profanity in a particular forum, these systems simply examine each word in the document. If any of the words are present on a list of profane terms, then the comment is labeled as profane. Basically, we introduced a system where as soon as the comment is introduced in the forum the comment is being checked for profanity and the profanity module runs in the background if it is found to be profane we stop further pre-processing. The profanity module is from Google where they update the list on a periodic basis and we make sure that the list is updated in our profanity module which takes care of all the spellings, partially censored, and other issues are taken care of.

B) PRE-Preprocessing module

Preprocessing is an important stage in natural language processing because the words, tokens, and other sentences identified in this stage are used for further preprocessing to find ngrams and apply algorithms.

(i)Stop word removal:

Many words in a sentence are used as joining words but they themselves do not provide any sense unless combined and framed grammatically to form a sentence. So we can say that their presence do not contribute to the content or the context of the document. Removal of these stop words is necessary because of their high-frequency causes obstacles in understanding the actual content of the document.

One can use their own stop word module to detect them and remove but it is not suitable because it is important to update all the stop words and check each word of the sentence to verify which leads to a hectic task. It’s better to use stop word modules and try to eliminate the stop words so that we get the actual words which are then taken to the next step for further preprocessing.

(ii)Tokenization:

Tokenization is the process where the sentence is broken into tokens. The main aim behind tokenization is to explore the meaning of the tokens formed and how they are preprocessed further to make meaningful outcomes after performing NLP.

We may have doubt that text is already in the readable format then why is the use of tokenizing but still we are left out with many punctuation words and expressions which are needed to be separated. So the main aim of tokenizing is to obtain tokens that are meaning full and remove the inconsistency. Tokenizing is based on the delimiter which is further dependent on the language where different languages have different delimiters. Space is a delimiter in English whereas some languages do not have a particular boundary like Chinese. So in the case of those languages, extra care is needed.

(iii)Stemmetization:

The word is the token is reduced to its root word. The root form is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix. Stemming is a type of normalization. The sequence is returned with its root word

(iv) Lemmatization:

The lemma of the word is found. So we can see that the extra endings are removed in lemmatization. The word which is returned is called the lemma. These two terms are not the same, Stemming is just finding the root word but most times it’s not preferable but lemming is a technique where morphological analysis of the words. It returns a lemma.

Fig. 2. Flow of Pre-processing module

  1. Algorithm: PPM
  2. Input: Comments entered in the forum
  3. Output: Filtered Tokens
    1. a) Get a profanity list from the profanity module
    2. b) If the comment contains profanity then
  4. Delete the comment and don’t display in the forum
    1. c) Else
  5. Get the stopwords of the English language present in the nltk module
  6. Remove stopwords present in the comments
  7. Tokenize the comment using any tokenizer (eg
  8. PunktTokenizer) present in the nltk module
  9. Summarize the tokens
  10. Lemmatize the tokens using WordnetLemmatizer

Fig. 3. Pseudo code Pre-processing module

C) Topic extraction module

The tokenized comments now need to be preprocessed in such a way that we can extract the topic of discussion and categorize the comments based on the topic being extracted.

(i) POS tagging:

In this process, for each token that has been formed after the preprocessing, we assign the part of speech to which it belongs. The words are classified based on their part of speech, tense, number, case, etc.

The set of all the pos tags used in the POS tagging are called tag sets where they differ from language to language. Basic tagsets are like N for noun, A for adjective, etc. For that we will have a list of tagsets but each one of them is not useful for topic extraction because as we know that mostly to obtain the topic being discussed in a given sentence or paragraph or document we rely on the nouns being discussed. Not only the nouns but we are also deciding it based on the adjectives and verbs being discussed as they describe the nouns and the situation which is being talked about in a sentence.

(ii)Unigrams and bigrams:

As we have our pos tagged words further we can group the words based on the distance and then we can conclude what topic is being discussed.

N-grams are sequences of items in a sequence can take values 1, 2, etc. but we no need a large value for N. In our project, we are generating bi grams to further application. Bi grams are formed by considering adjacent tokens and grouping them together. Before forming them to be grams they are in the uni grams stage and then we form the bi grams. Unigrams and bigrams are generated as they are essential to proceed toward LDA

(iii)Topic extraction:

As we have our bi grams there is a need to apply an algorithm to extract the topic. Here we used LDA(Latent Dirichlet Allocation) to extract the topic. The input to this algorithm needs to be in form of a dictionary and corpus. LDA is a topic extraction model which is vastly used for this purpose. This model is used to identify the topic of the document by classifying the text which is present in a document. This algorithm is used to build Dirichlet distributions (i.e. a topic per document model and words per topic model).

First, we try to analyze the frequency of terms (i.e. the number of occurrences of a term present in the document) using a document term matrix. After this has been done we generate an LDA model upon the document. When we apply it, each token is given a unique integer id, which is transformed to a bag of words known as a corpus then the LDA is applied. Based on the frequency of the words obtained topic is extracted. Based on the extracted topic our forum’s main discussion will be obtained daily basis and weekly basis.

Fig. 4. Flow of Topic extraction module

  1. Algorithm: TEM
  2. Input: Filtered Tokens
  3. Output: Topic of the comment
    1. a) POS Tagging using nltk module
    2. b) Find the candidate nouns (i.e. tokens which are singular nouns (NN), plural nouns (NNS), singular proper nouns (NNP), plural proper nouns (NNPS))
    3. c) Generate Bi-grams using the nltk module
    4. d) Dictionary formation using gensim module
    5. e) Corpus formation using gensim module
    6. f) Apply LDA algorithm using gensim module
    7. g) If the topic of the comment is not relevant to the topic of the forum then
  4. Delete the comment and don’t display in the forum
    1. h) Create a word cloud of most discussed topics using the word cloud module

Fig. 5. Pseudocode Topic extraction module

D) Sentiment analysis module

Sentiment calculation shows the sentiment of the people based on the topic being discussed and how this results in future comments commented and how the people’s opinions are based. This is a classification where the inserted phrase is decided based on the negative, positive and neutral sentiment.

In our research we used sentiwordnet. Senti wordnet is a document containing all the synsets of wordnet along with their “positivity”, “negativity”, and “neutrality”. Each synset is has three scores Positive score, Negative score, and Objective score which represent how positive, negative, and “objective” (i.e., neutral) the synset is. These scores may range from 0.0 and go up to 1.0, the sum of all the three scores being 1.0. Each score for a synset term has a non-zero value. SentiWordNetsynset term scores have been computed semi-automatically based on a semi-supervised algorithm.

So on the result obtained which shows the sentiment of the phrase describes how the opinion of the people is and also the opinion of the topic being discussed which helps a lot in case of our forum where students will be discussing all their issues which paves a way for the management and the teachers to look after the issues which needed to be taken care and how they need to be handled are also discussed as we provided their suggestions section also so they can reach the staff and be resolved. This system helps not only the faculty and institution but also the students who want their issues to be solved.

  1. Algorithm: SAM
  2. Input: Filtered tokens from a preprocessing module, SentiWordnet module which contains synset terms along with their positive and negative sentiment scores
    1. a) For each token in the filtered token list:
      1. (i)If token= ‘not’:
        1. positivescore= 0
        2. negativescore= thresholdvalue
      2. (ii)Else:
        1. positivescore= positive score of a synset
        2. term in sentiwordnet
        3. negativescore= negative score of synset
        4. term from sentiwordnet
    2. b) Create a plot visualizing the positive and negative sentiments using the matplotlib module

Fig. 6. Pseudocode of Sentiment Analysis Module

3 Results

The results are shown in the form of plots which are word clouds and bar plots that depict the sentiment and they will update each time the comment is being posted so we can observe the change in the plots and how the sentiment is being changed and how the requirements are to be done and we can observe that topic being discussed upon and the topic which we extracted are evaluated to see how our research has been done.

Fig. 7. Barplot visualizing the sentiment

Fig. 8. Wordcloud depicting the most discussed topics

4 Conclusion and Future scope

Our research has overcome the problem with some comments and all the disadvantages which were in the existing system and proposed the system where spam comments will be detected based on finding out its features and also the problem where topic irrelevant comments which lead to misconception are also dealt with.

Future enhancements can be made to this project as we are streaming the comments not just taking the static content which provides a great scope not only to remove the spam comments but to make this evaluation of the topic to be applicable in other areas of interest

References

  1. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian approach to filtering junk e-mail”. In AAAI-98 Workshop on Learning for Text Categorization, July 1998, Madison, Wisconsin, pp. 98-105
  2. Carreras, X. and Marquez, L., “Boosting trees for anti-spam email filtering”. In Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, 2001, pp. 58-64
  3. Davison B. D., “Recognizing Nepotistic Links on the Web”. In AAAI 2000 Workshop on Artificial Intelligence for Web Search, 2000, pp.23- 28.
  4. I. Drost and T. Scheffer., “Thwarting the nigritude ultramarine: Learning to identify link spam”. In ECML’05 Proceedings of the 16th European Conference on Machine Learning, 2005, Berlin, Germany, pp.96-107.
  5. Gilad Mishne, David Carmel, and Ronny Lempel, “Blocking blog spam with language model disagreement”. In Proceedings of the First International Workshop on Adversarial Information Web (AIRWeb), Chiba, Japan, May 2005, pp. 1-6
  6. A. Bhattari and D. Dasgupta, “A Self-supervised Approach to Comment Spam Detection based on Content Analysis”. In International Journal of Information Security and Privacy (IJISP), Volume 5, Issue 1, 2011, pp. 14-32M.
  7. Stefan Siersdorfer and Sergiu Chelaru, “How useful are your comments? analyzing and predicting YouTube comments and comment ratings”. In Proceedings of the 19th international conference on World wide web, 2010, pp. 891-900.
  8. Ruihai Dong, Markus Schaal, and Barry Smyth, “Topic extraction from online reviews for classification and recommendation”. In