Data mining has been defined differentially in diverse contexts, but the major underlying theme is that it is an activity that engages in nontrivial extraction of previously unknown data for purposes of collecting useful information that can be applied to a wide-range of settings.
Today, more than ever before, individuals, organizations and governments have access to seemingly endless amounts of data that has been stored electronically on the World Wide Web and the Internet, and thus it makes much sense for these entities to internalize the desire to analyze and synthesize this data in a focused attempt to discover meaningful patterns hidden within the data (Ethics in Computing para. 1-2).
However, it should always be remembered that data mining may occasion devastating effects if proper regulations are not adopted by the participating entities or stakeholders. The purpose of this paper, therefore, is to demonstrate that data mining is acceptable and advantageous if proper regulations are put in place.
The first reason why data mining is acceptable is that it facilitates organizations to provide better services to customers. This capability provides organizations with a distinct advantage over competitors as they are more able to learn about customer purchase behaviors, beliefs and expectations through data mining. Indeed, extant literature demonstrates that data mining assists organizations “to build detailed customer profiles, and gain marketing intelligence (van Wel & Royakkers 129).
Within the business context, therefore, it can be argued that data mining assists marketers and business organizations to not only build models based on data to predict the target audience that is likely to respond to new marketing initiatives or new products in the market, but also to reinforce customer buying behavior and experience (Zentut para. 2).
However, companies utilizing data mining to forecast customer trends should understand that there is always a probability of breach to security, which may lead to several negative ramifications, including theft of sensitive personally identifiable information. Such risks should be countered by putting in place adequate security measures to guarantee information privacy (van Well & Royakkers 138).
The second reason why data mining is acceptable is that it assists governments to not only provide effective and efficient services to citizens, but also identify and deal with criminal activities.
Indeed, extant literature demonstrates that “data mining helps government agency by digging and analyzing records of financial transaction to build patterns that can detect money laundering or criminal activities” (van Well & Royakkers 132). This is a worthy function considering the fact that governments all over the world lose huge amounts of money annually to money launderers.
Of course there exists a threat arising from the fact that government agencies may fail to exercise ethical responsibilities of data disclosure when dealing with personally identifiable information, but such a threat can be addressed through developing and implementing stringent security measures and rules of engagement when dealing with sensitive personal data. Privacy and confidentiality of data must be maintained at all times for data mining to achieve its desired objective within this context (Seltzer 1442).
The third reason is premised on the fact that data mining can be applied within the manufacturing sector to “detect faulty equipments and determine optimal control parameters” (Zentut para. 5).
This is a noble achievement by virtue of its capability to ensure the products coming out of our factories are safe to use and can also be depended upon to make life easier and more fulfilling. This benefit is intrinsically tied with a disadvantage that industries may engage in misuse of information or the usage of inaccurate information (Zentut para. 9), but then statutory and governmental rules and regulations governing the use of data mining information should be put in place to avoid misuse.
Data mining is a computer-based classification and summarization of a large set of data into grouped information based on the correlated data. The data mining tools present the information in a manner that is to interpret and predict. Data mining process entails the use of large relational database to identify the correlation that exists in a given data.
Data used in this process is drawn from different sources that are related to the outcome of the target phenomenon (Chattamvelli 18). Experts use data mining software to predict the occurrence of various situations and events (Miner 35). This paper analysis how data mining is used commercially to earn money, optimize performance and improve service delivery.
Business organizations use data mining technology to improve business management. These organizations use customized data mining applications to generate business trends and patterns from their vast relational database. The applications have specialized inbuilt software algorithm that performs correlation detection.
The principal role of the applications is to sift the data to identify correlations. Eventually, the software presents the correlations as statistical trends and patterns. Various business departments use data mining for management as follows (Chattamvelli 27).
Marketing departments use data mining to perform market analysis. Market analysis is the process of monitoring changes in marketing pattern and trends. Therefore, marketers use data mining applications to investigate the possible causes of changes in marketing trends (Miner 53). Research asserts that common changes in market analysisss include customer complaints, sales decline, and loss of customer’s loyalty.
In addition, marketing departments use market analysis to monitor new products launched in the market. Specialized data mining applications display customer buying trends, post purchasing behavior, customer reviews, and demand trends among others. Marketers use the prevailing market trends to identify the best promotional strategy that will outdo the competitors (Rahman 68).
Marketing departments also use data mining technology to facilitate direct and interactive marketing. Direct marketing is the process of developing a comprehensive business mailing list. The mailing list contains the addresses of the all the business stakeholders (Han & Kamber 56).
Business experts assert that direct marketing is one of the most efficient ways of winning customer and supplier’s loyalty. On the other hand, interactive marketing is the process of optimizing the business web accessibility. The process ensures the business website provides the user with satisfactory information. Successful interactive marketing leads to improved customer loyalty and online purchases (Soares & Ghani 58).
Customer care departments use customer data mining applications to improve service delivery on customer relations. The main function of these applications is to automate customer handling process to minimize response time. Some advanced data mining tools have automated inbuilt emailing feature that automatically respond to general customer concerns (Miner 65). Marketing departments also use this feature to promote direct marketing.
Furthermore, the departments use specialized data mining applications that automate answering of the frequently asked questions (FAQ). FAQ tool provides customers with a quick online guide of solving various problems without visiting the customer care. This practice improves service delivery leading to improved customer loyalty (Soares & Ghani 67).
Customer care department also uses customized data mining applications to manage business promotions. Promotional data mining applications automate the process of selecting winners in uplift modeling promotions (Rahman 45). Furthermore, the departments use data clustering applications to automate customer segmentation analysis. Market segments are distinct groups of customers who buy similar goods or services.
Market segmentation process helps the business in identification of the most profitable customers group for promotions and marketing. In addition, customer care departments heavily rely on data mining software for catalogue marketing. The departments manage huge database that host profiles of all other departments in the organization. Therefore, the department uses product profiles at their disposal to produce product catalogues for sales promotion (Han & Kamber 77).
Commercial companies widely used data mining analysis in the human resource (HR) department. HR departments use customized applications to monitoring and optimize employee’s performance. The application provides information on employees working history.
This information is essential in planning for recruitment, promotion, training, and rewarding (Miner 84). Nevertheless, HR department use Strategic Management application to generated key performance indicators (KPI). KPIs analysis provides business management with a performance progress report. The report provides a summary of the possibility of achieving the organization’s goals. Therefore, production and sales managers use KPIs to correct performance decline hence improving the profit margin (Rahman 73).
Entrepreneurs use data mining technology to make money. For instance, software development experts design and sell data mining software to companies, businesses, and institutions. Research reveals that hundreds of entrepreneurs have become millionaires from the sale of data mining software. In addition, the entrepreneurs does not require established companies to sale their software.
Other entrepreneurs make money from data mining by operating data mining agency. These professionals use the available data mining software to open a data mining consulting agency. Later they approach institutions, businesses and companies to help them run their businesses. The agency charges consultation fee at a favorable rate depending on the amount of work and urgency (Rahman 80).
Business people can also use data mining technology to save or make money. If the business owners are experienced in data mining, they can use this knowledge to analyze market trends. From the analysis, they can predict market threats like decline of price or loss of market for their goods and services.
The business can use this opportunity to reduce production or supply of services hence avoiding losses. On the other hand, the business people can predict a possible rise of demand of their goods or services. In that case, the business increases supply of its goods and services to meet the expected demand. Research asserts that businesses using this approach are highly profitable since they always avoid losses and maximize profits (Soares & Ghani 126).
Web and Search Engine Optimization (SEO) designers use data mining to make money online. Using reliable data mining software, the designers identify the most researched keyword for their website in different locations. The designers use this keyword to optimize their website content.
Such websites receive many organic visitors, which is profitable for the businesses. Research shows that most successful online selling websites use this approach to generate traffic. The method is also very profitable for SEO experts who are paid on a commission basis. When their website ranks high in the search engines, they are likely to get more clicks or signups hence earning more commission (Rahman 142).
In conclusion, commercial organizations widely use data mining as follows. Sales department of commercial companies use data mining to carry out customer churning. Customer churning involves using the buying behavior to predict whether the company is losing its customers to its rivals. Management uses the customer churning results to regulate production hence averting losses. Marketing department uses data mining technique to facilitate market analysis, segmentation, maximizing profits and minimizing expenditure.
Human resource department uses data mining technology to oversee staff monitoring, development and termination. Finally, entrepreneurs have made a lot of money using data mining software. In summary, use of data mining is very instrumental for business organization management and investment (Soares & Ghani 158).
According to Mihai & Crisan (2010), in the world today, almost every transaction that takes place is recorded and kept in files for later references. This has resulted to an increase in the data produced and stored in various fields of activity. In fact, every enterprise has accumulated data on operations, activities and performance.
All these data hold valuable information, e.g., trends and patterns, which can be utilized to improve business decisions and optimize success. Mihai & Crisan (2010), point out that that traditional methods of data analysis that are based mostly on humans dealing directly with the data, basically do not scale to handle these large amounts of data sets. For this reason, various advanced technologies called data mining techniques have been developed to process the huge volumes of data.
According to Han & Kamber (2000), data mining is the process of discovering correlations, patterns, trends or relationships by searching through a large amount of data that in most circumstances is stored in repositories, business databases and data warehouses. The data mining process is employed by several sectors to manage and deal with troubles that are normally related to clients and which hider efficient operations of various entities.
Important features of data mining tools
Data mining tools integrate many operations and provide an easy-to-use way to perform the data mining process. There are many different types of data mining tools. The tools are diverse in design and implementation. However, there are several important features for data mining tools that enable them to be accommodated by users in the most efficient and effective manner.
The first important feature is the capability to access various data sources. Usually, data is obtained from a variety of sources in diverse designs. A good tool should enable the user to access different data sources without difficulties. It should also posses the ability to run on huge data sets. This is very important in circumstances where an entity stores large amount of data (Mihai & Crisan, 2010).
Han & Kamber (2000) argue that good data mining tools should be user friendly. According to them, this is significant since most of the times the persons using the tools are not specialists. The data-mining tools should possess the ability to process data in a most efficient manner since this is often considered crucial in solving problem. Connolly, Begg & Holowczak (2008), point out that data mining tools should also enable good data and model visualization.
This is will enable users to properly analyze the data and make sound decisions. They further state that the data mining tools should also enable users to easily integrate different techniques during the mining process. This is because there is no one tool that can effectively handle all prevailing dilemmas. Therefore, superior data-mining tools should allow users to incorporate a variety of procedures in order to deal with various dilemmas.
Han & Kamber (2000), assert that since the rate of innovation is very rapid, new techniques and algorithms are always present in the market, it is also important that data mining tools provide good extensibility means. This design allows users to monitor the tools in the most efficient and effective approach. Good data mining tools should also enable interoperability with other tools as well as data exchange of model and provide support for good data mining standards.
Data warehouse
A data warehouse is a site where information is stored. The idea of data warehousing was developed out of the need by different parties to have easy access to structured store of quality data that can be used for decision-making. Generally, information is a very powerful asset that can provide important benefits to any enterprise and a competitive advantage in the business world.
The massive amount of data possessed by firms has made it difficult for the firms to access it and make use of it. This is because it is in many different formats, exists on many different platforms, and resides in many different file and database structures developed by different vendors. Data warehousing offers a better approach to managing these data (Connolly, Begg & Holowczak, 2008).
How data mining can realise the value of data warehouses
According to Connolly, Begg & Holowczak (2008), data mining represents one of the most important applications for data warehousing. This is because most of the information that can be used to analyze various problems is accumulated in a data warehouse. The data mining techniques enables users to extract all relevant information needed for making good decisions and which is hard to obtain in most cases.
This kind of data enables realization of the value of a data warehouse when used appropriately. Han & Kamber (2000), point out that data mining can also realize the value of data warehouse by making use of advanced data analysis techniques for strategic management to interpret the information stored in the data warehouses. It is crucial that decisions that are taken by administrators are taken on informed basis and not based exclusively on the talent and knowledge of the administrator.
This application of data mining techniques became possible by making predictions based on the data that an enterprise has access to i.e. data from its own databases (data warehouses). This is very important since the data collected is available in time for analysis when required.
According to Mihai & Crisan (2010), data mining can also aid in designing data warehouses for a specific application. In this way, the value of the data warehouse can more easily be realized because the amount of pre-processing required before data is mined can be determined according to the data available.
For instance, if the data is stored in relational databases, it is easier to analyze it and most of the data mining tools can be used without difficulties. Data mining also transforms the detailed level of operational data stored in the data warehouse to a relational form that makes the information to be more amenable to analytical processing.
Conclusion
Data mining is an exceptionally valuable tool to explore the essential data to create reasonable advantage in the ever-changing environment. Data mining is employed by several sectors to manage and deal with troubles that are normally related to clients and which hider efficient operations of the various entities.
There are many different types of data mining tools. The tools have different features and are diverse in design and implementation. Data warehouses are sources of new information and are built to provide simple means to access source of high quality data. Therefore, data mining can easily realise value of data warehouses by making use of the stored information.
References
Connolly, T., Begg, C., & Holowczak, R. (2008). Business database systems. Harlow: Addison Wesley.
Han, J., & Kamber, M. (2000). Data Mining: Concepts and Techniques. Massachusetts: USA, Morgan Kaufmann.
Mihai, A., & Crisan, D. (2010). Commercially Available Data Mining Tools used in the Economic Environment. Database Systems Journal, 1(2)45-54.
Data mining is aimed at providing fully detailed reports allowing to analysis the necessary statistical data and giving access to website traffic. Data mining is an important element in political and economic spheres of state activities as it is used in various policies and practices by law enforcement agencies and Department of Justice. Its role is considered to be decisive in the national fight with terrorism and criminals tracking.
Main Text
It is a well known fact that the FBI refers to data mining for the purpose of tracking the potential terrorists as well as individuals breaking the national law. The US Department of Justice states that the development of data mining programs will be especially stressed, “Each of these initiatives is extremely valuable for investigators, allowing them to analyze and process lawfully acquired information more effectively in order to detect potential criminal activity and focus resources appropriately”. (Vijayan, 2007) The principle issues covered by such programs are aimed at: the decrease of terrorism level in the USA focusing on the individuals being identified as the ones of interest for FBI; identification of theft intelligence through the customers’ complaints; the examinations of real estate transactions. It should be noted that data mining is also concentrated on Internet pharmacy fraud and that one of automobile insurance. (Vijayan, 2007)
Law Enforcement agencies are focused on the usage of data mining for the purpose of national security interests promotion. The Federal Bureau of Investigation feels sharp demand in obtaining the data referring not only to the criminals but to the people the criminals had contact with. Taking into account the tragedy of September, 11, 2001, the FBI started to stick to data mining technique officials in order to get more links to potential terrorists and criminals.
The FBI used to petition the Federal Communications Commission to get access to Internet connections. It was explained by the fact that terrorists can use voice through the technologies of internet protocol in order to evade detection. The aim is to introduce changes into the networks for the terrorist patterns to be discerned and data to be tapped. Nevertheless it caused some problems connected with the development of the FBI project as the absence of discrete circuits cannot help in the location identification. The usage of data mining has its disadvantages because it can provide false facts or mixed crimes. “Internet data, whether it is transmitted via a digital subscriber line (DSL), cable modem or dial-up modem, mixes and mingles with packets of data from thousands of other users”, said the representative of Aaxis Technologies CEO. (Koprowski, 2004)
Conclusion
So, data mining can be analyzed from positive and negative side though its advantages far outweigh disadvantages. It is necessary to note that the usage of data mining helps FBI to have access to the necessary information for terrorism and crime tracking. Besides, it is useful for the decrease of economical and social thefts and crimes spread in modern world. Certainly internet basis is too sophisticated and one cannot but meet mistakes and data mixing which cause troubles and make the search more difficult. (Schneider, 2007) Despite this fact the promotion of data mining in current departments of justice and governmental bodies as well as security agencies will allow to reduce the number of crimes and get more chances to fight modern terrorism spread worldwide.
References
Koprowski, Gene. 2004. FBI Plans to Track Suspects with Data-Mining Techniques. Tech News World. Web.
Data mining can be defined as the process through which crucial data patterns can be identified from a large quantity of data. Data mining finds its applications in different industries due to a number of benefits that can be derived from its use. Various methods of data mining include predictive analysis, web mining, and clustering and association discovery (Han, Kamber and Pei, 2011).
Each of these has a number of benefits to a business. In predictive analysis, analytical models are used to deliver solutions. Using this model, a business can uncover hidden data which can be utilized for the purposes of identifying trends and therefore, predicting the future.
This method requires a business to define the problem before data can be explored. There is also development of predictive models that must be tested. Finally, these models are applied in the population identification and in the prediction of behavior. The process followed helps a business to identify its current position in relations to the industry (Simsion and Witt, 2004).
From this, businesses can plan on how best they can improve their positions in relation to other companies in the industry. The trends obtained from analysis of the acquired data can be used for the purpose of planning which might further give a company an edge over its competitors.
In association discovery, the main aim is to discover correlation among different items that make up a shopping basket. The knowledge of these correlations is important in the development of effective marketing strategies. This is possible due to the insight gained on products that customers purchase together.
This method of data analysis can also help retailers in the design layout of their stores. In this layout, the retailer can conveniently place items that customer purchase together in order to make the shopping experience interesting to customers as well as increasing chances of high sales (Kantardzic, 2011). The method can also be used by a business to determine the products they should place on sale in order to promote the sale of items that go together with the first one.
Web mining is the process through which data present in the World Wide Web or data that has a relationship with a given website activity is made available for various business purposes.
This data can either be the contents of web pages found in various websites, profiles of website users, and information about the number of visitors in a given website among others. Web mining can be used by a business to personalize its products or services in order to meet specific needs of the customers. This is possible through tracking the movement of a given target customer on various web pages.
The method can also help a business improve on its marketing strategies through effective advertising. This can be achieved when used together with business intelligence. It also helps a business to identify the relevance of information present in its web sites and how it can improve this information with the view of increasing its visibility in the market.
Clustering involves grouping of data into specific classes based on specific characteristics (Han, Kamber and Pei, 2011). The process helps in the discovery of specific groups that the business should focus on. The method also helps a business to provide specific information that can be used to win over a given class of customers.
Data mining follows a sequence that ensures the data mined meets the requirements set down by the person mining it. Different algorithms handle the process of data mining differently based on the content of the data to be mined. Therefore, the reliability of the data obtained depends highly on the method used and the nature of data. Speed of data mining process is important as it has a role to play in the relevance of the data mined.
Therefore, a given algorithm should support speedy mining of data. The accuracy of data is also another factor that can be used to measure reliability of the mined data. For this reason an algorithm should be able to use specifications issued in the process of data mining. The two requirements for reliability are met by most algorithms which make them to be reliable for the purposes of data mining.
Various concerns arise over data mining and include invasion of privacy, ethics and legality. The issue of privacy arises when private information is obtained without the consent of its owners. Application of such information for business purposes can have detrimental effects to the business. Ethical issues arise when information mined is used by a business to take advantage of the owner of such information (Kantardzic, 2011).
There is also the question of legality of data mining without the consent of the person owning such information. To address the issues above, some businesses request permission from people before they can use information on them for various purposes which must be disclosed to the person.
Predictive analysis is used by businesses in market segmentation, analysis of the shopping basket and the planning of demand. Market segmentation enables a business to serve a given market better than if it had to serve a diverse market. In shopping basket analysis, a business can easily identify the products that are needed at specific times. The business can also determine demand and effectively plan how to meet it.
References
Han, J., Kamber, M. and Pei, J. (2011). Data Mining: Concepts and Techniques. Amsterdam: Elsevier
Kantardzic, M. (2011). Data Mining: Concepts, Models, Methods, and Algorithms. New York: John Wiley & Sons.
Simsion, G. C. and Witt, G. C. (2004). Data Modeling Essentials. Massachusetts: Morgan Kaufmann
The increasing adoption of data mining in various sectors illustrates the potential of the technology regarding the analysis of data by entities that seek information crucial to their operations. Data mining tools enable entities to establish relationships such as associations, classes, clusters and sequential patterns.
The analysis of data can occur using a variety of techniques such as genetic algorithms, rule induction, neural networks and data visualization. Information obtained through data mining has transformed business operations by aiding companies in decision-making and the prediction of crucial factors such customer’s behavior, which helps companies to gain a competitive advantage.
Benefits of data mining to businesses
The generation of predictive scores for organizational elements is crucial in the analysis of the behavior of customers. Predictive analytics enable companies to optimize marketing strategies by modeling trends in customers’ responses. This information helps businesses to allocate funds for various campaigns based on their potential of succeeding.
In addition, it minimizes wastage of time and money caused by the use of manually analyzed marketing strategies. Research shows that manual analysis of marketing methods is a cumbersome process that is error-prone due to the extensive skills required. Predictive analytics eliminate guesswork in the identification of marketing methods by providing reliable data on customers’ preferences and habits (Pyle, 2003).
Analyzing past and present trends about customers’ habits provides patterns that aid in making decisions on future undertakings of a business. A company that implements measures in response to future patterns of customers’ behavior is likely to gain a competitive advantage.
Data mining enables businesses to establish relationships among items in a transaction. The association technique facilitates identification of products that customers purchase frequently. Using the association rule, businesses can determine the way one product influences the sale of another product.
For example, an association of Bagels and Potato Chips provides insight on products that a fast foods business can sell with Bagels so that the sale of Potato Chips increases.
Market based analysis provides businesses with important information that guides the implementation of marketing campaigns by enabling them to establish hypotheses for customers’ buying patterns. Relevant marketing strategies boost sales and promote higher profits.
Web mining enables business entities to establish patterns of customers’ behavior from the web. Using data mining techniques, companies can identify customers’ interests on the web concerning textual or multimedia data (Soares, 2010).
Web usage mining facilitates the analysis of target demographics. Web content and structured mining enable companies to monitor brands and analyze the content and structure of competitors. Such undertakings create strategic advantages.
Clustering enables companies to identify distinct groups of customers and implement strategies to retain customers that are above a cluster and gain the confidence and loyalty of customers that are below the cluster.
Customer-relationship management uses clusters to segment customers based on particular variables indentified through data mining. Variables such as customer-retention probability help companies in the identification of marketing opportunities.
Reliability of data mining algorithms
The reliability of data mining algorithms depends on the nature of data under analysis. Some datasets contain information that has errors or is invalid. Research shows that algorithms have diverse responses to errors and thus compromise the results of data analysis in different manners. Assumptions such as noise-free data influence the accuracy levels of data mining algorithms.
Another factor that interferes with the accuracy of data mining algorithms is the size of data. The search space varies depending on the dimensions in a domain space.
Research shows that the relationship between the search space and dimensions in a domain space is exponential. This relationship introduces a phenomenon known as curse of dimensionality, which interferes with the reliability of data mining algorithms (Kantardzic, 2003).
Various errors arise due to factors that affect data-mining algorithms. Systematic errors are likely to arise due to assumptions on clean data. Preprocessing of data helps to minimize such errors. Other errors include training and pessimistic errors that arise due to invalid data and assumptions such as noise-free data.
Data mining infringement on privacy
Data for mining purposes raises many privacy concerns. First, data intended for profiling customers and analyzing their behavior contains a lot of personal information. The collection and storage of confidential information about individuals introduces controversies due the possibility of illegal access to the information.
Another issue concerns the dissemination of implicit information about an individual or a group of customers. Thirdly, data mining discovers valuable information that is subject to sale. This creates loopholes for the distribution of confidential information without control.
To address the issue of privacy protection in data mining, concerned bodies have established measures that promote reliable data mining results while meeting privacy requirements. OECD Guidelines on data mining extensively cover the use of personal information obtained through data mining by providing various guidelines (Aggarwal & Yu, 2008).
First, executors of data mining should clearly inform subjects on the process and the intended use of the collected data. This will ensure that people participate under their free will.
Secondly, the Forthcoming Policy regulates the use of data mining results by stipulating the purpose of data, its allowed use, and persons who should access the information. The Disclosure policy enables subjects of data mining to determine purposes for disclosure of knowledge by giving or denying consent on the anticipate use of data.
Businesses that have used predictive analysis to gain a competitive advantage
Flight 540 employed the predictive analytics strategy to boost their customer appeal and gain competitive advantage.
They used data from the internet, customer spending history, comment cards and various surveys to model their flights in such a manner that customers could easily make purchase decisions depending on the flight packages on offer. Instead of planning flights on a standardized basis, the company modified them as per groups of customers with similar preferences.
Netflix became a key player in the video rental business by using predictive results on customer preferences to model a customer-centric business approach. The company introduced products such as video rentals that did not constitute a lateness fee, and the provision of video streaming based on customers’ requests. Apart from expanding its market share, the company succeeded in reducing its promotional expenses.
Tusk supermarket was able to increase its sales considerably by using predictive analytics to establish feasible pricing strategies using data on customer survey feedback. The data facilitated estimation of price sensitivity and the determination of price ranges that would have minimal impacts on sales. This promoted customer retention for the supermarket, increase market share, and stabilized generation of revenue.
Conclusion
Data mining provides companies with a basis for analyzing customers’ behavior and responses of potential customers by enabling them to establish relationships among internal and external factors of business operation. In this regard, companies can effectively determine the role of factors such as price, competition and demographics in influencing customers’ behavior.
References
Aggarwal, C. C., & Yu, P. S. (2008). Privacy-preserving data mining models and algorithms. New York: Springer.
Kantardzic, M. (2003). Data mining: concepts, models, methods, and algorithms. Hoboken, NJ: Wiley-Interscience :.
Pyle, D. (2003). Business modeling and data mining. Amsterdam: Morgan Kaufmann Publishers.
Soares, C. (2010). Data mining for business applications. Amsterdam: IOS Press.
Select two application areas for data mining NOT discussed in the textbook and briefly discuss how data mining is being used to solve a problem (or to explore an opportunity)?
Data mining involves rearranging large volumes of data to create comprehensible information that can be used to solve problems. There are several ways in which data mining can be applied in the real world (Han et al. 76). It can be used to solve problems and explore opportunities.
Data Mining and the Detection of Disturbances in the Ecosystem
The use of data mining to detect disturbances in the ecosystem can help to avert problems that are destructive to the environment and to society. Such calamities include floods and droughts (Kumar and Bhardwaj 258). Remote sensing and earth science techniques are used to understand the radical changes in the environment. Data is collected and archived. It is later mined and used to detect disturbances.
Data Mining in Sports
Data mining can be used to predict sporting activities. A case in point is the Advanced Scout System developed by IBM (Leung and Kyle 715). The application is used by coaches to improve the performance of players. In most cases, fans predict games by watching. They may also use archived data, which is mined and statistically used to make predictions based on the history of the game.
What is Association Rule Mining? And explain how Market-basket analysis helps retail business to maximize the profit from business transactions?
Association Rule Mining
It is the retrieval of data based on the relationship between a given set of objects. It takes into consideration the ‘togetherness’ of these objects and how they appear in a database. It involves the identification of connections and correlations between objects (Ramageri 304).
Market-Based Basket Analysis and Retail Business
Market basket analysis and association rule mining can be used to maximize profits and improve transactions in the retail business. It is used to study the behavior of customers and their shopping trends. Marketers use the information to design catalogs and undertake customer behavior analysis (Han et al. 99). Consequently, the information can be used in marketing and advertisement to maximize profits and improve business transactions.
Discuss k-Nearest Neighbor (KNN) learning algorithm. What is the significance of the value of k in k-NN?
K-Nearest Neighbor (KNN) Learning Algorithm
The algorithm is a method that is used to classify data obtained from sources with similar sets of parameters. It uses a set of data based on the known classifications of the existing database. It makes use of separate classes to predict a new pattern and classify the new data. The ‘neighbors’ in this case are the separate sets of data with common characteristics (Bhatia and Vandana 304). For instance, a bank may get a customer who wants a loan, but the entity lacks time to calculate the credit rating of the applicant. The bank can use previous credit ratings of people with similar characteristics, such as earnings and collaterals.
The Significance of the Value of k in k-NN
The k represents the number of classes used in the comparison. Lower values of this component are more accurate compared to higher values. On the other hand, increasing the random data point raises the percentage error of approximation (Bhatia and Vandana 304). As such, k can be used to obtain the most accurate approximation in data classification and regression.
Discuss the two estimation methods of classification-type data mining models while considering ANN as a classifier
Supervised Learning
It is one of the estimation methods of classification data mining models in artificial neural networks (ANN). In this case, a set of example pairs is provided. The objective is to identify or ‘estimate’ a function. The function has to lie within the permitted cluster of functions (Nikam 15). In addition, it has to reflect the given examples.
Unsupervised Learning
In this estimation method, the ANN works with a given set of data. The data is usually denoted as x. The cost function to be minimized is also provided. The latter can be a random function of x. It can also be the output of the network. The output is usually denoted as f. The cost function relies on what the network is trying to model (Nikam 16). It is also affected by the assumptions made.
Works Cited
Bhatia, Nitin, and Ashev Vandana. “Survey of Nearest Neighbor Techniques.” International Journal of Computer Science and Information Security, vol. 8, no. 2, 2010, pp. 302-305.
Han, Jiawei, et al. Data Mining: Concepts and Techniques. 3rd ed., Morgan Kaufmann Publishers, 2011.
Kumar, Dharminder, and Deepak Bhardwaj. “Rise of Data Mining: Current and Future Application Areas.” International Journal of Computer Science Issues, vol. 8, no. 5, 2011, pp. 256-260.
Leung, Carson, and Joseph Kyle. “Sports Data Mining: Predicting Results for the College Football Games.” Procedia Computer Science, vol. 35, 2014, pp. 710-719.
Nikam, Sagar. “A Comparative Study of Classification Techniques in Data Mining Algorithms.” Oriental Journal of Computer Science & Technology, vol. 8, no. 1, 2015, pp. 13-19.
Ramageri, Bharati. “Data Mining Techniques and Applications.” Indian Journal of Computer Science and Engineering, vol. 1, no. 4, 2011, pp. 301-305.
C4.5 algorithm is a decision tree with unlimited number of paths within the node. This algorithm can work only with discrete dependent attribute, that is why it can solve only classification tasks. C4.5 algorithm is considered to be one of the most famous and widely used algorithms of generating decision trees. It is necessary to follow the next demands for working with C4.5 algorism:
Each record from set of data should be associated with one of the offered classes, it means that one of the attributes of the class should be considered as a class mark. It may be concluded that all the samples should belong to the same class, otherwise the mistakes are inevitable.
Each class should be discrete. Each sample should belong to one of the classes.
The number of classes should be much fewer from the number of samples in the considered scope of data.
One should understand that C4.5 algorithm works slowly with very large scale set of data.
Using the concept of information entropy, C4.5 builds the decision trees based on the set of data, like ID3 algorithm. Filestem.ext is the form for the files which are read and written within C4.5 algorithm (filestem is a file name, and ext is a file extension which is aimed at defining the file type). Working with the program, one is expected to have at least two files, the first one is with the file name and class definition and the second one is with the date which gathers the set of objects described by the value of the class attributes. Considering the structure of a decision tree based on C4.5 algorithm, it may be either a leaf, which is predicted to identify a class or a decision node with a number of branches and sub trees, which show the possible outcome of the trial (Quinlan 5).
There are two ways how this algorithm can generate decision trees, batch mode and iterative model. Batch mode (often called default mode) generates a single decision tree. This tree covers all the data available for the decision. Another kind of this algorithm, iterative mode, is based on the random basis. The set of data is selected randomly. Then, a decision tree is generated with adding some specific objects which have been misclassified.
The actions are repeated and the decision tree is continued until it is classified in a correct way or it is found out that there is no any progress. Keeping in mind that iterative model is based on the subset selected randomly, many trials may be used for generating decision trees based on the same data. Keeping in mind that there can be many different decision trees due to the multiple trials, the presence of the filestem.unpruned is necessary. This file is created with the purpose to collect the decision trees in the process. If the similar data is used for generating decision trees, the latest variant of the tree is used. The machine saves the best generated decision tree in the file filestem.tree.
Works Cited
Quinlan, John Ross, C4.5: programs for machine learning. Burlington: Morgan Kaufmann, 1993.
Data Mining Technology (DTM) is a well-developed technology that assists organizations to make the best informed decisions rather than assumptions and guesswork by mining or extracting useful information from large volumes of data set collected in the past (Wook, Yusof, & Nazri, 2014). The DMT consists of different approaches and techniques that have been thoroughly deployed in business and industry processes. It has been noted that data mining techniques applicable in these organizations can also be applied in the education sector, specifically higher education. The paper proposes the implementation and use of data mining in the Canadian University Dubai.
Data mining refers to “the extraction of hidden predictive information from large databases, and it is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses” (Thearling, 2012, p. 1). Given the robust nature of data mining technology, it can be applied in understanding unique elements of data stored in educational databases (Luan, 2004). The aim of mining data in the education environment is to enhance the quality of education for the mass through proactive and knowledge-based decision-making approaches. By assessing such benefits, Institutions of Higher Learning should focus on Data Mining Technology approaches as new strategies for comprehending and improving processes of learning and teaching. In fact, higher education requires technology to enhance competition and improve educational outcomes.
While some institutions of higher learning have made significant investments in Data Mining Technology, they have paid much attention either on the invention of complex algorithms or technical elements (Luan, 2004). However, these institutions have not indulged much about user perception of data mining technologies. In addition, not many studies have focused on understanding utilization of data mining technology that could inhibit adoption and appreciation (Jan & Contreras, 2011). Just like other technologies applicable in different environments, user support is extremely critical for such technologies to succeed; otherwise, DMT may experience low rates of adoption and acceptance irrespective of notable outcomes and benefits (Jan & Contreras, 2011). It is therefore imperative to observe user behaviors before proposing and implementing any technologies. This approach would reduce low usage rates or utter abandonment.
Technology Acceptance Model (TAM) and various models have been developed to understand specific user behavior toward technology adoption. Hence, it is advisable for institutions of higher learning to integrate such tools in their strategies before adopting DMT.
Who use it?
As previously noted, several organizations, businesses and industries use data mining technology to enhance decision-making. That is, data mining technology cuts across all sectors and organization, including small and mid-sized organizations. For instance, one pharmaceutical firm has been analyzing the past sales activities to enhance customer targeting and identify the most robust marketing strategies. Financial institutions have jumped into data mining to leverage abundant data from customer transactions for credit scoring and identify customers who are most likely to apply for new credit facilities. Customer experiences and needs can be used for market segmentation. Finally, retailers may mine data from loyalty cards to understand buying patterns, improve sales processes and customer experience. Generally, any organization can apply Data Mining Technology, including universities.
How it works?
Data Mining Technology in education is possible. Educational Data Mining (EDM) is “the area of scientific inquiry centered on the development of methods for making discoveries within the unique kinds of data that come from educational settings” (de Baker, 2010, p. 2). It also reflects the use of data mining techniques to a given set of data in an educational setting.
Data analysts or scientists have used different techniques and a wide range of methods, such as clustering, prediction and relationship mining, decision tree, k-means, Bayesian networks and neural network among others (Erdoğan & Timor, 2005). These techniques and methods are applicable in educational setting and data (Romero & Ventura, 2010). Specifically, neural networks, Bayesian networks and decision trees have been noted as more appropriate for the education sector (Romero & Ventura, 2010).
In education, data mining focuses on analysis of large volumes of data from various sources for improved comprehension of learning, processes and outcomes. According to figure 1, the process of Educational Data Mining involves transformation of large volumes of data into valuable knowledge for users.
Figure 2 shows the “phases, and the iterative nature of a data mining project” (Oracle, 2015, p. 1). From this process, one can observe that the flow of the process never stops even if a specific solution is found and implemented. Instead, new questions emerge, which can be subsequently used to create models that are more robust.
Data Mining Technology is useful for gaining access to data from both structured and unstructured sources – from the Web and traditional data storage tools.
Data can be mined to determine or predict possible behaviors and trends of learners in course session. These behaviors and trends may include performance and curriculum. For instance, effective deployment of Data Mining Technology can predict students’ failure or success in a given course. Classifiers or variables are used to determine relationships between variables in predicting outcomes. At the same time, the university can also determine dropout rates and develop appropriate intervention systems to control the number of learners inclined to dropout from their courses.
At the same time, the recent use of Data Mining Technology has increased within the educational Web-based setting. For instance, data obtained from e-learning, learning management applications, tutoring systems and adaptive systems have been crucial in helping educators to better comprehend learners’ online learning behaviors and trends.
The predictive approaches rely on students’ activities and other related pieces of information, including time spent online, assignment submission, content studied and study results. In this case, educators can use results from predictive analytics to identify at-risk students. Consequently, they can design the most effective intervention methods to enhance outcomes for such students. In this regard, educators and institutions can rely on Data Mining Technology to enhance learning environments for learners and improve their operational activities.
TCO of such technology
Many Data Mining technology tools are available in the market today. Data mining tools such as WEKA by the University of Waikato, New Zealand are absolutely free. On the other hand, some vendors, such as Oracle, SAP, IBM SPSS and SAS offer these data mining tools at varied costs, which could be significant. Hence, it is imperative to understand whether the CUD can deploy open source tool or use most touted expensive data mining software from specific vendors.
Costs associated with hiring or training data scientists for the University could be enormous. It is imperative to note that there are currently no enough data scientists to meet the rising demands.
Literature Review about Selected Technology
Many scholars have recognized myriads of challenges facing higher learning institutions (Delavari, Phon-Amnuaisuk, & Beikzadeh, 2008). Specifically, decision-making processes have become more difficult because of many interrelated factors within the education system. In this regard, data mining by using more efficient technology and expertise has been identified as a technique that can assist universities in decision-making processes. Information needed to facilitate decision-making lies in databases possessed by universities. Only tools and techniques of are required to gain new knowledge from data (Thearling, 2012).
Researchers continue to develop new data mining models for use in higher education institutions (Al-Twijri & Noaman, 2015). These models are designed to facilitate decision-making processes and control elements of student admission and graduation. In this regard, major characteristics of learners that could lead to higher retention and graduation can be mined from available data and predicted earlier enough within the first semester (Raju & Schumacker, 2015).
For a long time, studies have continuously demonstrated that learners are unique, have diverse knowledge levels, learn at different paces, face different socioeconomic challenges and topic familiarity differs. On this note, an intelligent curriculum should account for unique needs of every learner rather than using a standard curriculum for all. Learning contents should be flexible, adaptive and regularly reviewed. This calls for analytics so that significant insights can be discovered and applied to shape the content for different learners and improve learning experiences (Wagner & Ice, 2012).
At the lower levels, higher education should use data mining for student acquisitions; course selection; improving performance; student work groups; retention; teacher effectiveness; and attrition (Schmarzo, 2014). These diverse applications show that higher education outcomes can be enhanced through Data Mining Technology. Data Mining can act as the basis for making informed decisions, changing curriculum design and delivery, learning content evaluation, student learning activities, resource allocations and outcome monitoring among others.
Universities need to go deeper with analytics beyond student acquisition and attrition rates. While it is a good starting point, deeper insights can be obtained by monitoring, for instance, students at greater risk of dropping a course or out of college. By identifying these issues earlier enough, higher education can develop programs to reduce dropout rates effectively.
Rapid expansion of universities, online education and technologies are putting much pressure on these institutions to increase performance and graduation rates. Fortunately, universities can benefit from data analytics to enhance education outcomes. Universities can realize and exploit the identified opportunities from studies and adopt data mining to demonstrate how they can solve higher education multiple challenges. The process requires collaboration with technology vendors and data experts to show how these new technologies can support learning in universities. Universities should adopt best practices and models for data mining to provide them with opportunities to transform learning experiences and outcomes for students, teachers and other stakeholders.
Applying the Technology in the Canadian University Dubai
The Data Mining Technology is proposed for Canadian University Dubai (CUD). As globalization increases, global universities have focused on expanding their reach. The University was established in 2006 in Dubai. It provides students with “Canadian education but with the respect of the culture and values of the United Arab Emirates” (Canadian University Dubai, 2015, p. 1).
The goal of the University is help every student to “move forward and ensure well-rounded lifelong learner” (Canadian University Dubai, 2015, p. 1). Consequently, it has focused on academic achievement and extracurricular engagement. While all these goals sound good, the University’s goals are most likely to take longer than necessary because it lacks robust decision support systems. The University has not adopted Data Mining Technology to gain useful insights from its existing database.
To demonstrate how the University can benefit from Data Mining Technology, various case studies will be used to show how it can overcome specific challenges, as well as benefits of the outcomes. For instance, the University can use Data Mining to understand students that take most credit hours, classes that are most likely to be popular, students that are most likely to come for additional classes, or even predict pledges that alumni will make.
A certain institution of higher learning wanted to create consequential learning outcome typologies, but it faced the challenge of limited understanding of student. To overcome this issue, unsupervised data mining was applied (Luan, 2004). A typical university would have an enrolment of about “15,000 with students grouped as ‘transfer based’, vocational or basic skill upgraders” (Luan, 2004, p. 4). These means of student identification contain basic information of what learners declared during enrolment. In addition, they do not reflect specific differences noted between each learner type. In this case, the university used data mining techniques to develop the exact typologies for its 15,000 students. The researchers applied two techniques of “clustering algorithms, TwoStep and K-means and used the algorithms on the three general classifications noted above and obtained mixed results” (Luan, 2004, p. 4). There were no distinctions between boundaries of clusters even after data cleaning and repeated measures. No significant improvements were observed in the result. Thus, the researchers concluded that the initial declaration at enrolment did not reflect actual behaviors of learners. They then applied a replacement technique by concentrating on “educational outcomes alongside lengths of study” (Luan, 2004, p. 4).
The educational outcomes were difficult to define. Specific learning period was required to ensure that a learner has attained a given milestone. Dropout was also assessed as an “outcome of learning while researchers had to deal with ‘stopouts’ – learners who dropped out but later returned to continue with studies” (Luan, 2004, p. 5). Hence, the data scientist must be able to account for all these diverse variables in order to answer specific typology question and research objective. After taking care of the outliers by eliminating or including them in other cluster, TwoStep algorithm generated the following clusters: “Transfers,” “vocational students,” “basic skills students,” “students with mixed outcomes,” and “dropouts” (Luan, 2004, p. 5). The researchers then used k-means to validate the generated clusters. The length-of-study was also factored into the variables for every cluster, and new perspectives emerged. “Some transfer cluster learners completed quickly; some vocational learners took longer; and other students appeared to simply take one or two courses at a time” (Luan, 2004, p. 5).
The results were informative for the university. From data mining, the university was able to understand demographic and other related information about student typologies. It was also established that “some older students took more time to complete their studies while younger ones with more socioeconomic advantages settled for high credit courses and completed studies faster” (Luan, 2004, p. 5). The university was able to group students as ‘transfer speeders, college historians, fence sitters and skill upgraders among others. The typologies ensured that the university could understand students beyond the normal homogenous grouping. The data mining project discovered hidden vital information that the university could use to meet diverse needs of learners.
Another demonstration to show how the University can apply Data Mining Technology is in the area of academic planning and interventions (Luan, 2004). In this case study, the college faced the most difficult challenge of accurately predicting academic outcomes in order to develop appropriate interventions for learners. Colleges apply data mining techniques to determine learners that are at risk of low performance. Consequently, they can develop appropriate interventions to prevent failure even before learners realize the risks. Transferring through the four-year of academic at the university is the main objective of many students. However, academic challenges lead to extended periods of transferring while other students completely fail to transfer. Traditionally, it has been difficult to understand these issues and transferring among students. However, data miners can mine and match data available from various sources to understand behaviors and characteristics of students that transfer or fail to transfer. Thus, data scientists and decision-makers can relate these data and academic behaviors and outcomes of students to determine transfer outcomes.
The solution to the transfer issue was found after application of data mining techniques. Various typologies and domain knowledge were used to develop appropriate data mining model (Luan, 2004). In this case, it was determined that transfer education domain knowledge the most reliable means of handling transfer was to identify transfer-oriented students within the earliest opportunities. Training students who are potential candidate to transfer is more relevant than concentrating on students who have gathered adequate points to transfer. By relying on transfer outcome data, data miners developed “a dataset with different students’ variables under the major transfer clusters of laggards and speeders” (Luan, 2004, p. 5).
The researchers then created a test dataset and validation dataset from the original dataset through proprietary randomization technique (Luan, 2004). Transfer was regarded as outcome variable while other variables included “units earned, courses taken, demographics, and financial assistance were classified as predictors” (Luan, 2004, p. 5). Thus, they were analyzed without a focus on stepwise testing for significance (Luan, 2004). The data mining process “tolerated interactions between variables and non-linear relations” (Luan, 2004, p. 5). The researchers used supervised data mining for the study. Thus, neural network and rule induction algorithms were “performed at the same time for ease of comparing and contrasting the accuracy of the prediction” (Luan, 2004, p. 5).
The result allowed the college to identify students with better transfer opportunities. The extensive machine learning through “neural network algorithm increased the accuracy of the prediction” (Luan, 2004, p. 5). Thus, the researchers could easily identify patterns found in the data.
These are just few cases, which illustrate how the Canadian University Dubai can apply Data Mining Technology to improve learning and teaching outcomes. Robust algorithms and tools can be applied with highly qualified data scientists to help the University to attain its goals.
Discussion and Analysis
One major issue that institutions of higher learning encounter now is predicting possible outcomes of their students and alumni. It is necessary for universities to determine enrolment and students that would require help to transfer. Other issues associated with traditional management of learning continue to motivate universities to seek for better alternatives. Consequently, some universities have noted the potential of applying Data Mining Technology to overcome such challenges. Data mining gives different organizations abilities to exploit their current data resources and data mining tools, techniques and expertise to uncover and comprehend hidden patterns in large databases. The identified patterns are developed into data mining models to provide useful information that universities can use to predict student behaviors. From the insight obtained, universities can allocate teaching and learning resources more accurately and effectively.
Data Mining Technology would, for instance, provide an opportunity and information for the university to predict drop out and develop appropriate interventions to avert drop out or possible provide more resources for a given course based on the prediction.
Past studies in the subject of data mining and its application in higher learning reveal a powerful tool that can transform the education sector. At-risk students can be identified and get the necessary support they require, for example. In addition, there are specific functional areas in higher education that analytics and prediction can be applied to support outcomes. Analytics, for instance, can be applied in critical areas such as finance and budgeting, enrollment, instructional and student progress management among others (Mattingly, Rice, & Berge, 2012).
Although few institutions of higher learning have embraced Data Mining Technology, they can leverage student performance, usage, behaviors, faculty performance and social insights such as tendencies, propensities and trends to maximize learner engagement processes, reduce rates of attrition, enhance lifetime value and promote learning advocacy (Schmarzo, 2014).
Summary
The paper proposes the implementation and use of data mining in the Canadian University Dubai. Data Mining refers to the “extraction of unknown predictive information from large databases” (Thearling, 2012, p. 1). The technology is considered powerful with robust predictive analytics that can assist organizations to concentrate on valuable data stored in their databases. Data Mining Technology helps organizations to predict possible future trends. From the results, organizations can then act proactively based on available knowledge and, therefore, decisions can be more reliable. It is imperative to recognize that data mining is robust and goes beyond data analysis. It includes extraction of large volumes of data from various sources, which are analyzed to support decisions. Many data mining tools, including open, freely available ones can tackle some challenging issues that were once considered complex or tedious. These tools can scour multiple databases for any information that is predictive and can provide potential future behaviors.
Results from organizations, including few institutions of higher learning that have embraced data mining show that data mining is a robust analytical tool that can assist organizations to overcome some challenges. For instance, institutions of higher learning can use data mining techniques to understand student behaviors, improve resource and staff allocation and enhance relationships with alumni. The hidden patterns can provide critical information based on predictive models to manage issues of enrolment, dropouts, graduation and even alumni.
Different organizations have deployed data mining tools and techniques to analyze large volumes of data and get insights that can aid decision-making. Canadian University Dubai can also adopt Data Mining Technology for analytical purposes. It however requires expertise in data science and effective machine learning tools that can handle large volumes of structured and unstructured data from different sources.
References
Al-Twijri, M. I., & Noaman, A. Y. (2015). A New Data Mining Model Adopted for Higher Institutions. Procedia Computer Science, 65, 836–844. Web.
de Baker, R. J. (2010). Data Mining for Education. Oxford, UK: Elsevier.
Delavari, N., Phon-Amnuaisuk, S., & Beikzadeh, M. R. (2008). Data Mining Application in Higher Learning Institutions. Informatics in Education, 7(1), 31–54.
Erdoğan, Ş. Z., & Timor, M. (2005). A Data Mining Application in a Student Database. Journal of Aeronautics and Space Technologies, 2(2), 53-57.
Jan, A. U., & Contreras, V. (2011). Technology acceptance model for the use of information technology in universities. Computers in Human Behavior, 27(2), 845–851. Web.
Luan, J. (2004). Data Mining Applications in Higher Education. Chicago: SPSS Inc.
Mattingly, K. D., Rice, M. C., & Berge, Z. L. (2012). Learning analytics as a tool for closing the assessment loop in higher education. Knowledge Management & E- Learning: An International Journal, 4(3), 236-247.
Raju, D., & Schumacker, R. (2015). Exploring Student Characteristics of Retention that Lead to Graduation in Higher Education Using Data Mining Models. Journal of College Student Retention: Research, Theory & Practice, 16(4), 563-591. Web.
Romero, C., & Ventura, S. (2010). Educational Data Mining: A Review of the State of the Art. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 40(6), 601 – 618. Web.
Schmarzo, B. (2014). What Universities Can Learn from Big Data – Higher Education Analytics. Web.
Thearling, K. (2012). An Introduction to Data Mining. Web.
Wagner, E., & Ice, P. (2012). Data Changes Everything: Delivering on the Promise of Learning Analytics in Higher Education. EDUCAUSE Review, 47(4).
Wook, M., Yusof, Z. M., & Nazri, M. Z. (2014). Data Mining Technology Adoption in Institutions of Higher Learning: A Conceptual Framework Incorporating Technology Readiness Index Model and Technology Acceptance Model 3. Journal of Applied Sciences, 14, 2129-2138. Web.
Classifiers produced by C4.5 are either expressed as decision trees or rulesets. Our focus in this discussion is on decision trees and here we look at their advantages. C4.5 classifier is based on the ID32 algorithm whose primary aim is to find small decision trees. Based on this we can say that the decision trees produced are small and simple to understand (Witten, Frank, 2000). Another advantage of this classifier is that the source code is readily available. Thirdly, unavailable attribute values are accounted for in C4.5 by assessing the gain using the records where the particular attribute is defined. A fourth advantage is that; it can handle both continuous and discrete attributes. Attributes with a continuous range, we create partitions based on a predetermined pattern in the training set and calculate the gain on each partition, (Quinlan, 1993). Then the partition that maximizes the gain is picked. Lastly, tree pruning after creation ensures tree simplicity by replacing some branches with leaf nodes.
Disadvantages
C4.5 classifiers are basically slower in terms of processing speed. For instance, a task that will take C4.5 15hours to complete; C5.0 will take only 2.5 minutes. Secondly, it is inefficient in memory usage meaning that some tasks will not complete on 32-bit systems (Witten, Frank, 2000). In terms of accuracy, the rule sets produced by C4.5 classifiers basically contain many errors. C4.5 has limited data types and lacks the facility to label data as not applicable, (Quinlan, 1993). The decision trees produced by C4.5 are relatively bigger but these are catered for in the C5.0 version. All classification errors are treated equally but some are more serious than others and this is another disadvantage. Lack of a provision to quantify the importance of cases is a disadvantage because not all cases are equally important. Lastly, attributes need to be winnowed before a classifier is generated and C4.5 lacks this facility.
KStar Algorithm classifiers
Advantages
Firstly, In terms of accuracy, this algorithm is comparable to C4.5 on voluminous UCI benchmark datasets. KStar performs better and with a higher speed than C4.5 on big numbers of text classifications. Thirdly, this algorithm is low in time complexity that means it is very fast. Its speed can be compared to that of naive Bayes. In addition, this algorithm can be speeded up by combining it with other scaling-up methods (Cormen et al 1990). Another advantage is that it uses entropy as a measure of distance hence providing a consistent approach to management of symbolic attributes, unknown values and continuous value attributes. The results presented compare satisfactorily with many machine learning algorithms. This algorithm is an instance-based classifier. It classifies an instance by comparison; an instance is compared to pre-classified examples stored in a database. New instances to be added to the instance database and choice of instances from the database to be used in classification are determined by a concept description updater (Gray, 1990). This helps reduce memory space requirements and also improve tolerance to noise in data.
Disadvantages
One of the major disadvantages of this algorithm is the fact that it has to generate distance measures for all the recorded attributes (Cormen et al 1990). Another disadvantage is that it is not cost effective. Both building and learning processes are quite expensive. In terms of memory space, as much as the algorithm uses an instance updater to select and determine the instances to be used, the storage of sample instances is also involving in terms of memory space (Gray, 1990). Although the algorithm can be speeded up by combining it with other algorithms, it is sometimes a disadvantage since the combination process involves extra costs.
Bayesian Network classifier
Advantages
The first advantage of this classifier is its computational efficiency. The representation of large and complex computational problems is decomposed into smaller and simpler self sufficient models to enhance efficiency. Secondly, it simplifies the incorporation of domain knowledge into the model design by using the structure of problem domain. Thirdly, the natural combination of EM algorithm and the probabilistic representation helps address the problems with missing data. The ability of this classifier to indicate all the possible classes a new sample may belong and the probability, is another advantage since in case of a misclassification, it is easy to tell which other class the sample may belong, (Mitchell, 1997).Another advantage is the presence of an updater that enables the classifier to learn from new data samples. Bayesian networks have the advantage of being able to capture the complexity of decision making. Lastly, the system’s rules of classification can be semantically determined and justified to both the novice and the expert (Jensen, Graven 2007).
Disadvantages
The mismatch between the data likelihood and the actual label prediction accuracy tends to make the learning method suboptimal. Secondly, Bayesian networks require an expert to give domain information for the creation of the network (Jensen, Graven 2007). Despite the substantial amount of research carried out, the creation of these networks is still limited to data sets consisting of only a few variables which are very informative. Another disadvantage of these networks is the fact that their interpretation and efficiency is quite limited when rulesets are drawn from the network. In comparison to rules derived from decision trees whose interpretation is simple and direct, Bayesian networks are more complex (Mitchell, 1997).
Fast effective rule reduction [JRip classifier)
Advantages
The first advantage of this classifier is the use of propositional rule learner which ensures that errors are minimal. Also, the phase by phase implementation of the algorithm ensures that the overall results are near perfect. During the rule set growing phase, the condition with the highest information gain is picked and this ensures that the rule set is perfect (Arthur, 1996). Another advantage is that the algorithm is made shorter and simpler by pruning any useless parts of a rule. This makes the classifier easy to understand and interpret. The ease of generation of this classifier is a fourth advantage. This classifier is highly expressive; it is comparable to a decision tree. Even in terms of performance, it can be ranked the same level as a decision tree [Bishop, 1997]. JRip classifier has the ability to classify new instances rapidly. Lastly, it is easy for this classifier to handle missing values as well as numeric attributes.
Disadvantages
JRip algorithm requires a large investment in terms of time to learn the algorithm and test the features that can be customized. The fact that the RIPPER algorithm uses some induced rules, which sometimes have to be replaced by expert-derived rules for some applications is another disadvantage. In addition, the accuracy of JRip’s results sometimes vary accordingly. This is because; the results produced differ depending on the option of the rule voting method used (Authur, 1996). Another disadvantage is that, highly accurate results can only be achieved by running the algorithm many times. Assignment of salience which is an order of that has been prescribed for firing rules may lend the expert system’s inference engine powerless. This may also have negative effects on the performance of such a system which is rule-based (Bishop, 1997).
K-Nearest Neighbour Algorithm
Advantages
This algorithm uses local information. This kind of information can produce highly adaptive behaviour. Secondly this algorithm is robust to noisy training data and it is simple to implement. It is also very easy to use in parallel implementations. Another advantage of K-Nearest neighbour algorithm is the ease and simplicity of learning. In addition the training speed is very high and the results are nearly optimal in the case of a large sample limit. This means that the algorithm is more effective if the set of training data is large [Witten, 2005] The ability of this algorithm to approximate complex concepts of the target locally and also differently for every new instance is another advantage of this algorithm. Lastly, the algorithm is intuitive and easy to understand. This makes implementation and modification easy. It also avails a generalisation accuracy that is favourable on many domains (Duda, Hart, 2000).
Disadvantages
One disadvantage of this algorithm is that it has large memory space requirements. It needs to store all the data and hence the need for large memory space. Secondly, during instance classification, all training occurrences have to be visited and this makes the algorithm slow during this procedure. A third disadvantage is that the algorithm is easily fooled by irrelevant data attributes; this means that the accuracy decreases with increase in irrelevant data attributes. Also, the fact that the accuracy decreases with increased noise in the set of training data is another disadvantage. Another shortcoming of this algorithm is its computational complexity [Witten, 2005]. This is due to its intensive computational recall (Duda, Hart, 2000). This algorithm is a supervised learning and this means it runs slowly. The algorithm is highly vulnerable to the dimensionality curse and lastly, it is highly biased by the value of k.
Naive Bayes classifier
Advantages
Naive Bayes classifiers have very simplified assumptions and naive design and hence easy to build. The structure is basically the same and this eliminates the structure learning procedure. Model building is highly scalable and it is possible to parallelize scoring no matter the algorithm. Secondly, these classifiers have worked favourably in numerous real world scenarios. It has outperformed many complex classifiers in on large numbers of datasets, despite its simplicity (Box, Tiao 1992). Thirdly, this classifier requires a small amount of training data to approximate parameters required for classification which include averages and variances of these variables. A fourth advantage is that, it is only the class variances that need be resolved and not the variance of the entire dataset (Witten, Frank 2003). Both binary and multiclass classification problems can be solved by this algorithm. This algorithm relies on basic probability rules making it simple in operation; also being probabilistic the results are presented in a form favourable for incident management policy. Lastly, it facilitates a broader set of model parameters to be used.
Disadvantages
The assumption that every variable is independent of others is sometimes a problem. The class estimates are sometimes absurd and the threshold must be harmonized not set analytically. This classifier has been outperformed by newer approaches such as boosted trees. It lacks the ability to solve more complex problems in classification. Naive Bayes algorithm is used in Bayesian spam filtering and it is vulnerable to Bayesian Poisoning (Box, Tiao 1992). Also the spam filter is beaten by replacing text with pictures. Another disadvantage is that if parameter estimates are improved, the effectiveness of such a classification will be affected. When using the Bayesian filter, a user specific database must be consulted on every message (Witten, Frank, 2003). This database contains word probabilities that are used to detect spam. Lastly, the initialisation of the naive Bayes based filter is a bit time consuming.
Works Cited
Bishop, Christopher. Pattern recognition and Machine Learning. New York, NY: Springer, 1997.
Box, G., and Tiao, G. Bayesian Inference in Statistical Analysis. New York. John Wiley & Sons, 1992.