1. Introduction
1.1 Applications
Data Mining is implementing in wide variety of areas. There are many commercial data mining systems available in the world today. It is used in Loan payment prediction and customer credit policy analysis as far as financial data analysis is considered. In retail industry analysis of effectiveness of sales, campaigns is done with data mining. It plays an important role in providing the visualization tools in telecommunication data analysis. Biological data mining is a very important part of Bioinformatics. [10] In Biological data analysis, discovery of structural patterns and analysis of genetic networks and protein pathways is done with the help of data mining. It also contributes in intrusion detections. [10]
1.4 Research motivation and problem statement
1.4.1 Research motivation
Bone diseases like trauma, inflammation, arthritis, osteoporosis, bone tumor, etc are nowadays very common in people. Not only these diseases but also the bone fractures are very much common in the people nowadays. Hence we came up with the system which considers and analyses the previously-stored symptoms of the bone disease patients and predicts the status of the new patient which determines the complexity, and type of the bone disease the patient is involved. As a result of our system, unnecessary tests on bones can be avoided.
1.5 Research objectives and contributionS
1.5.1 Primary objectives
Bone is one of the important components of a human body. It is the one which supports and protects various organs. And it also provides a structure for the body. Being so important to our body it may also get affected by various diseases or infections. So our primary objective is to identify those diseases or infections of humans at the initial state itself which will avoid the unnecessary tests on bones.
1.5.2 Main contributions
We in our project are considering the symptoms, signs, age found to be affected, and gender found to be affected, of the previously affected patients and when a new patient comes with the similar symptom or sign we consider the previously stored and analyzed data and predict whether the patient is affected by the similar disease or not. As a result of this, the person may avoid some of the unnecessary tests.
1.6 Organization of the report
The report is well organized. It is divided into totally 9 chapters. The first chapter is the introduction which includes background, brief history of technology, applications, research motivation and problem statement, research objectives, contributions, organization of the report, and summary. The next chapter is Literature Survey, the other chapter is System Requirements Specifications, the other chapters are Design, Implementation, Testcases, Results, Conclusion, and References.
2. Literature survey
2.1 Introduction
The research works have been found related to “Bone Disease Prediction Using Data Mining Techniques”. The dataset, the algorithms, the methodology used by the authors, and the observed results along with the future work is carried out in finding out efficient models of medical diagnosis for various bone diseases.
Bone diseases are mainly of the following types:
Bone disease is any of the diseases or injuries that affect human bones. In, Osteoporosis is a condition that weakens bones, making them fragile and more likely to break. More than 300,000 people receive hospital treatment for fragility fractures every year as a result of osteoporosis.
2.2 Related work
Here is a brief discussion about the work related to Bone Disease Prediction that has been already carried out in past few years.
- 1. Prediction of fracture risk in postmenopausal white women with peripheral bone densitometry: Evidence from the national osteoporosis risk assessment.
Low Bone Mineral Density (BMD) is a risk factor for fracture and is considered as the important predictor of future fractures. The authors studied the relationship between Bone Mineral Density measurements at peripheral sites and subsequent fracture risk at the hip, wrist/forearm, spine, and rib in 149524 postmenopausal white women, without a prior diagnosis of osteoporosis. At enrollment, each participant completed a risk assessment questionnaire and had BMD testing at the heel, forearm, or finger. Main outcomes were new fractures of the hip, wrist/forearm, spine, or rib within the first 12 months after testing. The test is considered as the T-scores which is the measure of Bone Mineral Density for the prediction. The authors examined BMD measurement, Questionnaires, and Data analysis for the prediction purpose. The authors inspected the performance of the algorithms through evaluation criteria such as sensitivity and specificity
- 2. Biochemical Markers of Bone Metabolism and Prediction of Fracture in Elderly Women.
The authors studied that different markers of bone turnover predict the fracture in 1040 elderly women. The various markers considered were Serum bone-specific alkaline phosphatase and four different forms of serum osteocalcin (S-OC), and others as markers of bone resorption. They considered Sampling procedures, Bone markers formation, Bone markers resorption, Bone markers urine osteocalcin, and other measurements for the prediction.
- 3. Fracture prediction from bone mineral density in Japanese men and women
The paper mainly focuses on low bone mineral density which is the important predictors of future fractures. The authors considered the association of Bone Mineral Density(BMD) with the risk of fracture of the spine or hip among a cohort of 2356 men and women aged 47–95 years, who were followed up by biennial health examinations. Follow-up averaged 4 years after baseline measurements of BMD that were taken with the use of Dual-energy X-ray Absorptiometry(DXA). Vertebral fracture was assessed using semiquantitative methods, and the diagnosis of hip fracture was based on medical records. Poisson and Cox regression analysis were the models used.
- 4. Prediction and Informative Risk Factor Selection for Bone Disease.
The authors trained an independent model based on a specific group of patients. They examined Comprehensive Disease Memory (CDM) which captures the characteristics for all patients to predict the disease. Bone disease memory (BDM) memorizes the characteristics of those considered individuals who suffer from bone diseases. Similarly, the Non-Disease Memory (NDM) memorizes attributes for non-diseased individuals. They have used Shallow Restricted Boltzmann Machine and 2-Layer Deep Belief Network for the prediction purpose.
- 5. Predict and prevent bone disease using data mining techniques.
This work applied many models to predict and prevent various bone diseases. In the respective research, the author has used oomph model as the opening to the proposed method, then single-layer and multi-layer learning approaches are introduced to construct the different disease memories. Finally, they have proposed their model that are focusing on the prediction and educational Risk Factor selection for bone diseases. The author analyzes the performance of the algorithms through evaluation criteria such as sensitivity to skewed class, sensitivity to noisy data, and parameter selection.
2.3 Study of tools/technology
We consider Python interpreter for implementation purposes and Jupyter notebook web application for the better visualization of data and to check the accurate algorithm among Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbor (KNN), Decision Tree (DT), Naïve Bayes (NB), Support Vector Machine (SVM) which we have used.
Jupyter Notebook: The Jupyter Notebook is an open-source web application that can be used to create and share documents that contain live code, equations, visualizations, and text.
Summary
3. System requirements specifications
3.1 General Description
3.1.1 Product Perspective
The system predicts the type of bone diseases which are generally occurring in the human body by analyzing the dataset which can be divided into training and test set, based on the different number of features and by using more accurate algorithm. The prediction system predicts the two types of bone diseases such as Traumatic bone disease and Degenerative bone disease.
3.2 System Requirements
3.2.1 Hardware Requirements
- Processor: 2.10 GHz Dual Core (Faster is better)
- RAM:- 4 GB (2018 At least 8 GB)
- System type: 64-bit Operating System
3.2.2 Software Requirements
- 1. Anaconda-Jupyter
The Jupyter Notebook is an open-source web application which can be used to create and share the documents that contain live code, equations, visualizations, and text.
Fig 3.2.2.1. Jupyter web application
- 2. Python 3.6.4 Idle
Fig 3.2.2.2. Python 3.6.4 Idle
3.2.2.1 Functional Requirments & Non-functional Requirements
- EfficiencyThe system determines the efficiencies of the algorithms. The Accuracy of the considered algorithms will be determined among which the system selects the most accurate algorithm for prediction of bone diseases.
- Graph: The algorithm’s accuracy are represented in bar graph for better visualization.
3.2.2.2 User Requirements
The different modules are explained below:
- The Real-time Dataset consists of age, region affected, symptoms, signs, and functional disabilities of all the patients. LR model analyzes this stored data and predicts the type of bone disease that a new patient is incurred.
- The System determines the accuracy of all the considered algorithms in which the LR algorithm is selected for further process.
- LR model analyzes the stored data and predicts the type of bone disease that a new patient is incurred.
3.3 Summary
In this section, we have discussed and organized the functional requirements for the prediction system. And also the System requirement for the prediction system which includes hardware and software requirements are specified in this section. These are the requirements that should be fulfilled to successfully complete this project.
4. Design
4.1 Architectural design
Fig 4.1: Architectural Design
The above diagram Fig.4.1 demonstrates the architectural design for carrying out the prediction. The dataset (i.e Bone disease dataset) is fed to the ‘Training phase’ where six algorithms, kNN, Support Machine Vector, Decision Trees, Linear Discriminant, Logistic Regression, and Naïve Bayes are applied on the dataset to compare their efficiency. In the ‘Testing phase,’ the most efficient algorithm is selected among all of the six algorithms, then it is used to predict the type of bone disease when new data is fed for prediction.
4.2 Dataflow Diagram
Fig 4.2: Dataflow Diagram
The above diagram Fig.4.2. demonstrates the flow design for carrying out the prediction. Flow diagram is basically used to demonstrate how the data flows in the proposed system. The dataset is separated into the Training set and the Test set. The Training set is the set of data which is used to train a model. In training the model, specific features are picked out from the training set. Test data is the set of data on which the model is applied to predict the output. In the figure above, training set undergoes pre-processing to understand and select the special features and after pre-processing, the pre-processed data is fed to several algorithms, among which the most efficient is selected for the prediction. Test data is fed to the selected model to obtain the predicted output.
4.3 Class hierarchy Diagram
Fig 4.3: Class Hierarchy Diagram
The above diagram Fig.4.3. demonstrates the Class design for carrying out the prediction. Class diagram basically consists of all the classes defined in the project with its attributes and functions. In our project, there are four classes: main, get_data, predict_disease, and get_data1. get_data is the class which retrieves the data from the database. Predict_disease class is where prediction of bone disease is carried out.
4.4 Usecase Diagram
Fig 4.4: Usecase Diagram
The above diagram Fig.4.4. demonstrates the use case design for carrying out the prediction. Usecase diagram is basically about how the user interacts with the system and the database. Here, the dataset is collected from several patients and these datasets are stored in the database. The database feeds these datasets to several algorithms. The predicted results are stored in the Excel sheet and the results can be viewed by the patient.
4.5 Sequence Diagram
Fig 4.5: Sequence Diagram
The above diagram Fig.4.5. demonstrates the Sequence design for carrying out the prediction. The sequence diagram basically is used to demonstrate the sequence in which the whole process of disease prediction takes place. Here, data is collected from the patient and then stored in the database. After which the dataset stored in the database is fed to the algorithms and compared based on efficiency. The algorithm having best efficiency is then used to predict the result once the new data is fed to the selected algorithm. The predicted result and the dataset is stored in the database.
4.6 Activity diagram
Fig 4.6: Activity diagram
The above diagram Fig.4.6. demonstrates the activity design for carrying out the prediction. The dataset is fed to six considered algorithms(kNN, SVM, LR, LD, DT & NB) and the best algorithm among all six is selected based on their efficiency. New data is fed as input to the selected algorithm and the result is obtained.
5. Implementation
Dataset
The Dataset is collected from Sun Orthopedic Hospital, Mathikere. It has 47 rows, 29 attributes, and 2 classes.
Names do not contribute for the purpose of bone disease prediction. So this attribute is not used in our models. The dataset description is given in Table 5.1.1.
- Attribute
- Value
- Description
- Name
- String
- Name of the patient
- Age of the patient entered in years
- Yes, if Male. No, if Female.
- The region affected due to bone disease is body. (0=No; 1=Yes)
- The region affected due to bone disease is knee. (0=No; 1=Yes)
- The region affected due to bone disease is ankle. (0=No; 1=Yes)
- The region affected due to bone disease is foot. (0=No; 1=Yes)
- The region affected due to bone disease is lumbous hip. (0=No; 1=Yes)
- The region affected due to bone disease is shoulder. (0=No; 1=Yes)
- The region affected due to bone disease is wrist. (0=No; 1=Yes)
- The region affected due to bone disease is thumb. (0=No; 1=Yes)
- The region affected due to bone disease is shoulder joint. (0=No; 1=Yes)
- The region affected due to bone disease is hand. (0=No; 1=Yes)
- The region affected due to bone disease is leg. (0=No; 1=Yes)
- The region affected due to bone disease is rib. (0=No; 1=Yes)
- The region affected due to bone disease is multiple joint. (0=No; 1=Yes)
- The region affected due to bone disease is lower back. (0=No; 1=Yes)
- The region affected due to bone disease is thigh. (0=No; 1=Yes)
- The region affected due to bone disease is around neck. (0=No; 1=Yes)
- The region affected due to bone disease is spine. (0=No; 1=Yes)
- The region affected due to bone disease is hip. (0=No; 1=Yes)
- The region affected due to bone disease is elbow. (0=No; 1=Yes)
- The region affected due to bone disease is knee joint. (0=No; 1=Yes)
- The symptom that indicates the bone disease is pain in affected region. (0=No; 1=Yes)
- Difficulty in Movement
- Boolean
- The symptom that indicates the bone disease is difficulty in movement. (0=No; 1=Yes)
- The symptom that indicates the bone disease is buckling. (0=No; 1=Yes)
- Weakness in Muscle
- Boolean
- The symptom that indicates the bone disease is weakness in muscle. (0=No; 1=Yes)
- The sign that indicates the bone disease is swelling. (0=No; 1=Yes)
- The sign that indicates the bone disease is redness. (0=No; 1=Yes)
- The sign that indicates the bone disease is itching. (0=No; 1=Yes)
- The sign that indicates the bone disease is ankle deformality. (0=No; 1=Yes)
- Feverish due to Pain
- Boolean
- The sign that indicates the bone disease is fever due to pain. (0=No; 1=Yes)
- The sign that indicates the bone disease is pain in ribs. (0=No; 1=Yes)
- The sign that indicates the bone disease is tenderness. (0=No; 1=Yes)
- The sign that indicates the bone disease is sweating. (0=No; 1=Yes)
- Vomiting Sensation
- Boolean
- The sign that indicates the bone disease is vomiting sensation. (0=No; 1=Yes)
- The sign that indicates the bone disease is latchment. (0=No; 1=Yes)
- The sign that indicates the bone disease is minimal pain in affected region. (0=No; 1=Yes)
- The sign that indicates the bone disease is diabetic control. (0=No; 1=Yes)
- The sign that indicates the bone disease is bending. (0=No; 1=Yes)
- Water Content in Joint
- Boolean
- The sign that indicates the bone disease is water content in joint. (0=No; 1=Yes)
- The patient has no signs that indicate bone disease. (0=No;1=Yes)
- The physical incapacity to do chores due to bone disease is pain in affected region. (0=No; 1=Yes)
- The physical incapacity to do chores due to bone disease is bone dislocation. (0=No; 1=Yes)
- Difficulty in Doing Daily Activities
- Boolean
- The functional disability due to bone disease is difficulty in doing daily activities. (0=No; 1=Yes)
- Difficulty in Movement
- Boolean
- The functional disability due to bone disease is difficulty in movement. (0=No; 1=Yes)
- Indicates the type of bone disease.(1=Traumatic bone disease; 2=Degenerative bone disease)
Table 5.1.1. Dataset Description
5.1 Methodology
5.1.1. Algorithms
The algorithms which are used for the prediction of bone diseases are Decision Trees (DT), Logistic Regression (LR), Support Vector Machine (SVM), and K-Nearest Neighbor (K-NN). The description of these algorithms is given in the following section.
- 1. Logistic Regression (LR):
Logistic regression is a classification and predictive algorithm. LR is used to describe data and to explain the relationship between one dependent binary variable. There are one or more independent variables that determine an outcome. The binary logistic model is used to estimate the probability of a binary response based on one or more predictors. It is used to predict a binary outcome such as, “0” or “1” which may represent “Yes” or “No”, “True” or “false” in a given set of an independent variables.[9]
Logistic Regression Equation is shown in Equation (1) and its respective Sigmoid curve is shown in Fig. 4.2. (1)
- where S(x) represents Sigmoid function,
- x represents a real number.
Fig. 5.1.1.1. Logistic Regression Sigmoid Curve
- 2. Linear discriminant Analysis (LDA):
LDA is most commonly used as dimensionality reduction technique in the pre-processing step for the pattern-classification and the machine learning applications. The goal is to project a dataset onto a lower-dimensional space with good class separability in order to avoid overfitting and also to reduce the computational costs.
- 3. K-Nearest Neighbor (KNN):
The KNN algorithm is a non-parametric method used for classification and regression. The input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression.[7] The K nearest neighbors are measured by a distance function, distance function considered is Euclidean distance.
- 4. Decision Trees (DT):
The decision Trees algorithm can be used for solving regression and classification problems. Decision Tree creates the Training model which will be used to predict the class or value of target variables by learning decision rules inferred from the training data.
- 5. Naïve Bayes (NB):
The Naive Bayes algorithm is a technique based on Bayes Theorem which is used for classification with an assumption of independence between class predictors. The Naive Bayes classification algorithm assumes that the presence of an exact feature in a class is dissimilar to the presence of any other features. The Naive Bayes algorithm model is easy to implement and useful for large datasets.
Bayes theorem describes the way of calculating posterior probability P(Cx|X) from P(C), P(X), and P(X|Cx) shown in equation (3.1).
- 6. Support Vector Machine (SVM):
Support Vector Machine is used for classification and regression analysis.
5.2 Description of process
The implementation can be broadly divided into 6 parts:
- 1. Dataset: The realtime dataset is collected and split into Test and Training datasets.
- 2. Efficiency: Summarize the efficiencies of all algorithms.
- 3. Select: Choose the accurate algorithm that is LR.
- 4. Logistic Regression (LR): LR predicts bone disease.
- 5. Response: Generate a response from a set of data instances.
- 6. Main: It reads the Training Dataset and analyzes it, to predict the newly entered data. It stores the newly entered data in Excel.
- Dataset: The real-time dataset is collected and split into Test and Training datasets. The data is in CSV format. We can open the file by using the open function and read the data lines using the reader function in the CSV module. The data is splitted into a training dataset that LR can use to make predictions and the test dataset used to evaluate the accuracy of the model. A ratio of 70/30 for the train/test is a standard ratio used. We define a function called read_csv to load the dataset, with the provided filename and splits it randomly into training and test datasets by using the provided split ratio.
- Efficiency: It Summarizes the efficiencies of the algorithms. We have considered a few algorithms like LR, LDA, KNN, DT, NB, and SVM. The Accuracy of the considered algorithms will be determined among which the system selects the most accurate algorithm for prediction of bone diseases. The algorithm’s accuracy are represented in bar graph for better visualization.
- Select: Choose the accurate algorithm. By computation of all algorithms, the accuracy of each one is known. From this observation, LR came out to be the more accurate algorithm. Hence, LR is selected as the predicting model as it is more accurate compared to others.
- Logistic Regression (LR): LR predicts bone diseases. It uses Euclidean distance measure. LR is used to explain data and to brief the relationship between one dependent binary variable. There are one or more independent variables that determine an outcome. The binary logistic model is used to determine the probability of a binary response based on one or more predictors. It is used to predict a binary outcome such as, “0” or “1” which may represent “Yes” or “No”, “True” or “false” in a given set of independent variables.
- Response: The LR Model compares the stored dataset and the newly entered data, based on which the model predicts the type of bone disease that is either Traumatic bone disease or Degenerative bone disease.
- Main: It reads the Training Dataset and analyzes it, to predict the newly entered data. It stores the newly entered data in Excel. The dataset is split in the ratio of 70/30, where 70 is Training dataset and 30 is Test dataset. The LR model which is having more accuracy is used to compare the stored dataset with newly entered data. As a result, the type of bone disease that is either Traumatic or Degenerative bone disease is predicted.
6. Testcases
- Name of the Bone Disease
- Case
- Condition
- Status
- Traumatic
1
- Disease Predicted
- Display and data entry to the Excel sheet
- Degenerative
2
- Disease Predicted
- Display and data entry to the Excel sheet
- Dataset(Age)
1
- Age55 years
- Degenerative
Table 6.1. Testcases
7. Results
- Algorithm
- Accuracy
- Error rate
- 0.51
- 0.48
- 0.68
- 0.32
- 0.74
- 0.25
- 0.46
- 0.53
- 0.65
- 0.34
- 0.72
- 0.27
Table.7.1. Accuracy and Error rate of Algorithms
The above table specifies the accuracy and error rate of the algorithms. The system checks for the accuracy of the algorithms. As a result, it stores the algorithm which is more accurate and having least error rate compared to others for the further prediction of newly entered Bone Disease data. In our case, LR algorithm is chosen as it having more accuracy and least error rate than others.
Fig.7.2. Comparative performance analysis of models
The above figure shows the comparative performance analysis of models. Accuracy and error rate are considered as the legends of the graph.
Fig.7.1. The Accuracy of all Algorithms
Fig.7.2. The Splitting of Training and Test Data
Fig.7.3. A Simple Login Application
Fig 7.4. Incomplete login
Fig.7.5. Enter Username and Password to login
Fig.7.6. Invalid Username and Password
Fig.7.7. Data entering interface
Fig.7.8. New Patient (Test Data) entry
Fig.7.9. Traumatic Bone Disease
Fig.7.10. New Patient (Test Data) entry
Fig.7.11. Degenerative Bone Disease
Fig.7.12. Description of Bone Disease
8. Conclusions
In this project we consider greater number of features and efficient algorithms to predict bone diseases more accurately. The system evaluates various data mining techniques such as Support Vector Machine, Logistic Regression, Decision Trees, and K-Nearest Neighbor. Through our evaluation, we found that Logistic Regression algorithm gives better accuracy than other data mining algorithms. Hence Logistic Regression algorithm is used for predicting the bone diseases. With the training dataset, we develop a model that predicts the type of bone diseases. The developed Logistic Regression model performs classification and prediction of the test dataset based on the training dataset. The prediction of bone diseases considered in our work are Degenerative and Traumatic diseases.
9. References
- Paul D. Miller M.D., “Prediction of Fracture Risk in Postmenopausal White Women With Peripheral Bone Densitometry: Evidence From the National Osteoporosis Risk Assessment†,” Journal of Bone and Mineral Research., published.
- Paul Gerdhem, “Biochemical Markers of Bone Metabolism and Prediction of Fracture in Elderly Women†,” Journal of Bone and Mineral Research., published.
- Saeko Fujiwara, “Fracture Prediction From Bone Mineral Density in Japanese Men and Women†,” Journal of Bone and Mineral Research., published.
- Hui Li, Xiaoyi Li, Murali Ramanathan, and Aidong Zhang, “Prediction and Informative Risk Factor Selection for Bone Disease”, IEEE, DOI 10.1109/TCBB.2014.2330579, 2013.
- M Saranya and Dr. K Sarojini, “An Improved and Optimal Prediction of Bone Disease Based on Risk Factors”, IJCSIT, ISSN: 0975-9646, Volume 7(2), 2016.
- A Keerthana and Mrs. P Renukadevi, “Predict and Prevent the Bone Disease using Data Mining Techniques”, IJETCSE, ISSN: 0976-1353, Volume 21, Issue 4, APRIL 2016.
- Belur V. Dasarathy, ed. (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. ISBN 978-0-8186-8930-7.
- Data Flair, “SVM – Support Vector Machine Tutorial for Beginners”, Data Flair team, November192018,[Online]. Available: https://data-flair.training/blogs/svm-support-vectormachine-tutorial/ [Accessed: April 14,2019].
- JasonBrownlee,MachineLearningMastery,“LogisticRegressionTutorialForMachine Learning”, April 4 2016, [Online].Available:https://machinelearningmastery.com/logisticregression-tutorial-for-machine-learning/[Accessed: April 14,2019].
- [Online].Available:https://www.tutorialspoint.com/data_mining/dm_applications_trends.htm.
- Accuracy NB DT LR LDA SVM KNN 0.51555599999999957 0.6800000000000006 0.74666699999999997 0.466667 0.65777800000000175 0.72444399999999998 Error rate NB DT LR LDA SVM KNN 0.48444400000000032 0.32000000000000056 0.25333300000000003 0.5333329999999985 0.3422220000000003 0.27555600000000002 2018-19