Diabetes Risk Prediction Using Machine Learning

NB: We do not resell your papers. Upon ordering, we do an original paper exclusively for you.

NB: All your data is kept safe from the public.

Abstract

With changing lifestyle and food habits like lack of proper sleep, exercise, bad eating habits, etc have led to rapid increase in the number of people having diabetes hence, its necessary to decrease it. The proposed system developed will predict the risk of a person getting diabetes and classify it into one of the three categories namely low, medium and high. Depending on the risk level a diet plan or a nearby diabetologist will be suggested. The user’s risk level will be provided based on the lifestyle parameters thereby avoiding complex medical jargons. The main advantage of this proposed system is its simplicity, ease of use and easy access. The system uses random forest, a supervised machine learning algorithm for classifying the person into the appropriate risk category based on their inputs that are expert approved lifestyle parameters. The accuracy of the proposed system using the random forest algorithm is 88%. The proposed system allows the users to understand and analyze their lifestyle habits and encourages them to adopt a better and active lifestyle and good eating habits according to their risk category. So, this system effectively contributes in creating a healthy society on the whole.

Introduction

Diabetes is an extensively growing disorder among people nowadays because of their unhealthy lifestyle and imbalanced nutrition, hence finding a solution for its prevention at early stages and spreading awareness about it has become an absolute necessity. The age group of people getting affected by diabetes is increasing every day That is why diabetes risk prediction has become the need of the hour. The diabetes risk predictor will help the user to know his or her risk level. By knowing their risk level, the users can take various preventive measure before diabetes actually hits them. The proposed system hence plays a vital role in keeping the masses educated and prudent.

These days a few systems to calculate the risk of diabetes have surfaced online. Named as the “Diabetes Risk Calculator” they calculate the risk of a person getting diabetes and also provide trivia based on diabetes. In most cases of such systems, machine learning algorithms aren’t applied and hence risk is predicted according to a given range of set values of a few parameters. Hence, the accuracy of the risk calculated is at stake and not so reliable. Some other systems developed included some technical parameters that the user cannot enter without medical help which also affects the prediction’s accuracy and also makes it difficult for the users to use it hence making it less economical and user-friendly.

Drawing inspiration from these systems as well as taking their drawbacks into account the proposed system will be able to calculate the risk using machine learning algorithm called random forest which will improve the accuracy of the system as well make it more reliable. Apart from giving the risk classification the proposed system will also be able to give diet suggestions to the user as well as a list of nearby diabetologists based on the user location. Hence the proposed system to be developed will be a combination of all the pros of the previous systems and also an improvisation on them. This way an effective system to predict the risk of diabetes can provided to the society.

Training and Testing

First step to train and test data was to decide on a programming language that was decided as python and a platform where in the training and testing will be done for this purpose Jupyter Notebook using Anaconda was selected.

Next step will be accessing the collected dataset. Panda library will be used in order to import and read data. The data file imported by Pandas is in .csv format. Through Pandas we used its data cleaning features such as filling, replacing or imputing null values. The (pd.read_csv) reads the csv format, a comma-separated values (csv) file into DataFrame. display(data.head()) previews data.

The next thing that will be done is encoding the data into labels using Label Encoder. Since the data collected was in string format it cannot be processed or transformed without converting the string values to numeric values. Hence, the Label Encoder encodes this string data into numeric data according to the alphabetical order of the inputs column wise.

After this splitting of dataset into training and testing data will be done. The training data consists of a known output and the model will learn using this data in order to be generalized to other data afterwards. We have the test dataset in order to test our model’s prediction.

The test set should be big enough to get proper results and represent the data set as a whole. The main aim of this is to generate a model that generalizes and classifies new data well. The test set will represent a proxy for new data. This model does about as well on the test data as it does on the training data. The SciKit library will be used to divide the data via Model Selection library, a tool, that has a ‘train_test_split’ class. Using this the dataset is split into training and testing datasets in 70-30 parts.

The dataset was split into two different datasets, one for the independent features – X, and one for the dependent variable – y that is the risk class. Further the dataset X is split into two separate sets – X_train and X_test. Further we’ll split the dataset y into two sets as well – y_train and y_test.

Performance Comparison

This data collected was trained and tested using three different algorithms after the splitting of the data into test and training data. The three algorithms were tested to see which algorithm gives the best performance which was measured in terms of the algorithm giving the highest accuracy. The three algorithms whose performances were compared are K-Nearest Neighbor (KNN), Support Vector Machine (SVM) and Random Forest:

Methodology

The system proposed includes the making of a web application through which the user will interact with the machine learning model. In this, the user uses the web application to input the thirteen basic lifestyle parameters like gender, age, calorie intake, heredity, smoking, alcohol consumption, mental issues, daily physical activity, sleeping pattern, blood pressure, pcos and dark skin patches through a form. The values entered by the user via the form input

method will then be taken to the already trained model and the trained and tested machine learning model will give the risk level according to the parameter values inputted. This risk calculated will be then displayed on the web application according to which the user can take the necessary steps like going to a nearby diabetologists or working on the parameters that increase the risk of getting diabetes.

Conclusion

The system developed will be able to predict the risk of a person getting diabetes before its onset thereby encouraging people to adopt a healthier and more active lifestyle. Its ease of use is another factor which enables people to make full use of it. This system is in its nascent stage at this point. It has a lot of scope for improvement in the near future. This system can be made more accurate by collecting more dataset and can be extended to predict the risk of type I diabetes as well as gestational diabetes. The diet suggestion can also be customized according to each user’s individual habits.

References

G. K Sowjanya, Dr. Ayush Singhal, Chailtali Choudhary, “MobDBTest: A machine learning based system for predicting diabetes risk using mobile devices”, IEEE International Advance Computing Conference (IACC), 2015
Roxana Mirshahvald, Nastaran Asadi Zanjani, “Diabetes prediction using Ensemble Perceptron Algorithm”, 9th International Conference on Computational Intelligence and Communication Networks, 2017
Prof. Dhomse Kanchan B, Mr. Mahale Kishor M., “Study of Machine Learning Algorithms for Special Disease Prediction using Principal of Component Analysis”, International Conference on Global Trends in Signal Processing, Information Computing and Communication, December 2016
Raid M.Khalil, Adel Al-Jumaily, “Machine learning based prediction of depression among type 2 diabetic patients”, 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE)978, November 2017
Ayush Anand, Divya Shakti, “Prediction Of Diabetes Based On Personal Lifestyle Indicators”, 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
Md. Aminul Islam, Nusrat Jahan, “Predictions of Onset Diabetes using Machine Learning Techniques”, International Journal of Computer Applications(0975-8887), Vol 180- No.5, December 2017
http://www.diabetes.org/are-you-at-risk/diabetes-risk-test/
https://raw.githubusercontent.com/dollcg24/diabetes_dataset/master/data.csv

Click Here To Order Now!