There are enormous amount of data available from medical industry which could be useful for medical practitioners when it is used for discovering hidden pattern with help of existing data mining techniques. The basic medical records from a patient’s profile can be useful in identifying hidden pattern with data mining techniques. In this paper, NaA?ve Bayes algorithm to predict heart disease is implemented with basic records of patients like age, sex, heart rate, blood pressure etc., from a sample dataset. The benefits, limitations, and technical details of this implementation will also be discussed in this paper.

1 Introduction

Over these years in medical history, many types of medical problems have been identified and many data are available regarding a particular problem. But not all the medical data are same, but there are many patterns hidden inside those data which needs to be identified. Data mining techniques could help identify these hidden patterns by knowledge discovery. In the medical field, patient’s health issues are predicted by doctor’s intuition or experience [2] where the knowledge rich data is suppressed which results in high medical expenses and unnecessary medical tests. In recent years, there are many researches being conducted in order to find the hidden pattern from basic medical data [1]. Identifying these hidden pattern would result in a developing an efficient decision making system in medical industry which aide as a tool to support doctor’s decision making or at least serve as a prediction system for any medical issues.

In this paper, we have taken into consideration of heart disease and predict it using the set of data that are already in existence with the help of data mining technique. The algorithm that we have chosen is the NaA?ve Bayes algorithm, this algorithm is ideal for a vast amount of database that may contain hundreds and thousands of rows and columns. The NaA?ve Bayes algorithm provides the intended output faster and more accurate as the number of data in the database increase.

1.1 Problem Scenario

There are only few decision support systems available in medical industry whose functionalities are very limited. As mentioned earlier, medical decisions are made with doctor’s intuition and not from the rich data from the medical database. Wrong treatment due to misdiagnosis causes serious threat in medical field. In order to solve these issues data mining solution was with help of medical databases was introduced.

1.2 Related Work

There are many techniques available to discover knowledge from medical database [1]. Researchers at Southern California used data mining technique to discover the success and failure of back surgery in order to improve medical treatment [3]. Shouman et al [4] implemented predictive data mining to diagnose heart disease of patients. Palaniappan et al [2] developed a prototype Intelligent Heart Disease Prediction System (IHDPS), using data mining techniques.

1.3 Objective

In this paper, NaA?ve Bayes algorithm to predict heart disease is implemented with basic records of patients like age, sex, heart rate, blood pressure etc., from a sample dataset. Based on the literature survey NaA?ve Bayes algorithm was found to be an effective technique. The probabilistic method helped in finding the converse probability of the conditional relationship. The dependence relation may exist between two attributes of data set which can be determined with this algorithm.

2 Data Preparation

In order to implement the algorithm, a medical data was required. The sample dataset used for the purpose of implementation of algorithm was obtained from Cleveland Clinic Foundation. The sample of dataset is shown in the below figure (Figure1.)

C:UsersMadan KumarDesktopUntitled2.jpg

Figure1. Sample dataset

2.1 Dataset Source

The Cleveland institute medical data was downloaded from website of University of California, Irvine.

2.2 Dataset Attributes

The dataset consists of 16 attributes. The last attribute of dataset consists of value 0 and 1. The value ‘0’ indicates that the patient does not have heart disease whereas ‘1’ indicates that the patient has a heart disease. The prediction of algorithm can be verified with this value while evaluating the algorithm. The first 15 attributes are shown in the figure2.

C:UsersMadan KumarDesktopattri.jpg

Figure2. Dataset attributes

3 Program Architecture

The program was implemented using JAVA. Apache TOMCAT server and MySQL Database is also used. The NaA?ve Bayes algorithm has three class files: Calculation.java, Prediction.java, and Detection.java. Detection.java reads the data file from the source path and stores the attributes into temporary array list. The mean and standard deviation values calculations are performed and probability calculation is also done in Prediction.java. All the dataset attributes are defined in ‘calculation.java’ where mean and standard deviation of attributes were calculated. The calculation.java calls the other two classes while executing the program. Figure3 represents the program architecture.

C:UsersKirubanidhyDesktopArchitecture.jpg

Figure3. Architecture

3.1 Building and running a Demo

TOMCAT server is used to present the output in web based form. The output will run in localhost. The MySQL database is used to identify the patient records. At the execution point, the local host is accessed and 15 questions will be displayed which will be obtained from user and algorithm will be called to calculate and predict the disease possibility on that person. A report will be generated at the end of the demo which says if the person is predicted with heart disease or not.

In general,

1. Obtains the values from user.

2. Reads the data file.

3. Calls the algorithm and calculates mean, deviation, and probability of attributes.

4. Generates a report displaying the values given with the prediction of disease.

4. Implementation

All the attributes of dataset is of a numerical value that has some meaning. The meaning of dataset attributes are as shown in figure2. Example: the attributes sex is denoted with values ‘1’ and ‘0’ where ‘1’ denotes Male and ‘0’ denote Female. Fasting blood sugar values are also denoted using ‘1’ and ‘0’ where ‘1’ denotes >120mg of fasting blood sugar level and ‘0’ denotes <120mg and the same for other attributes.

These values from the data file are accessed by the NaA?ve Bayes algorithm. The values 0 and 1 are extracted from data file and stored to an array list for each attribute e.g. age array list, sex array list, and chest pain type array list etc., in order to perform calculation. Here, the values are defined on what those values stands for before storing to the array list. The sample of the interface (for obtaining slope value) is shown in figure4. Here the un-sloping, flat, and down- sloping represents the value 1, 2, and 3 respectively.

C:UsersMadan KumarDownloadsUntitled.jpg

Figure 4. Interface Sample

C:UsersMadan KumarDownloadsUntitled2.jpg

Figure5. Sample of report format

5. Modules Description

Analyzing the Data set

The attribute “Diagnosis” was identified as the predictable attribute with value “1” for patients with heart disease and value “0” for patients with no heart disease. The attribute “PatientID” was used as the key; the rest are input attributes. It is assumed that problems such as missing data, inconsistent data, and duplicate data have all been resolved.

Naives Baye’s Implementation in Mining

Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. If ‘B’ represents the dependent event and ‘A’ represents the prior event, Bayes’ theorem can be stated as follows.

5.2.1 Bayes’ Theorem

Prob (B given A) = Prob (A and B)/Prob (A)

To calculate the probability of B given A, the algorithm counts the number of cases where A and B occur together and divide it by the cases where A occurs alone. Applying NaA?ve Bayes to data with numerical attributes, predict the class using NaA?ve Bayes classification:

Figure6 (a) Top – Mean (b) Bottom – Standard Deviation

Figure6 (c) Laplace Transform

6. Evaluation

User enters the values for the questionnaire to find out whether the patient has a heart disease or not. By feeding sample data from the dataset and performing the mining operations with the NaA?ve Bayes algorithm, it is found out that the NaA?ve Bayes algorithm gives 95% probability in predicting if patient have heart disease or not. 95% accuracy is quite good to use as a decision support system.

The figure shows the accuracy of NaA?ve Bayes algorithm (figure7). The figure shows the highest probability of correct predictions and lowest probability of incorrect predictions.

C:UsersMadan KumarDesktopUntitled1.jpg

Figure7. Model Results of three algorithms [2]

7. Limitations

Apart from the benefits like probabilistic approaches and fast reliable algorithm of NaA?ve Bayes, the serious shortcoming of the algorithm is its ability in handling small datasets. NaA?ve Bayes classifier requires relatively large dataset to obtain best results. Yet, studies showed that Naive Bayes algorithm outperforms other algorithms in accuracy and efficiency. Notable limitation of this paper is the usage of small dataset. This dataset can be used for training or testing purpose only. Also the dataset could include more attributes for a more effective prediction in supporting clinical decisions.

8. Future Work

The algorithm is working well with this sample dataset. Implementing the algorithm with large dataset could give better results which can aid as a supporting tool in making medical decisions. In future, other possible algorithms could be implemented where efficiency of all algorithms could be analyzed to decide on best suitable technique in terms of speed, reliability, and accuracy.

9. Conclusion

In this paper, NaA?ve Bayes algorithm is the only algorithm used for calculation of attributes and prediction. Efficiency and accuracy of the algorithm in predicting were discussed. Designing effective models are constrained by size of the datasets and noisy, incorrect, missing data values. The prototype developed so far has been generally tested by computer experts and not by the doctors. For effective understanding of the health issues, medical experts have to work collaboratively and test the prototypes in order to implement the system in real life to support medical experts in taking clinical decisions.