A Comparative Analysis of C4.5 Classification Algorithm, Naïve Bayes and Support Vector Machine Based on Particle Swarm Optimization (PSO) for Heart Disease Prediction

ABSTRACT


INTRODUCTION
The soaring number of patients suffering from various diseases, especially heart disease, has an impact and affects the high cost of treatment makes the government or health people seek prevention early on.One of the answers to this problem can be found solutions using artificial intelligence and data mining [1].The heart is one of the vital organs in the body that serves to pump blood throughout the body of blood vessels.According to Sloane, the Heart is a hollow organ and has four spaces located between the two lungs in the middle of the thoracic cavity [2] [3].More than 36 million people die each year from Non-Communicable Diseases (PTM).Globally PTM is the number one cause of death is cardiovascular disease.Cardiovascular disease (PKV) is a disease caused by impaired heart and vascular function such as coronary heart disease, hypertension, and stroke [4].In 2008 an estimated 17.3 million deaths were caused by cardiovascular disease.Deaths caused by cardiovascular disease, especially coronary heart disease and stroke are expected to continue to rise to 23.3 million deaths by 2030.Coronary Heart Disease (CHD) is a disorder of heart function due to lack of blood heart muscle due to the narrowing of coronary blood vessels.The fact that a heart attack occurs without obvious symptoms, some existing methods for early detection can ISSN: 2721-3056 r not overcome it.[5].Clinically, it is characterized by chest pain or discomfort in the chest or chest feels severely depressed when hiking / heavy work or walking in a hurry when walking on a flat road or walking long distances.WHO data onto 2015 shows that 70% of deaths worldwide are caused by Non-Communicable Diseases (39.5 million out of 56.4 deaths).Of all deaths from Non-Communicable Diseases (PTM), 45% were caused by heart and vascular disease, which is 17.7 million out of 39.5 million deaths.[6].
Development of software repositories to gain a better understanding of the data Currently there is an increase.Interest in data mining and education systems, making educational data mining a growing new research community.In the real world, predicting a disease is a challenging task.Data mining is the process of finding patterns in large amounts of data [8].Trend Data mining is emerging as a promising new workflow that provides a wide range of techniques, methods and tools to perform comprehensive analysis of available data in various fields.
Classification is a simple process to find the type of data analysis that can help people predict sample class labels or data concepts, with the aim of being able to use an approach in predicting object classes whose class labels are not yet known.Another goal is to be able to predict distinct and unordered labels in very large data sets.Like a classification, a test setting is used to build a predictive tool while still using an independent test setting to be used to assess its accuracy.Another fact in a classification approach, there is a decrease in classification for category and numeric attributes.Of the twenty classification methods, Bayes Net, Naïve Bayes, Classification through Regression, Logistic Regression and Random Forest are the best classification methods.For mixed attribute datasets, Naïve Bayes, Bayes Net and Random Forest are the best classification methods.For Regression Classification numerical attribute datasets, NBTree and multiclass Classifier are the best methods.For the categorical attributes of the NB-Tree dataset, Classification via Regression and Bayes Net methods is best [9].Of these above five rules classification method based on PART and Decision Tree method is the best.However, several methods have shown the best results where the right technique is chosen for the right data.No particular classifier performs best for all datasets.
The quality health service is one of the basic necessities of any person or customer.To predict the large potential for heart disease, it is necessary to have research related to the detection of heart disease [10].Ideally data mining is based on several diciplines, such as machine learning, artificial intelligence, probability, and statistics [11].In a study on the Naïve Bayes data mining method based on Particle Swarm Optimization for the detection of deep heart disease [12], measurement of Naives Bayes resulted in an accuracy of 82.14%, while with Naives Bayes based on Particle Swarm Optimization, the accuracy increased to 92.86%.In another study that discusses the C4.5 algorithm as the simplest classification and is easy to implement, however, the C4.5 Algorithm still has weaknesses for handling high-dimensional data.The research which aims to apply genetic algorithms with attribute selection so as to reduce the dimensions of the data, and identify features of the data set using the C4.5 algorithm method has already a fairly good accuracy of 76.66% with the attribute selection process by a genetic algorithm, the model formed can be increased again to 77.40% in the classification of heart disease [13].
Support Vector Machine is one of the classical machine learning techniques that can still help solve big data classification problems.Especially, it can help the multidomain applications in a big data environment.Support Vector Machine is one way of machine learning whose process is based on the structural principle of risk minization (SRM) has the goal of getting the best hyperlane in separating two classes in the scope of input [14] [15].

RESEARCH METHOD
This study uses data based on the population where the research subjects are people with heart disease by entering test data derived from training data, which serves to provide research information [16].Each study on heart disease prediction uses a different method, not yet known the right algorithm to predict it.For this reason, a model of approach with a research framework as in figure 1, so that from this study can compare several classification algorithms namely Decision Tree (C4.5),Naïve Bayes and Support Vector Machine (SVM) with PSO to solve problems then test the performance of some of these methods.The data used is based on data from heart disease sufferers by entering test data derived from training data.Where the final data is obtained after the preprocessing process.It then tested six different algorithms using rapid miner 9.1 data mining software to test C4.5, Naïve Bayes, Support Vector Machine, C4.5+PSO, Naïve Bayes+PSO and Support Vector Machine+PSO.The results of the algorithm with the best accuracy will later be used in predicting heart disease.

Data Mining
Data mining is the activity of studying the collection, use of historical data to find regularity, patterns or relationships in large data sets.The output of this data mining can be used to improve decision making in the future.Data Mining is an automated process of existing data.The data to be processed is a large data, with the aim of obtaining relationships or patterns that may give a useful indication [17].According to [18] in his book entitled "Data mining for Classification and Clustering of Data", mentioning that Data mining is an analysis of the process of discovery of knowledge in databases or knowledge discovery in databases abbreviated as KDD.According to Sumathi and Sivandham (2009), data mining is also defined as part of the process of extracting knowledge in a database known as Knowledge Discovery in Database (KDD) which consists of several stages, namely Cleaning and Integration, Selection and Transformation, Data mining, and Evaluation and Interpretation [19].From some of the above definitions it can be concluded that data mining is a method of digging up valuable information that is buried or hidden in a very large data set so that it is found an interesting form that was previously unknown.The word mining itself means an attempt to obtain a small amount of valuables from a large amount of basic materials.Therefore, data mining actually has a long root in the fields of science such as artificial intelligence, machine learning, statistics and databases.

C4.5 Algorithm
C4.5 algorithm is one of the very effective decision tree algorithms to do organization.The decision tree method turns a very large reality into a decision tree that reflect the rules.A decision tree consists of internal nodes that determine tests on individual input variables or attributes that divide data into smaller set of sections, and a series of leaf nodes assign a class to each observation in the resulting segment [20].
C4.5 algorithm is one of the algorithms used in decison tree, especially in machine learning area [21].According to Larose, The C4.5 algorithm is Quinlan's extension of his own ID3 algorithm for generating decision trees.C.45 is a method and prediction that is very strong and very widely used.C4.5 used information gain to select the attributes that will be used for object separation [22].

Naive Bayes
Naive Bayes is a classification algorithm with a formula that is simple and easy to apply [23].Naive Bayes is included in the supervised learning, so at the learning stage it takes preliminary data in the form of training data to be able to make decisions [24].Naive bayes is one of the statistical methods for classification that makes it possible to capture uncertainty about a model in a principled way on defining probability outcomes [25] [26].

Support Vector Machine (SVM)
Support Vector Machine (SVM) was first presented by Vapnik in 1992 as a compatible series of excellent concepts in the field of pattern recognition.Support Vector Machine is a machine learning method that works on the principle of structural risk minization (SRM) with the aim of finding the best hyperlane that separates two classes in the input space [14].The optimum hyperplane is a hyperplane located in the central in two sets of objects of two classes.The optimum separator hyperplane between the two classes can be found by measuring the hyperplane's margin and finding its maximum point.The margin is the distance in the hyperplane and the closest system of each class.The closest system is referred to as root vector [27].
According to [28][29] Support Vector Machine (SVM) is defined as a set of related learning methods that analyze data and recognize system, which are then used for classification and regression analysis.SVM takes a set of input data and predicts for each given input, whichcomes from two classes that are then classified by looking for the best hyperplane value.
According to [30] Support Vector Machine (SVM) is a classification mode to find the best hyperplane value competent of invention best global solutions.So that the value of accuracy is not easily capricious.
According to [31] Support Vector Machine (SVM) is a knowladge that guides to quadratic programming with linear barrier.Based on the risk minimization of structured principles, SVM strives to minimize the upper limit of error generalizations rather than empirical errors, so that new prediction models to good purpoe avoid over-fitting problems.In extention, SVM models perform in high-dimensional attribute spaces formed by nonlinear mapping of N-dimensional vector input x into K-dimensional attribute space (K>N) through the use of nonlinear φ functions (x).
SVM has a basic principle of linear classifier that is a classification case that can be linearly separated, but SVM has been developed to work on non-linear problems by inserting kernel concepts in high-resolution workspaces [32].
Based on the above understanding can be concluded Support Vector Machine (SVM) is a classification method that maximizes the limit of hyperplane (maximal margin hyperplane).If in ANN all practice data will be studied for the practice process, while in SVM it is different because only a selected amount of data contributes to forming the model used in the classification to be studied [33].This is an advantage of SVM because not all practice data will be seen to be bound in each intercation of the practice.The provide data is see to as support vector so the method is called Support Vector Machine.

Particle Swarm optimization (PSO)
Particle Swarm Optimization (PSO) is a global heuristic optimization technique introduced by Doctors Kennedy and Eberhart in 1995 inspired by the social behavior of flocks of birds trying to achieve unintrodned goals [34] According to [28] Particle Swarm Optimization (PSO) is a type of intelligence algorithm capable of optimizing a related variable that is most effective.
According to [35] Particle Swarm Optimization (PSO) is an evolutionary calculation technique.Similar to genetic algorithms, PSO is an optimization tool.It is inspired by social behavior among individuals.Particles (individuals) that represent potential solutions to problems move through n-dimensional search space.Each particle i maintains a record of the best performance position in the vector called pbest.
According to [28] Particle Swarm Optimization (PSO) is a computational method that optimizes problems with iterative solutions to improve candidates with regard to a certain size of quality..The movement of each particle is influenced by the local position guided towards the most known position in the search for space, which is updated as a better position than other particles .
According to [36] Particle Swarm Optimization (PSO) is an evolutionary computing technique capable of producing globally optimal solutions in search space through individual interactions in a swarm of particles.Each particle conveys information in the form of its best position to the other particles and adjusts the position and speed of each based on the information received about the best position.
Particle Swarm Optimization (PSO) is a tool to deal with optimization problems [37].Although relatively new, many have applied PSO algorithms, because it is quite simple and has faster computing speed compared to other optimization algorithms such as Genetic Algorithm (GA).Each particle in the PSO is also associated with the speed at which particles fly through the search space with dynamic speeds adjusted for their historical behavior.Therefore, particles have a tendency to fly towards better search areas during the search process.
Based on the above understanding, it can be concluded that Particle Swarm Optimization (PSO) is an optimization method that is able to optimize the nearest variable to achieve maximum accuracy.

RESULTS AND DISCUSSION
In this study, the method used was experimental research where experimental research involved investigating causal relationships using tests controlled by the researcher himself.In this study on predicting heart disease using data based on data on people with heart disease by entering test data derived from training data where the final data is obtained after carrying out the preprocessing process.Then, six different algorithms were tested using the Rapid Miner 9.1 data mining software to test the C4.5 algorithm, Naïve Bayes, Support Vector Machine, C4.5 + PSO, Naïve Bayes + PSO and Support Vector Machine + PSO.The algorithm results with the best accuracy will be used to predict heart disease.

Algorithm C4.5
In this study, there are two kinds of variables, namely the dependent variable (dependent) and the independent variable (free), among others: Based on the test results from data on patients with heart disease that have been preprocessed, the accuracy of the testing data using the C4.5 Algorithm is known as following.

Naïve Bayes Algorithm
Based on the test results from data on patients with heart disease who have been preprocessed, the calculation of the accuracy of the testing data using the Naïve Bayes Algorithm shows the level of accuracy as follows.

Support Vector Machine Algorithm
Based on test results from data on patients with heart disease who have been preprocessed, the calculation of the accuracy of the testing data using the Support Vector Machine algorithm is known for the level of accuracy as follows.

Naïve Bayes Algorithm with Particle Swarm Optimization
Based on the test results from data on patients with heart disease who have been preprocessed, the calculation of the accuracy of the testing data using the Naïve Bayes Algorithm with Particle Swarm Optimization is known for the level of accuracy as follows.From the discussion above obtained Naïve Bayes algorithm combined with Particle Swarm Optimization algorithm turns out to have the highest value based on accuracy value of 86.30%, AUC of 0.895 and precision of 87.01%while the highest recall value obtained by Support Vector Machine algorithm combined with Particle Swarm Optimization is 96.00%.

Figure 1 .
Figure 1.Review of Literate RESULT :Most Appropriate Algorithm For Use In Predicting Heart DiseaseMEASUREMENT :Testing using the Cross Validation method, Confusion Matrix and ROC curve IMPLEMENTATION : The data used is based on data on patients with heart disease DEVELOPMENT : Using rapid miner 9.1 data mining software ALGORITHM MODEL : C4.5, Naïve Bayes, Support Vector Machine, C4.5+PSO, Naïve Bayes+PSO and Support Vector Machine+PSO PROBLEMS : The exact algorithm model for predicting heart disease is not yet known ISSN: 2721-3056 r

Figure 4 .
Figure 4. the AUC value of the C4.5 algorithm

Figure 7 .
Figure 7.The AUC value of the Naïve Bayes algorithm

Figure 16 .
Figure 16. the AUC value of the naïve Bayes algorithm + particle swarm optimization

Table 1 .
Comparison of all algorithm values