K-Nearest Neighbor with K-Fold Cross Validation and Analytic Hierarchy Process on Data Classification

ABSTRACT


INTRODUCTION
Data is a collection of various information or facts that contain information on a subject.The data classification leads to an artificial intelligence model that focuses on machine learning.Data classification is the grouping of objects into certain classes based on their values.Data classification is widely used in decision making.
Machine Learning is a very complex material [1].There are many machine learning methods that can be used in the classification process, one of which is the k-Nearest Neighbor (k-NN).K-NN is a data classification method that does not require prior knowledge, where the new sample label is only determined by the nearest neighbor [2] [3].The k-NN method is also very simple and intuitive, easy to implement quickly, and one of the simplest and most popular of machine learning algorithms [4].
Li & Zhang in their research on music personalized recommendation, using the k-NN algorithm in collaborative filtering and taking advantage of the basic algorithm (k-NN) to be modified to make it more effective (k-NN improved) [5].Meanwhile, Adege et al. in his research on indoor localization using the k-NN method and the backpropagation method obtained results that the k-NN method obtained better results than the backpropagation method [6].In the classification process, the k-NN method requires features or criteria (features).
Where each feature has its own value that defines certain classes, but too many features will slow down the performance of the k-NN method.The method proposed to reduce features is AHP (Analytic Hierarchy Process).Ren et al. in his research on the problem of selecting artificial intelligence strategies, the Analytic Hierarchy Process is proposed to complete the Multi-Criteria Group Decision-Making, so that many criteria in a problem can be weighted according to their importance in the decision making [7].
In data training, there is a method proposed to increase accuracy, namely k-Fold Cross Validation (k-FCV).Cross validation is statistical, it can be used in selecting a model to better predict predictive model test errors [8] [9].In their research, Caon et al. explained that k-Fold Cross Validation is the best technique that can be used in each case and completes the choice of method for further adaptation iterations [10].
From some of the explanations above, we will further analyze the performance of the k-Nearest Neighbor method with the k-Fold Cross Validation algorithm as a model evaluation by dividing training data and test data in order to obtain the best machine learning model and the Analytic Hierarchy Process method for feature selection from data classified based on the importance of each feature in decision making.

RESEARCH METHOD 2.1. Data Collection
Data collection was in the form of cervical cancer risk dataset (data) obtained from the UCI machine learning repository.This dataset focuses on the prediction of indicators or diagnosis of cervical cancer.These features include demographic information, habits, and historical medical records.In addition, literature studies and references to national and international journals are also needed to obtain additional knowledge related to theoretical foundations, analytical concepts, and methods in data classification.

Research
At this stage, the dataset is analyzed to obtain knowledge about the algorithms and methods analyzed, namely the k-Fold Cross Validation algorithm, the k-Nearest Neighbor method, and the Analytic Hierarchy Process method in classifying cervical cancer risk data and to determine the accuracy of the method used.used.The flow or process of this research, namely: a. Data cleaning A machine learning model cannot directly process data found from multiple sources.There is a term Garbage In -Garbage Out which means the results of machine learning will be bad if the input is also bad.General things that can be done at the data cleaning stage include format consistency, data scale, data duplication, missing value, and skewness.b.Data preparation Generally, some machine learning models cannot process categorical data, so it is necessary to convert categorical data into numeric data.This is called data preparation.c.Data storage Processed data is entered into certain data stores so that it can be processed again at a later time, with the concept of the Relational Database Management System (RDBMS).

d. Data evaluation
The evaluation data used in this study is the k-Fold Cross Validation.In cross validation, the dataset is divided by k folds.Where at each iteration each fold is used once as test data and the remaining fold is used as training data, the process is repeated until all data is evaluated.e.Data classification After obtaining the distribution of training data and test data from the evaluation of the model with k-Fold Cross Validation, the data is classified with k-NN to get the accuracy of the model being built.This is repeated until the k-Fold is 10 or until it gets the highest level of accuracy.
ISSN: 2721-3056 r f.Feature selection Feature selection will remove features or attributes that don't really affect machine learning performance.The feature selection proposed in this research is the Analytic Hierarchy Process method.g.Analysis Then an analysis is carried out whether the model built and the results of the feature selection can improve the performance and accuracy of machine learning for data classification or not.

Analysis Method
The dataset used has 858 data and 36 attributes.With 30,888 records, the missing value was 3,622 records.After the data cleaning process is complete, the number of attributes that initially amounted to 36 became 34 attributes including labels (target).Two attributes have been omitted due to too much missing data.
a. Evaluate the k-NN model with k-Fold Cross Validation Overview of the k-FCV process as an evaluation model in this study can be seen in Figure 1.In Figure 1 it can be seen the evaluation process of the k-NN method with the k-FCV.The number of folds that will be used in this study is 5 folds.Each fold will be tested using k-value k-NN, which varies from 3, 5, 7, and 9. Thus, the dataset will be divided into training data and testing data resulting from model evaluation with k-FCV.After determining the training data and test data by evaluating the k-FCV algorithm, the next step is to start the data classification process using the k-NN method.The following are the research stages in the k-NN method: 1) Determine the value of k.
2) Calculate the Euclidean distance of the test data with the training data in the dataset.
3) Display the Euclidean distance ascending.4) Take the smallest distance of k.
5) The results of data classification using the k-NN method.

b. Selection of features (attributes) with AHP
The steps taken in selecting the optimal attributes (features) to be used as a continuation of testing in this study are as follows: 1) Create a pairwise comparison matrix.2) The sum of each attribute (feature).3) Criteria (attribute) value matrix.After the consistency vector value has been determined, it is necessary to calculate the values of two other things, namely lamda (X) and consistency index (CI) before the final consistency ratio can be calculated.The lamda value is the average value of the consistency vector.Because CR <0.1, the consistency ratio of the calculation is acceptable.Therefore, the priority weight values in Table 4 can be used as parameter values for feature (attribute) selection.

Test Result
In this test using 150 test data with varying k values, namely 3, 5, 7, and 9, as well as varying fold values, namely 1, 2, 3, 4, and 5.The attributes or features used in the test also vary, namely 33 features (default dataset) and 4 and 7 features (selected by AHP).The total test data are all 9000 times tested.The 4 features used in testing the selection results using AHP are age, number of sexual partners, first sexual intercourse, and num of pregnancies.While the 7 features of AHP selection results are age, number of sexual partners, first sexual intercourse, number of pregnancies, smokes, smokes (years), and smokes (packs / year).Testing is assisted by selfprogrammed applications.Figure 2 shows a display of the application used.In Figure 2, you can see the display of the data classification process as a method of testing process.The test results will be presented in several tables.Table 7 shows the comparison of the results of the evaluation of the k-NN model with the k-FCV on the fold-3 and the results of feature selection with AHP (4 and 7 features).In Figure 3 you can see the level of accuracy in each test with varying folds and k values.Assigning the correct k value is quite influential in the classification process, although it is not always successful in all test data.The highest level of accuracy is in the fold-3 which identifies that the fold-3 test is the best machine learning model in this study.

CONCLUSION
Based on the results of testing and analysis of the k-Nearest Neighbor method with the k-Fold Cross Validation algorithm as an evaluation model, it is proven that it can share training data and test data quite well.As well as the Analytic Hierarchy Process method for feature selection from classified data, it also gets an accuracy level that is almost the same as the classification without feature selection, namely by testing from fold-1 to fold-5, it always produces an accuracy rate above 90%.The best test is on fold-3, which is getting an accuracy rate of 95%.So it can be concluded that the evaluation of the k-Nearest Neighbor model with k-Fold Cross Validation can obtain a good machine learning model and the Analytic Hierarchy Process as a feature selection also gets optimal results and can reduce the performance of the k-Nearest Neighbor method because it only uses attributes (features) which have been selected based on the level of importance for decision making.

Figure 2 .
Figure 2. Display data classification process

Table 2 .
The sum of each attribute

Table 3 .
Attribute value matrix 4) Find the average of the row.

Table 4 .
Priority weight value

Table 5 .
Weight sum vector