Improving cancer diseases classification using a hybrid filter and wrapper feature subset selection

In the classifi cation of cancer data sets, we note that they contain a number of additional features that infl uence the classifi cation accuracy. There are many evolutionary algorithms that are used to defi ne the feature and reduce dimensional patterns such as the gray wolf algorithm (GWO) after converting it from a continuous space to a discrete space. In this paper, a method of feature selection was proposed through two consecutive stages in the fi rst stage, the fuzzy mutual information (FMI) technique is used to determine the most important feature selection of diseases dataset through a fuzzy model that was built based on the data size. In the second stage, the binary gray wolf optimization (BGWO) algorithm is used to determine a specifi c number of features affecting the process of classifi cation, which came from the fi rst stage. The proposed algorithm, FMI_BGWO, describes effi ciency and effectiveness by obtaining a higher classifi cation accuracy and a small number of selected genes compared to other competitor algorithms. More Information *Address for Correspondence: Noor Muhammed Noori, Department of Mathematics, University of Mosul, Mosul, Iraq, Tel: +964 7715381243; Email: NoorMuhammedNoori@uomosul.edu.iq Submitted: 19 December 2019 Approved: 10 February 2020 Published: 11 February 2020 How to cite this article: Noori NM, Qasim OS. Improving cancer diseases classifi cation using a hybrid fi lter and wrapper feature subset selection. Ann Proteom Bioinform. 2020; 4: 006-0011. DOI: dx.doi.org/10.29328/journal.apb.1001010 Copyright: © 2020 Noori NM, et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Introduction
Mutual information (MI) is one of the iltering methods, which is calculated between two random variables using entropy. Entropy measures an average random variable the amount of information required to describe the random variable [1].
Gray wolf optimization (GWO) is a swarm intelligent technique developed by Mirjalili, et al. [2]. The hunting techniques and the social hierarchy of wolves are mathematically modeled in order to develop GWO and perform optimization. The GWO algorithm is tested with the standard test functions that indicate that it has superior exploration and exploitation characteristics than other swarm intelligence techniques [3]. Scientists have developed the basic algorithm GWO by converting the search algorithm from a continuous search space to a discrete search space. This modi ied algorithm is called a binary gray wolf optimization(BGWO) algorithm, which works in binary search areas and uses binary values equal to 0 or 1 [3,4].
Feature subset selection method is a procedure to reduce the number of unnecessary feature from the original feature set [5,6]. This method is used when we have a large number of features in a dataset, where the method of determining the feature to take a number of features necessary. The feature selection can be applied in several areas and is intended to improve the accuracy of the classi ication. Two basic methods are used in selecting the feature: iltering approaches, wrapper approaches.
In this study, a new FMI_BGWO algorithm is suggested to determine the best-feature selection. Our proposed algorithm can ef iciently achievement the powerful points of both FMI and BGWO algorithms in inding the most important feature. The experimental results show the excellent performance of the proposed algorithm when the number of feature is high, and the sample size is low.
The remains of this paper are organized as follows. The GWO is presented in Section 2. In Sections 3 and 4, a brief description of feature selection and MI respectively. In Section 5, the proposed algorithm is explained. Section 6 covers the obtained results and their discussion. Finally, the most important general conclusions are mentioned, in section 7.

Grey Wolf Optimization (GWO)
The gray wolf (Canis lupus) is part of the Canidae family. Gray wolves are types with a very rigorous social dominant hierarchy of leadership. Males and females in the pack are leaders, and they are called alpha [2]. Alpha wolf is also known as the dominant wolf because all its orders must be followed by all wolves in the pack. Only alpha wolves are allowed to mate in the pack. This means that the order of the pack and its regularity is more important than its strength. Beta is the second level in the hierarchy of gray wolves.
Beta are wolves that help alpha wolves make decisions or other things about the pack. The wolf beta is the best candidate to be alpha in the case if one of the wolves of alpha becomes too big or in the case of the death of someone in command.
Omega is less gray wolves. The omega wolf always offers all the other dominant wolves on it. Omega acts as a scapegoat. The last wolves allowed to eat are omega. We may apparently notice that omega is not an important member of the group, but in fact, if omega is lost, the entire pack will face problems and internal ighting.
Delta is the fourth class in the pack. Delta wolves control omega, but they must undergo alpha and beta orders. Delta shall be responsible for guards, scouts, hunters, elders, and patients and each has its own speci ic responsibility [7].

Mathematical modelling
The irst best solution in GWO is alpha (α), the second-best solution is beta (β), and the third-best solution is Delta (δ). While the rest of the solutions are Omega (ω), which follows the rest of the irst three solutions [8].
Encircling prey: Gray wolves surrounded prey while hunting. The following equations will represent a mathematical model about the surrounding behavior: where t is the iteration, p X  is the prey position, X  is the gray wolf position, the  operator indicates vector entry wise multiplication.
Where A  and C  are coef icient vectors calculated as follows: where components of a  are linearly decreased from 2 to 0 over the course of iterations and r1, r2, ae random vectors in [0,1].
Hunting: Gray wolves have the ability to encircle prey and locate them. Hunting is performed by a complete pack based on, Information from the alpha, beta and delta wolves, so the updating for the wolves positions is as in the following equations: The updating of the parameter a  through the following equation: Where t is the iteration number and Max iter is the total number of iteration allowed for the optimization [9]. The pseudo code of the GWO algorithm is displayed in igure 1.

Start
Generate an initial population of the gray wolf (i=1…..n) Generate an initial value A, a, and c Find the fitness function of each search agent x α =the first best search agent x β = the second-best search agent x δ =the third best search agent while (t < .

of iterations). for each search agent
Update the position of the current search agent by equation (5) end for Update A, a, and c Find the fitness function of each search agent Update x α , x β , and x δ set t=t+1 end while return x α

Binary Gray Wolf Optimization (BGWO)
The positions of gray wolves constantly change in space to any point. Solutions are limited to binary values such as 0 or 1 in some special problems such as feature selection. In our research, we suggested feature selection through a binary GWO algorithm. Updating the equations of wolves are a function of the positions of three vectors is, X a , X β and X ε which represents the best three solutions and which attracts each wolf towards it. At any given time, all solutions are at a corner of a hypercube and the solutions are grouped in binary form. Based on the basic GWO algorithm, the given wolf positions are updated, a binary restriction must be maintained according to equation (9).
The main update equation in the GWO algorithm is written as [3]: t i X Crossover X X X   Where (X 1 , X 2 , X 3 ) they are binary vectors representing wolves in bGWO and crossover (a, b, c) are an appropriate intersection between solutions (a, b, c) and (X 1 , X 2 , X 3 ), and the wolves calculate alpha, beta and delta in order using equations (10), (13) and (16).  (3), and (5) in the dimension d.
represents the wolf position beta vector in dimension d.  (3), and (5) in the dimension d.
In the following equations, we will apply the intersection to each of the solutions a, b, c: where rand is a random number derived from the uniform distribution in the closed period [0,1], and a d , b d and c d represents the binary values of the irst, second and third parameters of the dimension d, X d is the result of the exchange in dimension d.

Start
Generate an initial population of the grey wolf (i=1…..n)

Generate an initial value A, a, and c Find the fitness function of each search agent
x α =the first best search agent x β = the second-best search agent x δ =the third best search agent while (t < . of iterations). for each search agent Calculate 1 2 3 ; ; x x x using equations (10), (13), and (16) Apply the crossover method among 1 2 3 ; ; x x x using the equation (9) end for Update A, a, and c Find the fitness function of each search agent Update x α , x β , and x δ set t=t+1 end while return x α End

Naïve bayes classifi er
Naïve Bayes is one of the great models used in the classi ications, which calculates the probability that a particular feature belongs to a particular category it is assumed that the constituent features of the search space are restricted and dependent on certain conditions [10]. Naïve Bayes performs mostly well in terms of simplicity of construction and ease of implementation. From the example that describes the feature vector which is (x 1 , x 2 , …,x n ), we will look for the class G which increases the probability of: 1 2 ( , , , | ) (20) The Naïve Bayes method will allow conditional independence between the features in the data by expressing this probability as a result of those probabilities [11,12]:

Feature selection
The feature selection method improves the performance of the algorithm by reducing the number of attributes used to describe a data set [6,13]. The purpose of features selecting in an algorithm is to reduce the number of genes used to improve classi ication and increase classi ication accuracy [14]. The feature identi ication algorithms consist of three parts: 1. Search algorithm: A subset of properties (features), which are part of the original features.
2. Fitness function: these input and digital assessment modes. The goal of the search algorithm is to draw attention to this function.
3. Classi ier: It represents the required algorithm that uses the latest subset of genes (i.e., an algorithm that selects the most important features required) [15,16] (Figure 3).
values of the mutual information will be large, but in the absence of a correlation between the feature and the category, the mutual information = 0. The mutual information can also be described as the degree of uncertainty in a feature that has been removed by knowing its category: Where H(X i ) is an entropy in X i and H(X i \C) is an X i entropy after knowing C so if knowledge of C provides a lot of information then H(X i \C) will be low and therefore MutualInfo will be high. One disadvantage of the information exchanged favors rare features if the conditional probabilities are equal to P(X i,y ) and this may not always be the desired result [19].

The proposed algorithm
The proposed method for FMI _BGWO is two basic stages. In the irst stage, the fuzzy mutual information (FMI) method is used to determine the most important feature based on a fuzzy Mam dani model created by relying on two input variables that represent the size of the sample data and the number of genes per sample. Three linguistic variables (high, mean and low) were selected for both input and output. The output range was also ixed between (10-50) to select the number of genes arranged by the MI method and which will be submitted to the second phase. In the second stage, the BGWO algorithm is used to reduce and de ine a certain number of features, which came from the irst stage. In the BGWO method, the itness function (FF) is used to evaluate wolves' positions as follows: Where (q) is the length of the selected gene subset, (p) is the total number of features, (c) is the accuracy of the classi ication model. The vector in igure 2 consists of a series of binary values 0 and 1 that represents a subset of features, in this vector, all features can be selected.
The second stage of the proposed algorithm FMI_BGWO concentrates on the BGWO, speci ically the wrapper in feature selection (Figure 4). One of the ways to select a feature on the predictive performance of a prede ined algorithm is the wrapper method, which evaluates the quality and ef iciency of certain features [17]. This is based on two-steps, based on a speci ic algorithm: (1) Searching for a subset of properties and features (2) Evaluate these speci ic features and attributes. In order to obtain learning performance or to reach some stop criteria we repeat (1) and (2) [ 12,18].

Mutual Information (MI)
Mutual information technology is widely used in feature selection and also in Sequence alignments. Mutual information technology arranges features according to their importance to data [12]. Mutual information technology is used to ind out the relationship between some features and category classi ications through the following relationship:

  
Where C is the set of categories, and X i is the set of features in position i.
In mutual information technology, if there is a close correlation between the category and the feature, then the Through the method (FMI) obtained the most important features, these features will be entered in the algorithm (BGWO) to reduce them as a second stage. Using an algorithm (BGWO), a certain number of features are arranged and selected in the inal classi ication process using the Naive Bayes classi ier. Figure 5 presents an elaborate lowchart of the proposed FMI_BGWO structure.

Experimental results
The proposed algorithm FMI_BGWO is evaluated, and its interest is compared with the other competitor algorithms.

Datasets
We have selected (3) different classi ication problems from the literature to verify the effectiveness of the proposed algorithm for classi ication problems. From a repository UCI, a data set was obtained [20].
The target variable is a binary variable that includes a set of data where the binary variable represents the condition of the sick person who has good = 1 and bad = 0.
The following table shows the overall description of the data set (Table 1).

Discussion and analysis
To correctly evaluate the proposed algorithm FMI-BGWO, the results were compared with the BGWO algorithm for feature selection.
The training and testing dataset for the proposed algorithm, FMI-BGWO, achieved the best results for the classi ication. For instance, in the breast dataset, the CA of the testing dataset is 82% by the FMI-BGWO which is higher than 100% by BGWO.
When comparing the BGWO algorithm with the proposed algorithm FMI-BGWO we note that the proposed algorithm FMI-BGWO has a clear advantage in terms of accuracy and classi ication capacity and BGWO worse than FMI-BGWO through the two datasets ( Table 2).

Conclusion
In this paper, to improve the performance of the classi ication of the big dataset, the FMI_BGWO method has been proposed. After the feature selection process, datasets were sent to Naïve Bayes. The results of the FMI_BGWO method were compared with the results of BGWO in table 2. Experimental results with a data set indicate that the proposed algorithm FMI_BGWO has a better classi ication accuracy and a number of features less than BGWO.