A Comparative Study on Vaccination Prediction Using ML Algorithms
This article was written by Alparslan Mesri and Hale Kizilduman.
In late 2009 and early 2010, H1N1 flu surveys were conducted by telephone in the USA. In this survey, besides the social, economic, and demographic questions, the respondents were asked whether they had the H1N1 vaccine or the Seasonal Flu vaccine. With this information, it is aimed to predict whether these people have the H1N1 and Seasonal Flu vaccines.
This study is a preliminary preparation for future studies. As a method, 5 classification algorithms were used. These are respectively; Random Forest, XGBoost, Gradient Descent, Logistic Regression, and KNN. 3 independent variables with the highest correlation were selected for each method, and a comparison table showing the success of the models was given at the end of the study.
The study and dataset can be accessed here.
First, the necessary libraries are imported.
Then the csv files are loaded. The contents of the df1 table are the same as in the following tables.
There are several independent variables in the data set.

Then, the df2 variable, which contains the independent variables, is looked at. In this problem, the h1n1_vaccine and seasonal_vaccine columns are expected to be estimated.

In the next step, it was checked how much missing data in the columns were.

A quick glance at the properties of each column is taken with the Describe function.

The dependent and independent variables are combined in the united_df variable to look at the correlations between the columns.

When examining the correlation heatmap, the columns most correlated with the first target variable h1n1_vaccine are as follows:
#doctor_recc_h1n1: 0.39
#opinion_h1n1_risk: 0.32
#opinion_h1n1_vacc_effective: 0.27
#opinion_seas_risk: 0.26
#health_insurance : 0.22
#doctor_recc_seasonal: 0.21
The columns most correlated with the second target variable, the seasonal_vaccine column, are as follows:
#opinion_seas_risk: 0.39
#doctor_recc_seasonal: 0.37
#opinion_seas_vacc_effective: 0.36
#opinion_h1n1_risk: 0.22
#opinion_h1n1_vacc_effective: 0.21
#doctor_recc_h1n1: 0.2
#health_insurance: 0.2
In addition to these highly correlated variables, there is also a high correlation between h1n1_vaccine and seasonal_vaccine variables. However, since the dependent variables cannot be used in the estimation process, the correlations of these columns are neglected.
In the code block below, the dependent variables are copied to the variable y. Afterwards, the df1 and y variables were split 66% / 33% as train and validation data. After this process, the values that are nan in the x_train and x_val variables are filled in as mean.
In the next step, x_train1 is created to predict the first target variable h1n1_vaccine column, while x_train2 is created to predict the second target variable seasonal_vaccine. Only 3 variables that were most correlated with the target variables in the df1 variable were added to these variables.
The variables required for the comparison table are created and added to the next code block.
5 machine learning algorithms were called and run for h1n1_vaccine. The accuracy scores of the models are added to the h1n1_accuracy variable with the append function.
5 machine learning algorithms were called and run for seasonal_vaccine. The accuracy scores of the models are added to the seasonal_accuracy variable with the append function.
Model_accuracy_scores variable is made into dataframe and then this dataframe is called.
The results are as follows:

While the XGboost algorithm estimated the h1n1_accuracy target variable with an accuracy score of 0.824030, the best way compared to other algorithms, the KNN algorithm showed the weakest performance with an accuracy score of 0.814613. In Seasonal_accuracy, Random forest showed the best performance with an accuracy score of 0.745859, while KNN showed the weakest performance in this area with an accuracy score of 0.740186. When we look at the Accuracy metric, there is very little difference in scores between the algorithms.
This article has been prepared in order to reach a quick solution about the vaccination prediction. It is the first step of a more comprehensive study. In future studies, different metrics, different independent variable selection techniques, model parameters optimization, and stacking techniques can be used.
Resources: