Genetic Disorder Prediction:

Preprocessing with KNN-Imputer

5 min readMay 28, 2023

This article is written by Alparslan Mesri and Ugur Ziya Cifci.

You can download the dataset from this link.

In this case study you are hired as a Machine Learning Engineer from a government agency. You are given a dataset that contains medical information about children who have genetic disorders. Your task is to predict the following:

Genetic disorder
Disorder subclass

To kickstart the project, we first import the necessary libraries and assign the data to two variables, namely df_train and df_test.

The df_train variable contains 45 columns, making it difficult to view them all at once. To simplify this, we split it into three parts and took a quick glance at the data.

Target columns, “Genetic disorder” and “Disorder subclass” have some nan values. In the end, we will predict these two columns. As a simple approach we dropped the rows where any of target column contained nan value. But for a better score you can drop nan rows for each target column separately.

As can be seen above, Mother’s age and Father’s age columns need binning. Let’s do binning. After the creation of new binned columns, we deleted “Mother’s age” and “Father’s age” columns.

As a next step we dropped irrelevant columns. For further research maybe “Location of Institute” and “Institute Name” might contain some info. Nevertheless, as a quick solution, we will skip these columns and drop them.

In the preprocessing there are string value columns. We need to convert them into numerical ones. For this encoding, there are two technics: “Label encoding” and “OneHotEncoding”. For this encoding, there are two techniques: ‘Label encoding’ and ‘OneHotEncoding’. For columns with 2 unique values, we will use label encoding. Therefore, we need to identify which columns have less than 2 unique values and more than 2 unique values.

According to the information provided above, we have identified that the “Test 1–2–3–4–5” and “Parental consent” columns have zero variance. Therefore, we will proceed to drop these columns.

After identifying the columns with 2 unique values, we applied ordinal encoding to those columns. Ordinal encoding is preferred over label encoding in this case because it allows us to handle missing values (NaNs) using parameters such as “handle_unknown” and “unknown_value”. Unlike label encoder, ordinal encoder provides the flexibility to encode string columns while preserving the missing values.

To assign the training data to the variable X, we first dropped the target columns from the dataset. After that, we proceeded to split the remaining data into training and validation sets.

1- Feature Engineering of Train_DF

To prevent data leakage, it is important to perform feature engineering separately on the training and validation data. This ensures that the feature engineering techniques and transformations are applied only to the training data and then replicated on the validation data. By doing this, we avoid any information from the validation set leaking into the training set, which could lead to biased results and inaccurate performance evaluation.

After applying feature engineering techniques, the data is transformed as follows:

Numerical columns: The StandardScaler method was used to standardize the numerical columns, which ensures that they have zero mean and unit variance.
Categorical columns: The “get_dummies” function was used to create dummy variables for the categorical columns. This technique converts each categorical value into a binary column, indicating the presence or absence of that value in the original column.

Finally, the transformed numerical and categorical variables were concatenated back together to form the updated dataset.

Please note that this is the last situation of the data after feature engineering.

To fill the missing values in the dataset, we will use the KNN Imputer, which is a suitable tool for imputing missing values based on the values of the nearest neighbors. Before applying the KNN Imputer, we need to ensure that all the data is in numerical format.

After implementing the KNN Imputer, we will convert the imputed data back into a dataframe format.

2- Feature Engineering of Val_DF

We repeat same process for the validation data.

In the article, the main focus was on demonstrating the use of KNN imputer for handling missing values. Therefore, the prediction part is not extensively covered. However, now that the data is prepared, you can proceed with applying various models and algorithms for prediction. You can explore different machine learning techniques such as decision trees, random forests, logistic regression, or neural networks, depending on the nature of your data and the specific prediction task at hand.

Above, we imported the XGBoost classifier. In this problem, we have two target columns. We applied a one-hot encoder approach to encode the “Genetic Disorder” column into three different columns, and the “Disorder Subclass” column into nine different columns. These encoded columns are the last 12 columns in the dataset.

Since we have multiple target columns with multiple encoded columns, this problem requires a MultiOutput Classifier. You can implement your algorithm, such as XGBoost, inside the MultiOutput Classifier to handle the multi-target classification task.

Finally, we calculated the accuracy scores for both the “Genetic Disorder” and “Disorder Subclass” predictions.

As you can see, the accuracy for predicting the “Genetic Disorder” is 42%, and the accuracy for predicting the “Disorder Subclass” is 14%.

Considering that the “Genetic Disorder” column has 3 possible values and the “Disorder Subclass” column has 9 possible values, if we were to make random predictions, we would expect an accuracy of approximately 33% for “Genetic Disorder” and 11% for “Disorder Subclass”. Therefore, the achieved accuracies are slightly higher than random predictions, indicating some level of predictive power in the model.