Predicting the number of coronavirus cases based on GDP and other factors influencing it

mudassir ali
5 min readAug 3, 2020
src :

In today’s scenario based on the increase in number of corona cases all around the globe many researchers are trying to build a definite model for predicting the number of corona virus cases for each area which helps in finding new ways to tackle these virus. Many people are trying different ways to create an machine learning models, some of them are using country’s GDP, HDI and human population, while the others are recovered cases, number of deaths per million, number of tests done and number of times it is done on each person, recovery time for each corona affected person.

This blog mainly covers spread of corona virus and number of deaths caused. It mainly deals with developing a machine learning model that can predict number of coronavirus cases in each country based on GDP growth( also factors affecting GDP) and human development index (HDI). This project was part of my summer internship.

Initially we had to take datasets for the input variables. The dataset consists of three csv files, first file contains country data and number of cases and deaths, second was about GDP and its factors for each country while the last one contains HDI data for each country in the world. We combined all the three csv files data and created a final data frame using pandas python library. The main problem lying with the dataset was variance, when we performed Pearson correlation function on the final dataframe we found very few input columns have a larger impact on our output column which was predicting number of covid cases. The Regression model was to be built, since this is new problem statement and very few research has been done on this topic we need to start testing data from very basic regression model. When plotting graph for each input column with respect to cases per million for each country there was some unusual plotting that depicts that there is not a single input column which not strictly increasing or decreasing with respect to the number of corona cases. Along with the histogram plot you can also find the probability distribution for each input factor in the below given figures that can tells us which standardized scalar to be used to normalize our data since we are dealing with huge numbers.

Fig 1. X-axis : Cases per Million for each country, Y-axis : HDI for each country.
Fig 2. X-axis : Cases per Million for each country, Y-axis : GDP per Capita for each country.
Fig 3
Fig 4

Out of all the input columns shown in fig 5 ,only 9 factors were selected for creating ml model as shown in fig 6 taking only those whose correlation coefficient is atleast 0.2 with the output column. For normalization of data we considered standardized scaler whose mean is zero and min max scaler and overall both of them have similar input normalized values.

Fig 5. — Dataset
Fig 6. — Revised Dataset with only 9 columns

Now the data is ready and different regression models like Linear Regression, Polynomial Regression , Decision Tree Regressor, SVR, ANN are implemented in this data and results are shown in fig 7. It can be concluded that linear regression performs well with an accuracy score of 0.57 compared to other models. Since the data was less it was only for countries there are chances of overfitting, to avoid this we performed cross validation on all the models.

Fig 7. — Result-1

With the cross validation technique using sklearn package different regression models are implemented on different types of inputs of same data. The default number of times, same model is run with different inputs is five and for each time we calculate accuracy score and out of all 5 different accuracy scores of different inputs for each model we take maximum accuracy score fig 8 . There are slight improvements for each model compared to their last scores especially ANN but linear regression has highest score of 0.55.

Fig 8. — Result-2

To improve the accuracy score the input columns are reduced to only 6 input variables that have correlation coefficient of atleast 0.4 with the output column. With new inputs again we implemented the above models. This time Artificial Neural Network (Multi-Layer Perceptron) has achieved the highest accuracy score of 0.67 using Adam optimizer with an adaptive learning rate as shown in fig 9.

Fig 9. — Result-3

To conclude we referred different kinds of accuracy scores for different kind of regression models along with standardized scaler transformations. We learnt different kinds of cross validation techniques for achieving better accuracy. Further, we also plan to create plots using pydash library and implement this ml model on those plots.

Feel free to drop your suggestions and can contact me at