Predicting students’ academic progress and related attributes in first-year medical students: an analysis with artificial neural networks and Naïve Bayes | BMC Medical Education

Predicting students’ academic progress and related attributes in first-year medical students: an analysis with artificial neural networks and Naïve Bayes | BMC Medical Education


This study followed the general knowledge discovery process: pre-processing, data analysis, and validation. The following sections detail the process for each phase.

Preprocessing

Data sources

The undergraduate medical program at the National Autonomous University of Mexico Faculty of Medicine has a large student population, 10,104 students in 2022 [36]. When students apply for admission, information related to their family environment, socio-economic status and prior academic trajectory is collected through a questionnaire. After students are enrolled in the program, their knowledge in eight subjects is assessed using a standardized multiple-choice question (MCQ) diagnostic exam. During the program, data on their progress is recorded (grades, type of final exams -regular or resitFootnote 1– and credits achieved). At the end of each academic year, students require a specific number of credits in order to be able to enrol in next year’s courses. Students that have less than the required credits have to enrol again in the courses they failed, and they cannot be promoted to the next curricular year, so they lose a year and are delayed in their academic trajectory.

Students were categorised as either regular or irregular, based on the number of credits they obtained during their first academic year:

  • Regular: students who successfully completed all the required courses for the first year and obtained all the required credits at that time (value 1).

  • Irregular: students who fail one or more of the required courses for the first year, and consequently do not have all the required credits (value 0).

Data characteristics

This study used data from 7,976 anonymized students from the 2011 to 2017 cohorts of the program. It included information from the students’ diagnostic exam results, academic history, sociodemographic characteristics and family environment [12, 37, 38]. The main dataset included 48 variables (24 categorical, 8 discrete numerical, and 16 continuous numerical). Table 1 contains all the variables in the dataset classified in different groups: student’s demographics, family environment, socio-economic status, prior educational experience, type of admission and student progress.

Table 1 Explanatory variables collected at admission to the medical school (UNAM Faculty of Medicine, Mexico City)

Table 2 lists the groups of variables from the students’ performance in the admission diagnostic exam: students’ scores in eight high school subjects, including proficiency in Spanish and English. From this point forward, we will use the term students’ prior knowledge to refer to this group of variables.

Table 2 Variables from scores in the admission diagnostic exam (UNAM Faculty of Medicine, Mexico City)

The dependent variable (ACADEMIC_STATUS_1STY) was calculated using the percentage of credits completed at the end of the first year (PROGRESS).

Dataset preparation

From the initial 7,976 records, 910 (11.4%) were excluded from analysis because they had a significant percentage of missing data (they had little or no information in their demographics survey or they did not take the diagnostic exam). There was a slight difference in the class distribution, 47.8% of the students were categorised as irregular and 52.7% as regular. The EDM models used in this study had different data pre-processing requirements. For the Naïve Bayes model, numeric variables were converted to categorical in order to have a more balanced distribution of the number of students that belong to an attribute possible values. For example, a variable that reflected a grade would have few students that had a specific numeric value, compared to the number of students that would be within a range of the grade (e.g., 50–60%). This conversion helps in the interpretation of how different values influence the model as well as improve the model’s accuracy [39, 40].

For both models, the initial dataset was divided into a “training set” consisting of 80% randomly selected student records, and a “test set” with the remaining 20%. This distribution was chosen arbitrarily trying to balance the models’ accuracy and to avoid overfitting.

Artificial neural networks

The categorical variables were converted to numerical values by applying one-hot encoding, which separates the categories within each variable and transforms them to dichotomous variables with a value of 1 if the attribute is available and 0 if it is not [41]. Missing values were replaced using a smooth imputation with the SimpleImputer library of Scikit-learn in Python. In the case of numerical variables, missing values were replaced by the mean; in categorical variables, the mode was used since the percentage of missing values was less than 10% [42].

Naïve Bayes

Continuous numeric variables associated with percentages were categorised in five groups using percentile values as a reference [43]. Categories for discrete numeric variables were redefined so that each one contained an even number of cases. Since missing values ​​were treated as a possible value for the variables, imputation techniques were not used.

Data analysis

Data mining models

The ANN and the Naïve Bayes models were selected due to their reported high performance in classification tasks [39, 44]. Two classification tasks were carried out in both models: one to predict students’ regularity and the other to predict their irregularity. Even though trying to predict both scenarios might be redundant, this was done to explore if there would be any difference in the models regarding the results and the influence of the predicting variables.

Artificial neural networks

ANNs are a machine learning algorithm inspired by the physiology of neurons [27], in particular, how a neuron transmits an impulse based on its different connections. In the model, a neuron is a unit that will output a numeric result by computing the different weights, input values and an arbitrary bias through an activation function [27]. For this study, a Multilayer Perceptron (MLP) neural network with backpropagation (BP) with two hidden layers was used. The models were created using Python Scikit-learn library for data management and Google’s TensorFlow using the Keras interface library for setting up and running the models. The ANNs were fine-tuned based on their accuracy, specificity and sensitivity.

A disadvantage of this model is that ANNs are considered “black boxes”, it is impossible to dissect and understand exactly how the network produces a determined result or how each variable influences it [27]. However, there are some methodologies that can estimate the influence of each variable on the model such as the sensitivity analysis. A series of datasets were prepared where a variable was removed from each dataset, then multiple ANN were trained and their accuracy obtained through cross validation. Afterwards, the variables were ranked based on how much subtracting them from the dataset influenced the accuracy of the model.

Naïve Bayes

NB is a probabilistic classification method adequate for data sets with a high number of variables. As its name implies, it is based on Bayes’ theorem and assumes that the predictive variables are not conditionally dependent. It estimates the post-probability of an event or condition given the values of the predictive variables [45].

The model was created using the R programming language. First, the probability of belonging to a class (e.g. regularity) for each variable possible value was calculated. Second, a score was estimated for each variables’ values considering their likelihood of belonging to the target class. Third, the score for each student was obtained by adding the individual score of each variables’ value based on their data. Finally, an analysis was carried out to select the best score threshold for classifying a student. A ROC curve analysis was carried out to determine the optimal score threshold. Multiple models were trained with different thresholds (from − 9.73 to 8.48) to determine if a student was at-risk or not. The best value for the threshold (0.43) was determined by considering the models’ sensitivity and false positive rate.

In contrast with ANNs, with the Naïve Bayes model it is possible to analyse the influence that each variable has in the model based on its predictive value [42]. In order to better understand the significance of each variable and its values, the epsilon values were calculated:

$${\epsilon}_{{X}_{i}}=\frac{{N}_{{X}_{i}}[P\left({X}_{i}\right)-P\left({C}_{k}\right)]}{{\left[{N}_{{X}_{i}}P\right({C}_{k}\left)\right(1-P\left({C}_{k}\right)]}^{1/2}}$$

Where \({C}_{k}\) represents the class, \({X}_{i}\) the attribute in accordance with the response category and \({N}_{{X}_{i}}\)the number of students with attribute \({X}_{i}\). Categories with epsilon values greater than 2 or lower than − 2 are considered significant for prediction [42, 43].

Validation

Both models were validated using a test dataset to assess how they would perform with new data. The models were analysed based on their accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. These parameters were used because of their usefulness for designing interventions. They provide a better understanding of the models’ limitations and how they can be implemented in an individual or population scale.

Accuracy represents the percentage of correct classifications achieved by a model.

$$accuracy=\frac{Correctly\;classified}{Total\;population}$$

Specificity indicates the percentage of students that do not belong to the target class and were classified correctly by the model.

$$specificity=\frac{True\;negatives}{Total\;negatives}$$

Sensitivity denotes the percentage of students that belong to the target group and were classified correctly by the model.

$$sensitivity=\frac{Positives\;correctly\;classified}{Totalpositives}$$

The positive predictive value indicates the probability that a student belongs to the target group given that the model predicted they belong to it.

$$ppv=\frac{True\;positives}{Total\;positives}$$

The negative predictive value is the probability that a student does not belong to the target group given that the model did not classify them as such.

$$npv=\frac{True\;negatives}{Total\;negatives}$$



Read More