The aim of model building is to select the variables which will result in the best model to explain the observed data. Model building will be based both on methods, experience and common sense. The epidemiologist, not the software package, is responsible for the analysis and model building process.

The most frequent approach to model building is to achieve the smallest model (number of variables) that still explain the data. The smallest is chosen because it is also the more stable. Another objective is also to provide the best possible control of confounding within the data set.

The selection of variables should start with a careful univariate analysis of each variable. This involves defining if the variable is best described as a dichotomous, polytomous or continuous and verifying linearity assumptions. This also involves, prior to the logistic regression analysis, doing a careful stratified analysis by the means of 2xn contingency tables. This provides a unique way to look at the data (what is in each cell of 2x2 tables, including zeros).

Once the univariate analysis is completed we will select all variables with a statistical test leading to a p-value bellow a predefined cut-off level. A cut-off level of  p-value < 0,25 is often used. We should also include all variables we believe have a biological or public health importance. According to literature the use of more conservative or traditional level (p-value < 0,05) does not always allow for identifying all variables known to be important. One should also keep in mind that a group of variables which are not individually important in the model may play a collective role (confounding).

Several methods can be used to asses the fit of a best model. They include:

  • forward or backward step by step approach monitored by the analyst,
  • stepwise forward or backward (the software uses a precise algorithm to add or drop variables),
  • the best subset method.

Following the achievement of the best model fit, the importance of each variable should then be verified by comparing the crude association and the results of the model including comparison of confidence intervals and its statistical significance. The process of adding, fitting, dropping refitting continues until all variables in the model are judged either statistically or biologically important.

Once we have a model with all relevant variables we then should consider if interaction terms should be added. This implies that categories or linearity assumptions have been verified for polytomous and continuous variables.

<<Back to Logistic regression