The aim of model building is to select the variables which will
result in
the best model to explain the observed data. Model building will be
based both
on methods, experience and common sense. The epidemiologist, not the software
package, is
responsible for the analysis and model building process.
The most frequent approach to model building is to achieve the
smallest
model (number of variables) that still explain the data. The smallest is
chosen
because it is also the more stable. Another objective is also to provide
the
best possible control of confounding within the data set.
The selection of variables should start with a careful univariate
analysis of each variable. This involves defining if the variable is
best
described as a dichotomous, polytomous or continuous and verifying
linearity
assumptions. This also involves, prior to the logistic regression
analysis,
doing a careful stratified analysis by the means of 2xn contingency tables.
This
provides a unique way to look at the data (what is in each cell of 2x2
tables,
including zeros).
Once the univariate analysis is completed we will select all
variables
with a statistical test leading to a p-value bellow a predefined cut-off level. A cut-off level of p-value < 0,25 is often used. We should
also
include all variables we believe have a biological or public health
importance.
According to literature the use of more conservative or traditional
level (p-value
< 0,05) does not always allow for identifying all variables known to
be
important. One should also keep in mind that a group of variables which
are not
individually important in the model may play a collective role
(confounding).
Several methods can be used to asses the fit of a best model. They
include:
-
forward or backward
step by step approach monitored by the analyst,
-
stepwise forward or
backward (the software uses a precise algorithm to add or drop
variables),
-
the best subset
method.
Following the achievement of the best model fit, the importance of
each
variable should then be verified by comparing the crude association and
the
results of the model including comparison of confidence intervals and its statistical significance. The process
of
adding, fitting, dropping refitting continues until all variables in the
model
are judged either statistically or biologically important.
Once we have a model with all relevant variables we then should
consider
if interaction terms should be added. This implies that categories or
linearity
assumptions have been verified for polytomous and continuous variables.
<<Back to Logistic regression