Field Epidemiology Manual Wiki

Quality checking

Last modified at 3/3/2016 3:47 PM by Vladimir Prikazsky

The first step in analyzing surveillance data is to assess its quality by detecting data entry errors, inconsistent data and incomplete reporting. This is achieved by computing the frequency distributions of the variables in the data set. A review of these frequency distributions allows detecting and correcting data entry errors and missing fields.

It is not uncommon to notice a round digit attraction on numeric fields such as age (ages ending in 0 and 5 being more represented than expected) or dates (day 01, 15, 10 and 20 being overrepresented compared to other days of the month). Such a lack of precision on the data cannot be corrected at time of analysis, but needs to be taken into consideration when interpreting data plotted by age or date.

When several date fields are part of the data set, such as date of onset, admission, confirmation or notification, calculation of delays between these sequential steps may highlight data entry errors (e.g. large delays due to an error on the year) or inconsistencies (e.g. negative delays due to confirmation occurring before onset).

Distribution frequencies by diseases and age or sex may contribute to detecting additional errors (e.g. neonatal tetanus among adults).

Not all errors can be corrected at the time of analysis. However, it is crucial to get a good understanding of the quality of the data and its limitation prior to analyze and interpret results.

To design an effective surveillance system, it is necessary to define for each disease, which are the surveillance indicators best suited to trigger signals and which value of the indicator (threshold) is considered abnormal or unusual.

Indicators can be expressed as absolute numbers (usually appropriate for rare diseases with immediate notification), as proportions of notifications for a disease (proportional morbidity in the absence of denominators) or as incidence rates (weekly notification of the number of cases using population as denominator, in case of common disease).

Indicators have to be defined in terms of time and place (e.g. number of cases/week/district).

Thresholds are values of indicators above which the disease pattern is considered as abnormal or unusual and may require a public health intervention. For most epidemic-prone diseases under immediate notification, the threshold is set to 1 as the occurrence of a single case is considered as requiring a public health intervention (e.g. AFP, rabies, plague...). For more common diseases, thresholds can be set on the rate observed over a given time period (e.g. meningitis in Africa), or based on an increase in comparison with baseline data (e.g. influenza-like illness). Methods for setting thresholds are presented in chapter Methods for setting thresholds in time series analysis.

At this stage, it is also important to define indicators to monitor better the surveillance process (e.g. timeliness, completeness).