Organising epidemiological data

 

 

Introduction

 

In order to use information for decision making epidemiologists need first to organise collected data in a standard format allowing summarising many observations. Tools available include tables, graphics or diagrams. They facilitate description and interpretation of distributions, trends, and relationships in the data. This data organisation also serves the purpose of communicating results to various audiences. This is a rigorous task and, even if there are no fixed rules, some guiding principle can be defined.

 

 

Tables

 

A table is a data set organised in rows and columns. The simplest table includes 2 columns. The first column lists the categories in which data are grouped. The second column shows the number of events or individuals falling in each category. A third column may show the percentage of the total that each category represents.

 

Table I:

 

Number of cases of disease X by age groups, among residents of sample-city, 2012.

 

Age groups                                 Number of cases    

_________________________________________

0-9 years                                                         422

10- 19 years                                           783

20-29 years                                            565

30-39 years                                            904

40-49 years                                            237

50-59 years                                            676

60-69 years                                            898

70-79 years                                            239

80 and more years                                 120

Unkown                                                  220

_________________________________________

Total                                                     5061

 

  • The title should be concise and include the what, when and where content of the table (time, place, person). The title should be preceded by a table number (e.g. Table II).
  • Each row and column should be clearly and concisely labelled. Units of measurements should be noted (e.g. years, meters, cases/1000, etc.). Categories should be mutually exclusive.
  • The total of lines and columns should always be mentioned as well as missing or unknown information or any exclusion (as a special line or in foot notes).
  • Any abbreviation or code should be explained in a foot note (e.g. OR = odds ratio).
  • The source of the data should be mentioned in a foot note unless these are original data.
  • Lines to separate columns are not needed. They are easily replaced by proper alignment and justification of columns. Horizontal lines are reduced to the strict minimum.
  • Any table and its attached foot note should be self-explanatory. No additional text should be needed to understand the table.

 

The above table shows case count according to only one variable (age groups). Data could be segregated across a second or several other variables. This can be illustrated as follows.

 

Table II

Number of cases of disease X by age groups, sex and X/Y characteristic, among residents of sample-city, 2012.

 

_____________________________________________________________________________________________________

                                                                                                                    Number of cases               

                                                                                                    __________________________________

Age groups                                                Gender                               X                                      Y                      Total

_______________________________________________________________________________________

 

0-9 years

Males

 

 

 

 

Females

 

 

 

 

Total

 

 

 

 

 

 

 

 

10- 19 years

Males

 

 

 

 

Females

 

 

 

 

Total

 

 

 

 

 

 

 

 

20-29 years

Males

 

 

 

 

Females

 

 

 

 

Total

 

 

 

 

 

 

 

 

30-39 years

Males

 

 

 

 

Females

 

 

 

 

Total

 

 

 

 

 

 

 

 

40-49 years

Males

 

 

 

 

Females

 

 

 

 

Total

 

 

 

 

 

 

 

 

50-59 years

Males

 

 

 

 

Females

 

 

 

 

Total

 

 

 

 

 

 

 

 

60-69 years

Males

 

 

 

 

Females

 

 

 

 

Total

 

 

 

 

 

 

 

 

70-79 years

Males

 

 

 

 

Females

 

 

 

 

Total

 

 

 

 

 

 

 

 

80 and more years

Males

 

 

 

 

Females

 

 

 

 

Total

 

 

 

 

 

 

 

 

Unkown

Males

 

 

 

 

Females

 

 

 

 

Total

 

 

 

 

 

 

 

 

Total

Males

 

 

 

 

Females

 

 

 

 

 

 

 

 

 

Total

 

 

 

 

 

 

Find an illustration of foot notes.

Two by two tables

 

Cohort studies and case control studies are classical methods used by epidemiologists to identify association between an exposure and a disease. The crude results of such studies are frequently presented as contingency or 2 by 2 tables. They can be illustrated as follows.

 

 

Cohort studies

 

Table III: Cases of disease X according to consumption of food X, among customers of restaurant Y, 29 February 2012.

 

Consumption of

food X

 

Total

 

Cases

 

Risk %

 

Risk Ratio

Yes

100

40

40,0

2

No

50

10

29,0

Reference

Total

150

50

33,3

 

 

 

 

Case control study

 

Table IV: Cases of disease X and controls according to consumption of food X, among customers of restaurant Y, 29 February 2012.

 

 

Consumption of

food X

 

Cases

 

Controls

 

 

Odds Ratio

Yes

80

30

 

9,3

No

20

70

 

Reference

Total

100

100

 

 

 

 

 

Although epidemiologists cannot analyse data before they are collected they usually prepare their analysis by designing dummy tables (empty shells) which will later figure the results. This is an important part of any plan of analysis. It allows making sure that responses to be obtained will fit with the study design, the hypothesis tested and the way questions are asked.

 

Dummy tables for food specific attack rates.

 

Table V: Cases of gastroenteritis according to consumption of specific food items and beverages , among customers of restaurant X, date.

 

 

Have eaten

 

Did not eat

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Food item

Total

Cases

Risk%

 

Total

Cases

Risk%

 

Risk ratio

95 % CL*

 

Potato salad

 

 

 

 

 

 

 

 

 

 

 

Fruit salad

 

 

 

 

 

 

 

 

 

 

 

Tiramisu

 

 

 

 

 

 

 

 

 

 

 

Roasted chicken

 

 

 

 

 

 

 

 

 

 

 

Milk

 

 

 

 

 

 

 

 

 

 

 

Beer

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Graphics

 

A graphic is a way to visualize quantitative data using a system of coordinates. It helps us to see magnitude, trends, differences and similarities in the data. It is a key aspect in scientific communication whatever the audience.

 

 

Line graphs

 

In epidemiology we use rectangular coordinates. They include a vertical and a horizontal line with specific units of measurements and which intersect at a right angle. These are the x (horizontal) and y (vertical) axis. The scale used for x is arithmetic. Scales used for y can be arithmetic or logarithmic. We usually express y according to values of x. X (also called independent variable) usually represents classes of x or time. Y (the dependent variable) represents counts, proportions or rates.

 

 

 

Arithmetic line graphs

 

An arithmetic line graph shows distribution of an event (y) according to x (frequently time in epidemiology). Several events, several series of data, can be shown in the same line graph. The scale used in the x axis depends of the interval (time) used to collect data. The y unit of measurement depends on the magnitude of the highest value for y.

 

Example of a line graph showing number of tetanus cases reported in France from 1945 to 2003.

 

 

Source: InVS, Saint Maurice, France

 

Tips to select scales on axis:

 

  • Select a y axis shorter than the x axis. The graph will then be horizontal (ratio length y axis / length x axis = 3/5).
  • Always start the y axis with 0.
  • Determine the range of value needed on the y axis by identifying the highest y value in the data set.
  • Choose then an interval on the y axis which fits with the range (interval size which will give enough intervals and show enough details).
  • If y values are missing for one of several values of x, the line graph should be interrupted accordingly and start again after the gap. However the x and y axis should stay continuous. (e.g. in the x axis time should always be proportional to distance from zero).

 

The following graphic is incorrect.

 

Arithmetic line graphs can shows several categories of the same characteristic (age groups) on the same graph. The following example show reported incidence rates of gonorrhoea in Sweden by sex.

 

It is however difficult to show many series of data on the same line graph. The following example is a further break down of the above data in 6 age groups. Interpretation is becoming more difficult.

 

Source : SMI, Sweden

 

 

 

Semilogarithmic-scale line graphs

 

If we use a logarithmic scale on the y axis and if the x axis remains the same (arithmetic scale), we create a semi-logarithmic scale line graph. With a logarithmic scale on the y axis we represent the relative change of y over time rather than its absolute change over time. Semi-logarithmic scale line graphs are used to present and interpret rates of change over time rather than magnitude of change. They also allow showing very different magnitudes and ranges of rates between two lines (e.g. high incidence and low mortality rates for the same disease).

 

 

Semi-logarithmic scale paper

 

  • On the y axis, intervals are logarithmic and no longer arithmetic.
  • There are several cycles of tick marks on the y axis. Each corresponds to an equal distance on the y axis.
  • The values of one cycle are 10 times greater than the values of the previous cycle.
  • Within a cycle the 10 tick marks are not equally distant (distance from 2 to 3 is different than distance from 3 to 4). Their progression is geometric, not arithmetic.
  • The y axis can cover a large range of y values.

 

 

 

The following characteristics are noteworthy:

 

  • The slope of the line indicates the rate of change (the relative change) of y over time.
  • A horizontal straight line indicates no change.
  • An upward or downward straight line slope indicates a constant rate of increase or decrease in the measured indicator (e.g. rate) over time.
  • Two parallel lines indicate similar rate of change over time.

Source: Isituto Superiore di Sanita, Rome

 

The following example shows occurrence of cases and deaths of Measles in England and Wales from 1940 to 2002. On an arithmetic scale line graph it is impossible to see the trend in death rates over time. The different magnitude of rates between incidence and mortality does not allow showing both on the same graph. The solution is to use a logarithmic scale for the y axis. Doing so, we allow very different rates to be shown.

 

 

Source : CDSC, HPA, Colindale, UK.

 

 

OR

The following example

 

 

 

 

 

 

 

The decision to use arithmetic or semi-logarithmic-scale line graphs depends on what we want to show, absolute magnitudes or rates of change over time.

 

 

 

 

 

OR

Histograms

 

A histogram shows the frequency distribution of a continuous variable. Adjoining columns are used to represent the number of observations in each class interval of the distribution. The surface of each column is proportional to the number of observations in the column. There should be no scale break in the x axis otherwise the graph would not represent 100% of the data and surface units would no longer be proportional to the number of observations.

 

In intervention epidemiology histograms are frequently used to present occurrence (distribution) of onsets of illness according to time. This is frequently called an epidemic curve even if it is not a curve.

 

Several principles apply:

 

  • Time is represented on the x axis.
  • The choice of the appropriate time interval depends upon the duration of the epidemic and upon the incubation period. As a general rule, the time unit on the x axis should be less than one fourth of the incubation period.
  • The x axis begins showing time and any cases occurring before the outbreak. They can represent background cases or be index cases.
  • Each member (case) is centred between the two tick marks limiting a time interval.
  • One square represents one case. Using vertical or horizontal rectangles instead of squares would bias the interpretation of the shape of the curve by falsely creating or masking a peak.
  • In the legend we note beside a square what it represents (1 case).

 

 

The following histogram shows cases of tetanus reported after the Tsunami in Banda Aceh, Indonesia in 2004-5.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Source: Prof. Leegross, WHO

 

We may show a second or several additional variables on a histogram by shading the different components of a bar. However two many components in a bar are difficult to interpret. In this case it is better to do one histogram for each component.

Source: Prof Leegross, WHO

 

Histograms with unequal class interval can also be constructed. They are more difficult to do and to interpret. Whatever the interval the unit of surface used should always be proportional to the amount of information (number of cases).

 

Find one example

 

 

Frequency polygons

 

A frequency polygon shows a frequency distribution. It is constructed from a histogram. The frequency polygon is a polygon joining the mid points of the top of the bars of a histogram. The first point is on the x axis (y = 0) and is placed in the middle of the interval which precedes the first bar of the histrogram. The last point is located on the x axis in the middle of the interval immediately following the last bar of the histogram. The important issue is that, by joining the mid point of each bar and the x axis at each end, the surface under the frequency polygon is exactly the same as the surface of the histogram. Therefore the principle of the histogram is respected. The same surface represents the same amount of data (cases).

 

Frequency polygons represent an easy way to show several histograms on the same graphic.

 

 

Bar graphs

 

Bar graphs are methods to display information using only one coordinate. They are mainly used to compare data between discrete categories.

 

The simplest bar chart displays data from a table with one variable. Each bar represents one category. Bar graphs can be organised horizontally or vertically. Vertical bars differ from histograms since they are separated by a space. The height of the bar is proportional to the number of events (e.g. cases) in the category. But the surface is not always proportional to the width of the category on the x axis  (e.g. different width of age groups). If there is a logical order between categories it should be respected. Otherwise categories can be organised with the decreasing or increasing values of respective bars. Variables in a bar graph are discrete (sex, region, race, etc.) or continuous but organised in categories (e.g. age groups). The x axis does not need to be continuous. The following bar graph shows distribution of number of EPIET fellows by country of origin.

 

 

Grouped bar graphs

 

Sometimes several sub categories can be shown and placed close to each other in a larger category (grouped bar graphs). The following group bar graph represents the distribution of cases of Ebola haemorrhagic fever in Bumba zone, Zaire in 1976. The graphic shows two variables, age groups and sex.

 

 

 

The following example illustrates the difficulty to interpret too many bars on the same grouped bar graph.

 

Find one example

 

 

 

Stacked bar graphs

 

An alternative to the grouped bar graphs is the stacked bar graph. On this type of graph a bar is sub-divide in components. The height of each component is proportional to the part it takes in the bar. The following stacked bar graph shows the distributin of cases of Salmonella Typhimurium infection by age and sex in Norway. In each age group category the bar is divided into two sub-categories, males and females.

 

Source: National institute of health, Norway

 

 

 

 

 

Component bar graphs

 

Component bar graphs also called proportional component bar graphs are different from stacked bar graphs in the sense that they represent proportions rather than absolute values. Each bar has the same height and represents 100% of the data in that bar. Component of the bar are expressed as the percentage of the total bar they represent. The following 100% component bar graph shows the same data as above. Proportions are visible but the absolute magnitude of the distribution is no longer visible.

 

Source: National institute of health, Norway

 

Pie graphs

 

A pie graph is a graph in which the size of the "slices" is proportional to the amount of data (e.g. number of cases) it represents. Pie graphs are used to show the component of a larger group. However small differences between slices are more difficult to see than differences between bars on a bar graph. This is why the proportion (%) of the total that each slice represents is frequently added on the slice. Different shading or colours can also be used to identify the various slices. In addition slices can be ordered by decreasing or increasing magnitude. This can also be supported by a regular darkening gradient of a colour.

 

 

 

Source: National institute of health, Norway

 

 

 

 

Distribution of notifiable diseases reported during the World football cup, 4 June - 10 July 1998

 

 

 

 

Source: National institute of health, Norway

Maps

 

Maps are using geographic coordinates to locate events (cases by place of onset, residence, etc.). Field epidemiologists frequently use sport maps and are maps to illustrate occurrence of disease by place.

 

 

Spot maps

 

In a spot map, a dot or any other symbol on the map is located at the exact place the event occurred. This can be very precise when using geographical positing system to locate events. A spot map is useful to locate an event but since a spot map does not take into account the size of a population it does not allow showing the risk of occurrence of the event by place. Even when many dots are located in an area it does not tell us if it simply reflect population density or risk of occurrence or the event.

 

 

 

 

 

 

 

Area maps

 

Area maps use shaded or coloured area to show counts, risks, rates of an event by place. Shading or colour patterns are organised in a logical order. Usually the darker the area, the higher the count, risk or rate. If risks or rates are used, they are computed for each area taking into account the numerator (number of event in the area durin a specific period of time) and the denominator (persons leaving in the are or person-time experienced by the area population). The range of the risk distribution by areas is divided into exclusive categories and shaded / coloured accordingly.

 

The following area map illustrates rates of tuberculosis cases in France in 1996. Rates are expressed in number of new cases per 100 000 population per year. Different shading coloured patterns re used to describe categories of magnitude of incidence rates in each the 100 French health discricts.

 


 

We need to present and comment on the following type of map.

Diagrams and pictograms

 

Diagrams are popular tools among epidemiologists. They are frequently used to illustrate the transmission and spread patterns during an epidemic. The following diagram illustrates some aspect of the transmission during the SARS outbreak in Singapore in 2003.

 

 

When do we introduce this type of graphic?

 

 

 

 

Use of computers

 

Graphs are now exclusively computer generated. Task are becoming veryu to carry out. On the other hand epidemiologists become more and more dependant upon the possibilities and the limits of computer graphic software packages.

 

Whatever the software used some principles should probably be respected. We should âvoid three dimensional graphs. They do not improve communication, they are not easier to interpret. Colours should be selected according to complementary colour criteria (see chapter on visual aids). Many software do not allow to produce an histogram. What is frequently called an histogram is in fact a bar graph. Units and scale on the x and y axis are not always clear in some software packages. Most software are not allowing to do a perfect epidemic curve (one square = one case). Particularly when the x axis illustrates time, most software are not flexible enough to comply with what the epidemiologist hopes doing.

 

 

 

 

Conclusion

 

Tables, graphics and diagrams are useful and effective tools to summarise and communicate findings from epidemiological studies. There are a few guiding principle that need to be repeated here. Two small and simple tables are better than a large complicated table. What is difficult to show in a table may become simpler and clearer on a graph. All may one day be used out of context. They therefore need to have titles, labels, legend, foot notes and sources precisely. These tools help us to interpret and communicate. Whatever attractive computer technology is, one should always have a clear purpose in mind when chosing a specific type of table, graphic or diagram.