You can't make decissions on this page's approval status because you have not the owner or an admin on this page's Group.
Editor
Original Author
Alain Moren
Contributors
Organising epidemiological data
Introduction
In order to use information for decision making epidemiologists need first to organise collected data in a standard format allowing summarising many observations. Tools available include tables, graphics or diagrams. They facilitate description and interpretation of distributions, trends, and relationships in the data. This data organisation also serves the purpose of communicating results to various audiences. This is a rigorous task and, even if there are no fixed rules, some guiding principle can be defined.
Tables
A table is a data set organised in rows and columns. The simplest table includes 2 columns. The first column lists the categories in which data are grouped. The second column shows the number of events or individuals falling in each category. A third column may show the percentage of the total that each category represents.
Table I:
Number of cases of disease X by age groups, among residents of sample-city, 2012.
Age groups Number of cases
_________________________________________
0-9 years 422
10- 19 years 783
20-29 years 565
30-39 years 904
40-49 years 237
50-59 years 676
60-69 years 898
70-79 years 239
80 and more years 120
Unkown 220
Total 5061
The above table shows case count according to only one variable (age groups). Data could be segregated across a second or several other variables. This can be illustrated as follows.
Table II
Number of cases of disease X by age groups, sex and X/Y characteristic, among residents of sample-city, 2012.
_____________________________________________________________________________________________________
Number of cases
__________________________________
Age groups Gender X Y Total
_______________________________________________________________________________________
0-9 years
Males
Females
Total
10- 19 years
20-29 years
30-39 years
40-49 years
50-59 years
60-69 years
70-79 years
80 and more years
Unkown
Find an illustration of foot notes.
Two by two tables
Cohort studies and case control studies are classical methods used by epidemiologists to identify association between an exposure and a disease. The crude results of such studies are frequently presented as contingency or 2 by 2 tables. They can be illustrated as follows.
Cohort studies
Table III: Cases of disease X according to consumption of food X, among customers of restaurant Y, 29 February 2012.
Consumption of
food X
Cases
Risk %
Risk Ratio
Yes
100
40
40,0
2
No
50
10
29,0
Reference
150
33,3
Case control study
Table IV: Cases of disease X and controls according to consumption of food X, among customers of restaurant Y, 29 February 2012.
Controls
Odds Ratio
80
30
9,3
20
70
Although epidemiologists cannot analyse data before they are collected they usually prepare their analysis by designing dummy tables (empty shells) which will later figure the results. This is an important part of any plan of analysis. It allows making sure that responses to be obtained will fit with the study design, the hypothesis tested and the way questions are asked.
Dummy tables for food specific attack rates.
Table V: Cases of gastroenteritis according to consumption of specific food items and beverages , among customers of restaurant X, date.
Have eaten
Did not eat
Food item
Risk%
Risk ratio
95 % CL*
Potato salad
Fruit salad
Tiramisu
Roasted chicken
Milk
Beer
Graphics
A graphic is a way to visualize quantitative data using a system of coordinates. It helps us to see magnitude, trends, differences and similarities in the data. It is a key aspect in scientific communication whatever the audience.
Line graphs
In epidemiology we use rectangular coordinates. They include a vertical and a horizontal line with specific units of measurements and which intersect at a right angle. These are the x (horizontal) and y (vertical) axis. The scale used for x is arithmetic. Scales used for y can be arithmetic or logarithmic. We usually express y according to values of x. X (also called independent variable) usually represents classes of x or time. Y (the dependent variable) represents counts, proportions or rates.
Arithmetic line graphs
An arithmetic line graph shows distribution of an event (y) according to x (frequently time in epidemiology). Several events, several series of data, can be shown in the same line graph. The scale used in the x axis depends of the interval (time) used to collect data. The y unit of measurement depends on the magnitude of the highest value for y.
Example of a line graph showing number of tetanus cases reported in France from 1945 to 2003.
Source: InVS, Saint Maurice, France
Tips to select scales on axis:
The following graphic is incorrect.
Arithmetic line graphs can shows several categories of the same characteristic (age groups) on the same graph. The following example show reported incidence rates of gonorrhoea in Sweden by sex.
It is however difficult to show many series of data on the same line graph. The following example is a further break down of the above data in 6 age groups. Interpretation is becoming more difficult.
Source : SMI, Sweden
Semilogarithmic-scale line graphs
If we use a logarithmic scale on the y axis and if the x axis remains the same (arithmetic scale), we create a semi-logarithmic scale line graph. With a logarithmic scale on the y axis we represent the relative change of y over time rather than its absolute change over time. Semi-logarithmic scale line graphs are used to present and interpret rates of change over time rather than magnitude of change. They also allow showing very different magnitudes and ranges of rates between two lines (e.g. high incidence and low mortality rates for the same disease).
Semi-logarithmic scale paper
The following characteristics are noteworthy:
Source: Isituto Superiore di Sanita, Rome
The following example shows occurrence of cases and deaths of Measles in England and Wales from 1940 to 2002. On an arithmetic scale line graph it is impossible to see the trend in death rates over time. The different magnitude of rates between incidence and mortality does not allow showing both on the same graph. The solution is to use a logarithmic scale for the y axis. Doing so, we allow very different rates to be shown.
Source : CDSC, HPA, Colindale, UK.
OR
The following example
The decision to use arithmetic or semi-logarithmic-scale line graphs depends on what we want to show, absolute magnitudes or rates of change over time.
Histograms
A histogram shows the frequency distribution of a continuous variable. Adjoining columns are used to represent the number of observations in each class interval of the distribution. The surface of each column is proportional to the number of observations in the column. There should be no scale break in the x axis otherwise the graph would not represent 100% of the data and surface units would no longer be proportional to the number of observations.
In intervention epidemiology histograms are frequently used to present occurrence (distribution) of onsets of illness according to time. This is frequently called an epidemic curve even if it is not a curve.
Several principles apply:
The following histogram shows cases of tetanus reported after the Tsunami in Banda Aceh, Indonesia in 2004-5.
Source: Prof. Leegross, WHO
We may show a second or several additional variables on a histogram by shading the different components of a bar. However two many components in a bar are difficult to interpret. In this case it is better to do one histogram for each component.
Source: Prof Leegross, WHO
Histograms with unequal class interval can also be constructed. They are more difficult to do and to interpret. Whatever the interval the unit of surface used should always be proportional to the amount of information (number of cases).
Find one example
Frequency polygons
A frequency polygon shows a frequency distribution. It is constructed from a histogram. The frequency polygon is a polygon joining the mid points of the top of the bars of a histogram. The first point is on the x axis (y = 0) and is placed in the middle of the interval which precedes the first bar of the histrogram. The last point is located on the x axis in the middle of the interval immediately following the last bar of the histogram. The important issue is that, by joining the mid point of each bar and the x axis at each end, the surface under the frequency polygon is exactly the same as the surface of the histogram. Therefore the principle of the histogram is respected. The same surface represents the same amount of data (cases).
Frequency polygons represent an easy way to show several histograms on the same graphic.
Bar graphs
Bar graphs are methods to display information using only one coordinate. They are mainly used to compare data between discrete categories.
The simplest bar chart displays data from a table with one variable. Each bar represents one category. Bar graphs can be organised horizontally or vertically. Vertical bars differ from histograms since they are separated by a space. The height of the bar is proportional to the number of events (e.g. cases) in the category. But the surface is not always proportional to the width of the category on the x axis (e.g. different width of age groups). If there is a logical order between categories it should be respected. Otherwise categories can be organised with the decreasing or increasing values of respective bars. Variables in a bar graph are discrete (sex, region, race, etc.) or continuous but organised in categories (e.g. age groups). The x axis does not need to be continuous. The following bar graph shows distribution of number of EPIET fellows by country of origin.
Grouped bar graphs
Sometimes several sub categories can be shown and placed close to each other in a larger category (grouped bar graphs). The following group bar graph represents the distribution of cases of Ebola haemorrhagic fever in Bumba zone, Zaire in 1976. The graphic shows two variables, age groups and sex.
The following example illustrates the difficulty to interpret too many bars on the same grouped bar graph.
Stacked bar graphs
An alternative to the grouped bar graphs is the stacked bar graph. On this type of graph a bar is sub-divide in components. The height of each component is proportional to the part it takes in the bar. The following stacked bar graph shows the distributin of cases of Salmonella Typhimurium infection by age and sex in Norway. In each age group category the bar is divided into two sub-categories, males and females.
Source: National institute of health, Norway
Component bar graphs
Component bar graphs also called proportional component bar graphs are different from stacked bar graphs in the sense that they represent proportions rather than absolute values. Each bar has the same height and represents 100% of the data in that bar. Component of the bar are expressed as the percentage of the total bar they represent. The following 100% component bar graph shows the same data as above. Proportions are visible but the absolute magnitude of the distribution is no longer visible.
Pie graphs
A pie graph is a graph in which the size of the "slices" is proportional to the amount of data (e.g. number of cases) it represents. Pie graphs are used to show the component of a larger group. However small differences between slices are more difficult to see than differences between bars on a bar graph. This is why the proportion (%) of the total that each slice represents is frequently added on the slice. Different shading or colours can also be used to identify the various slices. In addition slices can be ordered by decreasing or increasing magnitude. This can also be supported by a regular darkening gradient of a colour.
Distribution of notifiable diseases reported during the World football cup, 4 June - 10 July 1998
Maps
Maps are using geographic coordinates to locate events (cases by place of onset, residence, etc.). Field epidemiologists frequently use sport maps and are maps to illustrate occurrence of disease by place.
Spot maps
In a spot map, a dot or any other symbol on the map is located at the exact place the event occurred. This can be very precise when using geographical positing system to locate events. A spot map is useful to locate an event but since a spot map does not take into account the size of a population it does not allow showing the risk of occurrence of the event by place. Even when many dots are located in an area it does not tell us if it simply reflect population density or risk of occurrence or the event.
Area maps
Area maps use shaded or coloured area to show counts, risks, rates of an event by place. Shading or colour patterns are organised in a logical order. Usually the darker the area, the higher the count, risk or rate. If risks or rates are used, they are computed for each area taking into account the numerator (number of event in the area durin a specific period of time) and the denominator (persons leaving in the are or person-time experienced by the area population). The range of the risk distribution by areas is divided into exclusive categories and shaded / coloured accordingly.
The following area map illustrates rates of tuberculosis cases in France in 1996. Rates are expressed in number of new cases per 100 000 population per year. Different shading coloured patterns re used to describe categories of magnitude of incidence rates in each the 100 French health discricts.
We need to present and comment on the following type of map.
Diagrams and pictograms
Diagrams are popular tools among epidemiologists. They are frequently used to illustrate the transmission and spread patterns during an epidemic. The following diagram illustrates some aspect of the transmission during the SARS outbreak in Singapore in 2003.
When do we introduce this type of graphic?
Use of computers
Graphs are now exclusively computer generated. Task are becoming veryu to carry out. On the other hand epidemiologists become more and more dependant upon the possibilities and the limits of computer graphic software packages.
Whatever the software used some principles should probably be respected. We should âvoid three dimensional graphs. They do not improve communication, they are not easier to interpret. Colours should be selected according to complementary colour criteria (see chapter on visual aids). Many software do not allow to produce an histogram. What is frequently called an histogram is in fact a bar graph. Units and scale on the x and y axis are not always clear in some software packages. Most software are not allowing to do a perfect epidemic curve (one square = one case). Particularly when the x axis illustrates time, most software are not flexible enough to comply with what the epidemiologist hopes doing.
Conclusion
Tables, graphics and diagrams are useful and effective tools to summarise and communicate findings from epidemiological studies. There are a few guiding principle that need to be repeated here. Two small and simple tables are better than a large complicated table. What is difficult to show in a table may become simpler and clearer on a graph. All may one day be used out of context. They therefore need to have titles, labels, legend, foot notes and sources precisely. These tools help us to interpret and communicate. Whatever attractive computer technology is, one should always have a clear purpose in mind when chosing a specific type of table, graphic or diagram.
Join the discussion about this article in the forum!
2 Comments
Webster posted on 8/8/2010 9:24:14 PM:
Hey Alain and Ágnes,
Congratulations to a clear and concise delineation on how to present data.
I have only one general issue, which I like to bring to your attention in short. The web-based format offers some advantages over the classical book form, and I am unsure whether so far the medium has been suitably exploited/explored. Namely, the "introduction" quite rightly states that the way data are most appropriately displayed depends on the scale of the variable of interest. Well, imagine a new intrepid field epidemiologist (NIFE) is eager to display his freshly collected data and has appropriately started by determining the scale of his/her variables. What now?(S)he knows that the variable is measured on, say, a nominal scale, but is unsure as to how best summarise the information of the variable. (S)he needs to read the entire chapter to find out which of all the different possibilities applies to a particular scale. An alternative would be to have a page where to every variable scale (type) one would find the tables, graphs, etc. that are an appropriate display.For example:Nominal variable: Tables: Frequency table,. Bars: simple bar chart (depending on how many groups) And so on...One could also display it in a table with two columns (gridlines invisible). On the left are the scales of the variables (nominal, etc.), and on the right the different presentation formats, ideally blockwise (according to tables, graphs, etc.) If you then click of the scale of a variable, the Apropriate Presentation Formats" (APF's) are highlighted in bold, or arrows would point from the scale to the different APF's, or otherwise. By clicking on the single APFs, one would jump to the appropriate text.Even without such a (new) "decision tree", for each display-tool (eg, stacked bar chart) for each APF it should be stated for which variable type it is suitable (ideally at the beginning or end). For example, it doesn't tell you for which variable type line graphs are appropriate.The rest of my few comments are not nitty-gritty but rather picky: - The order of the Headings (links) of the chapter on the bottom is not the same as on the right hand side.- it would be good to have a button at the end of each page that jumps you back to the beginning rather than having to scroll up there.- "case-control study" should be hyphenated.Subchapter "Types of variables":- Numerical variable is introduced in plural; sometimes an "A" precedes the type of variable, sometimes not.- the point "organisation of data" is rather slim. Consider deleting it from the title of the subheading chapter and place the text (even with a line list example) before the introduction of variable types.Subchapter "Types of variables":- there is a table-heading called "two-by-two tables". Consider using the generic term "contingency tables"- dummy tables: I tend to put the column "cases" before "total", anyway.Subchapter "Other types .."- it should be explained what a box-and-whisker plot displays.
Hope this mail finds you well and you will find some of it helpful.
Dirk
Agnes Hajdu replied on 8/12/2010 11:17:34 PM:
Dirk,
Thank you for the thorough review of the chapter and for the valuable suggestions! (and sorry for the late reply...)
Currently I am exploring possible formats for the guide you recommended - summarising appropriate displays for each type of data (with conditions that may apply). More challenging than I thought. :-) Your minor comments are also appreciated, will do the modifications.
Alain is on holidays now, he promised to reply after the 16th. I hope you are available to discuss any issues remaining.
Thanks again!
Ágnes
You need to be logged in to post comments.
You can log in here. You can register here if you haven't done so yet.