## Graphs, charts and diagrams

A graph is a way to visualize quantitative data and relationship between variables by using a system of coordinates. Examples include line graphs, histograms, and bar graphs. It helps us to see magnitude, trends, differences and similarities in the data. It is a key aspect in scientific communication for any audience.

### Line graphs

In epidemiology we typically use rectangular coordinates. They include a vertical and a horizontal line with specific units of measurements and which intersect at a right angle. These are the x (horizontal) and y (vertical) axes. The scale used for x is arithmetic. Scales used for y can be arithmetic or logarithmic. Each point on a graph represents a relationship and is determined by a pair of numbers containing two coordinates (x and y). We usually express y according to values of x. X (also called independent variable) usually represents classes of x or time. Y (the dependent variable) represents counts, proportions or rates.

#### Arithmetic line graphs

An arithmetic line graph shows distribution of an event (y) according to x (frequently time in epidemiology). Several events, several series of data, can be shown on the same line graph. The scale used on the x-axis depends on the interval (time) used to collect data. The y unit of measurement depends on the magnitude of the highest value for y. An example of a line graph showing the number of tetanus cases reported in France from 1945 to 2003 is shown below.

Source: InVS, Saint Maurice, France

Tips to select scales on the axes:

- Select a y-axis shorter than the x-axis. The graph will then be horizontal (ratio length y-axis / length x-axis = 3/5).
- Always start the y-axis with 0.
- Determine the range of value needed on the y-axis by identifying the highest y value in the data set.
- Choose then an interval on the y-axis which fits with the range (an interval size which will give enough intervals and show enough details).
- If y values are missing for one or several values of x, the line graph should be interrupted accordingly and started again after the gap. However the x and y axes should stay continuous (e.g. on the x-axis time should always be proportional to distance from zero).

The following graph is incorrect since the x-axis is interrupted and there is a "time gap".

Arithmetic line graphs can show several categories of the same characteristic (age groups) on the same graph. The following example shows reported incidence rates of gonorrhoea in Sweden by sex.

However, it is difficult to show many series of data on the same line graph. The following example is a further breakdown of the above data into six age groups. Interpretation is becoming more difficult.

Source : SMI, Solna, Sweden

#### Semilogarithmic-scale line graphs

If we use a logarithmic scale on the y-axis and if the x-axis remains the same (arithmetic scale), we create a semi-logarithmic scale line graph. With a logarithmic scale on the y-axis we represent the relative change of y over time rather than its absolute change over time. Semi-logarithmic scale line graphs are used to present and interpret rates of change over time rather than magnitude of change. They also allow showing very different magnitudes and ranges of rates between two lines (e.g. high incidence and low mortality rates for the same disease).

Semi-logarithmic scale paper:

- On the y-axis, intervals are logarithmic and no longer arithmetic.
- There are several cycles of tick marks on the y-axis. Each corresponds to an equal distance on the y-axis.
- The values of one cycle are 10 times greater than the values of the previous cycle.
- Within a cycle the 10 tick marks are not equally distant (distance from 2 to 3 is different than distance from 3 to 4). Their progression is geometric, not arithmetic.
- The y axis can cover a large range of y values.

The following characteristics are noteworthy:

- The slope of the line indicates the rate of change (the relative change) of y over time.
- A horizontal straight line indicates no change.
- An upward or downward straight line slope indicates a constant rate of increase or decrease in the measured indicator (e.g. rate) over time.
- Two parallel lines indicate similar rate of change over time.

Source: Isituto Superiore di Sanita, Rome, Italy

The following example shows occurrence of cases and deaths of measles in England and Wales from 1940 to 2002. On an arithmetic-scale line graph it is impossible to see the trend in death rates over time. The different magnitude of rates between incidence and mortality does not allow showing both on the same graph. The solution is to use a logarithmic scale for the y-axis. Doing so, we allow very different rates to be shown.

Figure XX: Arithmetic-scale line graph xx

Figure XX: Semi-logarithmic scale line graph

Source : CDSC, HPA, Colindale, UK

OR

The following example

The decision to use arithmetic or semi-logarithmic-scale line graphs depends on what we want to show, absolute magnitudes or rates of change over time.

### Histograms

A histogram shows the frequency distribution of a continuous variable. Adjoining columns are used to represent the number of observations in each class interval of the distribution. The surface of each column is proportional to the number of observations in the column. There should be no scale break on the x-axis otherwise the graph would not represent 100% of the data and surface units would no longer be proportional to the number of observations. Histograms can help visualise gaps in the data, outliners or other unusual observations.

In intervention epidemiology histograms are frequently used to present occurrence (distribution) of onsets of illness according to time. This is frequently called an epidemic curve even if it is not a curve.

Several principles apply:

- Time is represented on the x-axis.
- The choice of appropriate time interval depends on the duration of the epidemic and on the incubation period. As a general rule, the time unit on the x-axis should be less than one fourth of the incubation period.
- The x-axis begins showing time and any cases occurring before the outbreak. They can represent background cases or be index cases.
- Each member (case) is centred between the two tick marks limiting a time interval.
- One square represents one case. Using vertical or horizontal rectangles instead of squares would bias the interpretation of the shape of the curve by falsely creating or masking a peak.
- In the legend, we indicate beside a square what it represents (1 case).

The following histogram shows cases of tetanus reported after the Tsunami in Banda Aceh, Indonesia in 2004-2005.

Source: Prof. Leegross, WHO

We may show a second or several additional variables on a histogram by shading the different components of a bar. However two many components in a bar may be difficult to interpret. In this case it is better to do one histogram for each component.

Source: Prof Leegross, WHO

Histograms with unequal class interval can also be constructed. They are more difficult to create and to interpret. Whatever the interval, the unit of surface used should always be proportional to the amount of information (number of cases).

Find one example

### Frequency polygons

A frequency polygon shows a frequency distribution. It is constructed from a histogram. The frequency polygon is a polygon - a closed two-dimensional figure of straight line segments - joining the mid points of the top of the bars of a histogram. The first point is on the x-axis (y = 0) and is placed in the middle of the interval which precedes the first bar of the histrogram. The last point is located on the x axis in the middle of the interval immediately following the last bar of the histogram. The important issue is that, by joining the mid point of each bar and the x-axis at each end, the surface under the frequency polygon is exactly the same as the surface of the histogram. Therefore the principle of the histogram is respected. The same surface represents the same amount of data (cases).

Frequency polygons represent a simple way to show data of several histograms on the same graphic.

**Bar graphs**

Bar graphs (or bar charts) are displaying information using only one coordinate. They are mainly used to compare discrete data in distinct categories.

#### Simple bar graph

The simplest bar graph displays data from a table with one variable. Each bar represents one category. Bar graphs can be organised horizontally or vertically. Vertical bars differ from histograms since they are separated by a space. The height of the bar is proportional to the number of events (e.g. cases) in the category, nevertheless the surface is not always proportional to the width of the category on the x-axis (e.g. different width of age groups). If there is a logical order between categories it should be respected. Otherwise categories can be organised along decreasing or increasing values of respective bars. Variables in a bar graph can be discrete (e.g. sex, region, race) or continuous (e.g. age) but organised in categories (e.g. age groups). The x axis does not need to be continuous. The following horizontal bar graph shows distribution of EPIET fellows by country of origin.

**Grouped bar graphs**

Sometimes several subcategories can be shown and placed close to each other in a larger category (grouped bar graphs). It is important that information should be presented in the same order in each category. The following vertical grouped bar graph represents the distribution of cases of Ebola haemorrhagic fever in Bumba zone, Zaire in 1976. The graphic shows two variables, age groups and sex.

The following example illustrates the difficulty to interpret too many bars on the same grouped bar graph.

Find one example

**Stacked bar graphs**

An alternative to grouped bar graphs is the stacked bar graph (also known as sub-divided bar graph). On this type of graph a bar is divided into components, it shows segments of totals. The height of each component is proportional to the part it takes in the bar. Similarly to grouped bar graphs, information should be presented in the same sequence on each bar. The following stacked bar graph shows the distributin of cases of Salmonella Typhimurium infection by age and sex in Norway. In each age group category the bar is divided into two sub-categories, males and females.

Source: Norwegian Institute of Public Health, Oslo, Norway

**Component bar graphs**

Component bar graphs (also called proportional component bar graphs) are different from stacked bar graphs in the sense that they represent proportions rather than absolute values. Each bar has the same height and represents 100% of the data in that bar. Bar components are expressed as the percentage of the total bar they represent. The following 100% component bar graph shows the same data as above. Proportions are visible but the absolute magnitude of the distribution is no longer visible.

Source: Norwegian Institute of Public Health, Oslo, Norway

**Deviation bar graphs**

XXX

**Pie graphs**

A pie graph (also known as pie or circle chart) is a circular graph that is split into segments. The size of a given "slice" is proportional to the amount of data (e.g. number of cases) it represents. Pie graphs are used to show the component of a larger group. However small differences between slices are more difficult to see than differences between bars on a bar graph. This is why the proportion (%) of the total that each slice represents is frequently added on the slice. Different shading or colours can also be used to identify the various slices. In addition slices can be ordered by decreasing or increasing magnitude. This can also be supported by a regular darkening gradient of a colour.

Source: Norwegian Institute of Public Health, Oslo, Norway

Another example for pie charts:

Distribution of notifiable diseases reported during the World football cup, 4 June - 10 July 1998

Source: Norwegian Institute of Public Health, Oslo, Norway

### Other types of data display

The following types of diagrams are popular tools among epidemiologists since they do not only show quantitative data but relationships within the data and abstract information.

#### Scatter plot

The scatter plot (also called as scatter graph, scatter diagram or scatter chart) is a type of diagram which displays the relationship between two - either discrete or continuous - variables. One variable is plotted on the x-axis while the other is plotted on the y-axis. Their intersecting points can graphically show relationship patterns, e.g. negative correlation. Most often a scatter diagram is used to prove or disprove cause-and-effect relationships between two variables. On the other hand it does not prove that one variable causes the other.

Figure XX: Strong positive correlation between variables X and Y

#### Box plot

The box plot graphically displays the lowest value, highest value and median value in the dataset, as well as the size of the first and third quartile. A box plot is a good alternative or complement to a histogram and is usually more suitable for displaying several comparisons parallel to each other.

Figure XX: box plot

#### Forest plot

XX

#### Radar chart

A radar chart (also known as spider chart, web chart, polar chart, and irregular polygon) displays data of several quantitative variables in a graphical way that makes quick and easy to visualise highs and lows in the data, as well as similarities and differences between the variables themselves. Each axis represents a different category, while all the data belonging to a given variable visually form a star-shaped figure. Point zero is located on the intersection of axes, while a point for a given observation near the edge is a high value. For comparison of time series, a simple alternative to radar chart can be line chart.

Figure XX: radar chart

#### Network diagram

Network diagrams are frequently used to illustrate the transmission and spread patterns during an epidemic. The following network diagram illustrates some aspects of the transmission during the SARS outbreak in Singapore in 2003.