Choosing a reference group

Draft provided by: James Stuart, Alain Moren


Section I: Introduction

Making comparisons is fundamental to epidemiological investigations and studies. We need to compare risk or rates of illness in exposed and unexposed group, or odds of exposure in cases and controls. Without making comparisons with a reference group, we cannot say from data analysis that an association with a given outcome is anything other than spurious. Such a reference group is designated as the control group in case control studies and the unexposed group in cohort studies (see Chapter X). For the field epidemiologist, difficulties more often arise in choosing controls for case control studies than in choosing an unexposed group in cohort studies. This lecture will focus mainly on the former.


It is helpful first to be clear about who the cases are, in other words,

to start with a case definition. The case definition then helps to define the population from which the cases arise. This population is also the population from which controls should be drawn.


The most important principle to follow is that controls should be representative of the population from which cases arise, the source population. Case can be defined in any way that the investigator decides, but this definition is key to determining the source population of cases, and hence the source population of controls.


Controls should then have the following characteristics


(i)                 be representative of the exposure distribution in the source population

(ii)               have an equal chance of being identified as cases if they had the disease under study

(iii)             have the same exclusion and restriction criteria as cases


Section II: Case and control definitions


Case definition: resident of London aged under 10 years with faecal isolate of E. coli O157 during June 2006.

Exclusion: travel abroad in the week before onset of illness.


(i) The source population for cases is residents of London in June 2006 aged under 10 years. Controls should be representative of this source population with regards to the exposure of interest.


(ii) Since E. coli is a severe infection of children, we would expect all children in London to have a similarly high chance of being detected as cases if they had this infection. However there may be variations in proportion of cases diagnosed by geographical area through variation in factors such as health seeking behaviour, primary care sampling, diagnostic facilities. This may introduce a selection bias when we come to choose controls as it will be difficult to identify this same source population. This bias will not matter unless the proportion exposed differs between cases identified for our study and those cases who remain undetected.


(iii) In this definition cases have been excluded if they travelled abroad in the week before onset of illness. An equivalent suitable exclusion period for controls might be travel abroad in the week before interview. However, if cases mostly arise during school term, and if controls are interviewed in the summer holidays, some controls may be excluded unnecessarily. Another option might be to exclude those who travelled abroad in June. Or, if individually matched on potential time of exposure, travel exclusion could be restricted to the dates of the week before onset if illness of the matched case.



Control definition: resident of London aged under 10 years during June 2006.

Exclusion: travel abroad in the week before interview.


Section III: Control selection

Let us now return to the important decision about selecting as controls a sample that is representative of the source population.

Options include:

a. Population controls

    random from a population register/list/directory or stratified by some characteristic such as age/sex/general practice, known as matching (Chapter Y)

b. Neighbourhood  controls

c. Friend controls

d. Family controls

d. Hospital controls


a.      Population controls. As the aim is to obtain a random sample of the population that gives rise to cases, it is preferable to seek controls from a population register. A random sample of this population should be achievable if the register

(i)                 has a high level of completeness

(ii)               contains the cases (it should be possible to check that all the cases are identified in the register)

(iii)             can identify the parameters for the control definition (in this example, city residency and age)

(iv)             is accessible to the investigator, then


If a register is not available or is not suitable, other methods of population sampling can be considered. A commonly used method is random digit dialling. This involves phoning random numbers (cold calling), a system that has the advantage of speed and convenience but has important limitations. The source population is limited to those who have a phone and to those who are available to answer. It may be difficult to be sure that the relevant geographical area is covered, or alternatively one may find that such a large area is covered by the phone listings that it is difficult to find controls from the (smaller) source area. This is more of a problem if phone numbers are used that do not have an area code e.g. mobile phone numbers.   Co-operation from those receiving such calls may be low.


b. Neighbourhood controls. This involves selecting controls from the same neighbourhood as the cases i.e. they are matched for neighbourhood. One advantage is that there is no need for a population register. Also, controls are likely to be similar to cases in respect of socio-economic factors. This may be helpful if we wish to control for such complex factors and if we cannot measure them sufficiently. We cover this in more detail during the lecture on matching (Chapter Y) .


Disadvantages are that low co-operation (selection bias), may be time consuming and expensive (low efficiency), and that if we wish to measure the risk associated with socio-economic factors, we may not be able to do so.  In case control study of a disease that has a socio-economic gradient, e.g. invasive meningococcal disease, picking neighbourhood controls may not show any association between illness and level of income. People living in the same neighbourhood control are likely to have the same or similar socio-economic characteristics.


c.  Friend controls are another way of selecting matched controls.  Where speed of investigation is of the essence, eg. in a suspected outbreak of E.coli O157, friends  offer a rapid and convenient means of finding controls. Similarity of socio- economic characteristics and social behaviours have the same advantages and disadvantages as neighbourhood controls. In investigations of outbreaks of food borne infection, our aim is to identify a common source. Although friends may be more likely to share similar food habits as their corresponding case leading to an underestimate of the strength of association, the relative risk estimates can still be very high (Killalea). More of a problem may be a reluctance on the part of the case to give the names of friends to be interviewed (Boccia).


d.  Family controls are rarely used in field epidemiology as exposures in family controls are often so similar to those of the cases that the association of interest may not be shown at all.


e.  Hospital controls are useful if the cases have all been admitted to hospital or are on a specific disease register. Controls are easily identified and available at low cost from the same dataset that contains the cases e.g. hospital episode statistics, cancer register. Disadvantages may be that there are different catchment populations for different diseases so that the controls are not representative of the source population for the cases. More particularly the same causative factors can be responsible for the disease under study and other diseases that result in hospital admission. This will reduce the chances of showing a true association with the causative factor (bring the OR towards 1). In the study of any disease caused by smoking, selection of hospital controls would have a high chance of selecting people who were admitted with other conditions caused by smoking.


Section IV: Special considerations

(a) Controls in different types of case control studies: case cohort, traditional case control, density case control. 

Lets us come back to the one of the characteristics of the control population, that they should be representative of  exposures in the source population.  In selecting controls for a case cohort study, a random sample of the source population should, if done correctly, be representative of the exposure distribution in the population that gives rise to the cases. In a traditional case control study, where cases are excluded from the control selection, a bias has been introduced as the exposure distribution in potential controls is no longer representative of the source population. If the attack rate is low, this bias will also be low, but if attack rate is high, the potential for bias will also be high (Chapter X). In a density case control study where cases occur over a long time period, controls should be selected from the source population still free of disease at the time the case occurs. In this way they should be representative of the person time experience of the source population (Rothman).





(b) Asymptomatic cases

Does failure to identify those with mild or asymptomatic infection as cases introduce bias? This situation is analogous to non- response among cases. If the exposures among symptomatic and asymptomatic cases are the same, then no bias is introduced. There is only a reduction in power of the study. There is no difference in control selection as controls should be representative of the source population.


In a hypothetical case control study with 40 cases and 40 controls, and 50% exposure among cases,  Odds Ratio = 600/ 200 = 3.0







Not exposed




If we only detect 20 cases with the same number of controls , the Odds Ratio is unchanged (300/100 = 3.0) as long as % exposure is the same in detected and undetected cases. 







Not exposed





(c) Immune subjects

If some of the population is immune at the start of the study, then they are not eligible to be cases. They should then also be excluded as controls as they are not part of the source population. In practice we do not usually know who is immune.  Again this may not matter if % exposed is the same in immune and non-immune cases. However it may be that subjects are immune because they have already been cases in the past and that they have a similar level of exposure to the risk factor that caused the cases in the outbreak under study . This introduces bias that reduces the OR towards 1 and may result in a failure to detect a true association, especially if the proportion immune is high. For example, the inclusion of immune subjects in the control group is thought to explain the results of some case control studies that fail to show an association between contaminated drinking water and cryptosporidiosis (Hunter).


(d) Power and sample size in case control studies

A question often arises about the number of controls given a limited number of cases. Statistical programmes like Epi-Info can be used to estimate the sample size required to detect a specified odds ratio. It is  unusual to select more than 3 or 4 controls per case as little statistical advantage is gained beyond this number (Kirkwood and Sterne, Figure).  Alternatively we could show that power increases and plateaus with an increasing number of controls per case. The graph would then have the same shape but inverted.




Section V: Review of control definition

If we now review the control definition for the investigation of the E.coli O157 outbreak, a decision is taken to select population controls from the same general practice as the case.  These controls will have some geographical and social similarities to the cases, but are likely to provide a representative sample of the population giving rise to the cases.



Control definition: resident of London aged under 10 years during June 2006.

Exclusion: travel abroad in the week before interview.

Three controls per case will be selected at random from the same primary care register as the case.




Section VI: Cohort studies

For cohort studies, the field epidemiologist is likely to be involved in retrospective studies. In other words the investigation takes place after both exposure and disease have occurred. The commonest situation is an outbreak of food poisoning after a clearly defined event such as a party or wedding.  Following the same principles as for the case control study, it is first essential to define the source population.  This population then forms the cohort, usually defined as those who attended the function in question.  Individuals within the cohort are then classified into exposed or unexposed, for example,  according to whether they ate or did not eat specified items of food or drink. The unexposed constitute the reference group for each item.


Questions may arise about whether the unexposed should include those who did not eat any food. As for case control studies, this depends on your definition of the source population. Is the cohort defined as everyone who attended or everyone who attended AND who ate something?  As the number who did not eat anything will probably be small, it may be sensible to include them. If we should discover a substantial proportion of cases among those who attended but did not eat any food, food may not be the source of the outbreak. 


What happens if everyone ate the food in question i.e. there is no unexposed group? Luckily for the epidemiologist, our investigations involve human behaviour which usually offers a rich variety of exposures. In a food borne outbreak where everyone ate the delectable tiramisu, we then rely on trying to measure different levels of exposure (different amount of Tiramisu consumed). The reference group then becomes those with the lowest level of exposure.



Section VII: Key points in selecting controls

(i)                 Define the source population. It is helpful to imagine what could have been the cohort study we could have done instead. The total of exposed and unexposed represents the source population. 

(ii)               Aim for a sample that is representative of the source population

(iii)             Review advantages and disadvantages of available options, taking account of urgency and available resources

(iv)             Controls selected from population list preferable, but not always feasible

(v)               No control group is perfect

(vi)             Make a decision and do the study!




Section VIII: References

1. Rothmann KJ. Epidemiology: an introduction. Oxford University Press 2002.


2. Hennekens CH, Epidemiology in Medicine. Lippincott-Williams and Wilkins 1987.


3. Gregg MB. Field epidemiology. Oxford University Press 1996.


4. Wacholder S, McLaughlin JK, Silverman DT, Mandel JS. Selection of controls in case control studies I-III. Am J Epidemiol 1992; 135: 1019-50.


5. Kirkwood, B. R., & Sterne, J.A.C. Essential Medical Statistics (2nd Ed). Blackwell Science 2003.

6. Killalea D, Ward LR, Roberts D, de Louvois J, Sufi F et al. International epidemiological and microbiological study of outbreak of Salmonella agona infection from a ready to eat savoury snack - I: England and Wales and the United States.  BMJ 1996; 313:1105-7.