Biostatistics Simplified Sanjeev B Sarmukaddam
INDEX
×
Chapter Notes

Save Clear


_FM1Biostatistics Simplified_FM2
_FM3Biostatistics Simplified
Sanjeev BSarmukaddamMSc DBS MPS PhDConsultant (Biostatistics) Maharashtra Institute of Mental Health BJ Medical College and Sassoon Hospital CampusPune, Maharashtra, India
_FM4Published by
Jitendar P Vij
Jaypee Brothers Medical Publishers (P) Ltd
Corporate Office
4838/24 Ansari Road, Daryaganj, New Delhi - 110002, India, Phone: +91-11-43574357, Fax: +91-11-43574314
Registered Office
B-3 EMCA House, 23/23B Ansari Road, Daryaganj, New Delhi - 110 002, India
Phones: +91-11-23272143, +91-11-23272703, +91-11-23282021
+91-11-23245672, Rel: +91-11-32558559, Fax: +91-11-23276490, +91-11-23245683
Offices in India
Overseas Offices
Biostatistics Simplified
© 2010, Jaypee Brothers Medical Publishers (P) Ltd.
All rights reserved. No part of this publication should be reproduced, stored in a retrieval system, or transmitted in any form or by any means: electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the author and the publisher.
First Edition: 2010
9788184487480
Typeset at JPBMP typesetting unit
Printed at Ajanta Offset
_FM5Preface
Although the book ‘Biostatistics Simplified’ is not written in ‘question-answer’ format, one can find answer to almost any question posed in any exam/entrance exam (like MBBS final exam, PG entrance in India, US-MLE, etc.) somewhere in the book. Many clinicians are not actually producers of studies involving statistics/biostatistics, but almost all will be consumers. They must read the medical literature to keep up-to-date and they will find many papers contain a good deal of statistical analysis. I hope that the book will prove useful to all medical/health personnel (students of any specialty, faculty of all disciplines including nursing and other pre-, paramedical) for all purposes (above mentioned both or all exams).
No special background knowledge is required to understand this text. However, when essential, some algebra and notations are used. I have tried to explain all statistical concepts in simple terms. Any further reduction in length of text would have lead to loosing the vigor of the subject but, nevertheless, all fundamental concepts and important terms are covered.
I am grateful to M/s Jaypee Brothers Medical Publishers (P) Ltd, New Delhi (particularly Tarun Duneja, Director-Publishing) for their active co-operation.

Introduction1

Quality Control methods are very popular these days but very few people are aware that Indian housewives are the best followers of quality control principles. One important principle is that of ‘minimum wastage or no wastage; Indian women implement it instinctively while managing their kitchens. Carpenters and Masons work according to principles of “CPM” or “PERT” without knowing ‘what they are. No wonder that ‘one may not be aware, but observe/follow important principles’ enumerated in a relevant subject. Biostatistics has had the same state. People often argue that medical science used to be practiced quite well in the olden days when biostatistics did not exist as a separate subject (or was not taught). I believe that even if biostatistics was not recognized as a separate entity at that time it was still very much followed since, even in those days, medical practitioners used to compare any observed or test value with a normal or reference range. While taking an appropriate treatment decision, doctors used to take into consideration all types of 2possible risks and benefits without actually performing a formal “decision analysis”.
Although biostatistics will not help you directly in prescribing most appropriate medicines or in performing a surgical operation, working knowledge of this subject will definitely be helpful, though indirectly, in enhancing scientific abilities (in any discipline). It will train you to begin thinking in a more scientific, reasonable and logical way. Contrary to general belief that biostatistics is a science of ‘group of subjects’ it is actually the science of management of uncertainties in Health and Medicine and so could be helpful even while dealing with a single subject in day-to-day clinical practice.
No one can deny the important contributions that the proper use of biostatistics has made to the advancement of Medicine and Public Health. Through the intelligent collection and analysis of health data, epidemics are brought under control, drugs are tested for their efficacy, important factors in the etiology of diseases and risk factors involved are identified, health problems and health programs are assessed, etc. The rapid, almost daily, advancement in medical sciences can be made up-to-date only by enabling one to critically read and interpret the findings reported in medical literature such as journals and the electronic media. For this, at least some knowledge of biostatistics is vitally necessary as it helps develop such ability so that one can apply research findings judiciously in practice.3
“Medical Biostatistics” with ISBN:0-824-0426-6 published by Marcel Dekker of New York in 2001 and “Fundamentals of Biostatistics” with ISBN:81-8061-814-5 published by Jaypee Brothers Medical Publishers (P) Ltd. of New Delhi in 2006 are good books on the subject, however, this book will hopefully be useful to medical students for any entrance or other examinations. Considering the fact that ‘unless the subject knowledge seems to be applicable and interesting’ one may not try to digest it, in this book I have made all the efforts to fulfill the aforesaid things. You may realize that most of the topics are just above common sense level, making the subject easily understandable.
I have tried to touch upon as many topics as possible mainly through miscellany chapter which contains varied things. Though there are only ten chapters in this book, few bulleted (.) paragraphs appear in most chapters. These contain comments, which are important as they are more like answers to study questions and I have tried to present them crisply and briefly. A few but essential statistical tables are included in Appendix.
All statistical methods have a sound logic behind and most of the things are just above common sense. However, unfortunately common sense is not so common but needs to be learnt systematically. What makes statistics (and so Biostatistics) useful in medical science is – one of the characteristics of people is marked variation that is present or occurs between individuals. For example – the average height of an Indian man is 5′.3″ but one fairly frequently sees men from 4′.10″ upto 6′.0″. People on an average consult 4their general practitioner about 3 times in a year but there are few people who hardly ever see a doctor and a few who are in and out every week. We often find such marked variation in most of the characteristics of individual with whom we have to deal in medical science. Despite such large variation, statistics help us to find some pattern in it (i.e. Unity in Diversity).
Most common source of uncertainty in medicine (and health) is the natural biologic variability between and within individuals. Variations between samples, laboratories, instruments, observers, etc. further add to these uncertainties. There are other sources like incomplete information, imperfect instruments, lack of sufficient medical knowledge, poor compliance to the regimen, etc. Role of Biostatistics is to handle these factors to minimize their effect. It is a science of management of uncertainties—to measure them and to minimize their impact on decisions.
Researchers themselves make errors in analysis but even when they do not, their conclusions are often misinterpreted or misused by people who have access to them. In fact, what often finds its way to the general public is distorted and garbled to the extent that it can cause more harm than good. Many people find themselves increasingly frustrated with the confusing and contradictory health advice in media.
 
 
“Garbage in – Garbage out”
One of the disadvantages of computer technology is that it can make presentation such that the poor data 5may look good. The most unreliable and invalid information, collected in the most unsystematic and/or most unscientific way can easily be analyzed with the help of computer and can be thoroughly impressive. In this respect, the important point/adage to remember with regard to data processing is “garbage in – garbage out”.
 
‘Normal’
General practitioner in his practice/clinical work is constantly comparing symptoms presented by patients with his own perception of the acceptable range of human physiology and behavior. Many of the times the common meanings of the word ‘normal’ are socially defined. These definitions are often dependent on the health beliefs and expectations of individuals which may vary widely. A 75 years old lady with treatable congestive cardiac failure may regard her breathlessness as ‘normal’ because it is what she has become used to over many months. Likewise, an 80 years old man may regard deafness as ‘normal’ and not bother to ask his doctor whether the situation could be improved by removing wax or supplying a hearing aid. Therefore, it is very important to know where physiology ends and pathology starts.
 
‘Uses’
To develop an ability to critically read, interpret and use the findings of the other's study or knowledge from other sources is one of the important purposes 6of acquiring knowledge of this subject apart from, of course, planning and analyzing own study. To keep up to date information has become a necessity as medical science is progressing day by day. Prevalence of diseases like AIDS are increasing, new drugs and new therapies are developed. New diagnostic tools are used. Unless one completely understands research done elsewhere, one cannot fully utilize the findings for patients in the day to day practice. Science of planning good studies/experiments/surveys and interpreting conclusion of other's is well-developed these days. Sufficient knowledge of “Research Methodology” is essential for this purpose. Wrong study design yields wrong conclusions. Poorly designed study yields poor (with much less power) conclusions.
Many uses of statistics/biostatistics could be listed, however, the main of uses of statistics in general are:
  1. To collect data in best possible way. This includes methods of:
    • Designing forms for data collection
    • Organising the collection procedure
    • Designing and executing research
    • Conducting surveys in a population.
  2. To describe the characteristics of a group or a situation. This is accomplished mainly by:
    • Data collection
    • Data summary
    • Data presentation.
  3. To analyze data and to draw conclusions from such analyses. This involves the use of various analytical techniques and the use of probability in drawing conclusions.
    7
 
‘Term’
It might be useful to point out two different meanings attached to the term “statistics”. The term statistics has been used to indicate facts and figures of any kind: health statistics, hospital statistics, vital statistics, business statistics, etc. For this purpose, a plural sense of the term is used. The term is also used to refer to a branch of science developed for handling data in general. The essential features of statistics are evident from the various definitions of statistics:
  • Principles and methods for the collection, presentation, analysis and interpretation of numerical data of different kinds:
    • Observational data (measurement, survey data)
    • Data that have been obtained by a repetitive operations (experimental data)
    • Data affected to a marked degree by a multiplicity of causes.
  • The science and art of dealing with variation in such a way as to obtain reliable results.
  • Controlled, objective methods whereby group trends are abstracted from observations on many separate individuals.
  • The science of experimentation which may be regarded as mathematics applied to observational data.
Generally singular sense of the term is used for this purpose. In fact this term is very old. You will find similar word in ancient languages like:
  • Status – Latin
  • Statista – Italian
  • Statistik – German
    8
Meaning of all these terms is “political state”. In ancient time, the Govt used to collect the information regarding the population and property/wealth of the country, the former enabling the Govt to have an idea of the manpower of the country and the latter providing it a basis for introducing new taxes and levies.
There are many branches of applied statistics depending on the area of application. Few special methods are also developed to deal with specific/peculiar situations arising in those fields. One such major branch is Biostatistics. When the data being analyzed are derived from the biological sciences or medicine the term biostatistics is used to distinguish this particular application of statistical tools and concepts.
Biometry is the term which is used generally as synonym of the term biostatistics, but biometry is more than biostatistics as a biometrician is suppose to be in a position to take the measurements, (or collect the relevant data) himself. Therefore, biometry can be defined, in short, as quantitative biology. Information that is presented in numerical form are the data. This information could be either generated through observation/experimentation or may be compiled from other sources like old records or documents, medical records, census, etc. Data could be of different types or measured at different levels. Identification of type and level of measurement is important because further treatment for drawing valid and useful conclusions on its basis depends on it.9
How biostatistics works? (in brief and only in hypothesis testing situation)
This can be best illustrated with one simple example. Consider a trial of “tossing of a fair coin”. Since the coin is fair, probability of head coming up is one-half. Suppose we perform 20 trials and keep a count of heads coming up. About 10 heads are expected. Although 10 heads are expected, the actual number could be different. If the number is 9, 8 or 7, (or large as 11, 12, 13) we are not bothered. However, if the number is 6 or smaller (or larger than 13), we doubt about ‘fairness’ of the coin. If the number is very small, e.g. 0/1 (or very large, 19/20) we are almost sure about ‘unfairness’. We can find the actual probability of all possible events. These are calculated using ‘binomial’ distribution and are displayed in Table 1.1.
Table 1.1   Coin tossing experiment with n = 20
Number of Heads
Probability (Percentage)
0
0.00009
1
0.0002
2
0.018
3
0.109
4
0.462
5
1.479
6
3.696
7
7.393
8
12.013
9
16.018
10
17.620
10
11
16.018
12
12.013
13
7.393
14
3.696
15
1.479
16
0.462
17
0.109
18
0.018
19
0.0002
20
0.00009
This is called as ‘sampling distribution’.
That means, probability of 0 (or all 20) head(s) in 20 tosses is only 0.00009%, probablity of 5 or less (or more than/equal to 15) heads in 20 tosses is about 2% (cumulative), and so on. With varying degrees of probablity and within varying limits, the single sample proportion (or mean and SD) can be related to the corresponding parameter of the population from which it is drawn. This depends on knowing already the appropriate characteristics of the population. But in real life, population parameters are usually unknown. Statistical methods/theory help overcome this. Moreover, one does not need to draw such a sampling distribution in practice (only virtual sampling distribution is enough, and therefore, only one sample will do). ‘How?’ this is done in various situations is shown in all subsequent chapters.

Data Description2

 
TYPES OF DATA AND LEVELS OF MEASUREMENT
The data mainly are of two types:
  1. Qualitative data
  2. Quantitative data
A characteristic, which may take on different values, i.e. which may vary in different persons, places, or things is called variable. Therefore, data are realized values of this variable. Variable can be of qualitative or quantitative type. Qualitative data are also called, sometimes, as categorical or enumeration data and quantitative data are called measurement or metric data. It is essential to know type of data because they require separate statistical treatment (i.e. any further statistical treatment depends on what type of data you are handling).
Quantitative data can be either continuous (value can be fractional, e.g. weight 39.6 kg) or discrete (only full integer values, e.g. family size).12
Qualitative variables are measured either on a nominal or an ordinal scale and quantitative variables are measured on an interval or a ratio scale.
 
Nominal
Observations are placed into broad categories which may be denoted by symbols or labels or names. For example Blood groups, Diagnostic categories, Gender, Marital status and Cause of death.
 
Ordinal
Categories or observations are ranked or ordered. Each category is in unique position in relation to other categories but distances between the categories are not known. For example severity of illness, socioeconomic status and ranking of student according to marks.
 
Interval
In addition to ordinal level of measurement, distance between any two numbers (values of the variable) is fixed and equal. The origin is arbitrary, i.e. the zero point for the scale may be arbitrary (as it is, for example, in the Fahrenheit and Celsius scales of temperature measurement). Examples of interval scale measurements include year of birth and other variables measured in calendar time. These particular variables may be readily transformed to ratio scale measure-ments by, for example, conversion of year of birth to age at some fixed point. Other examples: Intelligence coefficient and Score on most of psychiatric scales.13
 
Ratio
In addition to interval level of measurement it has true zero point as its origin, i.e. zero indicates absence of the variable. That is a ratio scale has an absolute or natural zero which has empirical meaning. Ratio of any two scale points is meaningful. For example, Height and Weight generally most of the physiological or biochemical variables.
All scales have certain formal properties. For nominal scale, the only relation involved is that of equivalence. Ordinal scale incorporates not only the relation of equivalence but also the relation ‘greater than’. The equivalence (=) relation holds among members of the same class, and the > relation holds between any pair of classes. Any order-preserving transformation does not change the information contained in an ordinal scale.
Any change in the numbers associated with the positions of the objects measured in an interval scale must preserve not only the ordering of the objects but also the relative differences between the objects. That is the interval scale is unique up to a linear transformation. Thus the information yielded by the scale is not affected if each number is multiplied by a positive constant and then a constant is added to this product. The operations of arithmetic are permissible on the intervals between numbers assigned to the objects. Ratio scales are achieved only when all four of these relations (namely 1. equivalence 2. greater than 3. known ratio of any two intervals and 4. known ratio of any two scale values) are operationally possible to 14attain. All arithmetic operations are permissible (for ratio scale) on the numerical values assigned to the objects themselves as well as on the intervals. Ratio scale variable can measured as interval or ordinal or nominal. One can move from bottom to top but not from top to bottom if one wants to change the scale or convert the data in different (lower) scale.
 
PRESENTATION OF DATA
Raw data should be presented in a correct manner so that it—
  1. Arouses interest in the reader.
  2. Makes the data sufficiently concise without losing important details.
  3. Enables the readers to form quick impression and to draw some conclusion directly or indirectly.
  4. Facilitates further statistical analysis.
    Presentation of data is generally done by either of these two methods:
    1. Tabular presentation
    2. Graphical presentation.
 
Tabular Presentation
 
Frequency Distribution Table
It is the method by which data of a long series of observations are systematically organized and recorded. Each one of the observations should fall into one and only one of the categories and there should be at least one category for each observation, i.e. categories should mutually exclusive and exhaustive.15
 
Steps for Presentation of a Frequency Distribution Table
  1. Determine the range between the highest and lowest observation.
  2. Settle upon either number or size of the groupings (class intervals) for making classification.
    Estimate the other (number of classes or size/width) by using the relationship.
    W = R/K; where, W is width or size of the class
    • K is the number of class intervals and
    • R is the range (difference between the smallest and the largest observation).
    A commonly followed rule of thumb states that there should be no fewer than six intervals and no more than 15. Those who wish to have more specific guidance in the matter of deciding how many class intervals are needed may use the formula which gives K = 1 + 3.322 log(N); where, K stands for the number of class intervals and N is the number of values in the data set under consideration. Class intervals generally should be of the same width, this is essential for comparison of different class frequencies.
  3. Tally the observations in their proper class. (Give four straight and one oblique line to make a bundle of five.)
  4. Count tallies and present the classification in tabular form with a suitable heading.
    16
 
Notes
  1. Procedure is same for any type of data. But if the data are qualitative, there is no need to follow step 1 and 2.
  2. If there is one variable, the classification is called one-way classification. If there are two variables, the classification is called two-way classification. If three, three-way classification and so on.
  3. Any multi-way (two-way, three-way, etc.) is also sometimes called as cross-classification or cross tabulation.
  4. If in two-way classification, both the variables are dichotomous (two categories only) then a resulting table is called as four-fold table.
 
GRAPHICAL PRESENTATION
Graphs and diagrams are useful in the presentation of the data. They give better grasp of the information at a glance. However, graphs are not substitutes for the table. It is very essential to note the scale while interpreting a graph (particularly a line graph) for comparison of graphs, because change in the scale will show a different pattern (examples are shown in Figures 2.1A and B). Graphs must have clear and fully defined headings so that the contents could be grasped without referring to the text. Important graphical techniques are described here very briefly. These are the basic techniques but one can think (or use) many more modifications.17
 
Bar Diagram
The length of the bar indicates the frequency of the class or character.
Bar can be of four types:
  1. Simple
  2. Multiple
  3. Component
  4. Proportional
 
Pie Diagram
Pie diagram is used to represent proportions (of mutually exclusive components). A circle is divided into different sectors where area of sectors represent different proportions. Angle of the sector is (3.6* percentage).
 
Pictogram
Data presented in the form of pictures. Each picture indicates a unit of characteristic. The whole/universe is represented by fix number of pictures and then shade appropriate number of pictures.
 
Line graph
It is used to represent a trend over a period of time or for a continuous type of data. Presentation of chronological distribution of the number of cases of a disease is called an ‘epidemic curve’. It is a type of line graph. Important note: Scale on Y-axis should preferably start from zero. However, if all values on 18Y-axis are much above zero, a brake in the scale should shown clearly.
 
Histogram
It is used for continuous type of data. It is a graphical presentation of frequency distribution. Area of each rectangle represents frequency of the corresponding class.
 
Population Pyramid
Two histograms showing age distribution of popu-lation separately for males and females are placed base to base.
 
Frequency Polygon
Midpoint of the interval of corresponding rectangle in a histogram (or the frequencies of the classes plotted at midpoints of the intervals) are joined by a straight line.
 
Cumulative Frequency Polygon or Ogive
Instead of frequencies, cumulative frequencies (successive class frequencies are added) are used to draw frequency polygon, similar to frequency polygon. Number or percent of observations falling below or above a specific value can be presented in this graph.
Any point below which there are certain percent of observations is called as percentile. For example, 90th percentile is such an observation below which 19there are 90 percent observations. Graph drawn out of such points is called a percentile graph.
 
Growth Chart or Road to Health Chart
It is a visible display of the child's physical growth and development. It is designed primarily for the growth monitoring of child so that changes over time can be interpreted. When the child's weight is plotted on the growth chart at monthly intervals against his or her age, it gives what is known as weight-for-age growth curve. For purposes of comparison, reference curves are provided. Any deviation from “Normal” (Road to Health) or malnutrition grade can be detected easily by comparison with these reference curves.
 
Spot or Contour Map
This is used to show geographical distribution of events in pictorial form. Geographic distribution may provide evidence (or clues) of the source of disease and its mode of spread. Sometimes different parts (areas) of the map are shaded according to the prevalence (or incidence) of the part (area).
 
Scatter Diagram
This is the best suitable graphical technique to find any possible relationship that may exist between two quantitative variables. When observations for two variables are made/available on the subject to study the possible relationship, values are plotted on the two axes of the graph paper. The characters (variables) are 20read on the horizontal (X-axis) and vertical (Y-axis) axes and the perpendicular drawn from these readings meet to give one scatter point. After plotting all the points, when they are viewed collectively the trend of the points will suggest any possible relationship (strength as well as type) that may exist between the variables.
All the above are illustrated with same number given to pertaining figure. (Figs 2.1A to 2.12B)
Fig. 2.1A: Example of bar diagram—simple bars
21
Fig. 2.1B: Example of bar diagram—multiple bars
22
Fig. 2.1C: Example of bar diagram—component bars
Fig. 2.2: Example of pie diagram
23
Fig. 2.3: Example of pictogram
Fig. 2.4: Example of line graph
24
Fig. 2.5A: Example of histogram
Fig. 2.5B: Example of histogram
Fig. 2.5C: Example of histogram
25
Fig. 2.6: Example of population pyramid
26
Fig. 2.7A: Example of frequency polygon
Fig. 2.7B: Example of frequency polygon
27
Fig. 2.8A: Example of cumulative frequency polygon
Fig. 2.8B: Example of cumulative frequency polygon
28
Fig. 2.9: Example of growth chart
29
Fig. 2.10A: Example of spot map (city)
30
Fig. 2.10B: Example of spot map (country)
31
Fig. 2.11: Example of scatter diagram
Fig. 2.12A: Example of effect of change/choice of scale
32
Fig. 2.12B: Example of effect of change/choice of scale
33
 
MEASURES OF CENTRAL TENDENCY
Three commonly used measures of central tendency are Mean, Median, and Mode.
 
Mean
Mean is a simple arithmetic average of the observations. This is calculated by dividing the total of all the observations by number of observations.
(Note: A descriptive measure computed from the data of a sample is called a “statistic” and from the data of a population is called a “parameter“.)
Sample mean is denoted by
Population mean is denoted by µ.
Example: An investigation of household size in a village gave the following number of persons in each household.
To calculate mean
  1. add = 5 + 3 + 9 + 7 + 1 = 25
  2. divide sum by number = 25/5 = 5
    Therefore mean = 5
    The major advantage of the mean is that it uses all the data values, and is, in statistical sense, efficient.
 
Median
Median is the magnitude of the observation, which occupies the middle position when all the observations are arranged in the order of their magnitude. With even number of observations, it is the mean of the two middle observations. LD50 is the term used to indicate the same.34
Therefore there will be equal number of observations with higher and lower magnitude than the median.
When there are even number of observations, median is the mean of the two middle observations.
Example: In a hospital ward the following are the number of days of stay—
13, 52, 8, 9, 8.
Order = 8, 8, 9, 13, 52.
Median is 9.
Note: Mean = 90/5 = 18. Only one observation is higher than mean. Therefore, mean is not a correct representative of these observations. Remove extreme value 52 and then mean = 38/4 = 9.5. The median has the advantage that it is not affected by outliers. However, it is not statistically efficientas it does not use all the individual data values.
 
Mode
Mode is the most frequently occurring value. If all the values are different there is no mode. However, a set of values may have more than one mode.
Example: Number of children in 5 families
4, 2, 3, 2, 0.
Mode = 2
Bimodal (two modes)
Suppose, number of children in 5 families
4, 2, 3, 2, 3.
Modes = 2, 3.
Different measures are applicable for different level of measurements.35
Level of measurement
Measure(s) of central tendency (which can be meaningfully computed)
Nominal
Mode
Ordinal
Mode, Median
Interval or Ratio
Mode, Median, Mean
 
Measure(s) of Central Tendency For Grouped Data
 
Notations
Number of class intervals – K
  • Midpoint of ith class – mi
  • Frequency of ith class – fi
  • Lower limit of the ith class – Li
Then mean = Sum (mi fi)/Sum (fi), i.e. Σ (mi fi)/Σ (fi)
Median class: class interval containing median [i.e. (n/2)th observation].
Then median = Lmd + (Wmd/fmd)* [(n/2) – C]
Where Lmd is the lower limit of the median class, Wmd is the width of the median class, fmd is the frequency of the median class and C is the cumulative frequency of the class preceding median class.
Modal class: Class interval containing mode (i.e. maximum/highest frequency).
Then mode = Lmo + [Wmo (f1 – f0)]/[2*f1 – f0 – f2]
Where Lmo is the lower limit of the modal class, Wmo is the width of the modal class, f1 is the frequency of the modal class, f0 is the frequency of the preceding class, and f2 is the frequency of the succeeding class.36
Sometimes, the relationship used to find out mode is [Mode = {(3*Median) – (2*Mean)}].
 
When to Use Various Measures of Central Tendency
 
Mean
  • When observations are distributed symmetrically around a central point, i.e. when the distribution is not badly skewed.
    • Mean is the center of gravity of the distribution
    • Each observation contributes to its determination.
  • When the measure of central tendency having the greatest stability is wanted.
    • Mean is more stable than median or mode.
  • When other statistic like SD, Pearson's correlation coefficient are to be computed later.
    • Many statistics are based upon the mean.
 
Median
  • When the exact midpoint of the distribution is wanted as median divides the distribution into two equal halves (50% point).
  • When there are extreme observations which would affect the mean. (Extreme observations do not disturb the median).
  • When data are measured in ordinal scale.
 
Mode
  • When a quick and approximate measure of central tendency is all what is wanted.
    37
  • When the measure of central tendency should be the most typical value.
  • When data are measured in nominal scale.
 
OTHER MEASURES OF CENTRAL TENDENCY
 
 
Quantiles
Quantiles are the values of the variable that divide the total number of subjects/observations into order groups of equal size.
Names such as “percentiles”, “deciles”, “quintiles”, “quartiles” and “tirtiles” are used when the number of divisions are 100, 10, 5, 4, and 3 respectively.
In general, the values dividing subjects into ‘S’ equal groups may be called “s-tiles”.
The total number s-tiles is (s-1).
Pth s-tile = (p × n/s) th value in ascending order of magnitude.
Where n = total number of subjects/observations.
For n = 200 subjects
35th percentile = (35 * 200/100) = 70th value
7th decile = (7 * 200/10) = 140th value
3rd quintile = (3 * 200/5) = 120th value
1st quartile = (1* 200/4) = 50th value
2nd tirtile = (2 * 200/3) = 133rd value in ascending order of magnitude.
Percentiles and quartiles are of special interest manytimes.
 
Percentile and Quartiles
Percentile is an observation below which there are those many percent observations.38
For example: Kth percentile is an observation such that K% observations are less than that observation and (100-K)% observations are greater than that observation.
Subscript on P serve to distinguish one percentile from another. The ninetieth percentile is designated as “P90”.
Note that the fiftieth (P50) percentile is the median. The twenty-fifth percentile is also referred to as the first quartile and denoted as “Q1”.
The fiftieth percentile (median) is referred to as the second (middle) quartile.
The seventy-fifth percentile is referred to as third quartile.
 
Geometric Mean
‘n’ values are multiplied and then their nth root is taken.
GM = (X1 × X2 × …… × Xn)1/n
log GM = (1/n) Sum of log Xi
Xi = ith observation,
i.e. log GM = mean of log values
GM = antilog (mean of log values)
The mean of the logarithms of a set of numbers is equal to the logarithm of the geometric mean of the numbers.39
Example: Let X1 = 60, X2 = 40, X3 = 20, X4 = 80, X5 = 40.
Then GM
=
(60 × 40 × 20 × 80 × 40)1/5
=
(153600000)0.2
=
43.379
log60
=
1.778
log40
=
1.602
log20
=
1.301
log80
=
1.903
log40
=
1.602
Mean of these values is 1.537
GM = antilog (1.537) = 43.379
 
Harmonic Mean
HM = n /(1/X1 + 1/X2 +…….+ 1/Xn)
Example 1: Equal size populations
Area
Pop. served per doctor
Rural
1000
Urban
500
HM = 2/(1/1000 + 1/ 500) = 667
For pop. Size 50,000 each
Area
Pop. size
No. of doctors
Rural
50,000
50
Urban
50,000
100
Total
100,000
150
Mean = 100,000/150 = 66740
2. Unequal size populations
Area
Pop. served per doctor
Pop. size
No. of doctors
Rural
1,000
75,000
75
Urban
500
25,000
50
Total
100,000
125
Mean = 100,000/125 = 800
HM = {(75000 + 25000)}/{(75000/1000) + (25000/500)} = 800
 
Weighted Mean/Weighted Average
Type of worker
No. of workers
Wage
Skilled
20
100
Semi skilled
30
75
Un skilled
50
50
Total
100
225
Weighted averages are used in working out “index numbers”.41
 
Measures of Variability
 
Range
It is the interval between the highest and lowest observations.
Example: Diastolic BP of five individuals is 90, 80, 78, 84, 98.
Highest observation is 98
Lowest observation is 78
Range is 98 − 78 = 20 or 98 to 78
 
Standard Deviation (SD)
It is the square root of the mean of the squared deviations of the individual observations from the mean.
Steps in calculations are explained with the help of one example.
Example: Diastolic BP of five individuals is 90, 80, 78, 84, 98.
 
Step-1
Sum = 430
Mean = Sum/number = 430/5 = 86
 
Step-2
Deviations from mean: 4, − 6, − 8, − 2, 12.
 
Step-3
Square of the deviations: 16, 36, 64, 4, 144.42
 
Step-4
Sum (of squared deviations) is 264.
 
Step-5
This sum divided by n − 1.
This is called as “variance”.
Note: n − 1 is used in place of n to calculate sample SD because only then SD is “unbiased” estimate of population SD.
 
Step-6
Square root of this figure (variance) is SD.
SD = √ (66) = 8.12
Note: Instead of taking square of the deviations if we take absolute deviations (without sign) then the value of mean (of absolute deviations) is called “mean deviation”.
Sample SD (statistic) is denoted as “s” and Population SD (parameter) by “σ”.
Since titers tend to be positively skewed, are best-analyzed using logarithmic transformation, i.e. data are converted to log to the base 10. When using transformation, all analyses are carried out on the transformed values. This includes the calculation of confidence intervals also. When reporting the final results, however, it is often transformed back to original units by taking antilogs. The antilog of the mean of the transformed values is called geometric means (GMT: Geometric Mean Titer). Similarly, antilog of SD of log-transformed data is sometimes called as GSD.43
 
Coefficient of Variation (CV)
The interpretation of an absolute measure of variability (for example SD) depends on the scale of the observations. The effect of the magnitude of the observations is removed by standardizing the absolute measure (say SD) by a quantity that is representative of the magnitude (say mean). CV is that measure. That is, Coefficient of variation is the standard deviation (SD) expressed as a percentage of the mean.
Example: For earlier set of data
Since CV is dimensionless (independent of any unit of measurement) it is useful for comparison of variability in two distributions having variables expressed in different units (e.g. height expressed in inches for one distribution and in meters for the other).
Or
For comparison of variability in two different variables in same group of individuals or same variable having different means in two different groups.
Note: CV can be used only when mean is > 0.
 
Probability
There are two notions of “objective probability”
  1. Classical or a-priori
  2. Relative frequency or posteriori
    44
 
Classical or A-priori Probability
It is based on the presumption that the probability for occurrence of a specified event can be determined by an a-priori analysis in which we can recognize all the “equally possible” events.
If there are a total of ‘n’ mutually exclusive and equally likely outcomes of a trial and if nA of these outcomes have an attribute A, then the probability of A is the fraction nA/n.
 
Mutually exclusive
Two outcomes are said to be mutually exclusive when the happening of one of them precludes the happening of the other.
 
Equally likely
The outcomes are equally likely if they have the same probability of occurrence.
It may be noted that: Probability can be measured on a continuous scale of values between 0 and 1 (both inclusive).
An event that is impossible is said to have a probability of occurrence of ‘0’ and an event that is certain is said to have a probability of occurrence equal to ‘1’.
  1. The more likely the event, closer the probability to one.
  2. The more unlikely the event, closer the probability to zero.
    45
  3. The sum of the probabilities of all possible mutually exclusive events is unity, i.e. one.
 
Laws of Probability
 
Addition Rule
“The probability of occurrence of any number of mutually exclusive events is the sum of component probabilities.”
Two events A and B are mutually exclusive if only one of them (either A or B) can occur in a single trial.
Then,
Example: Six faced die is cast. What is the probability of any even number coming up?
 
Multiplication Rule
“The probability of occurrence of two or more independent events is the product of the component probabilities.”
Two events A and B are independent if the occurrence or nonoccurrence of A can no way affect the probability of occurrence of B. Then,
Example: If two coins are tossed simultaneously, what is the probability of both heads?46
Various possibilities of number of sons and daughters (M and F) in 2 children families
Possibilities:
2M: MM
1M: MF, FM
0M: FF
 
SUMMARY
No. of M
No. of F
Probability
2
0
1/4 = 0.25
1
1
1/2 = 0.50
0
2
1/4 = 0.25
Total
1.00
47
 
CONDITIONAL PROBABILITY
The conditional probability of A given B is equal to the probability that events occur simultaneously divided by the probability of B (provided that the probability of B is not zero).
Note: If two events A, B are independent, then
 
Counting Methods
 
Factorials
Given the positive integer ‘n’, the product of all the whole numbers from ‘n’ down through ‘1’, is called “n factorials” and is written as “n!”.
n! = n. (n − 1).(n − 2).(n − 3)…….1
 
Permutations
A permutation is an ordered arrangement of objects. The number of possible such ordered arrangements is referred to as the number of permutations of ‘n’ things (total no. of things) taken ‘r’ at a time (‘r’ is the no. of objects in the arrangement) and written as “nPr”.
The first place in the permutation may be filled in ‘n’ ways, the second in ‘(n − 1)’ ways from among all 48the objects which remain, the third similarly in ‘n-2’ ways’. We continue in this manner until the final choice is made from the remaining (n – r + 1) objects.
This number (nPr) is n!/(n – r)! in factorial notations.
 
Combinations
A combination is an arrangement of objects without regard to order. The number of possible such arrangements (without regard to order) is referred to as the number of combinations of ‘n’ things taken ‘r’ at a time and written as “nCr”.
We will have r! permutations of each combination of ‘n’ things taken ‘r’ at a time. That is, we have “r!” times as many permutations as combinations.
 
Bayes' Rule
Medical knowledge is sometimes, such as to provide probabilities, of the form P(complaints/disease) whereas the probabilities actually required in practice are P(disease/complaints). For simplicity and for generalizibility, let us denote the set of complaints and investigation results by C and the particular disease by D. When some additional information is available, we can find P(D/C) from P(C/D) by using Bayes' rule.
49
This implies that P (CD) = P (C/D) P (D)
Similarly, we also have
Because DC = CD we get,
This is called Bayes' rule and helps to convert P(C/D) into the more useful P(D/C). For this conversion, two additional probabilities are required. The first is P(D). This is prior probability of the disease and is the same as the prevalence of the disease in the population under investigation. This probability is generally available from various reports or books, or can be derived from the records. The second is P(C), which is the relative frequency of the complaints in the population. Special efforts may have to be made to compute P(C). In many cases, it may still be found easier to obtain P(C) than P(D/C) directly. Once these probabilities are available, the required probability P(D/C) can be computed from the above equation. This is the posterior probability after the complaints are knows.
Example: Suppose in a hospital, 1 in 5000 patients on average is finally diagnosed as a case of abdTB. This P(D) = 1/5000 = 0.0002. Suppose further that the complaints of pain in the abdomen, vomiting, and constipation of long duration are seen in 70% of cases of abdTB. Thus P(C/D) = 0.70. If a survey of records 50of that hospital also shows that these complaints are reported by nearly 1 in 1000 patients with all diseases, then P(C) = 0.001. Therefore,
Thus, there is only a 14% chance that a random patient reporting in that hospital with those complaints is a case of abdTB. Compare it with P(C/D), which is 70% P (C/D) is high but has a very little diagnostic value when the complaints are nonspecific and can occur in some other conditions also.
The probabilities P(C/D), P(D), and P(C) may be exactly available in a research setup but an estimate has to be made for everyday practice on the basis of knowledge and experience. A practitioner of long standing can have a fair idea of P(C/D), P(D), and P(C) from his experience in his area. Bayes' rule is very useful in estimating ‘predictive’ values from sensitivity and specificity.
 
FEW IMPORTANT PROBABILITY DISTRIBUTIONS
 
Binomial Distribution
Independent trials are made, in which the probability of success at any trial is π, and of failure is (1 – π). What is the probability of r successes in n trials? This provides a reasonable model for many real situations. A trial may be to toss a coin, in which success is to obtain a head and failure is to obtain a tail. Each toss is independent of every other unless the tossing is done 51in a very careless way. Or a trial may be to throw a die, in which success is to score a 6 and failure is to obtain a score other than 6. Each throw should again be unaffected by what happened at previous throws, i.e. independent of them. In both these cases, there is no reason for the probability π of success to change from one trial to another.
The sexes of individual children within a family of n children can often be considered independent. Each trial is a new birth, success is a girl and failure a boy (success and failure are merely convenient technical words and therefore can very well be reversed!), and the probability π of success is constant. The value of π in this case may not be exactly half, but as a first approximation could be taken so. Suppose there are two successes (S) and one failure (F) in three trials, i.e. n = 3 and r = 2. One possible sequence of results for the three trials is then SFS, and, since the results of each trial are independent of those at other trials, the probability for this sequence is π × (1 – π) × π = π2 (1–π). The two other possible sequences are FSS, SSF, and each of these has probability π2 (1–π) also. The possible sequences of results are mutually exclusive events, so the probability of r = 2 is the sum of the probabilities of the three sequences, which is 3 * π2 (1–π).
In the general case, the probability of any one particular sequence that contains r successes and (n – r) failures is πr (1–π)n – r. The number of different possible sequences of this type is nCr, and so the total probability of r successes in n trials is nCr πr (1–π)n – r.52
 
Definition
A discrete random variable R is said to follow the binomial distribution if
where, 0 < π < 1.
The conditions that give rise to a binomial distribution are:
  1. There is a fixed number n of trials;
  2. Only two outcomes, ‘success’ and ‘failure’, are possible at each trial;
  3. The trials are independent;
  4. There is a constant probability π of success at each trial;
  5. The variable is the total number of successes in n trials.
 
Poisson Distribution
There are situations in which the number of times an event occurs is meaningful and can be counted, but the number of times the event did not occur is either meaningless or can not be counted. For example, the number of goals scored by a football team in a series of matches can be meaningfully counted but not the number of goals not scored (it is meaningless to think of). Similarly, the number of spells of diarrhea observed in a group of infants over a predetermined period can be counted but not the number of spells that did not occur. The probability of observing one spell, two spells, etc. in a given sample in such cases can theoretically be found out by use of Poisson distribution.53
 
Definition
A discrete random variable R is said to have a Poisson distribution if
Note that the Poisson distribution is a discrete distribution that has an infinite number of possible values. It has a single parameter λ. When we use it as an approximation to the binomial distribution, we must not think of n, π individually, but only as their product, which is λ.
Example: The mean number of misprints per page in a book is 1.2. What is the probability of finding on a particular page: (a) no misprints; (b) three or more misprints?
We might assume that there is a constant probability π of making a mistake with each letter, and that the probabilities of making mistakes with different letters are independent. This latter assumption is more reasonable if we count the common mistake of interchanging two letters as one misprint. The number of misprints on a page is then binomially distributed with n equal to the number of letters on a page (say about 2000) and probability π (say about 0.0006). The value of n is large and that of π is small so we may use the Poisson approximation. We are told that the mean number of misprints per page is 1.2; on the binomial model this is equal to nπ. This is sufficient information to apply the Poisson approximation; we do not need the individual values of n and π, because we only need to replace the product nπ by λ. We thus have λ = 1.2.54
Pr (3 or more misprints) = 1 – Pr (0 or 1 or 2 misprints)
= 1 – e−λ [1 + λ + (λ2/2)] = 1 – e−1.2 (1 + 1.2 + 0.72)
= 0.121
For p < 1/2, generally the normal approximation (to Binomial) is adequate if the mean µ = np is greater than 15. But in many applications the event we study is rare so that even if n is large, the mean np is less than 15. The binomial distribution then is noticeably skewed and the normal approximation is unsatisfactory. However, Poisson distribution works well in this situation. Poisson distribution is a limiting form of the Binomial distribution when n tends to infinity and p tends to zero at the same time, in such a way that µ = np is constant.
 
Gaussian or Normal Distribution (Fig. 2.13)
The Binomial and Poisson distributions deal with the occurrence of discrete events such as the number of sick, number of accidents, number of cells, number of coins, etc. However, there are many situations in which the quantity that is studied is continuous in magnitude. For such characteristics many populations, and also their sampling distributions are very close to a pattern of frequency distribution known as Normal (Gaussian) distribution. The mathematical function that generates the probabilities is [{1/σ √(2π)} exp {– (1/σ2)* (x – µ)2}] where x is the variable in question, i.e. characteristic studied, µ is the mean, and σ is the standard deviation.55
Important characteristics of normal distribution are:
  1. The shape of the distribution resembles a bell and is symmetric around the midpoint, (center divides the curve in two equal halves which are mirror images of each other).
  2. At the center of the distribution that is peaked, three important measures of central tendency, namely, mean, median and mode coincide.
  3. The area under the curve between any two points which correspond to the proportion of observations between any two values of the variate can be found out in terms of a relationship between the mean and the standard deviation.
    • Mean ± 1 SD covers 68.3% of the observations
    • Mean ± 2 SD covers 95.4% of the observations
    • Mean ± 3 SD covers 99.7% of the observations
Fig. 2.13: Normal Distribution
56
The normal distribution is completely determined by the two parameters, mean (µ) and SD(σ). This implies that the normal distribution is a family of distributions in which one member distinguished from another on the basis of the values of mean (µ) and SD(σ). The most important member of this family is the ‘standard normal distribution’ which has a mean of zero and SD one.
 
Standard Normal Variate (SNV) or Relative Deviate (RD)
When a variable follows a normal distribution, the ratio [(observation – Mean)/SD] follows a normal distribution with mean equal to zero and SD equal to one (i.e. Standard Normal Distribution) and is called the Standard Normal Variate (generally denoted as ‘Z’).
Any normal distribution can thus be converted into standard form and since the tables (that provide the area under the curve corresponding to Z value) are available, we can find the results in which we might be interested.
 
Practical Application
The pulse rate values of healthy individuals follow a normal distribution with a mean of 72 per minute and SD of 3.5 per minute. In what percentage of individuals the pulse rate will be 78 or more per minute?
The area corresponding to SNV (or Z) value of 1.71 is 0.0436 (by referring to Standard Normal table in Appendix). That means, 4.36% individuals will have pulse rate 78 or more per minute.

Sample to Population3

 
INTRODUCTION
The idea of sampling is neither new nor unfamiliar in everyday life. A person examining handful grains from the sack, a cook examining few grains of rice to find whether it is properly cooked, examining few-tubes of water for deciding quantity of bleaching power needed for chlorination of entire water in well (by Horrock‘s test), the physician making a blood test—all there are employing the method of sampling. They do so with extreme confidence because they have good reason to believe that the material they are sampling is so homogeneous or well mixed that the sample will adequately represent the whole. But in many situations the material is not so homogeneous. Therefore, we need to take help of sampling theory in statistics to select a sample so that it becomes easy to know about population.58
 
Advantages of Sampling
The aggregate of all units or whole of material is called as population. A portion or a part of such population is called as a sample. Principal advantages of sampling as compared with complete enumeration are:
  1. Reduced Cost: If data are secured from only a small fraction of the aggregate, expenditures are smaller than if a complete census is attempted.
  2. Greater Speed: The data can be collected and summarized more quickly with a sample than with a complete count.
  3. Greater Accuracy: Because personnel of higher quality can be employed and given intensive training and because more careful supervision of the fieldwork and processing of results become feasible when the volume of work is reduced, a sample may produce more accurate results.
  4. Greater Scope: In certain types of enquires highly trained personnel or specialized equipment, which are limited in availability, must be used to obtain the data. A complete census is impracticable and so the choice is between obtaining the information by sampling or not at all. Thus, surveys based on sampling have more scope and flexibility regarding the types of information that can be obtained.
 
Probability and Nonprobability Sampling
Probability Sampling method is one in which the selection of a sample is governed by ascertainable laws of chance. In other words, in probability sampling 59scheme each member of the population has known (and non-zero) chance of being selected into the sample. Probability Sampling methods are also called as Random (or statistical) Sampling methods. As against this, methods that are not probability are Nonrandom sampling methods.
Following are some common types of non-probability sampling:
  1. Accessible: The sample is restricted to a part of the population that is readily accessible. Ex. A sample of coal from an open wagon may be taken from the top portion.
  2. Haphazard: The sample is haphazardly (without any conscious planning) chosen. Ex. picking 10 rabbits from a large cage in laboratory those that the hands rest on.
  3. Typical/Judgment: Sample of typical units or sample of units of average type by judgment.
  4. Volunteers: The sample of volunteers.
These methods may give useful results but when the sample has been obtained by any of these methods, then there is no way of knowing ‘how accurate the estimate is?’. They are not necessarily in-accurate but if they are accurate the accuracy is usually unknown. On the other hand, what gives probability sampling an advantage over many other ways of choosing a part of the population (a sample) is that, when the estimates of the population characteristics are made from the sample results, the precision (probable accuracy) of these estimates can also be gauged from the sample results themselves.60
 
TERMINOLOGY
Few new terms frequently used while discussing sampling theory are as follows. The aggregate of all units is called as population. Such basic elements in the population are called as elementary units. Portion of population is called sample. The population from which the sample is selected is called the sampled population. But the population about which information is wanted is called the target population. Information is sought from units called as sampling units. Generally the sample selection is done on the basis of these sampling units. They could be elementary units or group of elementary units but are convenient for sampling besides being clearly defined, identifiable and observable. For example, in family expenditure survey, family is sampling unit and each individual is elementary unit. List of all sampling units useful for drawing up a sample is called sampling frame. Number of sampling units in population is termed as population size (N) and number of sampling units selected in a sample is termed sample size (n). The ratio of sample size to population size, i.e. n/N is termed sampling fraction. Sometimes a term “universe” is used to refer to set of units in the population and “population” to the set of measurements associated with those units. Generally capital letter is used to denote population characteristic and the same letter in lower case is used to indicate same characteristic of the sample.61
 
Principal Steps in a Sample Survey
The steps involved in the planning and execution of a sample survey are:
  • Specification of the objectives
  • Definition of the population
  • Selection of proper sampling design and choice of sampling unit:
  • Questionnaire or study schedule preparation
  • Degree of precision desired and sample size determination
  • Construction of the frame
  • Pilot Survey or pre-test
  • Organization of the field work
  • Dealing with nonrespondents
  • Analysis of the data
Objectives of the sample survey should be clearly specified and all members of the participating team should be aware of the objectives. The population to be sampled (the sampled population) should coincide with the target population. Sometimes for reasons of practicability or convenience the sample population is more restricted than the target population. If so, it should be remembered that conclusions drawn from the sample apply to the sampled population. There are many sampling designs by which the sample may be selected. Depending on the applicability of the design in a given situation, appropriate design should be selected. Choice of a suitable sampling unit is also important. Preparation of questionnaire (which is self administered or mailed but respondent himself fills in 62the answers) or study/recording schedule (an interviewer asks the questions and records the answers on it) is a very tricky job as it is largely a matter of art rather than science. The results of the sample surveys are always subject to some uncertainty because only part of population has been measured and because of errors of measurement. This uncertainty can be reduced by taking larger samples and by using superior instruments of measurement. But this usually costs time and money. Consequently, the specification of the degree of precision wanted in the results is an important step. For this precision appropriate sample size should be determined.
The construction of the list of sampling units in the population is an important task. Although, sometimes it is one of the major practical problems, it is essential that, the frame is complete (i.e. it covers whole of the population) and the sampling units must not overlap. It has been found useful to try out the questionnaire and the field methods on a small scale. This nearly always results in improvements in the questionnaire and may reveal other troubles that will be serious on a large scale. The personnel must receive training in the purpose of the survey and in methods of measurement to be employed and must be adequately supervised in their work. Plans must be made for handling nonresponse. There would be various reasons for nonresponse. There are few methods, to deal with such nonresponses. Completed questionnaires/study schedules are edited in the hope of amending recording errors and deleting data that are obviously erroneous. 63Further steps of data processing, analysis of the data, interpretation of results and report writing also needs to be carried out carefully.
 
Probability Sampling Designs
There are many probability sampling designs but only important few of these are described here. Out of all probability sampling designs the most basic is ‘Simple Random sampling’. In simple random sampling each possible sample of different units has an equal chance of being selected. This implies that every member of the population has an equal chance of selection into the sample. In other words, if every member of the population (of size N) is given equal chance of selection into the sample then this insures that all possible samples (of size n) get equal chance of selection. Number of possible samples are (NCn) and so chances of selection of any one sample are 1/(NCn). Now consider on distinct sample, that is, one set of n specified units. At the first draw the probability that some one of the n specified units is selected in n/N. At the second draw the probability that some one of the remaining (n − 1) specified units is drawn is (n − 1)/(N − 1), and so on. Hence the probability that all n specified units are selected in draws is:
To select a sample by this method, prepare same size and similar looking chits with one number from 1 64to N on it, mix them thoroughly and select one chit. Corresponding (to number selected) individual is selected for the sample. In this way each individual has the same equal chance of selection. Next, without replacing the first chit, select a second one, so that the remaining N − 1 individuals have an equal chance of entering the sample. Repeat this process till you get a desired size sample. This procedure is called sample random sampling without replacement and method of selection is called lottery method. The other method of selecting a sample is by using random number tables. The members of the population are numbered from 1 to N and n numbers are selected from one of the tables in any convenient and systematic way. This (group of ‘n’ of those corresponding subjects) becomes the sample.
The accuracy of a sample estimate refers to its closeness to the correct population value and since generally the population value is not known, the actual accuracy of the sample estimate cannot usually be assessed, though its probable accuracy, which is termed precision, can be.
This sampling procedure (simple random sampling without replacement) will yield a sample mean which when averaged over all the possible values that can occur is exactly equal to µ. The characteristic that
is called the property of “unbiased-ness”. Sample mean is called an unbiased estimator of population mean.
For a given sampling design, the estimator is the method of estimating the population parameter from 65the sample data. An estimate is the value obtained by using the method of estimation for a specific sample. So if for the given sampling design, the expected value of the estimator is equal to the population parameter the estimator is called unbiased, if not it is called biased. The difference between the expected value and the true population value is termed the bias.
The sample variance σ2 (= Sum of deviations square divided by ‘n’) is not an unbiased estimator of the population variance σ2. The reason of this biasness is the divisor of n in σ2. However σ2 (= Sum of deviations square divided by ‘n − 1’), i.e SD calculated with a divisor of n − 1 is unbiased of population S2 (= Sum of deviations square divided by ‘N − 1’), i.e. if divisor is N − 1 in calculation of S2. So σ2 is an unbiased estimator of S2, that is, E(σ2) = S2. This is the reason why divisor (n − 1) is frequently used for the sums of squares of deviations of measurements from their mean.
Although the sample mean is unbiased, the estimates differ somewhat from each other and from µ. We need a measure of the extent to which the estimates derived from different samples are likely to differ from each other, in other words, a measure of the spread of the sampling distribution. The most usual measure of the spread of any distribution is the variance (or its square-root, the standard deviation). Var
is in fact the expectation of the squared deviation, E (x – µ)2. Var (x) can be alternatively computed by the formula:
66
Since this formula puts the sampling variance, Var
, in terms of the values n, N and S2 (the dispersion in the population), the calculation of Var
has been shifted from requiring knowledge of sampling distribution to requiring knowledge of the population.
In the formula for Var
, the multiplier (1 – n/N) is called the finite population correction (fpc) factor. If the population size, N, is very large relative to the sample size, n, the fpc is essentially equal to one and the formula for Var
reduces to S2/n. If the sample size is equal to the size of the population (i.e. n = N) then
. In this case, zero variance is reasonable as then all the units of the population will have been observed and the sample mean
will be exactly equal to the population mean (µ).
We have already noted that S2 can be estimated unbiasedly by σ2 and so substitute σ2 for S2 in the formula for Var
to give var
as an estimate of Var (x). It is easy to verify that over all possible samples Var
is on unbiased estimate of Var
. This unbiased ness follows directly from the fact that σ2 is an unbiased for S2 and the fact that the factor (1 – n/N)* 1/n is the same for all samples.
The square root of the Var
, that is, standard deviation, of the sampling distribution is called the standard error of the mean. The square root of the variance is frequently taken because it expresses to the spread in the terms of the original, basic units of measurement. The standard error measures the fluctuations of the estimates around their own mean. 67If the estimator is biased then the expected value of the estimator is not equal to the population value. In this situation the bias will not be included in the standard error. There is a measure, called the mean square error (MSE), which combines sampling variance and bias.
Showing that the variability around the population value (i.e. MSE) is equal to that around the expected value, Var
plus the square of the bias.
 
Disadvantages of Simple Random Sampling
There are some disadvantages of simple Random sampling method. These disadvantages lead, to alternative sampling methods. Important disadvantages and alternative method are tabulated in Table 3.1.
Table 3.1   Disadvantages of simple random sampling method and alternatives
Disadvantage
Name of the alternative method
1.
Requires complete frame
Systematic sampling
2.
Does not ensure representation of specific group
Stratified sampling
3.
May lead to selection of units from for flung areas
Cluster sampling
4.
Not suitable for sampling very large population
Multi-stage sampling
68
The alternative sampling designs mentioned here have few more advantages than just to overcome the respective disadvantage of simple random sampling given in the table. There are in fact many more sampling designs available but only few important sampling designs like systematic, stratified, cluster and multi-stage designs are described briefly.
 
Systematic Sampling
In this method the first unit of the sample is selected at random and the subsequent units are selected in a systematic way (as the name if self indicates). If there are N units in the population and n units are to be selected and K is the quotient after dividing N by n then one number is selected at random out of first K numbers if, N is completely divisible by n or out of the remainder after division, and the other units are selected subsequently by the addition of this quotient K (only its integer part) to the previously selected number. If the first selected number is r then the other units selected are (r + K)th, (r + 2K)th and so on. Quotient K is known as sampling interval.
This method is useful when the total size of the population is known but the other particulars of the units are not known. This type of sampling can also be adopted in case of selecting samples out of patients attending a clinic when the sampling frame cannot be prepared in advance. The main advantage of this method is that it is easier to draw a sample and often easier to execute without mistakes. If the order of the population (or list) is random with respect to the 69measurement (or the characteristic in which we are interested) then a systematic sample can be analyzed just if it were a completely random sample (i.e. simple random sample.)
 
Stratified Sampling
In this method the entire population is divided into certain homogeneous subgroups depending upon the characteristic(s) to be studied and then simple random sample is drawn independently from each subgroup. These subgroups are known as strata. The strata are formed such that they are homogeneous with respect to characteristic under study. For stratified sampling to be efficient, between strata variability should be high. This type of sampling is used when the population is heterogeneous with regards to characteristic(s) under study.
By this type of sampling the precision of estimate of the characteristic under study is increased and also due representation of the population is maintained. Another advantage of this type of sampling is that the estimates of the characteristics under study can be made for each stratum separately. Therefore, when the data (or estimates) of known precision are wanted for certain subdivisions of the populations we should use stratified sampling.
 
Cluster Sampling
In this method the sampling unit is a group (called cluster) of subunits. These groups(clusters) are selected 70randomly by giving equal chance to every cluster and then all the subunits in the selected groups(clusters) are completely enumerated. No further sampling is done within the clusters. This method is advocated in a situation when the units within the clusters are heterogeneous with respect to the characteristic under study. For cluster sampling to be efficient, between cluster variability should be low. Note that this situation is exactly opposite to what is needed for stratified sampling.
If the clusters are of unequal size then the best way is to choose clusters with probability proportional to size (PPS). If one cluster has twice as large as another, it is given twice the chance of being selected. Then the method is self-weighing and estimation procedures of SRS are applicable.
 
Multistage Sampling
Method of selecting a sample of subunits from a selected cluster is called subsampling. The process of subsampling can be carried to any stage. Therefore this design is called as multistage sampling. Multi-stage sampling is useful when the population units are arranged in hierarchical groups. For example, farmers in India may be grouped into, various states, then into districts within states, talukas/tahsils within Districts, Villages within talukas/tahsils and then farmers within villages. Multistage sampling of this population (of farmers) would consist of first randomly selecting a subset of states, then within these selected states, selecting a subset of districts, then selecting a subset 71of talukas/tahsils from these selected districts, then selecting a subset of villages from these selected talukas/tahsils and finally selecting a sample of farmers within the villages. In such a scheme the states are referred to as primary sampling units, the districts as secondary units and so on. Each of these levels contributes a component to the total sampling variance. Efficient sample design depends heavily on knowledge of the magnitude of these different components.
 
Other Sampling Designs
Multiphase sampling is a design in which some information is collected from every unit and additional information is (either at the same time or later) collected from subsample. Note the difference from multistage sampling (in which different types of sampling units are sampled at different stages). In multiphase sampling the same type of sampling unit is concerned at each phase but some units are asked for more information than others.
Sampling in which maps rather than lists serve as the sampling frame is known as Area Sampling. If repeated samples of the same area or population are to be taken, there are advantages in preparing some sort of Master Sample from which subsamples can be taken as and when required. It is occasionally useful to split a sample into two or more independent parts, each forming a self contained and adequate subsample of the population. They are called interpenetrating samples.72
In quota sampling the general composition of the sample, e.g. in terms of age, sex, social class, etc. is decided and the quota assignments are allocated to interviewers. The choice of the actual sample units to fit into this framework is left to the interviewers. Quota sampling is therefore a method of stratified sampling in which the selection within strata is nonrandom. It is this nonrandom element that constitutes its greatest weakness.
In Penal method, the aim is to collect data on broadly the same questions from the same sample on more than one occasion. The panel begins as a randomly selected sample of the survey population. Information is then sought from this sample at regular intervals, either by mail or by personal interview. For sampling of rare item, a method called Inverse Sampling is used. In this method, sampling is continued until a pre-determined fixed number, say m, of the rare items have been found in the sample. If n is the sample size at which with rare item appears (m greater than one) an unbiased estimate of P is P = (m − 1) / (n − 1). For N (population size) very large, P small and m > 10, a good approximation to the variance of this estimator is [{(mP2) * (1 − P)]/[(m − 1)2}].
The term Exploratory Study is often applied to a descriptive survey designed to increase the investigator's familiarity with the problem he wishes to study or to describe the situation. The aim may be to formulate a problem for more precise investigation, to develop hypotheses or clarify concepts. A Pilot Study is a dress rehearsal of an investigation performed 73in order to identify defects in the study design. The term Household Survey usually refers to a descriptive survey of illnesses and disability performed by interviewing persons in their own homes, often by questioning a single informant about other members of the household. KAP Studies are studies of Knowledge, Attitudes and Practices.
 
Sampling and Nonsampling Errors
The error arising due to drawing inferences about the population on the basis of observations on a part (sample) of it is termed sampling error. Clearly the sampling error in this sense is nonexistent in a complete enumeration (i.e. census) survey, since the whole population is surveyed.
The sampling error will be high if (1) faulty sampling design is used or (2) size of the sample is small. It usually decreases with increase in sample size and in fact in many situations the decrease is inversely proportional to the square root of the sample size.
From Figure 3.1, it can be seen that though the reduction in sampling error is substantial for initial increase in sample size, it becomes marginal after a certain stage. The sampling error can be reduced by using appropriate sampling design and adequate sample size and can be estimated from the sample. Nonsampling errors are possibly exits in both sample surveys as well as in complete enumeration surveys (i.e. Censuses).74
Fig. 3.1: Relationship between sample size and sampling error
The amount of such errors could be more in the case of complete enumeration survey than in the case of sample survey because it is possible to reduce the nonsampling errors to a great extent by using better organization and suitably trained personnel at the field and tabulation stages in the latter than in the former. But nonsampling error(s) can not be estimated. These errors can be placed in 6 categories as follows:
 
Coverage Errors
These are caused by failure to adequately sample the target population. This failure may arise through inadequacy of sampling frame, some part may be inaccessible or due to resources limitations. This could be reduced by making sure that the sampling frame does consist of all units, i.e. no omissions and no duplications.75
 
Response Errors
These could be due to nonresponse (because res-pondent is not available, not at home, may have moved or died) or wrong response (due to wrong beliefs, for prestige, questions may be sensitive or due to recall bias). These could be reduced by contacting non-respondents through intensive efforts and assuring respondents that the information given by them will not be used for bad or other purposes. One method known as ‘Randomized Response’ to deal with sensitive questions is very popular.
 
Selection Bias
This occurs when the real probabilities of selection of some units are different from what they are intended to be. This could be reduced by more alertness at the time of selection.
 
Observational Errors
These may be caused by the fault of the investigator (inaccurate recording, imperfect or misunderstood instructions, noncompliance with instructions, bias for or against a question, failure to convey question, etc.), by imperfect measuring instrument or questionnaire, by the person or object observed, or by the interaction of two or more of these factors. These can be reduced through training of the investigators and by perfect instruments. Pretesting of the questionnaire is of great help.76
 
Processing Errors
This category includes errors at tabulation or summarization stage, errors in method of statistical analysis, clerical errors in copying material, data errors and computational errors in data processing. These can be controlled by suitable administrative action.
 
Publication Errors
This includes printing errors or failure to point out the limitations.
 
Data Collection Tools
There are few regular systems of data collections, where data are collected routinely, for example, registration of Births and Deaths, Medical records. But many times data are collected on ad-hoc basis. Census (where data of all the individuals in the population are collected) is a special case, which is not completely ad-hoc as the census is conducted fairly regularly (in India every 10 years), but is not a regular system of data collection in the sense that this operation is not a continuous one. Sometimes a dual system (both the systems in combination) is used, for example, in Sample Registration System (SRS) operational in India since mid 1960, registration of vital events is continuously done but the same information is collected through a survey (carried out by different person than one who is responsible for continuous recording) after every six months.77
Basically there are three methods of data collection:
  1. Observation: Observation could be just visual with the help of instrument. It could be completely through or by a mechanical/electronic apparatus, by a human being or combination of these two. Observation can fairly be called the classic method of scientific enquiry. The accumulated knowledge of biologists, physicists, astronomers and other natural scientists is built upon centuries of systematic observation. But observation as a method of collecting data is quite another matter. For it is not sufficient that the subject matter is those to be observed. The method must be suitable for investigating the problems and appropriate to the population of study. It is rarely the most appropriate method for studying opinions and attitudes.
  2. Interview or self-administered Questionnaire: Interviews may be less or more structured (the wording and order of the questions being decided in advance). A questionnaire usually means a self-administered questionnaire, and the term is not applied to the interview schedule used by an interviewer. An interview could be semistructured in the sense that a check list (of topics to be covered) is prepared in advance but is not decided in advance exactly what questions will be asked, or in what order.
    Standardized methods of asking questions are usually preferred in community medicine research, since they provide more assurance that the data will be reproducible. Field investigator asks questions one by one and record the responses 78simultaneously on what can be called as study schedule (or recording schedule). It is necessary to make clear which questions are to be answered by whom. Sometimes whole sections concern only a subclass of respondents, in which case this should be quite clear to whoever is recording the answers. Before a questionnaire (or a study schedule) is constructed, the variables it is designed to measure/record should be listed. Then the suitable questions should be formulated. Although there are no rules or fixed guidelines for the construction of a questionnaire or a study schedule, following points may be found useful.
    Any question should have relevance to the topic and the objective of the study. The questions should be short, simple (easy to understand), non-offending, corroborative and they should be well planned and arranged in a logical sequence (e.g. Do you have children? not before Are you married?). If the questionnaire includes both specific and general questions about attitudes, the general questions should come first since specific questions found to be answered in the same way wherever they are placed.
    The first few questions should be easy to answer but important and possibly interesting. Difficult questions, which may occasion embarrassment or resentment should be left until later, even questions about age, education, etc. are sometimes left to the end for this reason. The temptation is always to cover too much, to ask everything that might turn 79out to be interesting. This must be resisted. Lengthy questionnaires are demoralizing for the interviewer and for the respondents as well. The questionnaire should be no longer than is absolutely necessary for the purpose.
    Three types of questions are distinguished: factual, opinion and knowledge. The survey interviewer are allowed to rely on their own discretion for probing on factual questions but with opinion and knowledge questions they are allowed to less freedom in probing. They are allowed to use only survey stock phrases and they must not invent their own probes. Any question could be either open or closed type. A typical example of contrast is:
    Open: What sports and other physical activities do you undertake each week on a regular basis (at least 30 minutes)?
    Closed: For each of the following sports tick the box if you regularly spend more than 30 minutes each week in that activity.
    • Walking
    • Jogging
    • Cycling
    • Swimming
    • Racket sports
    In the first situation, the subject is being asked to consider what activities he or she undertakes that would fit in with this description and to write them down. In the alternative formulation, the subject is presented with a list of options to make a decision 80(yes or no) about each one. Decision about the better approach is not absolutely clear. Subject recall may be enhanced by the closed approach. For example, walking and cycling may not be con-sidered by some subjects as ‘physical activities’ and thus are ignored. By contrast, the open question can detect an enormous number of activities, which would be impossible in the close approach without an overlong survey form, much of which would be left blank.
    Unlike social science, most epidemiological surveys adopt the closed question approach as being the most efficient approach for data handling and analysis. Nevertheless, a semi-approach (where a last option in closed approach is ‘other – please specify’ and kept open) is more advisable. In this, advantages of both approaches are achieved.
  3. Use of the documentary sources: Documents are frequently the only or the most convenient source of information at the investigator's disposal. However, secondary data (obtained by using any documents, for example – census records, medical records) should always be used with circumspection.
    Data collected by methods mentioned in 1 or 2 above, are called as Primary data. There are two properties of any method of data collection they are Reliability and validity. Reliability (also termed reproducibility or repeatability) refers to the stability or consistency of information, i.e. the extent to which similar information is supplied 81when a measurement is performed more than once. The validity of a measure is the adequacy with which the method of measurement does its job – how well does it measure the characteristic that the investigator actually wants to measure. If a measure is not reliable, this must reduce its validity. If reliability is high the measure is not necessarily valid. That is a measure or a method could be completely reliable but not valid. For example, suppose for measuring length we use a measuring tape, which is not calibrated correctly. This measuring tape yields a reading, which is an under-estimate by 2 inch for every 4 ft, say, i.e. for the actual length of 4 ft. the reading is 3.8 ft. This tape is completely reliable as it may yield same reading for the given object on repeated measurements. However, not valid as it does not measure the length correctly. It is more meaningful to consider validity in relation to the conceptual definition of the variable, i.e. in relation to what the investigator would like to measure. A measure or a data collection method could be of different validity in different groups or populations. If we contemplate the use of a study method that others have developed and validated, we should satisfy ourselves of its validity in our own study population.
 
Pretests and Pilot Surveys
It is difficult to plan survey without a good deal of knowledge of its subject matter, the population it is to cover, the way people will react to questions and even 82the answers they are likely to give (though it sounds paradoxical). How is one to estimate how long the survey will take, how many interviewers will be needed (i.e. sample size), ‘how much it will cost?’, etc. if one has not done some part of it? How, without trial interviews, can one be sure that the questions will be as meaningful to the average respondent as to the survey expert? How is one to decide which questions are worth asking?
To get answers to such questions, it is necessary to conduct a series of small “pretests” on, isolated problems of the design and then, when the broad plan of the enquiry is established of a pilot survey, which is a small-scale replica of the main survey. Pilot survey is in fact like a dress rehearsal. Pretests and pilot surveys are standard practice as they provide guidance on:
  1. The adequacy of the sampling frame, from which it is proposed to select the sample. Testing the completeness, accuracy, up-to-date ness and convenience of the sampling frame is essential.
  2. The variability, with regard to the subject under investigation, within the population to be surveyed. This is of importance in determining an efficient sampling design. Moreover decision on sample size requires some knowledge of the variability of the population.
  3. The nonresponse rate to be expected.
  4. The suitability of the method of collecting the data.
  5. The adequacy of the questionnaire/study schedule. This is probably the most valuable function of the 83pretest or pilot survey. Several points will have to be watched, the ease of handling the questionnaire in the field, the efficiency of its layout, the clarity of the definitions and the adequacy of the questions themselves (is the wording simple, clear, unam-biguous, free from technical terms?).
  6. The efficiency of the instructions and general briefing of interviews. The pretests and pilot survey are, of course, themselves part of the interviewer training.
  7. The codes chosen for precoded questions. In an open question the respondent is given freedom to divide the aspect, form detail and length of his answer but in the case of precoded questions either the respondent given a limited number of answers from which to choose or the question is asked as an open question and the interviewer allocates the answer to be appropriate code category.
  8. The probable cost and duration of the main survey and its various stages.
  9. The efficiency of the organization in the field, in the office and in the communication between the two.
The pilot survey is the researcher's last safeguard against the possibility that the main survey may be effective. The size and design of the pilot survey is a matter of convenience time and money. However, it should be large enough to fulfill the above functions. The number of subjects investigated does not need to be large. Nevertheless, ample time and resources spent at this stage will be rewarded by a greatly enhanced chance of a successful study outcome.84
 
Sample Size
Sample size requirement is the first question when designing a new survey. However, estimation of sample size requires specification of an estimator and method of sampling, that is, sampling design. It is not very difficult to estimate sample size if simple random sampling is used to select a sample but if a different sampling design is used then the sampling variance formula would be different and a different sample size would achieve the desired accuracy.
The length of confidence intervals depends directly upon the variance of the sampling distribution. Efficient design of sampling studies, therefore, is searching for methods, which have sampling distributions with smaller variance. If carefully coordinated with the grouping of the population, good selection procedure can realize very large reduction in variance. Much statistical literature gives the impression that increasing sample size is the only way to reduce variance. However, the magnitude of the potential gains implies that the method of sample selection (i.e. selection of appropriate sampling design) should be anybody's first choice method to reduce variance.
 
General Procedure of Determining Sample Size
Let the population parameter under estimation be denoted by M and its sample estimate by m. Let δ be the absolute difference between the two, i.e. δ = |M – m|. Suppose the investigator requires that this difference should not exceed a specified limit L in 85at least (1 – α) 100% times of repeated samples. The quantity L is called precision and (1 – α) the confidence level. α is the chance of being wrong. If a Gaussian (i.e. normal) form of distribution is assumed valid, then L = 2*SE(m) where the coefficient 2 (instead of precise 1.96) is taken from the fact that for standard Gaussian distribution the interval (– 2, 2) covers nearly 95% of the probability. For other values of α the table of probability of Gaussian curve needs to be consulted to find a cut-off Z1 – α such that the probability between – Z1 – α and Z1 – α is (1 – α). The formula L = 2*SE(m) is basic for calculation of sample size. SE(m) would invariably have n in the denominator which could then be worked out when other values are known. But a difficulty is that SE(m) would many times contain unknown parameters such as a population variability measure σ or population proportion π. This is to be substituted by its estimate, which may be taken from a previous study or estimated from a pilot study. Sometimes it even involves guesswork.
For estimating proportion (for testing equality in two population parameters for quantitative variables as well qualitative variables is given in chapter on clinical trials) the above formula for calculation of required sample size reduces to
where ‘L’ is the precision, i.e. we want the confidence interval extends ‘L’ units on either side of the estimate, and p is the expected or assumed proportion (or it could be in terms of percentage) of a characteristic in 86the population. Thus, there are 95% chances that a 95% CI calculated with our estimate will include the real population proportion ‘P’.
 
Validity of Assumptions while Determining Sample Size
Main assumptions while calculating sample size are:
  1. The sampling method used is simple random sampling.
  2. Proportion or variability in the population is known.
  3. Only one variable is of interest.
    Any or all of these assumptions could be wrong. Error in the estimate of required sample size will increase proportionately to departure of these assumptions from the reality. Very large samples (as large as possible) are needed for estimation. However, for testing of hypothesis, a sample size should be just appropriate. To calculate the appropriately required sample size, there are copious methods but it is essential to consider the validity of assumptions involved, while using any of them.
 
The Design Effect
Simple random sampling acts, generally as a useful basis of comparison for all varieties of random sampling. The ratio of the variance of an estimator for a particular design to the variance of estimator for a simple random sample of the same size is known as the design effect. The design effect has two primary 87uses – in sample size estimation and in appraising the efficiency of more complex plan. We can estimate sample size needed with simple random sample and multiply it by design effect. That gives sample size needed for that particular design. We can also judge whether the complex plan is advantageous in efficiency relative to its cost and complexity.
For estimating the design effect in particular situation we can use pilot survey results. Although it is required to find the design effect for any particular situation, in general it is observed that for cluster sampling the design effect is approximately 2. This would mean that, to obtain the same precision, twice as many individuals would have to be studied as with the simple random sampling strategy. For stratified random sampling in few situations where the stratification is nearly perfect the design effect might be between 0.40 to 0.80 only. In practice, the use of naturally occurring strata seems to result in about a 20% reduction in variance when compared to simple random sampling. Therefore one can calculate required sample size for a given situation if simple random sampling is to be used, and then, if cluster sampling is intended to be used double the sample size and if stratified sampling is intended to be used then reduce the sample size by 20% (provided the stratification is satisfactory). However, it is impossible to make any universal statement for any design because the value of design effect depends not only on sampling design but also on ‘which variable (its variability in the population) is under study’. Nevertheless, much amount of 88reduction in sampling error can very well be achieved by using the appropriate sampling design. A sampler should maximally utilize the known information about the population while planning a sample survey.
 
Sample Size Versus Choice of Appropriate Sampling Design
It is well known that increasing sample size decreases the standard error as it is inversely proportional to sample size. However, reduction in sampling error can be achieved by using the appropriate sampling design instead. This is illustrated with the help of one very simple hypothetical example, which is similar to that in Williams' book “Sampler on Sampling”. Such an extreme and simple situation may seldom arise in practice, but this empirical, artificially constructed example will suffice to demonstrate the point.
Let us assume that the population consists of nine units only and these nine are children whose mean age we wish to find. Their ages are: 3, 6, 3, 9, 3, 9, 6, 9 and 6 years. We decide to select a sample of size three from this population. If we select three units from nine units by a simple random sampling method, 84 samples having equal chance of selection are possible. The sampling distribution has the mean equal to 6 which is equal to the population mean and SE of 1.3 years. This standard error can be reduced by increasing the sample size from three units to four, from four units to five, and so on. If we study the whole population, the standard error becomes zero. Therefore, it is often said that the sample should be as large as possible. 89But the nonsampling error may increase as a result. For estimation this is acceptable, but for test of significance the side effect of large sample size is that even a small difference could be significant. It is of course necessary to differentiate between statistical and medical significance. Instead the reduction in sampling error can be achieved by using an appropriate sampling design with the best possible use of the information available about the population in exploiting the provisions offered by the sampling design.
Let us assume that the additional information about the population is that these nine children are from three standards: three are from nursery, three from first standard and the remaining three from second standard. We divide the population of nine units into three homogeneous groups of size three each and decide to use stratified sampling design. All the possible 27 samples will yield a mean as 6 and therefore the standard error is zero. Instead of their standard in school, suppose we come to know that nine children are from three families and in each family there are three children with ages 3, 6, and 9. If we treat these families as clusters and decide to select any one cluster randomly for a sample of size three, any cluster selected will have the mean equal to 6 and therefore the standard error will be zero. If we select the sample according to systematic sampling design with random start with a sampling interval of three, then any one of the three samples with unit numbers: 1st, 4th, 7th or 2nd, 5th, 8th or 3rd, 6th, 9th will be selected. Each 90one of these samples has a mean equal to 6 and therefore the standard error is zero. Though many other sampling designs are available, only four important ones are considered here. Choice of an appropriate design depends on the situation, availability and nature of the information about the population from which the sample is to be selected. It is also known that the sampling scheme which is not appropriate in a given situation will yield much larger standard error. The required sample size for a given situation may be calculated by any of the methods available but it should be modified/adjusted taking into consideration the design effect.
It is possible to quote many real life examples to show the role of an appropriate choice of a sampling design which ultimately proved to be a successful sampling scheme. To quote a recent example, while sampling families from an earthquake affected area, we fully utilized die information about the present physical rehabilitation status which was likely to have an impact on characteristics we wanted to study. The sampling scheme proved to be very successful in yielding the required type of sample for further clinical studies.
 
FEW POINTS TO BE REMEMBERED
  • Choice of an appropriate design depends on the situation and availability and nature of the information about the population from which the sample is to be selected. It can be easily shown that the sampling scheme that is more appropriate in 91a given situation yields a standard error, which is much smaller, and a sampling scheme that is not appropriate in a given situation will yield standard error, which is much larger. The sample size should be modified/adjusted taking into consideration the design effect, as all the methods available for calculation of required sample size assume simple random sampling. However, in practice we may use some other method.
  • Apart from the technical considerations, the decision on sample size depends on the resources available for the survey. It is well known that larger the sample size, better it would be from the point of view of achieving higher precision of the estimate(s) as well as enabling one to perform more elaborate analysis of the data. However, larger the size, greater is requirement for resources and strain on having an effective management and supervision of the entire survey work, which can be quite crucial. Moreover, that may lead to larger nonsampling error.
  • While planning studies, the modal and often the first question asked to the consulting statistician seems to be “how large a sample do I need?” The problem of choosing the number of observations is in some ways one of the most difficult in applied statistics. It takes technical knowledge and experience to approach the problem properly. Unfortunately, to laymen, the problem of selecting a sample size may appear to be easy one; one that statistician deal with routinely many times each 92day. The problem is not routine. In fact, sample size estimation is one of the last questions to be answered, not the first. And it is all the more important to choose an appropriate method of selecting the sample (i.e. sampling design) from population. Any study has two important aspects, one generalizibility and the other validity. Sample size and sample selections are vital for generalizibility, which is no doubt important. But the study should be valid in the first place. Methods used in the study are responsible for increasing the validity of the study and so of the results. Therefore, it is not always the sample size, which is important, methodology used should be sound and followed rigorously.
  • A questionnaire (all the questions) should be appropriate (i.e. which is capable of providing answers to the questions being asked), intelligible (which the respondent can understand), unambiguous (which means the same to both the respondent and the inquirer), unbiased (i.e. without recall bias), capable of coping with all possible responses (i.e. omnicompetent), satisfactorily coded (the coding system must be carefully checked for ambiguity and overlap), piloted, and ethical. The key steps in designing a questionnaire are: decide what data you need, select items for inclusion, design the individual questions, compose the wording, design the layout and presentation, think about coding, prepare the first draft and pretest, pilot and evaluate the form, and perform the survey.
    93
  • For better ‘stratification’ stratas should be homogeneous within but heterogeneous between. For better ‘cluster’ formation clusters should be heterogeneous within, i.e. variability in population is represented fully in each cluster, but clusters should be homogeneous between, i.e. clusters are not different from each other, all clusters are same.
  • All questionnaires should be pretested before being put into routine use. The particular objectives of pretesting are to see whether the questions are understood and elicit appropriate responses and to ensure that where alternative answers have been provided they cover the full range of relevant answers. Pretesting should include, at least, initial evaluation by peers and testing on a sample of subjects from the population to be studied. Translation of a questionnaire into another language should be carried out by a person who is expert in both languages and, ideally, also understands the subject matter. A cycle of translation and back-translation to the original language should be followed until the back-translated questionnaire appears to have the same meaning as the original.
  • Response rates tend to be highest for face-to-face interviews, intermediate telephone interviews, and least for mailed questionnaires. The use of multiple methods of approach after initial nonresponse, however, can narrow the response gap between methods while not reducing the cost advantage of telephone and mail approaches. Failure to participate in surveys is not a random phenomenon; it 94correlates with a variety of characteristics of individuals, particularly greater age, less education, lower socioeconomic status, current smoking, and poor present health. Thus, in theory at least, non-response can lead to appreciable bias in some characteristics of the sample of subjects although, when samples of the same population with low and high response rates have been compared, the differences in prevalence proportions of relevant characteristics have been small.
  • A variety of techniques can be used to maximize response rates in different kinds of epidemiological studies. They include: Blanket publicity before the survey, appropriate advance notice of contact, personalization of correspondence and approaches by telephone, use of special class mailings (e.g. registered post), a carefully constructed introductory letter or statement signed, if possible, by someone salient to the subject, telephone interviewers perceived as sounding confident and competent, the offer of a monetary incentive, stamped return envelopes, avoidance of signed consent procedures at the beginning of interviews, use of experienced interviewers, multiple approaches to subjects, etc. Randomized response technique is very useful for sensitive questions.

Study Designs4

 
EPIDEMIOLOGY
Epidemiology is concerned with the distributions and determinations of disease frequency in human populations. The basic design strategies used in epidemiologic research can be broadly categorized according to whether such investigations focus on describing the distributions of disease or elucidating its determinants. Descriptive epidemiology is concerned with the distribution of disease, including consideration of what populations or subgroups do or do not develop a disease, in what geographic location it is most or least common, and how the frequency of occurrence varies over time. Information on each of these characteristics can provide clues leading to the formation of an epidemiologic hypothesis that is consistent with existing knowledge of disease occurrence. Analytic epidemiology focuses on the determinants of a disease by testing the hypothesis formulated from descriptive studies, with the ultimate goal of judging whether a particular exposure causes or prevents disease.96
Descriptive epidemiology is concerned with describing the general characteristics of the distribution if a disease, particularly in relation to person, place and time. Descriptive data provide valuable information to enable health care providers and administrators to allocate resources efficiently and to plan effective prevention or education programs. In addition, descriptive studies have often provided the first important clues about possible determinants of a disease. Due to limitations inherent in their design however, descriptive studies are primarily useful for the formulation of hypotheses that can be tested subsequently using an analytic design.
There are three main types of descriptive studies, which are listed in Table 4.1. The first type, correlation study, uses data from entire populations to compare disease frequencies between different groups during the same period of time or in the same population at different points in time.
Table 4.1   Overview of epidemiologic design strategies
  • Descriptive studies
    • Populations (correlational studies)
    • Individuals
    • Case reports
    • Case series
    • Cross-sectional surveys
  • Analytic studies
  • Observational studies
    • Case-control studies
    • Cohorts-studies – retrospective and prospective
  • Intervention studies (clinical trials)
97
The case report is the most basic type of descriptive study of individuals, consisting of a careful, detailed report by one or more clinicians of the profile of a single patient. The individual case report can be expanded to a case series, which describes characteristics of a number of patients with a given disease. The third type of descriptive epidemiologic study design in individuals is the cross-sectional survey, in which the status of an individual with respect to the presence or absence of both exposure and disease is assessed at the same point in time.
All study designs involve some implicit (descriptive) or explicit (analytic) type of comparison of exposure and disease status. In a case report, for example, where a clinician observes a particular feature of a single case, a hypothesis is formulated based on an implicit comparison with the “expected” or usual experience. In analytic study designs the comparison is explicit, since the investigator assembles groups of individuals for specific purpose of systematically determining whether or not the risk of disease is different for individuals exposed or not exposed to a factor of interest. It is the use of an appropriate comparison group that allows testing of epidemiologic hypotheses in analytic study designs.
There are a number of specific analytic study design options that can be employed. These can be divided into two broad design strategies: observational and interventional. The major difference between the two lies in the role played by the investigator. In observational studies, the investigator simply observes 98the natural course of events, noting who is exposed and nonexposed and who has and has not developed the outcome of interest. In intervention studies, the investigators themselves allocate the exposure and then follow the subjects for the subsequent development of disease.
 
Observational Studies
There are two basic types of observational analytic investigations—the cohort study and the case-control study. In theory, it is possible to test a hypothesis using either design strategy. In practice, however, each design offers certain unique advantages and disadvantages. In general, the decision to use a particular design strategy is based on features of the exposure and disease, the current state of knowledge, and logistic considerations such as available time and resources.
In a cohort study, subjects are classified on the basis of the presence or absence of exposure to a particular factor and then followed for a specified period of time to determine the development of disease in each exposure group. In a case-control study, a case group or series of patients who have a disease of interest and a control, or comparison, group of individuals, without the disease are selected for investigation and the proportions with the exposure of interest in each group are compared.
Considerable confusion arises concerning the terms retrospective and prospective as applied to epidemiologic studies. Some investigators have used these terms synonymously with case-control and cohort, 99respectively, reasoning that the former looks backward from a disease to a possible cause, while the latter looks forward from an exposure to an outcome. It is more informative to use these terms to refer to the temporal relationship between initiation of the study by the investigator and the occurrence of the disease outcomes being studied. The feature that distinguishes a prospective from a retrospective cohort is simply and solely whether the outcome of interest has occurred at the time the investigator initiates the study. Thus, at the beginning of a prospective cohort study, the groups of exposed and unexposed subjects have been assembled, but the disease has not yet occurred, so that the investigator must conduct follow-up during an appropriate interval to ascertain the outcome of interest.
It is often possible to investigate a particular hypothesis using either a case-control or cohort study design. For example, the hypothesis that oral contraceptive use increases the risk of breast cancer has been evaluated in number of case-control studies that identified women with and without breast cancer and compared the proportions who were users of oral contraceptives. In addition, the question has been examined using a cohort design as well where women initially free from disease were classified according to their use of oral contraceptives and then followed forward over time to compare the development of breast cancer in the two groups.
The choice of which type of design to use to study a particular exposure-disease relationship depends on 100the nature of the disease under investigation, the type of exposure, and the available resources. For example, the case-control design is particularly efficient for investigation of a relatively rare disease since it selects a group of individuals who has already developed the outcome. In the example of artificial sweeteners and bladder cancer, it would have been much more costly and time-consuming to use a cohort approach involving identification of a necessarily very large group of individuals who used artificial sweeteners and a comparable group of nonusers and follow them to compare how many patients would subsequently develop bladder cancer over the next 20 years. Conversely, since cohort studies enroll individuals who are initially healthy and observe the subsequent development of disease of time, this design is best suited to investigations of relatively common outcomes that will accrue in sufficiently large numbers over a reasonable short period of follow-up. In general, therefore, cohort studies of rare diseases are far less feasible.
Intervention studies, also referred to as experimental studies or clinical trials, may be viewed as a type of prospective cohort study, because participants are identified on the basis of their exposure status and followed to determine whether they develop the disease. The distinguishing feature of the intervention design is that the exposure status of each participant is assigned by the investigator. When well-designed and conducted, intervention studies can indeed provide the most direct epidemiologic evidence on which to judge whether an exposure cause or prevents a disease.101
 
Other Operational Details: Cohort Study
Because participants are free from the disease at the time their exposure, the temporal sequence between exposure and disease can be more clearly established. They are particularly well-suited for assessing the effects of rare exposures and allow for the examination of multiple effects of a single exposure.
Cohort studies may be classified as either prospective or retrospective, depending on the temporal relationship between the initiation of the study and the occurrence of the disease. By definition, both prospective and retrospective cohort study designs classify subjects in the study on the basis of presence or absence of exposure. In retrospective cohort studies, however, all the relevant events (both the exposures and outcomes of interest) have already occurred when the study is initiated. In prospective studies, the relevant exposures may or may not have occurred at the time the study is begun, but the outcomes have certainly not yet occurred. Retrospective cohort studies usually evaluate exposures that occurred many years previously, they depend on the routine availability of relevant exposure data in adequate detail from pre-existing records.
In cohort studies in which a single, general cohort is entered and its members then classified into exposure categories, an internal comparison group can be utilized. That is, the experience of those cohort members classified as having a particular exposure is compared with that of members of the same cohort who are either nonexposed or exposed to a different 102degree. For cohort studies that involve the use of a special exposure group, such as in an occupational setting or for a particular environment, it is often not possible to identify a portion of the cohort that can safely be assumed to be nonexposed for comparison. In this instance, an external comparison group is used, such as the general population of the area in which the exposed individuals reside.
The major disadvantage of using the general population as a comparison group is that its members may not be directly comparable to those of the study cohort. People who are employed are on average healthier than those who are not. Since the general population includes people who are unable to work due to illness as well as those who are employed, rates of disease and death among the general population are almost always higher than they are for members of the work force. The effect of this phenomenon, termed the “healthy worker” effect is that any excess risk associated with a particular occupation will tend to be under estimated by a comparison with the general population.
In many cohort studies, a single classification of exposure is made for each individual at the time of his or her entry into the study. Frequently however, changes in exposure levels for the factors of interest will occur during the course of long-term follow-up. Individuals may cut down of their smoking habits, change jobs, or begin to eat a diet lower in saturated fats. Similarly, the introduction into the workplace of a new piece of standard equipment may affect the level 103of exposure experienced by all workers in a single plant or entire industry. Such changes will tend to result in an underestimate of the true strength of the association between an exposure and disease. Consequently, many cohort studies are designed to allow for periodic re-examination or resurvey of the members of the cohort to allow for revision of exposure categories according to the new information. The analysis can then take into account the reasons for these changes.
Whatever the procedures for identifying the outcomes of interest, it is crucial for the validity of the study that they are equally applied to all exposed and nonexposed individuals. As with any epidemiologic investigation, an evaluation of the validity of a cohort study requires consideration of the chance, bias, and confounding as alternative explanations for the study finding.
A principal advantage of cohort studies is that they are optimal for investigation of the effects of rare exposure. A second advantage of cohort studies is their ability to examine effects of a single exposure, thus providing a picture of the range of health outcomes that could be related to a factor or factors of interest. Third, since the participants are disease-free at the time exposure status is identified, the temporal sequence between exposure and disease can be more clearly elucidated. Moreover, since in a prospective cohort study the outcomes of interest have not yet occurred the time the study is begun, bias in the selection of subjects and ascertainment of exposure is minimized. For retrospective cohort studies where all the relevant 104events have occurred when the study begins, the potential for these biases is similar to that of a case-control study. Finally cohort studies allow the direct calculation of incidence rates of the outcomes under investigation in the exposed and nonexposed groups.
With respect to disadvantages, a cohort study is not an efficient design for the evaluation of a rare outcome unless the study population is extremely large or the outcome is common among those who are exposed. Moreover, if prospective, cohort study is very expensive and time-consuming compared with either a case-control or retrospective study. Finally, the unique potential for bias in the cohort design is the problem of losses to follow-up. If the proportion of those lost to follow-up is high or even if the proportion is low but is related to both, the exposure and outcome under investigation, the study findings may not be valid.
 
Other Operational Details: Case-control Study
The case-control design offers a solution to the difficulties of studying disease with very long latency periods, since investigators could identify affected and unaffected individuals and then look backward in time to assess their antecedent exposure rather than having to wait a number of years for the disease to develop. For a case-control study to provide sound evidence of whether there is a valid statistical association between an exposure and disease, comparability of cases and controls is essential. Important issue to be considered in the design of a case-control study is the definition 105of the disease or outcome of interest. It is important that this represent as homogeneous a disease entity as possible since often similar manifestational entities of disease have very different etiologies. To help ensure that cases selected for study represent a homogeneous entity, one of the first tasks in any study is to establish strict diagnostic criteria for the disease.
Depending on the certainty of the diagnosis, and the amount of information available, it is often useful to perform analyses separately for cases classified as definite, probable or possible. Once the diagnostic criteria and definition of the disease have been clearly established, the individuals with this condition can be selected from a number of sources. These include identifying persons with the disease who have been treated at a particular hospital or medical care facility during a specified period of time or selecting all persons with the disease in a defined general population at a single point or during a given period of time. The first approach, referred to as a hospital-based case-control study, is more common because it is relatively easy and inexpensive to conduct. The second, the population based case-control study, involves locating and obtaining data from all affected individuals or a random sample from a defined population. The scientific advantage of a population-based design is that it avoids bias arising from whatever selection factors lead an affected individual to utilize a particular health care facility or physician. Moreover, the population-based approach allows the description of the entire of the disease in that population and the 106direct computation of rates of disease in exposed and nonexposed individuals. Because the logistic and cost considerations are often prohibitively large, however, population based case-control studies are not routinely done.
Regardless of the source, the affected individuals can represent either incident (newly diagnosed) or prevalent (existing at a point in time) cases of the disease. Certainly, the inclusion of prevalent cases, especially of a rare condition, will greatly increase the sample size available for study in a given time period. Since prevalent cases reflect determinants of duration as well as the development of the disease, however the interpretation of the findings from such a study may be complicated. Clarification of the temporal sequence between exposure and disease is always an issue in case-control studies, it is a more serious problem when prevalent rather than incident cases are used.
The cases in a case control study should be selected to be representative of all persons with the disease. Rationale for selecting a random sample derives from the fact that findings from a study that utilized representative cases would be more readily generalizable to the broader population of all men. However, the primary concern in the design of any study must be validity, not generalizability. A case-control study can be restricted to a particular type of case on whom complete and reliable information on exposure and disease can be obtained.
The control subjects should be selected to be comparable to the cases, and as a consequence will 107represent (not the population of all non-diseased persons but) the population of non-diseased persons who would have been included as cases had they developed the disease. Such a case-control study will, therefore, provide a valid estimate of the association between the exposure and the disease and a judgment concerning the generalizability of the findings can then safely be made. Validity should not be compromised in an attempt to achieve generalizability, since a lack of confidence in the validity of findings from a study will preclude any ability to generalize the results.
The selection of an appropriate comparison group is perhaps the most difficult and critical issue in the design of a case-control study. Controls are necessary to allow the evaluation of whether the frequency of an exposure or specified characteristics observed in the case group is different from that which would have been expected based on the experience of a series of comparable individuals who do not have the disease. Among the specific issues to be considered in selecting controls is the source of subjects. There are several possible sources commonly used. Including hospital controls, general population controls, and special control series such as friends, neighbors, or relatives of case. Each offers particular advantages and disadvantages that must be considered for any particular study in view of the nature of the cases, their source and the type of information to be obtained.
There are a number of important practical and scientific advantages to using hospitalized controls. The first is that they are easily identified and readily 108available in sufficient numbers, thus minimizing the costs and effort involved in their assembly. Second, because they are hospitalized, they are more likely than healthy individuals to be aware of antecedent exposures or events. In this respect, their comparability to cases in their accuracy or reporting information will reduce the potential for recall bias. Third, using patients hospitalized with other diseases as controls means that they are likely to have been subject to the same selection factors that influenced the cases to come to this particular physician or hospital. They are more likely to be willing to cooperate than healthy individuals, thus minimizing bias due to nonresponse.
The chief disadvantage of using hospitalized controls is that they are, by definition, ill and therefore differ from healthy individuals in a number of ways that may be associated with illness or hospitalization in general. Studies using both hospitalized and nonhospitalized controls have demonstrated that as a group, hospitalized patients are more likely to smoke cigarettes, use oral contraceptives, and are heavy drinkers of alcohol than nonhospitalized individuals. The result of using hospital controls for a study of any of these risk factors could be a biased estimate of effect. The customary recommendation is to choose the control group from a community population, rather than from other hospitalized patients. However, the bias (called as Berkson's bias) arises in the “case” group, because of increased hospitalization for the subpopulation that is both exposed and diseased, a change in the control group from a hospital to a community source will not eliminate the problem. Berkson's bias could be 109removed only if the cases and controls were selected from the same community.
Procedures used to obtain information must be as similar as possible for cases and controls, for example, it is preferable that the place and circumstances of interviews be the same to avoid a situation in which, for example cases are interviewed in the hospital and the population controls at home. In some situations it may be possible for the interviewers or medical records abstractors to be blinded to the case or control status of subjects. Equal diagnostic examination is needed to check that in both cases and controls similar procedures and criteria have been utilized for the diagnosis of the disease. It is important to ensure that members of the control group are free of the disease being studied. Fulfillment of this standard is particularly important when the target disease is an ailment, such as cancer, that can occur in an asymptomatic form, allowing the disease to escape detection unless specifically sought.
The analysis of a case control study is basically a comparison between cases and controls with respect to the frequency of an exposure whose potential etiologic role is being evaluated. In the vast majority of case control studies, primarily estimating the relative risk as computed by the odds ratio makes this comparison. Role of chance can be evaluated by testing the significance of this measure of association and calculating confidence intervals. Cases and controls must, of course, also be compared to ensure similarly with respect to other baseline differences that could be associated with the risk of developing the outcome under study. In all analytic epidemiologic studies, and 110evaluation of the validity of the findings requires consi-deration of the roles of chance, bias, and confounding as possible alternative explanation.
 
On Analysis of Epidemiologic Studies
It must be clearly kept in mind that tests of statistical significance and confidence intervals evaluate only the role of chance as an alternative explanation of an observed association between an exposure and disease. While an examination of the P value and or confidence interval may lead to the conclusion that chance is an unlikely explanation for the findings, this provides absolutely no information concerning the possibility that the observed association is due to the effects of uncontrolled bias or confounding. All three possible alternative explanations (chance, bias, confounding) must always be considered in the interpretation of the results of every study.
Unlike chance and confounding, which can be evaluated quantitatively, the effects of bias are far more difficult to evaluate and may even be impossible to take into account in the analysis. For this reason, it is of paramount importance to design and conduct each study in such a way that every possibility for introducing bias has been anticipated and that steps have been taken to minimize its occurrence.
The validity of the study results must always be the primary objective, because it is clearly not possible to generalize an invalid finding. It is far more important to restrict admissibility to individuals who are comparable with respect to other risk factors for the 111outcome under study, as well as on whom complete and accurate information can be obtained. Selecting a study of population on which accurate and complete information could be obtained can enhance confidence in the validity of the findings in any study. It is important to keep in mind that validity should not be compromised in an effort to achieve generalizibility, because generalizibility can be inferred only for a valid result.
In a given study, if chance, bias and confounding are all determined to be unlikely alternative explanations of the findings, one can then conclude that a valid statistical association exists between the exposure and disease in these data. It is then necessary to consider whether this relationship can be judged one of cause and effect, since the presence of a valid statistical association in no way implies causality. Such a judgment can only be made in the context of all evidence available at that moment and as such must be re-evaluated with each new finding. Criteria that can aid in the judgment concerning causality include strength of the association, biologic credibility of the hypothesis, consistency of the findings, as well as other information concerning the temporal sequence and the presence of a dose-response relationship. Framework for the interpretation of an epidemiologic study is then:
  1. Is there a valid statistical association?
    1. Is the association likely to be due to chance?
    2. Is the association likely to be due to bias?
    3. Is the association likely to be due to confounding?
      112
  2. Can this valid statistical association be judged as cause and effect?
    1. Is there a strong association?
    2. Is there biologic credibility to the hypothesis?
    3. Is there consistency with other studies?
    4. Is the time sequence compatible?
    5. Is there evidence of a dose-response relationship?
 
MEASUREMENT OF RISK
Relative risk (RR) is the ratio of the risk of developing outcome such as disease (D) in those with antecedent factor (A) compared to those without this factor. Antecedent factor would generally be an exposure believed to cause the disease. The term risk here has the same meaning as incidence. In terms of probabilities,
The ‘+’ sign is for presence and the ‘–’ sign is for absence. In terms of notation of the Table 4.2 RR = π1112 subject to the condition that π11 + π21 = 1 and π12 + π22 = 1. Computation of RR requires a prospective study. It measures the degree of association of outcome with the antecedent factor. If the incidence of lung cancer among heavy smokers is 5% and nonsmokers 0.5% then RR = 10. That is heavy smokers are 10 times at risk of developing lung cancer than the nonsmokers.
Estimate of RR: RR = [(O11/O.1) ÷ (O12/O.2)]
where Os are as in Table 4.2. If any Orc is zero then a modified estimate of (RR).
RRmod = [(O11 + 0.5)/O.1] ÷ [(O12 + 0.5)/O.2]113
Table 4.2   Structure of cohort (prospective) study – Independent samples
Outcome
Antecedent
Total
Present
Absent
Present
O1111)
O1212)
O1.(π1.)
Absent
O2121)
O2222)
O2.(π2.)
Total
n1(1)
n2(1)
  • RR = 1 implies complete independence, i.e. the risk in exposed is the same as in nonexposed. RR > 1 means a higher risk in the exposed and RR < 1 means a lower risk. The later can be interpreted as protective effect in place of risk.
  • It is preferable to keep the adverse category of antecedent in the first column of 2 × 2 contingency table and adverse outcome in the first row. The interpretation then is easy. The interpretation of RR is in terms of first row first column. The concept of RR is applicable to the prevalence also. A slight caution is advised while interpreting RR based on prevalence. This tells about risk of presence of disease instead of risk of developing the disease.
RR in case of matched pairs: A general situation of matched pairs with regard to antecedent and outcome is as in Table 4.3. In all there are n matched pairs, and b is the number of pairs in which exposed partner develops the disease and the nonexposed partner does not develop the disease.
114
The numerator is the number of subjects developing the disease among exposed and the denominator is the number of subjects developing the disease among the nonexposed.
Table 4.3   Structure of cohort (prospective) study – Matched pairs
Partner 2 Antecedent present (exposed)
Partner 1 Antecedent not present (not exposed)
Total
Positive outcome (disease +)
Negative outcome (disease –)
Positive outcome (disease +)
a
b
a + b
Negative outcome (disease –)
c
d
c + d
Total
a + c
b + d
n = a + b + c + d
  • The term risk in literal sense relates to an adverse outcome. In statistical sense, this is not necessarily so. For example, the “risk” is for relief within one week which is a positive feature. The term “protective effect” is used for such a factor.
Attributable risk (AR): Attributable risk is the differ-ence in the risk among the exposed and the non-exposed. In terms of probabilities, AR = π11 – π12
provided π11 + π21 = 1 and π12 + π22 = 1. This is estimated as:
115
AR measures the expected reduction in risk if the exposure factor is eliminated. Thus this has a public health importance. RR and AR can sometime give very different conclusion. Consider the data in Table 4.4 that summarizes results from the famous Doll and Hill study on British doctors. It compares the mortality by lung cancer and cardiovascular disease in nonsmokers and heavy (> 25 cigarettes per day) smokers from 1951 to 1961.
Table 4.4   Comparison of RR and AR of death due to lung cancer and cardiovascular diseases in heavy smokers among British male physicians
Cause of death
Annual death rate/1000
RR
AR
Nonsmokers
Heavy smokers
Lung cancer
0.27
2.27
32.43
2.00
Cardiovascular disease
7.32
9.93
1.36
2.61
The RR for lung cancer is high—32.43. This indicates a very strong association of lung cancer deaths with heavy smoking. This underscores the importance of smoking in etiology of lung cancer. The association of cardiovascular disease death with heavy smoking was mild as RR is only 1.36. The AR in the two cases were nearly same. During 1951-61, elimination of heavy smoking among British male doctors would have reduced the cause specific mortality for lung cancer almost as much as for cardiovascular disease.
In view of the public health importance of AR, sometimes it is of interest to estimate the excess rate of disease attributable to the exposure in the total 116population under study. This excess is called the population attributable risk (PAR) and is calculated as: PAR = Incidence in the total population (including exposed)—Incidence in the nonexposed. The first component of PAR can be estimated from the study sample only when sample contains the exposed and nonexposed subjects in the same proportion as in the total population. Otherwise, an extraneous estimate would be needed.
The attributable risk in case of matched pairs is estimated as:
where, b and c are as in Table 4.3 above.
Test of H0 on RR and AR: The null hypothesis generally of interest for rejection in case of RR is H0: RR = 1. This is the same as H0: AR = 0. Since RR measures the degree of association (so does AR in a different way), this null is the same as the hypothesis of homogeneity of probabilities in case of 2 × 2 tables. Thus, for this H0 too, χ2 for independent samples and χ2 for paired samples, are the respective criteria.
CI for RR: Suppose there are two populations in which the probabilities that an individual shows characteristic A are π1 and π2 respectively. A random sample of size n1 from the first population suppose has r1 members showing the characteristic (and a proportion π1 = r1/n1), while the corresponding values for an independent sample from the second population are n2, r2 and π2 = r2/n2. Then relative risk (RR) is the simple ratio of these proportions.
117
Note: In is a natural or Nepaerian logarithm. Therefore, 95% CI for RR is exp [In RR ± 1.96 SE (In RR)]
 
Odds Ratio
The term odds ratio is used in the context of retrospective or case-control studies. The comparison in such studies is between the frequency of occurrence or of presence of antecedent among the cases relative to among the controls. Ideally, all other possible factors are appropriately matched so that they do not influence the results. If there are some that are not matched then the statistical analysis is geared to minimizes the influence of these factors on the result. For this, methods such as that of logistic regression are used.
In betting, it is stated, for example, that the odd of wining is 1:3. This odd means that a loss is 3 times more likely than a win. Or, the odds in favor of win is 1/3. Similarly, in case-control studies, an odd is the frequency of presence of antecedent relative to its absence (Table 4.5). This is calculated for the cases and the controls. The ratio of these two odds is called the odds ratio.
Table 4.5   Structure of case-control study – Independent samples
Outcome
Antecedent
Total
Present
Absent
Present (Cases)
O1111)
O1212)
n1. (1)
Absent (Controls)
O2121)
O2222)
n2. (1)
Total
n1. (π.1)
n1.(π.2)
n
118
Thus, the odds ratio is
This is estimated as:
Since the numerator is the product of the elements in the leading diagonal and the denominator of the other diagonal elements, OR is also sometimes called cross-product ratio. The above equitation becomes undefined if any of cell frequencies is zero. Then, modified estimate of OR:
OR mod = {[(O11 + 0.5) (O22 + 0.5)] ÷ [(O12 + 0.5) (O21 + 0.5)]} should be used.
The interpretation of odds ratio is similar to that of relative risk. An OR = 2 says that the presence of antecedent is twice as common among the cases as in the controls. It can be shown that odds ratio approximates relative risk when the outcome of interest is rare in the target population. In fact, it is also shown that this rare disease assumption is not necessary in many situations.
  • The sample odds ratio is always a good estimate of the population odds ratio whether or not the disease is rare in the population.
  • Note that in case of case-control studies, the reference is to presence or absence of antecedent characteristic. For this, it could be inappropriate to use the term incidence. Neither the term ‘rate’ seems appropriate for presence of an antecedent characteristic. Thus term ‘odds’ is used.
  • Relative risk can give very different conclusion depending upon that positive outcome is being 119measured or the negative outcome is being measured. Odds ratio gives the same conclusion.
OR in matched pairs: Consider a table (Table 4.6) on matched pairs which is similar to earlier one. We are now using capital A, B, C, D as notation of cell frequencies in place of lower case a, b, c, d to distinguish the case-control setup from the prospective setup.
Table 4.6   Structure of case-control study – Matched pairs
Cases
Controls
Antecedent present (exposed)
Antecedent not present (nonexposed)
Antecedent present (exposed)
A
B
Antecedent not present (nonexposed)
C
D
The total number of pairs is A + B + C + D. A is the number of pairs with both case and control subjects exposed, D is the number of pairs with both nonexposed.
Odds ratio (Matched pairs): ORM = B/C
Test of Ho on OR: The H0 in this case almost invariably is that OR = 1. This says that the presence of antecedent is as common in cases as in controls. Since OR = RR if outcome is rare, H0: OR = 1 also says, in that case, that presence or absence of antecedent does not influence the outcome. A simple statement which takes care of both the directions of relationship is that there is no association under H0 between antecedent and outcome. The alternative could be one-sided H1: OR < 1 or H1: OR > 1, or could be two-sided H1: OR # 1. The latter is 120applicable when there is no a-priori assurance that the relationship could be one sided. The hypothesis is tested by classical chi-square.
CI for OR: OR is a ratio and its natural logarithm (ln) becomes a linear function. The distribution of OR can be shown to be highly skewed but ln (OR) has nearly Gaussian pattern for large n. It has been established, for large n,
SE (ln OR) = √(1/O11 + 1/O12 + 1/O21 + 1/O22)
where O11, O12, O21 and O22 are as in earlier table. If any Orc=0 then add 1/2 to each Orc in the denominators. For large n,
95% CI for OR: exp [ln OR + 2SE (ln OR)],
where exp is exponent on Neparien e.
A very useful method of comparing the cases with controls with respect to one or more dichotomous antecedents is logistic regression. The odds ratio arises naturally in this kind of model. The odds ratio approximates to the relative risk when the probability of endpoints is lower than 10%. Above this threshold the odds ratio will overestimate the relative risk. To derive an approximate relative risk from the odds ratio, one can use the following formula:
Relative risk = {[odds ratio]/[1 + Pc × (odds ratio − 1)]}
where Pc is proportion of events in the control group.
  • When the odds ratio (OR) is interpreted as relative risk (RR) it always overstate any effect size. The OR is smaller than the RR for ORs of less than one, and bigger than the RR for ORs of greater than one. The extent of overstatement increases as both 121the initial risk increases and the odds ratio departs from unity. For risks less than about 20% the odds are not greatly dissimilar to the risk, but as the risk climbs above 50% the odds start to look very different. That is, serious divergence between the OR and the RR occurs only with large effects on groups at high initial risk. Therefore, qualitative judgments based on interpreting odds ratios, as though they were relative risks, are unlikely to be seriously in error.
  • Odds ratio is ratio of odds of exposure in cases to odds of exposure in noncases (which is mathematically equivalent to ratio of odds of disease among exposed to odds of disease among nonexposed). Relative risk is ratio of odds of exposure in cases to odds of exposure in the population. Therefore, if instead of odds of exposure in control group, odds of exposure in the population is used in the calculation of odds ratio (or when the control group is selected from the population and not from the noncases), then the odds ratio is exactly equal to the relative risk. No need of rare disease assumption in that case.
 
Experimental Studies: Clinical Trials
Clinical trials are planned experiments designed to assess the efficacies of intervention techniques applied to individuals. These techniques may be therapeutic agents, devices, regimes or procedures (therapeutic trials), preventive ones (prophylactic trials) or rehabilitative, educational etc. The objectives are usually to 122measure the efficacy and safety of the procedure, and their variation among patients with different characteristics, which is generally done by comparing the outcomes with the test treatment with those observed in a comparable group of patients receiving a control (i.e. comparison) treatment. The groups may be established through randomization, in which case the trial is called as Randomized Clinical Trial (RCT). Therefore, randomize clinical trial is an experiment in which individuals are randomly allocated to two groups, known as “experimental” (or study or intervention or treatment) group and the “control” (or comparison) group. The experimental group is given the treatment (drug, vaccine or therapy) being tested and the control group is not given that treatment, may be given the drug in current use or if no such drug exists, then a placebo, an inert substance such as sugar pill or a saline injection is used.
The experimental and control groups must be comparable in all factors except the one being studied, i.e. the treatment. Comparability on factors that are known to have an influence on the outcome, such as age, sex or severity of disease can be achieved by matching for these factors. However, one cannot match individuals on factors whose influence is not known or cannot be measured. This problem can be resolved by the random allocation of individuals to the experimental and control groups, which assures the comparability of these groups with respect to all factors—known or unknown, measurable and not measurable—except for the one being studied. In addition, randomization is the means by which 123investigator avoids introducing conscious or subconscious bias into the process of allocating individuals to the experimental or control groups, thereby increasing the degree of comparability. It also provides a basis for statistical inference.
The plan of clinical trial is formally stated in a sequence of events called “protocol” (a sort of blueprint), which contains the objectives and specific procedures to be used in the trial. It must be written before the start of the trial. General outline (steps of preparations) of a protocol is (are) given below:
  1. Rationale of study and collection of background information.
  2. Defining objectives and formation of hypothesis.
  3. Concise statement of the study design, (including – Blinding, Randomization, Type of trial, Duration of treatment, Sampling technique, Sample size, etc.)
  4. Criteria for including and excluding subjects.
  5. Outline of treatment procedures.
  6. Definition of all clinical, laboratory, etc. methods and diagnostic criteria.
  7. Ethical considerations and methods of assuring the integrity of the data.
  8. Major and minor endpoints.
  9. Provisions for observing and recording side effects.
  10. Procedures for handling problem cases.
  11. Procedures for obtaining inform consent of subjects.
  12. Procedures for collection and analysis of data.
    124
  13. Interpretation and recommendation (s) of the trial.
It is often argued correctly that there is no underlying “population model” in clinical trials (which is behind all statistical testing/thinking), only “invoked population model” (also called as “randomization model”) is present (unspecified sampling procedure is used to select patients from unspecified patient population) and therefore any statistical tests are not validly applicable. But it is shown that the final results by so called randomization sequence tests or other usually used tests are almost same and as a matter of fact ‘linear rank tests’ are more valid in these situations. Therefore, use of non-parametric tests is recommended in clinical trials.
Every proposed trial must be weighed in the ethical balance – each according to its own circumstances and its own problems. However, the most important question is – “Whether it is proper to withhold from any patient a treatment that might, perhaps, give him benefit?”. The value of the treatment is, clearly, not proven (if it were, there would be no need for a trial). Nevertheless, on the other hand, there must be some basis for it. Where the value of treatment, new or old, is doubtful, there may be a higher moral obligation to test it critically than to continue to prescribe it. It may be far more ethical to use a new treatment under careful and designed observation, in comparison with patients not so treated, than to use it widely and indiscriminately before its dangers as well as its merits have been determined.125
The beneficial effects of some therapies are so great (slambang effect) that they are immediately obvious and are accepted by clinicians without questions. For example, use of quinine for malaria, vaccination against smallpox, sulfonamides for lobar pneumonia and meningococcal maningitis, pencillin for gonorrhea. For the great majority of new treatments, however, the benefits are not so obvious. To obtain evidence of effectiveness it is necessary to conduct a scientific trial.
Important ethical considerations are as follows:
  1. Is the proposed treatment safe and therefore unlikely to bring harm to the patient?
  2. For the sake of a controlled trial, can a treatment ethically be withheld from any patient in the doctor's care?
  3. What patients may be brought into a controlled trial and allocated randomly to different treatments?
  4. Is it necessary to obtain the patient's consent to inclusion in a controlled trail?
  5. Is it ethical to use a placebo or dummy treatment?
  6. Is it proper for a trial to be double (or in any way) blind?
A category of investigation that has occasionally raised questions in the minds of investigators is that in which a new preventive, such as a vaccine, is tried. Necessarily preventives are given to people who are not, at the moment suffering from the relevant illness. However, the ethical and legal considerations are the same as those that govern the introduction of a new treatment. The intention is to benefit an individual by 126protecting him against a future hazard, and it is a matter of professional judgment whether the procedure in question offers a better chance of doing so, than previously existing measures. In general, the propriety of procedures intended to benefit the individual – whether these are directed to treatment, to prevention or to assessment – are determined by the same considerations as govern the care of patients.
 
Need for Control or Comparison Group
The table below gives one remarkable example of entirely inaccurate conclusions that an investigator might draw in the absence of control group patients.
Group
Total Numbers
No. cured/improved
%(Cure-rate)
Treatment
579
489
84.5
Control
577
471
81.6
These are the results recorded after 3 days of Antihistamine trial conducted by British Medical Council of UK (1950) where:
Treatment
: Three 50 mg tablets of thonzylamine on 3 consecutive days
Control
: Three totally inactive tablets indistinguishable from the former.
About 85% cure rate in 3 days is a highly en-couraging finding for the use of antihistamines. However, practically the same proportion, namely 82% of placebo patients (controls group) also had a cure (or improvement) in 3 days. Obviously, therefore, the 127antihistamine had no effect, and the improvement noted was probably the natural course of the common cold.
 
Placebo
Patient's knowledge about no treatment may have some adverse psychological effect. To avoid this autosuggestion phenomena use of placebo is very essential. A placebo is a totally inert treatment, which is given as a substitute for an active treatment to a patient who is unaware of the difference. It is a similar looking drug without any therapeutic ingredient in it. Psychological (and so even physiological) effect introduced by the knowledge that the placebo is given is called as ‘placebo effect.’ To avoid this, single blinding of the trial is necessary.
 
Blinding or Masking
Knowledge of whether the participant was in the treatment or control group can influence response or observation resulting in biased inferences, consciously or subconsciously. To remove these sources of bias in the observations, subjects (i.e. participants or patients), observer (i.e. investigators or evaluators) and/or data analyst are kept unaware of the allocation/assignment to groups. This is called blinding or masking and depending on who are kept blinded, following three terms are used.128
 
Single Blind
Subject are not given any indication whether they are in the treatment or control group to prevent the subjects from introducing bias into the response/observations. This is usually accomplished by means of placebo.
 
Double Blind
Both the subject and the observer of the subject are blind regarding the subject's group allocation to remove biases that occur as a result of either the subject or the observer of the subject being influenced by knowledge that the subject is in treatment or control group. It is generally agreed that the attitude of the doctor towards the treatment can influence the course of the disease (hetero suggestion phenomena).
 
Triple Blind
The subject, the observer of the subject and the person analyzing the data are all blind with regard to the group to which a specific individual belongs.
Following chart summarizes these things:
 
Type of Blinding
Single
Double
Triple
Subject
X
X
X
Observer
X
X
Data Analyst
X
(X): Blind with respect to subject's allocation
(−): May be aware of subject's allocation
129
Depending of the type of blinding, a trial is so named or classified. Therefore, the types of trial are – Unblinded or open (when all three—subject, observer and data analyst are aware about the subject's allocation), Single Blind, Double Blind, or Triple Blind. When the trial is without any blinding and is pharmaceutical, it called ‘open-label’.
 
Randomization
As already mentioned the experimental and control groups must be comparable in all factors except the one being studied. Random allocation of individuals to the experimental and control group is the best way to achieve the comparability. Randomization is the procedure of random allocation that is ensuring that any subject who has been recruited for the trial has equal chance of assignment to any group, i.e. he is equally likely to be assigned to either experiment or control groups.
 
Advantages of Randomization
  1. Randomization generally implies equal distribution of subject characteristics in each group and thereby facilitates causal inference. It tends to balance treatment groups in covariates (prognostic factors) whether or not these variables are known. This balance means that the treatment groups being compared will in fact tend to be truly comparable.
    • This does not mean that in any single experiment (trial) all such characteristics will be perfectly balanced between two groups. Randomization 130is a sort of insurance and not a guaranty scheme. Therefore, it is necessary to make sure that the groups are comparable with respect to all important/relevant variables. Stratified randomization/matching before the trial and stratified analysis, standardization and other adjustment procedures like analysis of covariance after the trial should be used.
  2. Randomization eliminates selection bias/selection effects. If individuals found eligible for a study (trial) are randomized into groups, there is no possibility that the investigator's initial biases or preferences about which subjects should receive what program could influence the results. The investigator may or may not be conscious of his or her own selectivity in a non-randomized allocation but randomization will assure him or her as well as others that subtle selection effects have not operated.
  3. Randomization provides a basis for statistical inference. The process of randomization allows us to assign probabilities to observed differences in outcome under the assumption that the treatment has no effect and to perform significance tests. The purpose of a significance test is to rule out the random explanation. If it is used in conjunction with randomization it rules out every explanation other than the treatment.
 
TYPES OF RANDOMIZATION
Randomization is the process by which all subjects are equally likely to be assigned to either the experimental 131group or the control group. There are three types of randomization methods.
 
Simple Randomization
Randomization is achieved by simple method like coin tossing (if head is the outcome, assigning that subject to one group, say experimental group and if it is tail assign to control group), dice throwing (odd numbers on uppermost face – experimental group, even – control group) or by using random number tables (again odd numbers – experimental, even – control).
By using any one of these methods one may end up with unequal numbers of assignments (subjects). For example, if we want to assign 50 subjects to two groups then the result could be 30 subject to one group and 20 to another.
 
Block Randomization
Used in order to avoid the imbalance in the numbers of subjects assigned to each group, an imbalance that could occur in the simple randomized procedure.
 
Example of Blocks of Size 4
There are six possible combinations of group assignment:
AABB, ABAB, ABBA, BBAA, BABA, BAAB
(A – assignment to experimental group, B – assignment to control group)
One of these arrangements is selected at random (may be by lottery method) and the four subjects are 132assigned accordingly. This process is repeated as many times as needed.
If there are two groups (experimental and control) then block size is always an even number and total number of possible combinations of group assignments can be calculated by ‘combinations formula’.
 
Stratified Randomization
This process involves measuring the level of selected factors for a subject (say age and sex), determining to which stratum he belongs (young, old and male, female – four stratas will be: young male, young female, old female, old male) and then performing the randomization within that stratum. Within each stratum the randomization process itself could be simple randomization or blocked randomization. (Stratifying is sometimes mistakenly called as blocking).
 
TYPES OF STUDY DESIGNS
Sound scientific clinical investigation almost always demands that a control group be used against which the new intervention can be compared. Randomization is the preferred way of assigning subjects to control and intervention (experimental) groups.
There are basically three types of study designs:
  1. Randomized Control Studies
  2. Nonrandomized
    1. Concurrent Control Studies
    2. Historical Control Studies
  3. Cross-over Designs
    133
 
 
Randomized Control Studies
Assignment of subjects to group (either intervention group or control group) is determined by the formal procedure of randomization. Randomization in the simplest case is a process by which all subjects are equally likely to be assigned to either the intervention or the control group. Since both the groups are concurrently treated, these are also called ‘parallel group’ designs.
 
Non-randomized Concurrent Control Studies
Controls in this type of study are subjects treated without the new intervention at approximately the same time as the intervention group is treated. Subjects are allocated to one of the two groups but not by a random process. The major weakness of the nonrandomized concurrent control study is the potential that the intervention group and control group are not strictly comparable. Since both the groups are concurrently treated, these are also called ‘parallel group’ designs, but non-randomized nature is to be mentioned.
 
Non-randomized Historical Control Studies
In historical control studies, a new intervention is used in a series of subjects and the results are compared to the outcome in a previous (treated in past) series of comparable subjects. Use of recent historical controls, particularly when there has been a sequence of studies admitting similar kind of patients and using similar criteria for evaluating response, is clearly preferable. 134However, it provides limited protection against bias introduced by time changes in the nature of patient population, in exposure to pathogenic agents or in support care and diagnostic criteria.
 
Cross-over Studies
The cross-over design is a special case of a randomized control trial that allows each subject to serve as his own control. In the simplest case, namely the two period cross-over design, each subject will receive either intervention or control (A or B) in the first period and alternative in the succeeding period. The order in which A and B are given to each subject is randomized. Half of the subjects receive the intervention in the sequence AB and other half in the sequence BA, so that any trend from first period to second period (called as ‘order effect’) is eliminated in the estimate of difference in response.
In order to use the cross-over design, assumption to be made is that the effect of the intervention during the first period must not carry-over into the second period (carry-over effect). Therefore, it is necessary that the therapies under study have no carry-over effect. Differential carry-over effect may be eliminated by interposing a long dry-out or wash-out period (the length of which is determined by the pharmacological properties of the drugs being tested) between the termination of the treatment given first and the beginning of the treatment given second.
Cross-over studies offer a number of advantages. The appeal of this design is to avoid ‘between-subject’ 135variation in estimating the intervention effect, thereby all patients can be assured that sometime during the course of investigation, they will receive the new therapy. However, this method of study is not suitable if the drug of interest cures the disease, if the drug is effective only during a certain stage of the disease or if the disease changes radically during the period of time required for the study.
 
Confounding Factors
Background factors which satisfy conditions 1 and 2 below are called confounding factors.
  1. The risk groups differ on the background factor.
  2. The background factor itself influences the outcome.
In other words, a confounding factor is a variable that has the following properties.
  1. Is statistically associated with the risk factors, i.e. it is different in two groups (study and control).
  2. Directly affects the outcome.
The judgment that particular variable exerts a direct casual influence on the outcome cannot be based on statistical considerations; it requires a logical argument or evidence from other investigations. Confounding variable is also called as co-variate or concomitant variable. A prognostic variable whose distribution is different in two groups is confounding variable. If it is different in two groups, it affects the estimate of the treatment effect. This can be seen more clearly in the following example pertaining to estimation of treatment effect for blood pressure – smoking (Fig. 4.1).136
Fig. 4.1: Example of Confounding
In this example age is a confounding variable because its distribution is different in two groups— smokers and non smokers. Randomization is one best way to make the groups comparable. But randomi-zation does not guarantee the balance in a single experiment/trial. Other method to make the groups comparable is matching. Matching can be used for these confounding variables that are identified in advance (before the trial begins).
 
Matching
Matching of the experimental and control samples (groups) with respect to co-variables can be accomplished in a number of ways. Conceptually the 137simplest is the method of pairing. Each member of the experimental sample is taken in turn and a partner is caught from the control population which has the same values as the Experimental member (within defined limits) for each of the co-variables. Depending on these defined limits, there are three types.
  1. Caliper matching: defining two subject to be a match if they differ on the value of the numerical (i.e. quantitative) co-variable by no more than a small tolerance. Exact matching corresponds to caliper matching with a tolerance of zero.
  2. Nearest Available matching: this does not use a fixed tolerance as does caliper matching. As the name indicates it matches the nearest available value of the variable of the individual from the remaining control population.
  3. Stratified matching: matching is done considering the strata rather than the exact value of the variable. It is an appropriate pair matching procedure for categorical (i.e. qualitative) covariate. Often however, the covariate is numerical but the investigator may choose to work with the variable in its categorical form.
Many times, instead of pair matching, overall groups (experimental and control) are matched (group matching). In the technique known as “Balancing” we do not pair individually but select the control sample so that its means agree with those of the experimental sample for each of the co-variables (mean matching). It gives the same precision as pairing if the regression of the outcome on the co-variable(s) is linear. Balancing is less precise if the regression is non-linear.138
Another method of group matching is frequency matching. This involves stratifying the distribution of the covariate in the experimental group and then finding control subjects so that the number of experimental and control subjects is same within each stratum. Frequency matching is relatively effective in reducing bias in the parallel linear response situation provided that enough strata are used. If there are more than one co variables to be matched and matching is done simultaneously for all co-variables (a combination of above or some method used for each variable), the matching is called as multivariate matching.
Even when matching for few important covariates and randomization is used it is essential to check that the groups are comparable. Sometimes the confounding variable is identified after the trial. In both these situations, effect of the confounding variable can be reduced or even eliminated by using one of the adjustment procedures. Post stratification, analysis of covariance, standardization, using regression or logistic regression models are some of the adjustment procedures.
 
Phases of Clinical Trial
A full clinical trial programme with any new drug passes through three main phases of each with its own distinct set of objectives.
 
Phase I Trial
The primary objectives of phase I trial are:
  1. To establish that the new compound may be safely administered to the human subjects.
    139
  2. To provide sufficient pharmacological data upon which to base the subsequent therapeutic (phase II and III) trails.
The subjects recruited for phase I studies with most drugs may be normal volunteers or patients not taking other drugs. But since the first trials invariably demand close 24 hours supervision, it is frequently more convenient to use hospitalized patients.
 
Phase II Trial
Having established an acceptable level of safety by means of phase I trials, the assessment of therapeutic new drug formally beings with initiation of phase II trials. These trials have the following objectives.
  1. To establish clinical efficacy and incidence of side effects.
  2. To define most appropriate clinical dosage schedules.
  3. To provide detailed pharmacological data and metabolic data so as to further the optional use of drug.
The major factor governing the number and detailed design of phase II trials are highly dependent on particular type of drug under investigation. They allow the clinical investigators to become familiar with the new drugs; they serve to define possible dose ranges for consideration in subsequent controlled trials and by careful definition of diagnostic criteria, can begin to give some indication of type of patient or type of symptoms that will benefit most from the drug.140
 
Phase III Trials
Once the early clinical trials (phase I and II) have established the overall safety of the drug, its basic clinical pharmacology, its therapeutic properties and its most important side effects, phase III clinical trial (clinical trial proper) is undertaken to provide a full picture of its likely clinical performance when released for marketing.
Major areas that require clinical investigations during this phase are:
  1. The exact place of new drug in the therapy of target disease including identification of diagnostic group, who respond more or less well in comparison with the major existing drug in the same therapeutic area in terms of both beneficial and adverse effects.
  2. An increase in the patient exposure, both number of patients and length of administration of individuals to identify less common and late side effects.
  3. The possibility of adverse interaction with other drugs with which the new drug is likely to be prescribed.
  4. Ideal dosage regime for different types of patients.
  5. Further clinical pharmacological studies.
  6. Medical attitudes to the drug in countries with varying medical cultures.
 
Multicenter Trials
Often it becomes necessary to extend a trial to several centres. This could be either due to scarcity of patients at individual centers or a desire to complete the trial 141within a relatively short period of time or by desire to represent as broad a geographic spectrum as possible. Multi-centre trials bring with them their own problems. A great detail of attention should be paid to the practical administration of a trial as a whole and it is best for a trial to be coordinated centrally by one person or by a small group of experienced workers. Stratification by centre and allocation within each centre should invariably be employed, to minimize difficulties in interpretation of the findings.
The reference population is the general group to whom the investigators expect the results of the particular trial to be applicable. The reference population may include all human beings, if it seems likely that the study findings are universally applicable. Conversely, the reference population may be restricted by geography, age, sex, particular type of illness/ disorder or any other characteristic that is thought to modify the existence or magnitude of the effects seen in the trial. The experimental population is the actual group in which the trial is conducted. While, in general, it is preferable that this group not to differ from the reference population in such a way that generalizability to the latter is not possible, the primary consideration in the design of the trial should always be to obtain a valid result.
 
Preventive Trials, Program Trials, Community Trials
A preventive (or primary prevention) trial involves the evaluation of whether an agent or procedure 142reduces the risk of developing disease among those free from that condition at enrollment. Thus, preventive trials can be conducted among healthy individuals at usual risk or those already recognized to be at high risk of developing a disease. The most frequently occurring type of preventive trials are the trials of vaccines and chemo-prophylactic drugs. The basic principles of experimental design are applicable to these trials also.
While therapeutic (or secondary prevention) trials are virtually always conducted among individuals, preventive (or primary prevention) measures can be studied among either individuals or entire populations (i.e. groups of individuals). Such trials (groups based trials) are called ‘community trials’. When the trial involves studying the effectiveness of health programme, the trial is called as ‘Programme Trial’. Programme trials may be individual based or group based. In individual based trials of programmes, individuals are allocated(preferably randomly) to groups that are exposed (or not exposed) to the program under study. These trials do not differ in their design from other clinical trials.
In group based trials (i.e. community trials) the experimental units are not individuals but groups or communities that are exposed (or not exposed) to the program. Since investigators seldom have the power to decide where and when programs will be established, most program trials are quasi-experiments. ‘Post-measure Only’ studies, where the findings in an intervention (experimental) groups after exposure to a program are compared with those in a 143control group, are useful only if it can be assumed that the groups were similar before the institution of the program. The minimal requirement is generally a ‘Pre-measure – Post-measure’ with ‘Before’ and ‘After’ measurements for both the groups.
In ‘self-controlled’ community trials, observations before and after the institution of the program are compared. As in clinical trials of this sort, the main biases are those connected with extraneous events or changes that occur between the observations, non-specific effects caused by the trial itself, and changes in methods of measurement. ‘Before-after’ experiments of this sort, without external controls are common in public health but to be reasonably convincing, the before-after trial should be replicated in different populations or at different times. It is also helpful if a before-after study can be extended to an examination of what happens when the program is withdrawn. It must be very rare, however, for an investigator to have the power or ethical justification for a decision to discontinue a program that has shown an apparent effect.
Randomized controlled trials (clinical or community) have been extended to assess the effectiveness and efficiency of health services. Often choices have to be made between alternative policies of health care delivery because of the fact that resources are limited and priorities must be set for the implementation of large number of activities which contribute to the welfare of the society. One very good example of such an evaluation is the controlled trials in the chemotherapy of tuberculosis is in India, which 144demonstrated that ‘domiciliary treatment’ of pulmonary tuberculosis was as effective as the more costly hospital or sanatorium treatment.
The actual study population of trial is often not only a relatively small but also a select subgroup of the experimental population. It is well recognized that those who participate in an intervention study are very likely to differ from non-participants in many ways that may affect the rate of development of the end points under investigation. Among all who are eligible, those willing to participate in clinical trials tend to experience lower morbidity and mortality rates than those who do not, regardless not only of the hypothesis under study but of the actual treatment to which they are assigned. Volunteerism is likely to be associated with age, sex, socio-economic status, education and other less well-defined correlates of health consciousness that might significantly influence subsequent morbidity and mortality.
Whether the subgroup of participants is representative of the entire experimental population will not affect the validity of the results of a trial conducted among that group. It may, however, affect the ability to generalize those results, to either the experimental or the reference population.
 
Compliance
An intervention study requires the active participation and cooperation of the study subjects. After agreeing to participate (i.e. after written consent), subjects in trial of medical therapy may deviate from the protocol 145for a verity of reasons, including developing side effects, forgetting to take their medication or simply withdrawing their consent after randomization. Analogously, in a trial of surgical therapy those who were randomized to one group may choose to obtain the alternative treatment on their own initiative. In addition there will be instance where participants cannot comply, such as when the condition of a randomized patient rapidly worsens to the point where therapy becomes contraindicated. Consequently, the problem of achieving and maintaining high compliance is an issue in the design and conduct of all clinical trials.
The extent of noncompliance in any trial is related to the length of time that participants are expected to adhere to the intervention, as well as to the complexity of the study protocol. There are a number of possible strategies that can be adopted to try to enhance compliance among the participants in a trial. Selection of a population of individuals who are both interested and reliable can enhance compliance rates. Compliance levels must be measured but the measures available to estimate compliance have inherent limitations. The simplest measure is a self-report. For some interventions, such as exercise programs or behaviour modifications, this may be the only practical way to assess compliance. In trials of pharmacologic agents, pill counts have been used, where participants bring unused medication to each clinic visit or return it to the investigator at specified intervals. It assumes that the subject has ingested all medication that has not been returned to the clinic.146
A more objective way of assessing compliance is the use of biochemical parameters to validate selfreports but this is expensive and logistically difficult. Monitoring compliance is important because noncompliance will decrease the statistical power of a trial to detect any true effect of the study treatment. Some proportion of participants in a trial will become non-complaint despite all reasonable efforts. Bias due to withdrawals and dropouts can be avoided by comparing the outcomes in all subjects originally allocated to each group (intention-to-treat analysis). Therefore it is essential to obtain as complete follow–up information as possible on those who have discontinued the treatment program. Such approach may underestimate the efficiency of the treatment and an ‘on randomized treatment’ analysis may be performed as well, comparing the experience of subjects while they were still on their allocated treatment.
 
NATURAL EXPERIMENTS, CESSATION EXPERIMENTS
The term natural experiment is often applied to circumstances where, as a result of ‘naturally’ occurring changes or differences provide a rare or unique opportunity to observe or study the effects of specific factor(s). For example, September 1993 major earthquake in Marathwada region of Maharashtra, India provided an opportunity to study the health consequences of disaster.147
John Snow's discovery that Cholera is a water-borne disease was the outcome of natural experiment. Snow in his experiment identified two randomly mixed populations, alike in other important respects except the source of water supply in their households. The great difference in the occurrence of cholera amongst these two populations gave clear demon-stration that cholera is a water-borne disease. This was demonstration (1954) long before the advent of the bacteriological era. This also led to the institution of public health measures to control cholera.
Experiments in which an attempt is made to evaluate the termination of a habit (or removal of suspected agent) considered to be causally related to a disease are called as ‘cessation experiments’. If such action is followed by a significant reduction in the disease, the hypothesis of cause is greatly strengthened. The familiar example is cigarette smoking and lung cancer. If in a randomized controlled trial one group of cigarette smokers continue to smoke and the other group has given up, the demonstration of a decrease in the incidence of lung cancer in the study group greatly strengthens the hypothesis of a causal relationship.
 
Sequential Trials
The aim of the classical sequential design is to minimize the number of subjects that must be entered into a study. The decision to continue to enroll subjects depends on results from those already entered. It is necessary that the outcome is known in a short time relative to the duration of the trial. Therefore, these 148methods are applicable to trials involving acute illness but are not so useful for studies involving chronic diseases.
The sequential analysis method as originally developed by Wald and applied by Armitage to the clinical trial, involves repeated testing of data in a single experiment. The method assumes that the only decision to be made is whether the trial should continue or be terminated because one of the groups is responding significantly better or worse than the other. This classical sequential design rule is called “open plan” because there is no guarantee of when a decision to terminate will be reached. The method requires data to be paired, one observation from each group (Fig. 4.2).
Net Intervention Effect is the number of pairs intervention superior minus the number of pairs intervention inferior.
Fig. 4.2: Sequential trials – example of open plan
149
The expected sample size (for given α and 1 – β) is smaller for a sequential design than for its fixed sample size counter part. But there is a chance, although very small, that the treatment difference will remain within the defined boundaries no matter how many pairs of patients are enrolled. This probability is eliminated by imposing a limit on the number of patients that may be enrolled. Such design is called as closed sequential design. One example for a binary response variable is shown in figure below (two sided test with α = 0.05 and 1 – β = 0.95). For both the examples (this one and earlier one of open plan) the null hypothesis (H0) is “no difference between two groups” and the alternative hypothesis (HA) is that the probability of the intervention being superior is 0.6 (or being inferior is 0.6) (Fig. 4.3).
Fig. 4.3: Sequential trials – example of closed plan
150
For a specific alternative hypothesis, Armitage (Sequential clinical trials, John Wiley and Sons, New York, 2nd Edition, 1975) has given the appropriate boundary parameters for the binomially and normally distributed response variables and the maximum number of pairs of subjects needed. The use of sequential design is limited to situations in which outcome assessment can be made shortly after patients enrolled in the trial. Moreover the enrollment of patients in pairs is needed. (One member of each pair is assigned to the test treatment and the other member is assigned to the control treatment. The decision as to whether to enroll the next pair of patients is based on result observed for patients already enrolled.)
Pocock modified the concept and developed a group sequential method for clinical trials. This method divides the subjects into a series of K equal sized groups with 2n subjects in each, n assigned to intervention and n to control. K is the number of times the data will be monitored, during the course of the trial. Test statistic computed and plotted as soon as data for group of 2n subjects are available and recomputed when data from successive group become known. Group sequential methods have advantage over the classical sequential methods in that the data do not have to be continuously tested and individual subjects do not have to be paired.
Group sequential trials are type of ‘flexible clinical trials’ other two being ‘information based designs’ and ‘adaptive designs’. Any type of sequential design is often impracticable when there is a long delay between starting a treatment and its outcome; when there is 151more than one response variable; when more than the two treatments are compared; or when patients have very different prognoses.
 
Adaptive Procedures
There are two basic methods for allocating treatments to patients (or patients to treatments): fixed allocation and adaptive allocation. Fixed allocation is a procedure that assigns a pre-specified proportion (usually equal proportions) of patient population to each of the treatment groups in a manner that does not depend on the accumulating data. An adaptive allocation procedure uses information obtained during the course of the trial to determine the treatment assignment for the next patient about to enter the trial.
Aim in adaptive procedures is to minimize the number of patients to be allocated to an inferior treatment (or maximize the number of patients to be allocated to a superior treatment). Adaptive procedures usually assume that:
  1. There are only two treatments (i.e. two groups, experimental and control);
  2. Patients enter the trial sequentially;
  3. The response for the previous patient is known before the next patient arrives, and
  4. The response depends only on the treatment given.
First three of these assumptions are necessary even for sequential trials. While first two are reasonable the third one may hold good in specific situations. A difficulty in using adaptive procedures is that most clinical trials have more than one outcome of interest 152such as efficacy and safety. A more serious limitation of adaptive procedures arises from the underlying assumption that the patients admitted throughout the study are homogeneous in characteristics that affect the response to treatment.
Following are two specific examples of adaptive procedures.
 
Play the Winner Rule
Assign the next patient to the same treatment as the previous patient until a treatment failure occurs; after each such failure the next patient is assigned to the alternative treatment.
Where, S: Success, F: Failure, A and B: Two treatments.
 
Two-armed-bandit Problem
Continually estimate the probability of success for each treatment as the trial progresses and use that information to determine allocation. Here evidence of superiority of the treatment is based on so far accumulated information instead of success or failure in just previous treatment.
 
“N-of-1” Trials
The ‘N-of-1’ clinical trial or ‘single-case experiment’ is a special kind of crossover study aimed at determining 153the efficacy of a treatment or the relative value of alternative treatments for a specific patient. The patient is repeatedly given a treatment and placebo or different treatments in successive time periods. Thus, N-of-1 randomized controlled trial requires that the patient receive active therapy during one period of each pair, during the other the patient receives a placebo or alternative therapy. The order of two treatments within each pair is determined by random allocation; and both the patient and the treating clinician may be kept blind to the treatment actually received at any one time (double blind trial). This approach is an example of the crossover design. Conventional trials generally use a single crossover from one treatment to the other, but occasionally multiple crossovers similar to those in N-of-1 trials have been used. Conventional studies attempt to determine the overall treatment effect on a group of patients, whereas the N-of-1 study attempts to determine whether the treatment is beneficial in a particular patient.
 
Power and OC Curves
Although the logic of test of significance, two types of errors and power are explained in the next chapter of this book, they are discussed here in context of clinical trial. We assume that there are only two groups, one experimental or treatment group and the other comparison or control group. In a typical randomized control trial the aim is to estimate the true response rate, πt, for a treatment under study and compare it with the estimate of the true response rate πc, for a 154competing control therapy, whether it is existing standard treatment or a placebo.
The treatment group of patients yields the observed response rate, πt, which is the estimate of πt, and the control group of patients produced the observed response rate πc, which is the estimate of πc. The observed difference πc – πt is then an estimate of effectiveness of the treatment. But even if πt and πc are truly equal, non-zero observed differences (πc – πt) will occur by the working of chance (called sampling variation) in the samples of patients. Various sizes of observed differences will occur with various probabilities. When the difference actually observed has a very small probability of occurring by chance, we make the ‘possible but not probable’ argument, and reject the notion that πc = πt. Of course, we have to have a rule for how small ‘small’ probability must be. This is the famous ‘level of significance’, which is designated as α.
In test of significance (i.e. test of hypothesis) we set up the null hypothesis (denoted as H0) that πc – πt = 0. After we obtain πc – πt, we can figure out the probability that a difference as large in magnitude as the actual observed one would occur if H0 were true. This is well known P value. If P ≤ α we reject H0. When we reject H0, we are running a risk of making an erroneous decision. The two rates πt and πc may indeed be equal, so that our rejection decision is an error. This false positive error (made by rejecting H0 when in fact it is true) is called ‘Type-I Error’. We have fixed the probability of type I Error at α as we use rejection rule as the criterion P ≤ α.155
In the above procedure of testing we run another risk. If πt – πt is such that P > α we shall decide ‘Do not reject H0’, meaning that observed difference is ‘not statistically significant’. Clearly, this may be an error too. The true difference πc – πt may be non-zero, and chance delivers an observed difference πc – πt which is not large enough to fit our rejection of Ho rule. We than make the error of not rejection H0 when in fact it is false. This false negative error is called ‘Type II Error’ and the probability of it is designated as β.
The probability of Type II error is not one single value like the probability of Type I error. The probability of Type I error is calculated on the basis of that H0 is correct, i.e. on the basis of πc – πt = 0. The probability of Type II error is based on the situation where H0 is false, i.e. on the basis of πc – πt ≠ 0. But if πc – πt is not zero, there are infinite values that such a difference (say Δ) can have. For each value of Δ there is a probability of Type II error when we run test of H0. These values make up an entire curve of b as function of Δ. This function is called as the ‘operating characteristic (OC) Curve’ of the test.
Many times, complement of β i.e. (1 – β) which is called as the ‘power’ of the test is used instead of the value of β. 1 – β expresses the probability of avoiding Type II error that is to say, the probability of detecting the difference Δ. Formally, the power of a significance test is a measure of how likely that test is to produce a statistically significant result for a population difference of any given magnitude. Practically it indicates the ability to detect a true difference of clinical importance. 156The power may be calculated retrospectively to see how much chance a completed study had of detecting (as significant) a clinically relevant difference. More importantly, it may be used prospectively to calculate a suitable sample size. If the smallest difference of clinical relevance can be specified we can calculate the sample size necessary to have a high probability of obtaining a statistically significant result, i.e. power, if that is the true difference. For a quantitative variable, such as weight or blood pressure, it is also necessary to have a measure of variability.
 
Sample Size Determination
Determination of appropriate sample size is a problem of great interest and implications to an investigator planning a clinical study/trial. This should be decided by balancing statistical against clinical and practical considerations. A study with an overly large sample size may be deemed unethical through unnecessary involvement of extra subjects and the correspondingly increased cost. Very large sample sizes almost always guarantee a statistical significance even for small, practically unimportant difference. On the other hand a study with too small a sample size will unable to detect clinically important results and sometimes may lead to false negative findings.
To determine the appropriate sample size the investigator must specify the following:
  1. The minimum difference between parameters, which is of sufficient clinical importance (for example – the difference between two population 157means or two population proportions, designated as d).
  2. The ethical and medical implications of a wrong decision. That is, specifying:
    1. The acceptable level of type I error, namely, the error of rejecting the null hypothesis of equality of two population means or two population proportions when the null hypothesis is true (i.e. α) and,
    2. The acceptable level of type II error (complement of which is called as Power), namely, the error of not rejecting the null hypothesis when it is untrue (i.e. β).
  3. The approximate proportion of the condition or its variability (or estimated standard deviation).
Then one of formulas (depending on the type of variable) used to determine the sample size (n) are:
 
Quantitative
 
Qualitative
Where,
  • Zα = Value of standard normal variate corresponding to type I error, α
  • Z1-β = Value of standard normal variate corresponding to
    (1 – type II error) i.e. Power 1 – β
(Zα and Z1 – β for most frequently used α and 1 – β are given in the Table 4.7).158
σ
= approximate (value of measure of) variability (estimated S.D.) in the population.
d
= difference between two means or pro-portions worth detecting or difference of clinical importance
P
= Pooled value of P1 and P2.
P1: assumed or accepted proportion in Sample 1 (say in experimental group)
P2: assumed or accepted proportion in Sample 2 (say in Control group) i.e. P = (P1 + P2)/2 since d = P1 – P2, P2 = P1 – d or d = P2 –P1, P2 = P1 + d
Q
= Complement of P
(i.e. Q = 1 – P or Q = 100 – P)
 
Notes:
  1. n is the number of subjects needed in each group. Equal sample size for both groups is assumed.
  2. For one-sided test use Zα/2 value in place of Zα.
Table 4.7   Values of Standard Normal Variate (For Two Sided Test)
α
Zα
1 – β
Z1 – β
0.10
1.64
0.90
− 1.282
0.05
1.96
0.85
− 1.037
0.025
2.24
0.01
2.58
0.80
– 0.842
0.005
2.81
0.001
3.27
0.75
– 0.675
 
Illustrative Examples
  1. A milk feeding trial on 5 year old children is to be carried out. It is known that at this age children's 159height gain in 12 months has a mean of about 6 cm and a standard deviation of 2 cm. An extra increase in height in the milk (experimental) group of 0.5 cm on average can be considered as an important difference. How many children are needed in each group to conduct such a trial to achieve 90% chance of detecting the specified difference of 0.5 cm. significant at 5 % level?
    1 – β = 0.90
    ∴ Z1-β = − 1.282
    For one sided test with 5% level α = 0.025
  2. An investigation of past records reveals that a standard treatment in disease A gives 70% cure rate. A new drug for the treatment of this disease is to be accepted only if it has 90% cure rate. The level of significance (type I error) is set at 0.05 and value of type II error is set at 0.2. Calculate the appropriate sample size required in each group.
    160
 
EQUIVALENCE TRIALS
A non-inferiority (one-sided equivalence) trial of therapy efficacy is typically designed to show that the relative risk (or relative incidence rate, or relative hazard rate) of disease, infection, etc. with the new therapy compared to the control, is not greater than a pre-specified clinically relevant quantity. In non-inferiority trial based on immune response, the relative effect of interest may be a difference in proportions responding in a pre-specified manner, or a ratio of means (arithmetic or geometric)/geometric mean titres or concentrations. The trial is designed to show that the proportion responding to the new therapy is not less than that with the control by as much as a pre-specified quantity. In contrast, the classical trial is called ‘superiority trial’ (a trial with the primary objective of showing that the response to investigational product is superior to a comparative agentactive or placebo). For evaluation of say titres, the trial may be designed to demonstrate that the ratio of the geometric mean titre/concentration of the new vaccine relative to the control is not less than some pre-specified ratio.
A non-inferiority trial for an adverse event can have as its comparative outcome measure either a difference or a ratio of risks. Since non-inferiority evaluations are one-sided, statistical inference is based only on the upper or lower confidence limit. The null hypothesis is that the treatment difference is greater than the lower or upper equivalence margin. A two-sided equivalence trial, such as might be used to 161compare two lots, is designed to show that the outcome measure for one group is similar in both directions to that for another group. The choice of the equivalence margins should be scientifically justified. Conclusions should be drawn on the basis of an appropriate confidence interval using the pre-specified criteria of equivalence (used in the sample size calculation). The result of the analysis of the primary endpoint should be one of the following: 1. that the confidence interval for the difference between two treatments lies entirely within equivalence range so that equivalence may be concluded with only a small probability of error; 2. that the confidence interval covers at least some portion which lie outside the equivalence range, so that the difference of potential clinical importance remain a real possibility and equivalence cannot safely be concluded; or 3. that the confidence interval is wholly outside the equivalence range.
In classical clinical trials the standard analysis uses statistical significance to determine whether the null hypothesis of “no difference” may be rejected, together with confidence interval to place bounds on the possible size of the difference between the treatments. In equivalence trial this conventional significance test has little relevance: failure to detect a difference does not imply equivalence; and a difference which is detected may not have any clinical relevance. The relevance of the confidence interval, however, is easier to see. This defines a range for the possible difference between the treatments, any point of which is 162reasonably compatible with the observed data. If any point within this range corresponds to a difference of clinical importance then the treatments may be considered to be equivalent.
It is important to emphasize that absolute equivalence can never be demonstrated: it is possible only to assess the true difference is unlikely to be outside a range that depends on the size of the trial. If we have pre-defined range of equivalence as an interval from (– Δ to + Δ), we then simply check whether the confidence interval calculated on the observed difference lies entirely between (– Δ to + Δ). If it does, equivalence is demonstrated; if it does not then there is still room for doubt. The selection of α and β follows similar lines as for the classical comparative trials. The use of a 95% confidence interval in an equivalence trial, corresponds to a value for α of 0.025. However, β is treated identically (generally set at 0.1 to give a power of 90% or 0.2 to give a power of 80%). The distinction between one sided and two sided tests of statistical significance also carries over into the confidence interval approach. For a one sided the equivalence is declared if the lower one sided confidence limit exceeds – Δ. This approach is indicated when the objective is to ensure that the new agent is not inferior to the standard. Often the motivation is “a new intervention may offer an advantage (superiority) that is not reflected in measure of efficacy w.r.t. less side effects, or simple dosing, or less invasiveness, or lower cost, etc.”163
 
Other Terms
The treatment or the drug can be released for marketing after successful completion of phase III trial. However, the monitoring continues. This is called post-marketing surveillance (PMS) and is generally restricted to monitoring of adverse reactions. Sometimes, physician's experience program (PEP) is conducted. Systematic collection of such information is also called as phase IV study/trial. A whole new discipline of pharmacoepidemiology that studies these issues also search applications of epidemiological techniques to the study of drugs, especially drug safety.
Despite all the care that is exercised in clinical trials, it is important to realize that the efficacy of a regimen is not the same as effectiveness. Efficacy is evaluated under controlled conditions whereas Effectiveness is in the field, where some conditions may deviate from those kept under control in a clinical trial. Effectiveness is determined not only by efficacy but also by coverage, compliance, provider performance, etc.
An adverse event is any untoward medical occurrence in a clinical trial subject administered a treatment/vaccine/placebo, it does not necessarily have a causal relationship with treatment/vaccine/placebo. An adverse reaction is a response to treatment/vaccine/placebo that is noxious and unintended. Serious adverse event is an event that is associated with death, admission to hospital, prolongation of hospital stay, persistent disability or incapacity, or is otherwise life-threatening in connection with the clinical trial. Case report/record form (CRF) is a document used to record 164data on a clinical trial subject during course of the clinical trial, as defined by the protocol. The data should be collected by procedures that guarantee preservation, retention and retrieval of information and allow easy access for verification, audit and inspection.
A subject's voluntary confirmation of willingness to participate in the trial and the documentation thereof is called informed consent. This consent should be sought after appropriate information has been given about the trial, including an explanation of its status as research, its objectives, potential benefits, risks and inconveniences, alternative treatment(s) that may be available and of the subject's rights and responsibilities in accordance with the current revision of the Declaration of Helsinki. Meta-analysis (which seeks to combine evidence from different studies) should include studies performed using similar methodology. It is quite common in the medical literature that an overall assessment is made by pooling results from different publications. However, two points to be remembered in this regard are—1. The literature is generally unduly loaded with “positive” results (negative or indifferent results are either not sent for publication or are not published by the journals with the same frequency, therefore any conclusion based on commonality in publications thus can magnify this bias) and 2. individual studies with small sample sizes may give statistically not significant results but when combined can give significant results because the sample size is so much increased by pooling.
In an intention to treat analysis patients are analyzed according to their randomized treatment, irrespective 165to whether they actually received the treatment (ITT analysis is not appropriate for examining adverse effects). On the other hand, a per protocol analysis compares patients according to the treatment actually received and includes only those patients who satisfied the entry criteria and promptly followed the protocol. Biological assays (bioassays) are methods for the estimation of the nature, constitution, or potency of a material (or of a process, drug) by means of the reaction that follows its application to living matter. Stocastic is a branch of statistics which deals with the study of sequence of events which are governed by probabilistic laws. Sometimes trials are classified as either ‘explanatory’ (trials that aim to determine the exact pharmacological action of a drug) or ‘pragmatic’ (trials that aim to determine the efficacy of a drug as used in day-to-day clinical practice.
  • The total therapeutic effect arises as a combination of pharmaceutical action together with the doctor's behavior and the patient's anticipation. Therefore, it is often said that ‘Statistical measurements of therapy have often dealt with pharmaceutical but not therapeutic effectiveness’. This fact needs to be taken into consideration while conducting a clinical trial.
  • It is often argued that “If these people are not a random sample of the disease or condition under treatment, the results cannot be validly extrapolated for statistical purposes. The results are highly limited.” Many randomized trials have been performed as a type of ‘in vitro’ experiments The ability to generalize the results of randomized trials 166would be greatly enhanced if better techniques were developed for identifying and stratifying clusters of patients with distinctive prognostic characteristics.
  • As already mentioned in ‘sampling’ chapter, any study/trial has two important aspects, one generalizability and the other validity. Sample size and sample selections (i.e. random sampling) are vital for generalizability, which is no doubt important. But the study should be valid in the first place. Random allocation is one important component to increase the validity of a trial.
  • A single trial may demonstrate that an agent is efficacious, but cannot prove the absence of efficacy. No clinical trial can prove that an agent is safe. The trial can only show that the agent brought no harm to the people who received it. The agent may not have been maintained long enough for toxic effects to develop, or the trial may not have included the particular kind of patients for whom the agent is unsafe.
  • General trials in human are classified into three phases: phase I, phase II and phase III. In certain countries formal regulatory approval is required in order to undertake phase I, II and III studies. This approval takes different forms such as ‘Investigational New Drug (IND) Application’ in the United States of America and ‘Clinical Trial Certificate (CTC) or Clinical Trial Exemption (CTX) in the United Kingdom. This is in addition to ethical clearance required and ethical consideration and review, in accordance 167with the “Declaration of Helsinki”. All trials should be adhere to standards described in Good Clinical Practice (GCP).
  • Toxicity studies in animals may be considered to assess the potential toxic effects in target organs, including the hematopoietic and immune systems as well as to assess systemic toxicity. Toxicity studies may help to identify potential toxic problems requiring further clinical monitoring. However, it should be recognized that a suitable animal model may not be available for undertaking toxicological evaluation and that such models are not necessarily predictive of human responses.
  • The determination of study population sample size as well as the duration of the trial to achieve a statistically meaningful result with respect to efficacy and safety requires a clear understanding of background disease incidence as well as an understanding of the background incidence of various adverse reactions.
  • The gold standard in clinical research is the randomized placebo controlled- double blind clinical trial. This design is favored for confirmatory trials carried out as part of the phase III development of new medicines. Because of the number and range of medicines already available, new medicines are increasingly being developed for indications in which a placebo control group would be unethical. In such situations, one obvious solution is to use as an active comparator on existing drug already licensed and regularly used for the indications in 168question. New treatment is simply expected to match the efficacy of the standard treatment but have advantages in safety, convenience, or cost; in some cases the new treatment may have no immediate advantage but may present an alternative or second line therapy. Under these circumstances, the objective of the trial is to show equivalent efficacy — the so called “equivalence” trial. Such trials have been referred to as “active control equivalence studies” or “positive control studies.”
  • Censoring of the subjects means ‘not being observed for the full period until the occurrence of the event of interest’. Censored observations contribute to the denominator but not to the numerator in risk calculations.
  • In meta-analysis studies relevant to a conceptual issue are collected, summary statistics from each study (e.g. means or correlations) are treated as units of analysis, and the aggregate data are then analyzed in quantitative tests of propositions under examination.
  • It may be accepted as a maxim that a poorly or improperly designed study involving human subjects is by definition unethical. Moreover, when a study is in itself scientifically invalid, all other ethical considerations become irrelevant. There is no point in obtaining ‘informed consent’ to perform a useless study.

Confidence Intervals and Logic of Tests of Significance5

 
Depending upon whether the variable is qualitative or quantitative, the parameter generally of interest is proportion (or probability) π or mean µ respectively. The sample values p and
are considered to be point estimates of π and µ and respectively. Samples are subject to fluctuations and different samples tend to give different estimates. This inter-sample variability in p and
should be estimated. As in the case of individuals, the variability in p and
from sample to sample is measured in terms of their standard deviation, but it is now called standard error. Thus standard error (SE) is the measure of variability of sample estimates from sample to sample. Use of these terms help to distinguish between interindividual variability measured by standard deviation (SD) and inter-sample 170variability measured by SE. The SE of p measures the variability in proportion p from sample to sample and SE of measures the variability in mean
from sample to sample. Larger the SE, more is the variability and less is our confidence in the sample results.
The SE is calculated on the basis of all possible samples of specific size n from the specified target population. However, these samples are not actually drawn. Statistical theory helps to obtain SE on the basis of just one sample, provided it is randomly drawn. In case of simple random sampling, the SEs are computed as follows.
Standard errors:
where n is the size of the sample, π is the proportion in the population and σ is the SD of the individuals in the population. If the measurements of all N subjects in the entire target population are really available, the σ is computed where the denominator used is N and not (N − 1). In practice, though, the population parameters are seldom known. If they are known then there is no need of a sample. They are estimated by the corresponding values in the sample. Since the estimator of π is p and of µ is
, we get,
Estimated standard errors:
where s now is computed with denominator (n − 1). This adjustment helps to achieve a more accurate estimate in the long run called “unbiased” in statistics.171
It has been theoretically established that in almost all situations arising in medical practice, this pattern tends to become Gaussian when n is large—even when the underlying distribution among individuals is not Gaussian. This statistical result is called central limit theorem (CLT), is very useful in drawing inference on π and µ when n is large. The criteria generally followed to decide that n is large or not are:
For proportion, n is large if np ≥ 5 and n(1 – p) ≥ 5
For mean, n is large if n ≥ 30.
If p = 0.30 then n must be a minimum of 17 and if p = 0.02 then n must be at least 250.
 
Confidence Intervals
The method of random sampling has one more advantage. When an average value or a proportion (or any other quantity such as ratio) is calculated from the sample, we can estimate the range within which the corresponding population parameter is expected to lie with a given degree of probability. This probability is called confidence and the range so obtained is called a confidence interval (CI). A confidence interval gives a range in the hope that it will include the parameter of interest. The confidence level associated with the interval (say 90, 95, or 99 percent) is the percentage of all such possible intervals that will actually include the true value of the parameter. Note that a CI tells us what to expect in the long run. It does not say anything about what happened is a particular sample.172
Most of the statistical procedures are designed to help decide whether or not a set of observations is compatible with some hypothesis. These procedures yield P values to estimate the chance of reporting that a treatment has an effect when it really does not and the power to estimate the chance that the test would detect a treatment effect of some specified size. This decision-making paradigm does not characterize the size of the difference. It can be more informative to think not only in terms of the accept-reject approach of statistical hypothesis testing but also to estimate the size of the treatment effect together with some measure of the uncertainty in that estimate.
The specific 95 percent confidence interval we obtain depends on the specific random sample we happen to select for observation. The confidence interval gives a range that is computed in the hope that it will include the parameter of interest. A particular interval (associated with a given set of data) will or will not actually include the true value of the parameter. The confidence level associated with the interval (say 90, 95, or 99 percent) gives the percentage of all such possible intervals that will actually include the true value of the parameter. Unfortunately, you can never know whether or not that interval does. All you can say is that the chance of selecting an interval that does not include the true value is small (10, 5, or 1 percent). Therefore, a specific 95 percent confidence interval associated with a given set of data may or may not include the true size of the treatment effect, but in the long run 95 percent of all possible confidence intervals 173will include the true value associated with the treatment. So it describes not only the size of the effect but quantifies the certainty with which one can estimate the size of the treatment effect.
 
Confidence interval for π and µ
CI for π: When n is large, property of Gaussian distribution is invoked to calculate the CI for π.
In the long run, if repeated samples are taken, the sample proportion would vary around the proportion p in the target population. Gaussian distribution says that there are 95% chances that 2SE distance on either side of π would contain the value of the parameter. Or there are only 5% chances that the parameter value would be outside such limits. These are called ± 2SE limits—the lower limit is 2SE less than p and the upper limit is 2SE more than p. Note that we are using estimate of SE in place of SE itself. This can be safely done when n is large.
Example: The management of the cases of bronchiolitis in infants may become easier if somehow the course of the disease can be predicted on the basis of the condition at the time of hospital admission. One criterion for this could be the respiratory rate (RR). Consider an investigation in which cases with RR ≥ 68 per minute are observed for the hospital course of the patients. Suppose in a random block of 80 consecutive cases of bronchiolitis coming to a hospital with RR ≥ 68, a total of 51 (64%) are ultimately observed to have 174severe form of the disease, i.e. they either had prolonged stay in the hospital, developed some complication, required endotracheal intubation or mechanical ventilation or died. This 64 is the percentage observed in the present sample. Other samples from the same hospital or from the same area may give a different percentage. What could be the percentage of cases to have severe form of disease in the entire ‘population’ admitted to the hospital with a diagnosis of bronchiolitis and RR ≥ 68 per minute?
The best point estimate as per this sample is 64 percent. However, this estimate is likely to differ from the actual percentage in the whole population or when another sample is taken. Since n = 80 and p = 0.64, np and n(1 – p) are large enough to ensure Gaussian pattern. The ± 2SE limits in this example are
Thus, the percentage with poor prognosis is likely to be anywhere between 53 and 75 in cases of bronchiolitis with RR ≥ 68. In other words, there are between 53% and 75% chances that a case of bronchiolitis with RR ≥ 68 per minute at the time of hospitalization will require special handling.
Suppose that 6% of those with RR ≥ 68 per minute in the above example fail to survive. The 95% CI for the proportion dying is
175
Thus the actual case fatality rate could be anywhere between 1% and 11%. This is a wide interval relative to the case fatality 6% observed in the sample, in the sense that the case fatality could in fact be nearly double of what is actually observed. Compare it with (53%, 75%) interval earlier obtained for cases with poor prognosis. This interval is narrow relative to the 64% rate observed in the sample. In general, the CI is narrow relative to p when p is around 0.5, say between 20% and 80% and wide relative to p when p is either very low or very high.
CI for mean µ: The Central Limit Theorem tells us that the distribution of sample mean is nearly always Gaussian for large n. The underlying distribution of measurements on individual subjects is then immaterial. The distribution of the duration of survival after detection in leukaemic patients is skewed and far from Gaussian. Yet, when mean survival times are obtained in many samples, each of large size, the means would still follow a near Gaussian pattern. We can again invoke property of Gaussian distribution and obtain that 95% CI for µ is
. But SE is seldom known and is replaced by its estimate. This changes the standard Gaussian pattern to a different form called Students't. Fortunately, this change shift for large n is negligible and we continue to get 95% CI for µ (large n)
Example: A random sample of 100 hypertensives with mean diastolic BP of 102 mm Hg are given a new hypotensive drug for one week as a trial. The mean 176level after the therapy came down to 96 mm Hg. The SD of the decrease in these 100 subjects is 5 mm Hg. What is the 95% CI for the actual mean decrease?
The 95% CI for mean decrease
Thus, there are 95% chances that the actual mean decrease after one week regimen would be between 5 and 7 mm Hg. The other interpretation of such a CI is that there are only 5% chances that the decrease after one week regimen would either be less than 5 mm Hg or more than 7 mm Hg when a very large number of subjects or when the entire target population is investigated.
  • Note that the CI (5, 7) mm Hg in this example is for the mean decrease. Decreases in 95% individual patients would vary between
    or between (6 − 10, 6 + 10) or (− 4, 16) mm Hg. Individual variation is much higher than the variation in means. Minus sign indicates that some patients may show a rise instead of decline in diastolic BP to the extent of 4 mm Hg. Care is required to maintain a distinction between SD and SE.
  • It is important to realize that a 95% CI is very likely to contain the value of the parameter. Thus the CI is to be interpreted as useful in the long run. If 100 such trials are done, nearly 95 of them are likely to give a mean decrease between 5 mm Hg and 7 mm Hg. The other 5 trials can give either a higher or a lower decrease. In fact, the value of CI is more in what it does not contain. In this example, the CI 177says that the chance of mean decrease being either less than 5 mm Hg or more than 7 mm Hg is very remote.
  • The multiplier 2 is the approximate value of exact 1.96 in Gaussian distribution. It is preferred not only for convenience but also because this slight increase tends to cover the mild variation form Gaussian pattern. In practice, the distribution is seldom exactly Gaussian. Secondly, replacing σ by sample SD (s) in any case transforms a Gaussian-distribution to a t-distribution. The latter has a larger variance. In fact, for relatively small n, the multiplier could be more than 2.0. For example, for n = 30, the exact value is 2.042 (refer to ‘t’ table of Appendix).
  • Though the discussion here is restricted to con-fidence level of 95% (because that is the most common level), the CI can be obtained with any other confidence level. The multiplier is approximately 2.6 for 99% confidence, i.e. the 99% CI is obtained by ± 2.6 SE limits. The 90% CI is obtained by ± 1.7SE and 80% by ± 1.3SE. The exact multipliers are in Z table of Appendix.
Proportion and mean are just about the two most common parameters on which confidence intervals are drawn. There might be isolated examples where the interest is in CI for median or for a decile, even σ. The basic methodology to obtain a CI is to get 2.5th and 97.5th percentile of the distribution of the corresponding value in the sample. The difficulty is that the distribution, and hence 2.5th and 97.5th percentile of sample median or of sample decile is not easy to obtain.178
CIs for difference: The interest in medicine often is in the difference between two groups, such as placebo and drug group, drug 1 and drug 2 groups and males and females. We can estimate the limits within which the true difference is likely to lie.
(1 – α)100% CI for1 – µ2): We explain the calculations and interpretation with the help of an example.
Example: Suppose we wish to study the potential value of a proposed antihypertensive drug, we select two samples of 50 people each and administer a drug to one group and placebo to the other. The groups were similar with respect to diastolic blood pressure before the administration. The treatment group now (i.e. after administration) has a mean diastolic blood pressure of 80 mm Hg and a standard deviation of 10 mm Hg; the control (i.e. placebo) group has a mean blood pressure of 85 mm Hg and a standard deviation of 12 mm Hg. To test the null hypothesis that in both the groups, mean diastolic blood pressure is same (i.e. the difference in means is zero), independent samples ‘t’ – test is used. t = − 2.87 which is significant at 158 df for α = 0.05. But the question is—‘is this result clinically significant?’ To gain feeling for this, compute the 95 percent confidence interval for the mean difference in diastolic blood pressure for people taking placebo versus the drug. Since t0.05 for 158 df is 1.977, the confidence interval is
− 5 − 1.977(1.74) < µ1 − µ2 < − 5 − 1.977(1.74)
− 8.44 < µ1 − µ2 < − 1.56
In other words, we can be 95 percent confident that the drug lowers blood pressure between 1.56 and 1798.44 mm Hg. This is not a very large effect, especially when compared with standard deviations of the blood pressures observed within each of the samples, which are around 11 mm Hg. Thus while the drug does seem to lower blood pressure on the average, examining the confidence interval permitted us to see that the size of the effect is not very impressive. The small value of P was more a reflection of the sample size than the size of the effect on blood pressure.
We can summarize the formulas used in calculation of (1 – α) 100% CI for (µ1 – µ2) as follows. CI is
where
tα is t value corresponding with α and df = (n1+ n2 − 2).
(1 – α) 100% CI for (p1 – p2): We explain the calculations and interpretation with the help of an example.
Example: Many studies have reported the evidence that administering low-dose aspirin to people receiving regular kidney dialysis reduces the proportion of people who develop thromboses. In one such study, of the 40 people taking the placebo, 70% developed thromboses and 30% of the 40 people taking aspirin did. Given only this information, we would report that aspirin reduces the proportion of patients who developed thromboses by 40%. It is most probably true that aspirin reduces the proportion of patients who developed thromboses but could be by 18% only. A 95% confidence interval reveals this fact. The 180standard error of the difference in proportions of patients who developed thromboses is 0.11. So the 95% confidence interval for the true difference in the proportion of patients who developed thromboses is 18 to 62%. That means, we can be 95 percent confident that aspirin reduces the rate of thromboses somewhere between 18 and 62% compared with placebo.
We can summarize the formulas used in calculation of (1 – α)100% CI for (π1 – π2) as follows.
CI is (p1 – p2) ± Za * SE(p1 – p2)
where SE(p1 − p2) = √[p*(1 − p) * {(1/n1) + (1/n2)}]
p = [(n1 p1) + (n2 p2)]/(n1 + n2)
Za is Z value corresponding with α.
The interest sometimes is not in the lower and upper limit of confidence but only in one of them. Such limits are the called bounds. Suppose in a group of heart patients, the 5-year survival rate is 60% when they are on medication. Surgery is expensive and can be advised only if it substantially increases the survival rate. This increase could be quantified as a minimum of say 20%. This is the lower bound on the increase in the survival rate in this case.
 
Exact Confidence interval for Rates and Proportions
When the sample size or observed proportion is too small for the approximate confidence interval based on the normal distribution is not much reliable, and then you have to compute the confidence interval based on the exact theoretical distribution of proportion. 181When the sample size is small or the result is extreme we use Binomial Distribution and when the observed proportion is small we use Poisson Distribution to calculate the confidence interval for the population proportion or population rate. Since results based on small sample sizes with low observed rates of events turn up frequently in the medical literature, the results of computation of confidence intervals using the Binomial, and Poisson distribution are presented below.
To illustrate failure of usual procedure (when np is below about 5, where n is sample size and p is proportion), consider an example. Suppose a particular surgeon has done 10 operations without a single complication. His observed complication rate p is 0/30 = 0 percent for the 30 specific patients he operated on. This is impressive but it is unlikely that the surgeon will continue operating forever without a complication. Therefore the fact that p = 0 probably reflects good luck in the randomly selected patients who happened to be operated on during the period in question. To obtain a better estimate of p, the surgeon's true complication rate, we will compute the 95 per cent confidence interval for p.
Let us try to apply existing (usual) procedure.
and the 95 percent confidence interval is p ± 1.96 S. E.(p) = Zero to Zero. This result does not make sense. Obviously, the approximation breaks down.182
A figure in Biometrika, vol. 26, 1934, page 410, gives a graphical presentation of the 95% confidence intervals for proportions which is very easy to use. The upper and lower limits are read off the vertical axis using the pair of curves corresponding to the size of the sample n used to estimate p at the point on the horizontal axis corresponding to the observed p. For our surgeon, p = 0 and n = 10 so the 95% confidence interval for his true complication rate is from 0 to .27. In other words, we can be 95% confident that his true complication rate, based on the 10 cases we happened to observe, is somewhere between 0 and 27%.
Since we often have to find confidence interval for the extreme result (0 or 100%), 95 percent confidence intervals for extreme results for various sample sizes are displayed in the Table 5.1 that are calculated using EPI-INFO.
In utilizing a simple random sample of independent events in a large population to provide an estimate of the mortality rate, incidence rate, or prevalence rate of given disease in that larger population, it is extremely helpful to describe the precision of the estimate. One such measure of precision is the confidence interval that allows us to estimate with a given degree of probability what the rate is in the larger population. In calculating confidence interval for relatively uncommon events such as mortality or morbidity rates for diseases of the nervous system (rare diseases), the Poisson distribution is most appropriate.183
Table 5.1   95% confidence limits for extreme results
If the denominator is
And if the percentage is 0% the true could be as high as %
And if the percentage is 100% the true % could be as low as
1
97.50%
2.50%
2
84.19%
15.81%
3
70.76%
29.24%
4
60.24%
39.76%
5
52.18%
47.82%
6
45.93%
54.07%
7
40.96%
59.04%
8
36.94%
63.06%
9
33.63%
66.37%
10
30.85%
69.15%
15
21.80%
78.20%
20
16.84%
83.16%
25
13.72%
86.28%
30
11.57%
88.43%
35
10.00%
89.99%
40
8.81%
90.19%
45
7.87%
92.13%
50
7.11%
92.89%
55
6.49%
93.51%
60
5.96%
94.04%
65
5.52%
94.48%
70
5.13%
94.87%
75
4.80%
95.20%
80
4.51%
95.49%
85
4.25%
95.75%
90
4.02%
95.98%
95
3.81%
96.19%
100
3.62%
96.38%
150
2.43%
97.57%
200
1.83%
98.17%
300
1.22%
98.80%
400
0.92%
99.08%
500
0.73%
99.26%
1000
0.37%
99.63%
184
 
Differentiation in Standard Deviation and Standard Error, and Confidence Interval and Tolerance Interval
The standard deviation and standard error of the mean measures two very different things. The standard error of the mean tells not about variability in the original population, as the standard deviation does, but about the certainty with which a sample mean estimates the true population mean. Since the certainty with which we can estimate the mean increases at the sample size increases the standard error of the mean decreases as the sample size increases. Conversely, the more variability in the original population, the more variability will appears in possible mean values of samples. Therefore, the standard error of the mean increases as the population standard deviation increases.
It has been show that the distribution of mean values will always approximately follow a normal distribution regardless of how the population from which the original samples were drawn is distributed. This result is called as “central limit theorem” which can be summarized as:
  • The distribution of sample means will be approximately normal regardless of the distribution of values in the original population from which the samples were drawn.
  • The mean value of the collection of all possible sample means will equal the mean of the original population.
    185
  • The standard deviation of the collection of all possible means of samples of a given size, called the standard error of the mean, depends on both the standard deviation of the original population and the size of the sample.
Since the possible values of the sample mean tend to follow a normal distribution, the true (but un-observed) mean of the original population will be within 2 standard errors of the sample mean about 95% of the time. Increase in the precision with which the sample mean estimates the population mean is reflected by the smaller standard error of the mean with large sample sizes.
Most medical investigators summarize their data with the standard error of the mean because it is always smaller than the standards deviation. It makes their data look better. However, unlike the standard deviation, which quantifies the variability in the population, the standard error of the mean quantifies uncertainty in the estimate of the mean. Since readers are generally interested in knowing about the population, data should never be summarized with the standard error of the mean.
To understand the difference between the standard deviation and standard error of the mean and why one ought to summarize data using the standard deviation, consider the following example.
Suppose that: Average duration of gestation period in 100 women was found to be 280 days with standard deviation of 5 days. If the sample size as 100 then the standard error is 0.5 and the 95% confidence interval 186for average gestation period of the entire population is 279 to 281. These values describe the range, which, with about 95% confidence, contains the average gestation period of the entire population from which the sample of 100 women was drawn. This is not the interval that contains gestation period of 95% of the women. If we want that interval, then we should use standard deviation and not the standard error. So the interval which contains gestation period of 95% of the women is 280 ± 2 (5) = 270 to 290. Such interval is called “tolerance interval” and the end points of such interval are called “tolerance limits.”
 
P-Values and Statistical Significance
Intersample variability has another type of implication. If the mean decrease in cholesterol level after a therapy in a sample of 60 subjects of age 40-49 years is 9 mg/dL and in a second sample of 25 subjects of age 50-59 years 13 mg/dL, can it be safely concluded that the average decrease in the two groups is really different? Or is this difference just occurred by chance in these samples? Statistical methods help to take a decision one way or the other on the basis of the probability of occurrence of such a difference. When the conclusion is that the difference is very likely to be real then the difference is called statistically significant. In order to describe the concept of statistical significance more fully, we briefly visit the methodology followed in all empirical conclusions. This will also help in understanding the concepts of null hypothesis and of P-values which are so vital to the concept of statistical 187significance. These concepts are intimately related to Confidence Interval.
 
Null Hypothesis and P-value
The concepts are best understood with the help of an example of a court decision in a crime case. Consider the possibilities mentioned in Table 5.2. When a case is presented before a court by prosecutor, the judge is supposed to start with the presumption of innocence.
Table 5.2   Error in various settings – Court setting
Decision
Assumption of innocence
True
False
Pronounced guilty
Serious error
Pronounced not guilty
Error
In a court of law, it is upto the prosecutor to put-up sufficient evidence against the innocence of the person and change the initial opinion of the judge. Guilt should prove beyond reasonable doubt. If the evidence is not sufficient, the person is acquitted whether the crime was committed or not. Sometimes the circumstantial evidence is strong and an innocent person is wrongly pronounced guilty. This is considered a very serious error. Special caution is exercised to guard against this type of error even at the cost of acquitting some criminals!
Diagnostic Journey: In the process of diagnosis a healthy individual may be wrongly classified as ill (false 188positive—misdiagnosis) and some really ill person may fail the detection procedure (false negative—missed diagnosis). Diagnostic journey always starts with the presumption of “no diagnosis” with respect to a particular disease. To rule out or to confirm the presence, a thorough clinical examination and/or some diagnostic test is used. But since the whole procedure is not full proof, the above two types of errors (mis-diagnosis and missed-diagnosis) are possible (Table 5.3).
Table 5.3   Error in various settings – Diagnosis setting
Diagnosis
Actual condition
Disease absent
Disease present
Disease present
Misdiagnosed
Disease absent
Missed diagnosed
In the process of hypothesis testing, two types of errors are possible to be committed. The error committed when a true null hypothesis is rejected (i.e. conclusion of significant difference where in fact there is no real difference) is called “Type I error“. Probability of committing type I error is generally denoted by α. The error committed when a false null hypothesis is not rejected (i.e. conclusion of not significant difference in presence of true difference) is called “Type II error”. Probability of committing type II error is generally denoted by β. The complement of type II error, 1 – β (i.e. rejecting H0 whenever H0 is false) is called as Power (Table 5.4).189
Table 5.4   Error in various settings – Empirical setting
Decision
Null hypothesis
True
False
Rejected
Type I error
Not rejected
Type II error
Null hypothesis: In case of empirical decisions, the initial assumption is that there is no difference between the groups. This is equivalent to the presumption of innocence in the court setting and is called the null hypothesis. The notation used for this is H0.
The sample observations serve as evidence. Depending upon this evidence, the H0 is either rejected or not rejected. In empirical set-up, the H0 is never accepted. The conclusion reached is that the evidence is not enough to reject H0. This may mean two things, – (i) carry out further investigations and collect more evidence, and (ii) continue to accept the present knowledge as though, this investigation was never done. The ‘truth‘ remains unchanged.
Consider the claim of a manufacturer that his drug is superior to the existing ACE inhibitors in improving insulin sensitivity in diabetic hypertensives. In a trial on matched cases, the improvement was seen in suppose 4 out of 10 patients on the new drug compared to 3 out of 10 on the existing drug. The sample size n = 10 in each group is too small and the difference is small to provide sufficient evidence to reject the H0 of no difference between the drugs. If so, the claim of superiority is not tenable. The manufacturer needs to withdraw the claim forever or till such time that more 190evidence is available for scrutiny. The hypothesis of equality of two drugs is called as null hypothesis because it nullifies the effect, which we want to prove. It is not always that null hypothesis is the statement, which nullifies the effect but it is the statement under which there exits only one condition.
That's why for a given null hypothesis there is only one α value but number of β values because under the alternative hypothesis there exist many possibilities, and for each possibility there is one β value. In test of hypothesis we make α value (type I error which is in fact significance level) small but we generally exercise no control over β value (type II error).
Alternative hypothesis: The claim made is called the Alternative hypothesis. This is denoted by H1. In the above example, the claim is that of superiority of the new drug. This is the alternative hypothesis, H1, in this case. This is a one-sided alternative because only superiority is claimed and inferiority is ruled out. One sided alternative can be considered as saying that one group is ‘at least as good as’ or ‘worse’ than the other, and two-sided as saying that one group is ‘either better or worse’ than the other group. Sometimes it is not possible to claim that one group is better than the other but the only claim is that they are different. In the case of peak expiratory flow rate (PEFR) in factory workers exposed to different pollutants, there may not be any a priori reason to assert that it would be more in one than the other. Then the alternative is that the PEFRs are unequal. This is called two-sided alternative. The null is that they are equal in the two groups.191
P-value: The values observed in the sample serve as evidence against H0. But these values are subject to sampling fluctuations and may or may not lead to a correct conclusion. The error of rejecting a true null hypothesis is similar to punishing an innocent. This is more serious and is called Type I error. This is popularly referred to as P-value. Thus P-value is the probability that a true null hypothesis is wrongly rejected. This is the probability that the conclusion of presence of difference is reached when actually there is no difference. In a clinical trial setup, this is the probability that the drug is declared effective or better when actually it is not. This wrong conclusion can allow an ineffective drug to be marketed as being effective. This clearly is unacceptable and need to be guarded against. For this reason, P-value is kept at a low level, mostly less than 5%, or P < 0.05. The maximum P-value allowed in a problem is called the level of significance or sometimes as α-level. When P-value is small or smaller, it is generally considered safe to conclude that the groups are indeed different.
Type II error: The second type of error is failing to reject a H0 when it is false. The probability of this error is denoted by β. In a clinical trial setup, this is equivalent to declaring a drug ineffective when it actually is effective. A drug which could possibly provide better relief to hundreds of patients is denied entry into the market. If the manufacturer believes that the drug is really effective, the company will carry out further trials and collect further evidence. Thus, the introduction of the drug is delayed but not denied.192
Let us suppose that treatment A produces 30% cure and treatment B produces 55% cure. Now let us conduct a study, taking a pair of samples of size 30 each and administering treatment A to one group and treatment B to another group. Use the test of significance with the Type 1 error of 5% and judge whether there is any significant difference between the treatments based on these samples. Suppose that we have repeated such studies a large number of times and have judged that there is significant difference between the treatments only in 40 per cent of the studies. This means that in 60% of the pairs of samples we have failed to detect a difference that is large enough to reject the null hypothesis. We call this 60% as Type II error. The magnitude of risk of this error is related to the actual difference between the populations.
Every investigator is anxious to keep both Type I and Type II errors at the lowest but it is not possible to reduce both the errors simultaneously. For a given sample size, if one is reduced, the other automatically increases. Usually the Type I error is fixed at a tolerable limit and the Type II error is minimized. After fixing the Type I error, Type II error can be decreased by increasing the size of the sample. There is another useful concept closely associated with Type II error. If the Type II error is 60%, its complement, i.e. 40% is known as the ‘power’ of the test. The power is a numerical value indicating the sensitivity of a test.
Power: The complement of Type II error is called power. Thus, power of a test is the probability of rejecting a H0 when it is false. This depends on the 193magnitude of the difference between the observed and the real value present in the target population. The power of a test is high if it is able to detect small difference and reject H0. Suppose the mean PEFR in workers of tyre manufacturing industry is 296 liters per minute and that in workers of paint-varnish industry 307 liters per minute. The mean difference is 11 liters per minute. This difference seems small relative to the PEFR values. A test with high power is needed to detect this difference and to call it significant. A low power test will not be able to reject H0 of equality and will give conclusion that the difference is likely to have arisen due to chance in the samples studied.
Power becomes a specially important consideration when the investigator does not want to miss a specified difference. For example, a hypotensive drug may be considered useful if it reduces diastolic BP by an average at least of 5 mm Hg after use for one week. A sufficiently powerful statistical test would be needed to detect this kind of difference. (1 – β) is an important consideration in this setup. However, one would like that the difference (5 mm Hg) in this case is chosen with some objective basis. Increasing the size of the sample, beside by choosing an appropriate design of the study can increase power of a test.
 
Testing Statistical Significance
When the probability of Type I error, P, is less than a low threshold such as 0.05, the null hypothesis is rejected and the result is said to be statistically 194significant. The exact form of test criterion for obtaining P-value depends mostly on:
  1. The nature of the data (qualitative or quantitative).
  2. The form of the distribution if the data are quantitative (Gaussian or non-Gaussian).
  3. The number of groups to be compared (two or more than two).
  4. The parameter to be compared (can be mean, median, correlation coefficient, etc. in case of quantitative data; it always is proportion π or a ratio in case of quantitative data).
  5. The size of sample (small or large).
  6. The number of variables considered together (one, two or more). The distributional forms of these criteria have been obtained and are known. These are used to find P-values as per the following procedure.
 
Hypothesis Testing Procedure and Possible Errors
The procedure of hypothesis testing is summarized in following steps.
  1. Identify the problem under study. Identify the type of data that form the basis of the testing procedure (since this determines the particular test to be employed).
  2. State the Null hypothesis which is denoted as H0. Null hypothesis is sometimes referred to as hypothesis of no difference since it is a statement of agreement with or no difference from conditions presumed to be true in the population of interest.
    195
    In the testing process the null hypothesis either is rejected or is not rejected. If the null hypothesis is not rejected, we say that the data on which the test is based do not provide sufficient evidence to cause rejection. If the testing procedure leads to rejection, we say that data at hand are not com-patible with the null hypothesis but are supportive of some other hypothesis. This other hypothesis is known as the alternative hypothesis and is designated as HA or H1.
  3. Calculate the Critical Ratio/Test statistic.
    Critical Ratio is generally of the form [(Relevant statistic – Hypothesized parameter)/ (Standard Error of the relevant statistic)]
    Therefore in most of the situations, standard error of the relevant statistic has to be calculated separately.
  4. Find out the probability of obtaining a value of critical ratio as extreme as or more extreme (in the appropriate direction) than the one actually computed, critical ratio by chance (i.e. under the null hypothesis). This probability is called P-value
    To find out this probability it is necessary to specify the probability distribution of the critical ratio (i.e. sampling distribution of critical ratio). If the tables are available, P-value can be found out easily by referring to these tables.
  5. Reject the null hypothesis if the P-value is very small. Otherwise do not reject the null hypothesis. Of course, we have to have a rule for how small “very small” probability must be. This is called as ‘level 196of significance. By convention 5% (.05) level of significance is fixed. If P-value is less than 5% the difference is said to be “significant, ”i.e. unlikely to have arisen by chance and reject the null hypothesis. If it is greater than 5% the difference is said to be “not significant”, i.e. might have easily arisen by chance, and do not reject the null hypothesis. If P-value is less than 1% the difference is said be highly significant.
    Whenever the distribution of critical ratio depends on degrees – of freedom, the tables available give limited P-values. Therefore it becomes difficult to find out the exact P-value. In such situations reject H0 if calculated value of critical ratio is greater than the one that is tabulated for a fixed significance level. Do not reject H0 if calculated value of critical ratio (known as critical value) is smaller than on which is tabulated for a fixed significance level.
    In short, identify the rejection region and acceptance region. The values of the critical ratio forming the rejection region are those values that are less likely to occur if the null hypothesis is true (i.e. they yield the P-value which is smaller than significance level), while the values making up the acceptance region are more likely to occur if the null hypothesis is true (i.e. the corresponding P-value is greater than significance level). The decision rule tells us to reject the null hypothesis if the value of the critical ratio that we compute from our sample is one of the values in the rejection region 197and to not reject the null hypothesis if the computed value of critical ratio is one of the value in the acceptance region.
    Whether the rejection region is spilt between two sides or tails of the distribution of the critical ratio (then test is called as two-sided or two tailed test) or the entire rejection region is in one or the other tail of the distribution (in which case the test is called as one-sided or one-tailed test) depends on the nature of the alternative hypothesis (one-sided or two-sided).
  6. Decision regarding rejection or not rejecting the null hypothesis is taken (using above decision rule). If H0 is rejected, we conclude that H1 (HA) is true and if H0 is not rejected we conclude that H0 may be true. Give the interpretation in non-statistical terms, i.e. explain the result in terms of original problem.
 
Errors in Statistical Testing
In the above process, two types of errors are possible (to be committed). The error committed when a true null hypothesis is rejected (i.e. conclusion of significant difference where in fact there is no real difference) is called “Type I error.” The error committed when a false null hypothesis is not rejected (i.e. conclusion of not significant difference in presence of true difference) is called “Type II error”. Following table helps to summarize these (Table 5.5).198
Table 5.5   Error in statistical testing
Decision on application of test
Actual situation (condition of null hypothesis)
H0 True
H0 False
Reject H0
Type I error
Correct decision
Do not Reject H0
Correct decision
Type II error
The following remarks are pertinent.
  • We cannot have a null hypothesis that says that some difference in present. It is stated in form of no difference. However, we can have an H0 that states that the difference between two groups is a specified quantity D. For example, a null hypothesis could be that the difference in mean birth weight of babies born to educated and illiterate mothers is atleast 300 gms. The alternative hypothesis in this case is that it is less than 300 gms.
  • Not being able to reject H0 is analogous to pronouncing in a court that the person is ‘not proven guilty’. Note that this is different from saying that the person is ‘not guilty’. The other way that this could be understood is that a null hypothesis is ‘conceded’ but not accepted.
  • Distinction must also be made between ‘not significant’ and ‘insignifiacnt’. Statistical tests are for the former and not for the latter. A statistically ‘not significant’ difference is not necessarily ‘insignificant’.
  • With statistical inference, the results can seldom, if ever, be absolutely conclusive, the P-value never becomes zero. There is always a possibility, 199however small, that the observed difference arose by chance alone.
  • Whenever statistical significance is not reached, the evidence is not considered in favour of H0—it is only not sufficiently against it. Samples provide evidence against H0 and in favour of H1, but never in favour of H0 and against H1.
  • It can be shown that any null hypothesis value of the population, which lies inside the 95% confidence interval will be rejected at the 5% level of significance. This means, confidence interval can be used to test any number of hypotheses, whereas a hypothesis test only indicates whether we should reject a particular hypothesis. In this sense a confidence interval is much more useful than a hypothesis test. Moreover, the P value is a measure of evidence that do not take into account the size of the observed effect. A small effect in a study with large sample size can have the same P value as a large effect in a small study.
  • The P value is a conditional probability statement representing likelihood that the result observed in the data, or one more extreme, is due to chance, given that the null hypothesis is true. If the P value is low, conventionally less than 0.05, the null hypothesis of no association is rejected chance is considered an unlikely explanation for the findings, and the results are termed statistically significant at the level. If the P value is greater than or equal to 0.05, the interpretation is that chance cannot be excluded as a likely explanation for the finding, 200the null hypothesis cannot be rejected and the results are not statistically significant at the 0.05 level. The confidence interval provides the range within which the true estimate of effect lies with a certain level of confidence. Thus, a 95 percent confidence interval around an observed relative risk estimate indicates that, with 95 percent confidence, the true relative risk will be no less than the lower bond and no greater than the upper bond.
  • There are two factors that affect the size of the P value: the magnitude of the difference between the study groups and the size of the sample. Consequently, even a small difference can be statistically significant if the sample size is sufficiently large, and conversely, a difference that may be of interest clinically may not achieve statistical significance if the sample size is small. The confidence interval is more informative than the P value since it separates information about the magnitude of effect from that concerning the variability of the estimate. The width of the interval indicates the amount of variability inherent in the estimate of effect. The larger the sample size, the more stable the estimate, and the narrower the confidence interval. Moreover, the confidence interval indicates whether the findings are statistically significant at a given level. If, for example the null value for a relative risk, 1.0 is included in the 95 percent confidence interval the results are not statistically significant at the 0.05 level.
    201
  • The power of a study is defined as the probability of detecting a difference between the study groups if one truly exists. This ability is dependent on two factors: the magnitude of the effect as well as the size of the study population. The issue of power needs to be considered both in the design of study and in the interpretation of its findings. In designing any investigation, it is crucial to calculate the sample size necessary to ensure that the study has adequate power to detect the most likely magnitude of effect with the desired level of confidence. Moreover, in interpreting a finding that does not achieve statistical significance, it is necessary to consider whether the study had adequate ability to detect an effect of the observed magnitude. If not, the finding is not merely a null result, but a null result that is uninformative.

Inference: Quantitative Variables6

 
COMPARISON OF MEANS (STUDENT'S ‘t’ TEST)
 
One Sample Setup
Suppose the interest is in finding whether the patients of chronic diarrhea have same Hb level as normally seen in healthy subjects in the area. The normal level of Hb would be known and suppose is 14.6 g/dl. This is assumed to be known and fixed. A random sample of 10 patients of chronic diarrhea is investigated and the average Hb level is found to be 13.8 g/dl. The sample mean is lower than the normal but the difference could be because the sample happens to comprise subjects with lower levels. Such subjects are not uncommon in the healthy population as well. If another sample of 10 patients is studied, the average could well be 14.8 g/dl. Can we say with reasonable confidence that the patients with chronic diarrhea have lower Hb level on average?203
We have only one sample in this example and comparison is with known normal of the healthy population. It is a one-sample problem though the comparison is of two values—one mean found in the sample and the other known normal for the healthy population. The null hypothesis in this case is H0: µ = 14.6 g/dl. The possibility of a higher Hb level in the chronic patients is ruled out, so that H1: µ < 14.6 g/dl. If H0 is not true then H1 is considered true. An appropriate criterion in chosen and calculated assuming that H0 is true. Then the probability of observed or extreme values is calculated. This is the P-value. If P-value is very small, H0 is rejected. This procedure is the same as described earlier. Heuristically, the answer depends on the magnitude of difference of sample mean from the known normal in the healthy population. In the example above, this difference is 0.8 g/dl. This magnitude is assessed relative to the variation in means expected from sample to sample. The latter is measured by its SE, namely, (σ /√n). The SD σ would be rarely known and is replaced by its estimate ‘s’. Thus, the criterion in this setup is—
where µ0 is the value of mean under H0. (Student here was a pen name of WS Gosset.) Larger the value of t, more is the chance that the sample has not come from a population with mean µ0. That is, the decision of rejecting H0 has less chance of error. When this probability (P-value) of Type I error is very low, say, less than 0.05, H0 can be safely rejected. Exact P-values 204for t are provided by most of the standard statistical packages. Alternatively, consult Table in Appendix to check that P is less or not than threshold such as 0.10, 0.05 and 0.01. The distribution of t depends on degrees of freedom. For t above, df = n − 1. This is (sometimes) specified as a subscript of t.
Example: Let us assume that Hb level in 10 chronic diarrhea patients are as follows.
11.5, 12.2, 14.9, 14.0, 15.4, 13.8, 15.0, 11.2, 16.1, 13.9
The hypothesis under test is that the average Hb level in the patients of chronic diarrhea is the same 14.6 g/dl as is normal in healthy subjects. Thus, H0: µ = 14.6 g/dl. The alternative is H1: µ < 14.6 g/dl.
P (t < − 1.51) = 0.0827. This H1 is one-sided and so only one-sided probability is required to be obtained. If exact P-value is not available from a computer package, Table in appendix gives P > 0.05. The chance is more than 5% that H0 is true. Thus, H0 can not be rejected. The difference between the sample mean 13.8 g/dl and the normal 14.6 g/dl is not statistically significant. This sample does not provide sufficient evidence that the mean Hb level in chronic diarrhea is less than the normal in healthy subjects. This conclusion is reached despite sample mean being so much less than the normal. There is the high variability in the Hb level in the patients. It was only 11.2 g/dl in one patient and 16.1 g/dl in another. Various values are 205widely scattered. This gave a high value of sample SD and led us to expect that the inter sample variability in means too would be high. Thus, the reliability of the result from this sample is low.
 
Two Sample Setup
Consider a situation now where two samples are available, one from each of two groups. These groups could be males and females, diseased and healthy, suffering from disease A and from disease B, mildly suffering and moderately suffering, or even one group measured before treatment and after treatment. The general form of criterion in the setup of two samples is
Student's t
= mean difference ÷ SE of mean difference.
The procedure to calculate SE in case of paired samples is different. The observations are said to be paired when they are obtained two times on the same subjects such as BP before and after treatment, or when the subjects in the two groups such as in case-control study are one-to-one matched. In this case, obtain the difference between the pairs di = x2i − x1i, i = 1, 2, …, n where n is number of pairs, and calculate the SD of these di's by sd = √∑ (di –)2/(n − 1). The null hypothesis of interest for rejection in this case is H0: µ1 = µ2. This is the same as that for two independent samples but says that there is no difference in means, e.g. before treatment and after treatment. Under this H0,
206
This is basically same as for one sample. After the difference is obtained, it reduces to one sample problem on these differences.
In case of unpaired or independent samples, SE of mean difference is calculated by
where, S1 and n1 are SD and size of first sample, and S2 and n2 are the corresponding values of the second sample. Thus, in this case, for H0: µ1 – µ2 = 0
where the subscripts refer to the samples or to the group. The degrees of freedom are n1 + n2 − 2. It assumes that S1 = S2, i.e. the variability in the two groups is the same. This can be tested by Levene's test which is done by computing the absolute deviation of each observation from its grand mean and then performing a two sample t-test on the absolute deviations. This test has been shown to be quite robust for lack of normality and therefore preferred to a similar test known as Bartlett's test.
If the null hypothesis of equal variances (H0: σ1 = σ2) is rejected, (this problem of unequal variances is called as Fisher-Behrens problem), then one might consider using the separate variances t-test, which does not assume equal population variances while testing the same null hypothesis as pooled t-test given above. The separate variance two sample t-test statistic is
and the significance level of t‘ is approximately207
 
NONPARAMETRIC STATISTICAL INFERENCE
 
Comparison of Means from Non-Gaussian Distributions
The procedures discussed earlier for quantitative data are applicable only when (i) sample size is large (in this case, the underlying distribution may or not be Gaussian), and (ii) sample size is small but the underlying distribution is Gaussian. In the case of small sample size from non-Gaussian distribution requires special methods—called nonparametric or distribution free methods.
The term nonparametric method implies a method that is not for any specific parameter. For example, the hypothesis can only be concerned with the form of the population as in goodness-of-fit tests, or with some of the characteristic of the distribution of the variable, as randomness and trend. On the other hand, distribution free methods are based on functions of sample observations whose distribution does not depend on the specific underlying distribution in the population from which the sample was drawn. Therefore, assumptions regarding the form of underlying distribution are not necessary. Although, nonparametric 208tests and distribution-free tests are not synonymous labels, the customary practice is to consider both types of tests as nonparametric methods.
This implies a definition of nonparametric methods then as “methods which do not involve hypothesis concerning specific values of parameters or which are based on some function of the sample observations whose sampling distribution can be determined without knowledge of the specific distribution function of the underlying population”.
When the t-test is used on non-normal data, two things happen. The significance probabilities are changed, i.e. the probability that t exceeds t0.05 when the null hypothesis is true is no longer 0.05, but may be, say, 0.041 or 0.097. Secondly, the sensitivity or power of the test in finding a significant result when the null hypothesis is false is altered. Much of the work on non-parametric methods is motivated by a desire to find tests whose significance probabilities do not change and whose sensitivity relative to competing tests remains high when the data are non-normal. With the rank tests, the significance levels remain the same for any continuous distribution, except that they are affected to some extent by ties, and by zeros in the signed rank test. In large normal samples, the rank tests have an efficiency of about 95% relative to the t-test, and in small normal samples, has been shown to have an efficiency slightly higher than this. With non-normal data from a continuous distribution, the efficiency of the rank tests relative to t never falls below 86% in large samples and may be 100% for 209distributions that have long tails. Since they are relatively quickly made, the rank tests are highly useful for the investigator who is doubtful whether his data can be regarded as normal.
 
Advantages of Non-parametric Methods
  1. If the data are inherently of the nature of ranks, not measurements, they can be treated directly by non-parametric methods without assuming some specific form for the underlying distribution.
  2. Non-parametric methods are available to treat data which are simply classificatory, i.e. those measured in a nominal scale. NO parametric technique applies to such data.
  3. Whatever may be the form of the distribution from which the sample has been drawn, a non-parametric test of a specified significance level actually has that significance level.
  4. If samples are very small (e.g. six), there is in effect no alternative to a non-parametric test (unless the parent distribution really is known).
  5. There are suitable non-parametric tests for treating samples made-up of observations from several different populations of different distributions.
  6. Non-parametric methods are usually easier to apply than the classical techniques.
 
Disadvantages of Non-parametric Methods
  1. If all the assumptions of the parametric statistical model are in fact met in the data and if the 210measurement is of the required strength, then non-parametric statistical tests are wasteful of data.
  2. For large samples some of the non-parametric methods require a great amount of labor unless approximations are employed. Non-parametric methods for multivariate data are tedious for application.
 
Two Samples Comparison
Only two non-parametric methods (for two samples comparison) are described here, one for paired case and one for unpaired/independent samples case.
 
Case I – Paired
A simple test called the sign test, utilizes information simply about the direction of the differences (negative or positive) within pairs. However, this test is deficient because it does not consider the magnitude of differences. This ‘sign test’ is not discussed here in this text. If the relative magnitude as well as the direction of the differences is considered, a more powerful test namely ‘Wilcoxon's matched-pairs signed-ranks test’ can be used. It gives more weight to a pair that shows a large difference between the two conditions than to a pair that shows a small difference. This test is a very useful test for a behavioral problem. With behavioral data, it is not uncommon that the researcher can (i) tell which member of a pair is “greater than” the other. That is, the sign of the difference between any pair can be stated, (ii) Rank 211the differences in order of absolute size. With such information, the researchers may use the Wilcoxon's test.
 
Wilcoxon's Signed-Ranked Test
Suppose a child psychologist wants to test if nursery school attendance has any effect on child's social perceptiveness. He obtains, say, 10 pairs of identical twins to serve as subjects. One twin from each pair is randomly assigned to attend school for a term and the other to remain out of school. At the end of the term, the 20 children are each given the test of social perceptiveness. Suppose he scores social perceptiveness by rating child's response to a group of pictures depicting a variety of social situations. He is confident that a higher score represents higher social perception than a lower score, but he is not sure that the scores are sufficiently exact to be treated numerically. In such situations, the only way of analyzing the data are by applying Wilcoxon's test.
Let di = the difference is the variable value for any matched pair, representing the difference between the pair's scores under the two treatments. Each pair has one di. To use the Wilcoxon's test, rank all the di's without regard to sign: give the rank of 1 to the smallest di, the rank of 2 to the next smallest, etc. When one ranks scores without respects to sign, a di of − 1 is given a lower rank than a di of either − 2 or + 2. Then to each rank affix the sign of the difference. That is indicate which ranks arose from negative di's and which ranks arose from positive di's.212
Now if treatments A and B are equivalent, that is, if H0 is true, we should expect to find some of the larger di's favoring treatment A and some favoring treatment B. That is, some of the larger ranks would come from positive di's while others would come from negative di's. Thus, if we summed the ranks having a plus sign and summed the ranks having a minus sign, we would expect the two sums to be about equal under H0. But if the sum of the positive ranks is very much different from the sum of the negative ranks, we would infer that treatment A differs from treatment B, and thus we would reject H0. That is, we reject H0 if either the sum of the ranks for the negative di's or the sum of the ranks for the positive di's is too small. Suppose we denote this test statistic by W. Then
where
and
 
Ties
Occasionally the two scores of any pair are equal. That is, no difference between the two treatments is observed for the pair, so that di = 0. Such pairs are dropped from the analysis. n = the number of the matched pairs minus the number of pairs whose di = 0. Another sort of tie can occur. Two or more di's can be of the same size. We assign such tied cases the same rank. The rank assigned is the average of the ranks which would have been assigned if the di's had differed slightly. Thus three pairs might yield di's of − 1, − 1 and + 1. Each pair would be assigned the rank of 2, for (1 + 2 + 3)/3 = 2. Then the next di inorder 213would receive the rank of 4, because ranks 1, 2, and 3 have already been used. If two pairs had yielded di's of 1, both would receive the rank of 1.5, and the next largest di would receive the rank of 3. The practice of giving tied observations the average of the ranks they would otherwise have gotten has a negligible effect on W, the statistic on which the Wilcoxon's test is based.
If there is only one sample and when we want to compare the location of the distribution with that of a specified one, we may use this test. We should calculate the difference from the specified value, say median, rank all the absolute (without sign) differences together and then treat them as two groups according to whether the difference is positive or negative. For small samples the exact significance tables are in many books.
 
Large Samples
Though non-parametric methods are specially suited for small samples, some prefer them for large samples also. It can be shown that when n is larger than 20 the value of W is practically normally distributed, with—
Mean = XW = [n(n + 1)]/4 and
Standard deviation = sW 2 = [n (n + 1) (2n + 1)]/24
Therefore
is approximately normally distributed with zero mean and unit variance. Thus standard normal distribution table in appendix can be used for finding various probabilities. The procedure is to calculate Z for a given 214pair of samples and find one-tail or two-tail P-values from Table.
Example: Suppose a dental research team wished to know if teaching people how to brush their teeth would be beneficial. Twenty pairs of patients seen in a dental clinic, after careful matching with respect to age, sex, intelligence, and initial oral hygiene scores, were included in the study. One member of each pair received instruction on how to brush the teeth and also on other oral hygiene matters. Six months later all 40 subjects were examined and assigned an oral hygiene score by a dental hygienist unaware of which subjects had received the instructions. A low score indicates a high level of oral hygiene. The results are displayed below.
Pair Number
Score
Instructed Group
Not Instructed Group
1
1.0
1.5
2
2.5
1.5
3
1.0
2.5
4
2.7
2.5
5
2.8
2.8
6
2.0
1.5
7
2.5
1.8
8
2.0
1.9
9
1.8
2.2
10
1.7
1.1
11
2.0
1.5
12
2.8
2.3
215
13
2.4
2.3
14
2.6
1.9
15
3.0
2.1
16
2.7
2.5
17
2.9
2.2
18
1.8
1.9
19
2.2
2.1
20
3.4
2.7
For these data, W = 37.00. n = 19 as one di have zero value. A computer package gives a P-value = 0.0194 considering W ‘s exact sampling distribution. Large sample approximation yields XW = 85.5, sW = 22.96 and Z = − 2.11. We can refer to Table in appendix and find P-value which is equal to 0.0174. We can conclude that the instruction was beneficial.
 
Case II—Unpaired
One of the most powerful of the nonparametric tests and a most useful alternative to the parametric t test (when the researcher wants to avoid the t test's assumptions) while testing whether two independent groups have been drawn from the same population, is the Mann-Whitney U test.
 
Mann-Whitney U test
To apply the U test, we first combine the observations or scores from both the groups and rank these. Now 216focus on one of the groups, say the group with n1 cases (where n1 is number of cases in the smaller of the two independent groups and n2 is the number of cases in the larger group). The value of U is the number of times that a score in the group with n2 cases precedes a score in the group with n1 cases in the ranking. For fairly large values of n1 and n2, the counting method of determining the value of U may be rather tedious. An alternative method, which gives identical results, is to assign the rank of 1 to the lowest score in the combined (n1 + n2) group of scores, assign rank 2 to the next lowest score, etc. Then
and
where, R1 = sum of the ranks assigned to group whose sample size is n1
R2 = sum of the ranks assigned to group whose sample size is n2
And U = Minimum {U1, U2}. Problem of ties is handled in the same way as described for Wilcoxon's test. For small samples the exact significance tables are in many books.
 
Large Samples
It has been shown that as n1, and n2 increase in size, the sampling distribution of U rapidly approaches the normal distribution, with217
which is practically normally distributed with zero mean and unit variance. That is, the probability associated with occurrence under H0 of values as extreme as an observed Z may be determined by reference to table of standard normal distribution in appendix.
Example: The following are the lengths of times (in minutes) spent in the operating room by 20 subjects undergoing the same operative procedure. Eleven of the subjects were in hospital A and 9 were in hospital B.
Hospital A: 29, 30, 31, 32, 36, 33, 39, 37, 31, 30, 31.
Hospital B: 41, 39, 43, 48, 38, 49, 46, 45, 40.
On the basis of these data can one conclude that, for the same operative procedure, patients in hospital B tend to be in the operating room longer than the patients in hospital A?
Rank sum for hospital B, i.e. R1 = 142.5 and for hospital A, i.e. R2 = 67.5.
U1 = 1.5 and U2 = 97.5 so that U = 1.5.
A computer package gives a P-value = 0.0003 considering U ‘s exact sampling distribution. Large sample approximation yields
, sU = 13.16 and Z = − 3.647. We can refer to standard normal table and find that P-value is less than 0.0002. We can conclude that, for the same operative procedure, patients in 218hospital B tend to be in the operating room longer than the patients in hospital A.
 
Comparison of Several Means—Analysis of Variance
More complex sets of data comprising more than two groups are common, and their analysis often involves the comparison of the means of the component subgroups. For example, it may be desired to analyze hemoglobin measurements collected as part of a community survey to see how they vary with age and sex, and to see whether any sex difference is the same for all age groups. Conceivably it would be possible to do this using a series of t tests, comparing the groups two at a time. This is not only practically tedious, however, but theoretically unsound, since carrying out a large number of significance tests is likely to lead to spurious significant results. For example, a significant result at the 5% level would be expected from 5% (one in 20) of all tests performed even if there were no real differences.
A completely different approach, called analysis of variance, is used instead. The methodology is fairly complex. The calculations are time-consuming and are usually carried out using a standard computer package. For these reasons the emphasis in this chapter is on the principles involved, with the aim of giving the reader sufficient knowledge to be able to specify the form of analysis required and to interpret the results. Details of the calculations are included, however, for the simplest case, that of one-way analysis of variance, as these are helpful in understanding the basis of the method and its relationship to the t test.219
One-way analysis of variance is appropriate when the subgroups to be compared are defined by just one factor, for example in the comparison of means between different socioeconomic classes or between different ethnic groups. Two-way analysis of variance is also described and is appropriate when the subdivision is based on two factors such as age and sex. The methods are easily extended to the comparison of subgroups cross-classified by more than two factors.
A factor is chosen for inclusion in an analysis of variance either because it is desired to compare its different levels or because it represents a source of variation that it is important to take into account. Consider the following example. Following the discovery that the coronary heart disease rates differ markedly between different ethnic groups, a survey was carried out to see whether this was reflected in differing mean lipid concentrations for the different ethnic groups. As lipid concentrations are known to vary with age and sex, it is appropriate to include age-group and sex in the analysis of variance along with ethnic group, although these were not themselves of particular interest in the study. Their inclusion has two benefits. Firstly, the resulting significance test of differences between the ethnic groups is more powerful, that is it is more likely to detect any real differences as being statistically significant. Secondly, it ensures that the ethnic group comparisons are not biased due to different age-sex compositions as confounding variables.
It is also possible to analyze data subdivided by one or more factors using the closely related but more 220general technique of multiple regression. Both approaches give identical result, but it is less efficient in these simpler situations.
Some readers may find the content of the rest of this chapter a little difficult. It may be omitted at a first reading.
 
One-way Analysis of Variance
One-way analysis of variance is used to compare the means of several groups, for example, the mean hemoglobin levels of patients with different types of sickle cell disease. The analysis is called one-way as the data are classified just one way, in this case by type of sickle cell disease. The method is based on assessing how much of the overall variation in the data is attributable to differences between the group means and comparing this with the amount attributable to difference between individuals in the same group. Hence the name analysis of variance.
We start by calculating the variance of all the observations, ignoring their subdivision into groups. The variance is the square of the standard deviation and equals the sum of squared deviations of the obser-vations about the overall mean, divided by the degree of freedom. One-way analysis of variance partitions this sum of squares (SS) into two distinct components.
The sum of squares due to differences between the observations within each group is also called the residual sum of squares. The total degrees of freedom are similarly divided. The calculations for the sickle cell data are shown and the results laid out in an analysis of variance table in Table 6.1.221
The fourth column of the table gives the amount of variation per degree of freedom, and this is called the mean square (MS). The significance test for differences between the groups is based on a comparison of the between groups and within groups mean squares. If the observed differences in mean haemoglobin levels for the different types of sickle cell disease were simply due to chance variation, the variation between these group means would be about the same size as the variation between individuals with the same type, while if they were real differences the between groups variation would be larger. The mean squares are compared using the F test, sometimes called the variance-ratio test.
F = [(Between-groups MS)/(Within-groups MS)] and degrees of freedom
= [d.f.Between-groups, d.f.Within-groups] = (k − 1, N − k)
where N is the total number of observations and k is the number of groups.
F should be about 1 if there are no real differences between the groups and larger than 1 if there are differences. Under the null hypothesis that the differences are simply due to chance, this ratio follows an F distribution which, in contrast to most distributions, is specified by a pair of degrees of freedom: (k − 1) degrees of freedom in the numerator and (N-k) in the denominator. The percentage points of the F distribution are tabulated for various pairs of degrees of freedom in Appendix. The columns of the tables refer to the numerator degrees of freedom and the blocks of rows to the denominator degrees of freedom. 222The percentage points are one-sided as the test is based only on extreme levels of F larger than one.
F = 49.94/1.00 = 49.9 with degrees of freedom (2, 38). The tables of percentage points have rows for 30 and 40 but not 38 degrees of freedom. We say that the 1% percentage point of F(2, 38) is between 5.18 and 5.39, which are the 1% percentage points for F (2, 30) and F (2, 40). Clearly 49.9 is greater than either of these. Thus the steady-stage hemoglobin levels differ significantly between patients with different types of sickle cell disease (P< 0.001), the mean level being lowest for patients with Hb SS disease, intermediate for patients with Hb S/b-thalassemia, and highest for patients with Hb SC disease.
Table 6.1   One-way analysis of variance differences in steady-state hemoglobin levels between patients with different types of sickle cell disease
Type of sickle cell disease
No. of patients (ni)
Hemoglobin (g/dl)
Means (Xi)
s.d. (Si)
Individual values (X)
Hb SS
16
8.7155
0.8445
7.2, 7.7, 8.0, 8.1, 8.3, 8.4
8.4, 8.5, 8.6, 8.7, 9.1, 9.1
9.1, 9.8, 10.1, 10.3
Hb S/β-thalassemia
10
10.6300
1.2841
8.1, 9.2, 10.0, 10.4, 10.6
10.9, 11.1, 11.9, 12.0, 12.1
Hb SC
15
12.3000
0.9419
10.7, 11.3, 11.5, 11.6, 11.7, 11.8, 12.0, 12.1, 12.3, 12.6, 12.6, 13.3, 13.3, 13.8, 13.9
223
 
Calculations
 
Analysis of Variance Table
Source of variation
SS
d.f.
MS = (SS)/df
F = (Between-group MS)/(Within-group MS)
Between groups
99.89
2
49.94
49.9 P < 0.001
Within groups
37.96
38
1.00
Total
137.85
40
224
 
Assumptions
There are two assumptions underlying the F test. The first is that the data are normally distributed. The second is that the population value for the standard deviation between individuals is the same for each group. This is estimated by the square root of the within-groups mean square. Moderate departures from normality may be safely ignored, but the effect of unequal standard deviations may be serious. In the latter case, transforming the data may help.
 
Relationship with the Two-Samples ‘t’ Test
One way analysis of variance is an extension of the two-sample t test. When there are only two groups, it gives exactly the same results as the t test. The F value equals the square of the corresponding t value and the percentage points of the F distribution with (1, N-2) degrees of freedom are the same as the square of those of the t distribution with N-2 degrees of freedom. With unequal ni, the F-tests as well as t-tests are more affected by non-normality and heterogeneity of variances than with equal ni. We continue with the setup where, the response is quantitative, practically continuous, and the interest is in comparing three or more groups. The underlying distribution of the response variable does not follow a Gaussian pattern. Nonparametric methods, also called distribution free methods, are needed for this setup. The hypothesis in this case is in terms of general location parameter that could be mean or median or any other such parameter. 225The distribution in each group should be identical except possibly some difference in location.
 
NON-PARAMETRIC STATISTICAL INFERENCE: THREE OR MORE SAMPLES COMPARISON
We continue with the setup where, the response is quantitative, practically continuous, and the interest is in comparing three or more groups. The underlying distribution of the response variable does not follow a Gaussian pattern. Non-parametric methods, also called distribution free methods, are needed for this setup. The hypothesis in this case is in terms of general location parameter that could be mean or median or any other such parameter. The distribution in each group should be identical except possibly some difference in location.
 
One Way Layout—Kruskal-Wallis Test
Let the number of groups be denoted by J each containing n subjects. The total number of subjects in all the groups combined is nJ. A variable, that is continuous for practical purposes, is measured for each of these subjects and the measurement obtained for ith subject (i = 1, 2, …, n) of jth group (j = 1, 2, …, J) is yij. The distribution of yij is similar in each group except possibly a difference in their location. The hypothesis under test is whether or not this difference is present.
Consider an example of cholesterol level in diastolic hypertensives, systolic hypertensives, ‘clear’ 226hypertensives and controls. All subjects are adult males of medium built (neither obese nor thin) and the groups are matched for age. It is expected that the pattern of distribution of cholesterol level in different groups would be the same but not Gaussian, but one or more groups may have lower or higher measurements than the others. This setup is the same as in one-way ANOVA but the interest there was specifically in mean. In the case of nonparametric tests, we refrain from specifying any such parameter and talk instead in terms of general location.
For non-parametric test of the null of equal location in J groups, rank all nJ observations jointly from smallest to largest. Denote the rank of yij by Rij. Let the sum of the ranks of the observations in jth group be R.j (j = 1, 2, …, J). That is, R.j = Σi Rij. If there is no difference in the central value in the groups then R.1, R.2, …, R.J would be nearly equal, and their variance nearly equal to zero. A criterion, called Kruskal-Wallis H exploits this premise to test the equality of locations. A large value of H is indicative of inequality of R.j and hence of the locations in different groups. This is
The distribution of H under H0 is known and P-values corresponding to the value of H calculated for a set of data can be obtained. Again, we leave this part for statistical package to handle. In case you are interested in consulting a table of probabilities of H, 227see table O in Sidney Siegel' book. They have tabulated for J = 3 groups only, however, number of subjects in different groups can be unequal. The table is for upto five subjects in each group. For n > 5, H is approximately chi-square with (J − 1) df. Then, chi-square distribution can be used to obtain P-values.
Example: The following are the cholesterol levels in females with different types of hypertension and control.
Total plasma cholestrol level (mg/dL)
Sum of ranks
No hypertension (Dias BP < 90, Sys BP < 140)
223 (8)
207 (2.5)
248 (16)
195 (1)
219 (7)
(34.5)
Diastolic hypertension (Dias BP ≥ 95, Sys BP < 140)
217 (5)
258 (17)
225 (9)
215 (4)
228 (11)
(46)
Systolic hypertension (Dias BP < 90, Sys BP ≥ 150)
262 (18)
227 (10)
207 (2.5)
245 (15)
230(12)
(57.5)
“Clear” hypertension (Dias BP ≥ 90, Sys BP ≥ 140)
218 (6)
238 (13)
265 (19)
269 (20)
240 (14)
(72)
Notes:
  1. The categories are not exhaustive. A person with dias BP = 92 and sys BP = 138 would not be in the study.
  2. Given in parentheses are the joint ranks, and the last column has the sum of the ranks (R.j) for each group.
  3. In this case, J = 4 and n = 5.
228
Probability tables of H gives P(H ≥ 4.41) > 0.10. Computer package yield P = 0.2203. This P-value under H0 is large and not less than 0.05. Thus, H0 of equality of locations in the four groups is quite plausible and cannot be rejected. Note that this conclusion is reached despite major differences in R.j in the four groups. But those differences may have arisen due to sampling fluctuations while the samples are from populations with same central value of cholesterol level in the four groups.
  • Note that this non-parametric test is based on ranks and not the actual values. The magnitude of differences between various values is ignored. Cholesterol level of the fourth person in the first group is 195 mg/dl. If it is 206 or 167, the rank would still remain the same and the value of the criterion H would not alter. This is a limitation of almost all nonparametric procedures though some nonparametric enthusiasts consider it a strength.
  • It is because of use of ranks in place of the actual values that non-parametric tests do not require any particular shape of the distribution of the underlying variable. That is the reason that they are called non-parametric or sometimes distribution-free.
  • Whenever two or more observations are equal, they are assigned same rank which is the average of the ranks that would have been otherwise assigned. In the example (given above), cholesterol level 229207 mg/dl is observed for two subjects—one in the first group and one in the third group. Both get rank 2.5 in place of 2 and 3. When such ties are present, the formula of H slightly changes. We again refrain from going into such finer details. Standard statistical packages automatically take care of such contingencies.
 
Multiple Comparisons
Once overall significance is indicated by the F-test, the next step is to identify the groups that are different from one or more of the others. This requires pairwise comparisons. If there are four groups, the comparisons are group 1 with group 2, 1 with 3, 1 with 4,2 with 3,2 with 4 and 3 with 4. There are a total of six comparisons and are called multiple comparisons. You now know that means in two groups are generally compared by Student's t-test. However, repeated application of this test at, say, the 5% level of significance on the same data blows up the total probability of Type I error to an unacceptable level. If there are 15 tests, each done at the 5% level, then the overall (experimentwise) Type I error could be as high as 1 – (1 − 0.05)15 = 0.54. Compare this with the desired 0.05. To keep the probability of Type I error within a specified limit such as 0.05, many procedures for multiple comparisons are available. Each of these is generally known by the name of the scientist who first proposed it. Among them are Bonferroni, Tukey, Scheffe, Newman-Keul, Duncan and Dunnett. The last 230is used specifically when each group is to be compared with the control only. Only Bonferroni's procedure is discussed here as it is more popular, easy to apply, and general (in the sense that it can be used in other ‘Multiple Comparison’ situations including non-parametric, i.e. applying Mann-Whiteny test(s) after Kruskal-Wallis test).
 
Bonferroni Procedure
This is the simplest method to ensure that the probability of Type I error does not exceed the desired level. Under this procedure, each comparison is done by using Student's t test but a difference is considered significant only if the corresponding P-value is less than α/H where H is the number of comparisons. If there are four groups and all pairwise comparisons are required, then H = 6. Then a difference would be considered significant at the 5% level if P < 0.05/6, that is, if P < 0.0083.
The Bonferroni procedure is conservative in the sense that the actual probability of Type I error will be much less than α. This means that there is an additional chance that some differences are actually significant but pronounced not significant. This is not a major limitation in an empirical sense and the advantage of the Bonferroni procedure is that H can be only as much as the number of comparisons of interest. If there are four groups and the interest is only in comparing group 1 with group 2, group 2 with group 3, and group 3 with group 4 (and not, for 231example, in comparing group 1 with group 3), then H = 3, Small H improves the efficiency of this procedure.
Multiple comparisons, may sometimes give results at variance with the results of the F-test. It is possible for the F-test to be significant but none of the pairwise comparisons significant. Conversely, the F-test may not show significance but comparison for a specific pair may still be significant. This happens because both require a Gaussian pattern but the underlying distribution may not be exactly Gaussian. The problem may arise more frequently for small n than for large n because large n is insulation against violation of a Gaussian pattern in most cases. Any ‘Multiple Comparison’ procedures are valid only for preplanned comparisons. Sometimes the idea of testing statistical significance between two or more groups arises after seeing the observed data. This is sometimes called data snooping.
 
Two-way Analysis of Variance
Two-way analysis of variance is used when the data are classified in two ways; for example, by age-group and sex. The data are said to have a balanced design if there are equal numbers of observations in each group and an unbalanced design if there are not. Balanced designs are of two types, with replication if there is more than one observation in each group and without replication if there is only one.232
 
Balanced Design with Replication
Results from an experiment in which five male and five female rats of each of three strains were treated with growth hormone are displayed in Table 6.1. The aims were to find out whether the strains responded to the treatment to the same extent, and whether there was any sex difference. The measure of response was weight gain after seven days.
Table 6.2   Differences in response to growth hormone for five male and five female rats from three different strains
(a) Mean weight gains in grams with standard deviations in parentheses (n = 5 for each group)
Sex
Strain
A
B
C
Male
11.9 (0.9)
12.1 (0.7)
12.2 (0.7)
Female
12.3 (1.1)
11.8 (0.6)
13.1 (0.9)
(b) Two way analysis of variance: balanced design with replication.
Source of variation
SS
d.f.
MS
F = MS effect/MS residual
Main effects
Strain
2.63
2
1.32
1.9, P > 0.10
Sex
1.16
1
1.16
1.7, P > 0.10
Interaction
Strain X-Sex
1.65
2
0.83
1.2, P > 0.10
Residual
16.86
24
0.70
Total
22.30
29
233
These data are classified in two ways, by strain and by sex. The design is balanced with replication because there are five observations in each strain-sex group. Two-way analysis of variance divides the total sum of squares into four components:
The sum of squares due to differences between the strains. This is said to be the main effect of the factor, strain. Its associated degrees of freedom are one less than the number of strains and equal 2. The sum of squares due to differences between the sexes, that is the main effect of sex. Its degrees of freedom equal 1, one less than the number of sexes. The sum of squares due to the interaction between strain and sex. An interaction means that the strain differences are not the same for both sexes or, equivalently, that the sex difference is not the same for the three strains. The degrees of freedom equal the product of the degrees of freedom of the two main effects, which is 2 × 1 = 2.
The residual sum of squares due to differences between the rats within each strain-sex group. Its degrees of freedom equal 24, the product of the number of strains (3), the number of sexes (2) and one less than the number of observations in each group (4).
The main effects and interaction are tested for significance using the F test to compare their mean squares with the residual mean square, as described for one-way analysis of variance. No significant results were obtained in this experiment.234
 
Balanced Design Without Replication
A physical therapist wished to compare three methods for teaching patients to use a certain prosthetic device. He felt that the rate of learning would be different for patients of different ages and wished to design an experiment in which the influence of age could be taken into account. This one is the appropriate design for this physical therapist. Three patients in each of five age groups were selected to participate in the experiment, and one patient in each age group was randomly assigned to each of the teaching methods. The methods of instruction constitute our three treatments, and the five age groups are the blocks. The data shown in Table 6.2 were obtained.
Other details like sum of squares formulas, are not given as they are complicated and are found easily in many books on the subject.
Table 6.2   Time in days required to learn the use of a certain prosthetic device
Teaching Method
Age Group
A
B
C
Total
Mean
Under 20
7
9
10
26
8.67
20 to 29
8
9
10
27
9.00
30 to 39
9
9
12
30
10.00
40 to 49
10
9
12
31
10.33
50 and over
11
12
14
37
12.33
Total
45
48
58
151
Mean
9.0
9.6
11.6
10.07
235
 
Non-parametric Statistical Inference: Two Way Lay Out—Friedman Test
In earlier example on cholesterol level in different types of hypertensions, suppose the interest is in simultaneously considering the obesity also. Another study is carried out and cholesterol level recorded for thin, normal and obese subjects belonging to each of the four hypertensive groups. Now factor-1 is hypertension status and factor-2 is obesity. The former has J = 4 levels as before and the latter has K = 3 levels. For simplicity, we now take the case of only one subject in each group. The measurement obtained on the subject with jth level of factor-1 and kth level of factor-2 is denoted by yjk. Note that there is no need of subscript i since n = 1. The non-parametric criterion to test H0 of equality of location of factor-2 for this setup is Friedman S. This is calculated as follows:
  1. For each level of factor-1, rank the K observations from smallest to largest. Let Rjk denote the rank received by yjk.
  2. Calculate R.k = ∑j Rjk. This is the sum of the ranks received by J subjects in level k of factor-2. If H0 is true, this sum would be nearly equal for each k (k = 1, 2,…, K). Thus, the variance of R.k would be close to zero. This leads to criterion S.
  3. Calculate
    Friedman S = {[12/JK (K + 1)] * [∑k R.k2 − 3J(K + 1)]}
If S is too large, the probability of H0 being true is exceedingly small and thus is rejected. S follows chi-square distribution with (k − 1) degrees of freedom. Thus, obtaining P-value is easy.236
 
Fixed and Random Effects
Factors can be divided into two types, fixed effects (the more common) and random effects. Factors such as sex, age-group, and type of sickle cell disease are all fixed effects since their individual levels have specific values; sex is always male or female. In contrast the individuals levels of a random effect are not of intrinsic interest, but are a sample of levels representative of a source of variation. For example, consider a study to investigate the variation in sodium and sucrose concentrations of home-prepared oral rehydration solutions, in which 10 persons were each asked to prepare eight solutions. In this case, the 10 persons are of interest only as representatives of the variation between solutions prepared by different persons. Persons is a random effect. In this particular example, as well as testing whether the person effect is significant, we would be interested in estimating the sizes of the variation in concentrations between solutions prepared by the same person, and of the variation between solutions prepared by different persons. These are called components of variation.
The method of significance testing is the same for fixed and random effects in one-way designs and in tow-way designs without replication, but not in two-way designs with replication (or in higher level designs) it differs. Since these are complex points, readers interested in a more detailed account should refer to special books on this subject.

Inference: Qualitative Variables (Categorical Data Analysis)7

 
COMPARISON OF PROPORTIONS
 
Significance Test for a Single Proportion
The normal test for a single proportion uses the formula:
For example, in the clinical trial to compare two analgesics, the proportion of patients preferring drug A was 9/12 = 0.75 compared to the null hypothesis value π = 0.5.
The value 1.73 lies between the 5% and 10% points of the standard normal distribution. Its exact significance level is 2 × 0.0418 = 0.0836 or 8.36%. This is appreciably smaller than the corresponding value of 14.58% calculated using the binomial probabilities. The reason for this is that we have used a continuous distribution, 238the normal distribution, to approximate the binomial distribution, which is discrete. This situation is corrected by the introduction of what is called a continuity correction into the formula for the normal test.
The normal test with continuity correction is:
where |(p – π)| means the absolute value of (p – π) or, in other words, the value of (p – π) ignoring the minus sign if it is negative.
The effect of the continuity correction is to reduce the value of z and to improve the agreement with the binomial test. In the example, the z value with the continuity correction is 1.44. The significance level for z = 1.44 is 2 × 0.0749 = 0.1498 or 14.98% which agrees much better with the value of 14.58% from the binomial test.
 
Significance Test for Comparing two Proportions
The normal test to compare two sample proportions, p1= r1/n1 and p2 = r2/n2 is based on:
where the standard error of (p1 – p2), under the null hypothesis that the population proportions are equal (i.e. π1 = π2 = π), is:
p is estimated by the overall proportion in both samples, that is by:
239
As in the case of a single proportion the test is improved by inclusion of a continuity correction. The formula is then:
Where p = (r1 + r2)/(n1 + n2) and |p1 – p2| means the size of the difference ignoring its sign. This normal test is a valid approximation provided that either n1 + n2 is greater than 40 or n1p, n1 – n1p, n2p and n2 – n2 – p are all 5 or more. The exact test should be used when this condition is not satisfied.
 
The Chi-Squared Test for Contingency Tables
When there are two qualitative variables, the data are arranged in a contingency table. The categories for one variable define the rows, and the categories for the other variable define the columns. Individuals are assigned to the appropriate cell of the contingency table according to their values for the two variables. A contingency table is also used for discrete quantitative variables or for continuous quantitative variables whose values have been grouped.
A chi-squared (χ2) test is used as ‘goodness of fit’ test. It is also used to test whether there is an association between the row variable and the column variable or in other words, whether the distribution of individuals among the categories of one variable is independent of their distribution among the categories of the other. When the table has only two rows or two columns this is equivalent to the comparison of proportions.240
 
2 × 2 Table
The results of one influenza vaccination trial are arranged in tabular form in Table 7.1(a) for illustration. Of 460 adults who took part, 240 received influenza vaccination and 220 placebo vaccination. Overall 100 people contracted influenza, of whom 20 were in the vaccine group and 80 in the placebo group. Such a table of data is called a 2 × 2 contingency table. Since the two variables, namely type of vaccine and presence of influenza, both have two possible values. It is sometimes called a fourfold table since each subject falls into one of four possible categories.
The first step in interpreting contingency table data is to calculate appropriate proportions or percentages. Thus the percentages contracting influenza were 8.3% in the vaccine group, 36.4% in the placebo group, and 21.7% overall. We now need to decide whether this is sufficient evidence that the vaccine was effective or whether the difference could have arisen by chance.
Table 7.1   Results from an influenza vaccine trial
(a) Observed numbers
Influenza
Vaccine
Placebo
Total
Yes
20 (8.3%)
80 (36.4%)
100 (21.7%)
No
220
140
360
Total
240
220
460
(b) Expected numbers
Influenza
Vaccine
Placebo
Total
Yes
52.2
47.8
100
No
187.8
172.2
360
Total
240
220
460
241
This is done using the chi-squired test, which compares the observed numbers in each of the four categories in the contingency table with the numbers to be expected if there were no difference in effectiveness between the vaccine and placebo.
Overall 100/460 people contracted influenza and, if the vaccine and the placebo were equally effective, one would expect this same proportion in each of the two groups; that is 100/460 × 240 = 52.2 in the vaccine group and 100/460 × 220 = 47.8 in the placebo group would have contracted influenza. Similarly 360/460 × 240 = 187.8 and 360/460 × 220 = 172.2 would have escaped influenza. These expected numbers are shown in Table 7.1(b). They add up to the same row and column totals as the observed numbers. The chi-squared value is obtained by calculating [(observed-expected)2/expected] for each of the four cells in the contingency table and then summing them. That is
The greater the differences between the observed and expected numbers, the larger the value of χ2 and the less likely it is that the difference is due to chance. The percentage points of the chi-squared distribution are given in appendix. The values depend on the degrees of freedom, which equal 1 for a 2 × 2 table. In this example χ2 = 53.09 is greater than 10.83, the 0.1% point for the chi-squared distribution with 1 degree of freedom. Thus the probability is less than 0.1% that such a large observed difference in the percentage contracting influenza could have arisen by chance, if 242there were no real difference between the vaccine and the placebo. It is therefore concluded that the vaccine is effective.
 
Continuity Correction
Like the normal test, the chi-squared test for a 2×2 table can be improved by using a continuity correction, often called Yates' continuity correction. The formula becomes:
resulting in a smaller value for χ2. |O – E| means the absolute value of (O – E) or, in other words, the value of (O – E) ignoring its sign. In the example the value for χ2 becomes 51.46, P< 0.001
 
Comparison with Normal Test
The normal test for comparing two proportions and the chi-squared test for a 2 × 2 contingency table are in fact mathematically equivalent and χ2 = Z2. This is true whether or not a continuity correction is used, provided it is used or not used in both tests. The normal test has the advantage that confidence intervals are more readily calculated. However, the chi-squared test is simpler to apply and it can also be extended to the comparison of several proportions and to larger contingency tables.
 
Validity
The use of the continuity correction is always advisable although it has most effect when the expected numbers are small. When they are very small the chi-squared 243test (and also the normal test) is not a good enough approximation, even with a continuity correction, and the alternative exact test for a 2 × 2 table should be used. It is recommended to use the exact test when the overall total of the table is less than 20 or when it is between 20 and 40 but the smallest of the four expected values is less than 5. Thus the chi-squared test is valid when the overall total is more than 40, regardless of the expected values, and when the overall total is between 20 and 40 provided all the expected values are at least 5.
 
Short-cut Formula
Table 7.2   Generalized notation for a 2×2 contingency table
Influenza
Vaccine
Placebo
Total
Yes
a
b
e
No
c
d
f
Total
g
h
n
If the various numbers in the contingency table are represented by the letters, as shown in Table 7.2, then a quicker formula for calculating chi-squared on a 2 × 2 table is: χ2 = [n (ad – bc)2/efgh which apart from rounding error is the same as obtained above. With the continuity correction the quick formula becomes: χ2 = [n (|ad – bc|– n/2)2/efgh, d.f. = 1.
In earlier example χ2 = 51.37, which again, apart from rounding error, is the same as obtained above.
 
Larger Tables
The chi-squared test can also be applied to larger tables, generally called r × c tables, where r denotes 244the number of rows in the table and c the number of columns.
There is no continuity correction for contingency tables larger than 2 × 2. However, an exact test is available (based on multinomial distribution) but it is very complicated and almost impossible to apply manually. It is suggested in the literature that the approximation of the chi-squared test is valid provided less than 20% of the expected values/numbers are under 5 and none is less than 1. This restriction can sometimes be overcome by combining (combinations should be meaningful) rows or columns with low expected values/numbers.
There is no quick formula for a general r × c table, however, for easy computation χ2 = { [∑ (O2/E)] – n } can be used. The expected numbers must be computed for each cell in any case. The reasoning employed is the same as that described above for the 2 × 2 table. The general rule for calculating an expected number is: E = [(column total × row total)/overall total]. It is important to note that the chi-squared test is valid only if applied to the actual numbers/frequencies in the various categories. It must never be applied to tables showing just proportions or percentages. Chi-square test is a test of significance and is not a measure of association. Thus a large value of chi-square denotes low probability of obtaining such result by chance alone. It does not denote greater degree of association between the two characteristics/variables.245
 
Exact Test for 2 × 2 Tables
The approximation of the chi-squared and normal tests for a 2 × 2 table may not be valid if the sample sizes are small. Alternative exact test, sometimes called Fisher's exact test, when:
  1. The overall total of the table is less than 20, or
  2. The overall total is between 20 and 40 and the smallest of the four expected numbers is less than 5.
The test is described in the context of a particular example. Below are the results (Table 7.3) from a study to compare two treatment regimes for controlling bleeding in haemophiliacs undergoing surgery.
Table 7.3   Comparison of two treatment regimes for controlling bleeding in haemophiliacs undergoing surgery – Hypothetical data
Treatment regime
Bleeding Complication
A
B
Total
Yes
1
3
4
No
12
9
21
Total
13
12
25
Only one (8%) of the 13 haemophiliacs given treatment regime A suffered bleeding complications, compared to three out of 12 (25%) given regime B. These numbers are too small for the chi-squared test to be valid; the overall total, 25, is less than 40, and the smallest expected value, 1.9 (complications with 246regime B), is less than 5. The exact test is therefore indicated which is based on calculating the exact probabilities of the observed table and of more ‘extreme’ tables with the same row and column totals, using the following formula.
Exact probability (of 2 × 2 table)
where the notation is the same as that defined in Table 7.2. The exclamation mark denotes the factorial of the number and means all the integers from the number down to 1 multiplied together (0! is defined to equal 1). The exact probability of Table 7.3 is therefore: (4!21!13!12!)/(25!1!3!12!9!) = 0.2261
In order to test the null hypothesis that there is no difference between the treatment regimes, we need to calculate not only the probability of the observed table but also the probability that a more extreme table could occur by chance. Altogether there are five possible tables which have the same row and column totals as the data. These are shown in Table 7.4 together with their probabilities, which total 1.
The observed case is table (b) with a probability of 0.2261. Defining more extreme as less probable, more extreme tables are (a) and (e) with probabilities 0.0391 and 0.0565 respectively. The total probability needed for the significance level is therefore 0.2261 + 0.0391+ 0.0565 = 0.3217, and so the difference between the regimes is clearly not significant.247
Table 7.4   All possible tables with the same row and column totals as earlier table together with their probabilities
 
Comparison of Two Proportions – Paired Case
In some studies we are interested in comparing two proportions calculated from observations which are paired in some way. This arises whenever the two proportions are measured on the same individuals or from case-control studies and clinical trials in which a matched paired design has been used.248
For example, consider the results of an experiment to compare two methods for detecting some type of specific eggs in feces in which two subsamples from each of 315 specimens were analyzed, one by each method. The results are arranged in a 2 × 2 contingency table (Table 7.5a), showing the agreement between the two methods. It is not this agreement, however, that we are concerned with here. We are interested primarily in comparing the proportions of specimens found positive with the two methods. These were 238/315 (75.6%) using method I and 198/315 (62.9%) using method II. Note that it would be incorrect to arrange the data as in Table 7.5b and to apply the standard chi-squared test, as this would take no account of the paired nature of the data, namely that it was the same 315 specimens examined with each method, and not 630 different ones.
Table 7.5   Comparison of two methods—same 315 specimens were examined using each method
(a) Correct layout
(b) Incorrect layout
Method II
+
Total
Result
Method I
Method II
Total
Method I
+ 184
52
238
+
238
198
436
− 14
63
77
77
117
194
Total
198
117
315
Total
315
315
630
The correct approach is as follows. One hundred and eighty-four specimens were positive with both methods and 63 were negative with both. These 247 specimens therefore give us no information about 249which of the two methods is better at detecting eggs. The information we require is entirely contained in the 68 specimens for which the methods did not agree. Of these 54 were positive with method I only compared to 14 positive with method II only. If there was no difference in the abilities of the methods to detect eggs, we would not of course expect complete agreement since different subsamples were examined, but we would expect on average half the disagreements to be positive with method I only and half to be positive with method II only. Thus, an appropriate significance test is to compare the proportion found positive with method I only, namely 54/68, with the hypothetical value of 0.5. This may be done using the normal test (with continuity correction), This gives:
indicating that method I is significantly better (P < 0.001) than method II at detecting eggs.
(Note that exactly the same result would have been obtained had the proportion positive with method II only, namely 14/68, been compared with 0.5).
 
McNemar's Chi-squared Test
There is also an alternative chi-squared test, called Mcnemar's chi-squared test, which is based on the numbers of discordant pairs, r and s. The version including a continuity correction is:
In above example r= 54 and s=14, giving χ2paired = 22.36, and P < 0.001.250
Apart from rounding error, this χ2paired value is same as square of the Z value obtained above, indicating that the two tests are mathematically equivalent.
 
Confidence Interval
The proportion of specimens found positive with method I was 0.7556 compared to 0.6286 with method II. The difference between the proportions was therefore 0.1270. This difference, together with its standard error, can also be calculated from the numbers of discordant pairs, r and s, and the total number of pairs, n.
Difference between paired proportions
= [(r – s)/ n]
And Standard Error = [√(r + s)/n]
Here this gives a difference, (r-s)/n = 40/315 = 0.1270, the same as calculated directly from the proportions, with the s.e.= 0.0262. An approximate confidence interval may be based on the normal distribution.
(1 – α)% Confidence Interval is
{ [(r – s)/n] ± Za × [√(r + s)/n]
The 95% confidence interval for the difference between the proportions of specimens found positive with method I and method II is therefore:
This means a positivity rate between 7.6% and 17.8% higher if method I is used to detect eggs than if method II is used.251
 
Analyzing Several 2 × 2 Tables
A very important general point to realize when carrying out an analysis is that it may be misleading to pool dissimilar subsets of the data. For example, consider the hypothetical results, shown in Table 7.6, from a survey carried out to compare the prevalence of antibodies to leptospirosis in rural and urban areas, for both males and females. Prevalence is higher in rural areas, but when the sexes are combined there appears to be no difference. This is caused by a combination of two factors:
Table 7.6   Comparisons of prevalence of antibodies to leptospirosis between rural and urban areas
(A) Males
Antibodies
Rural
Urban
Total
Yes
36 (72%)
50 (50%)
86
No
14
50
64
Total
50
100
150
χ2 = 5.73, d.f. = 1, P < 0.025
(B) Females
Antibodies
Rural
Urban
Total
Yes
24 (16%)
10 (10%)
34
No
126
90
216
Total
150
100
250
χ2 = 1.36, d.f. = 1, P = 0.025
(C) Males and females combined
Antibodies
Rural
Urban
Total
Yes
60 (30%)
60 (30%)
120
No
140
140
280
Total
200
200
400
252
  1. The prevalence of antibodies is not the same for males and females. It is much lower among females in both areas.
  2. The samples from the rural and urban areas have different sex compositions. The proportion of males is 100/200 (50%) in the urban sample but only 50/200 (25%) in the rural sample.
Sex is said to be a confounding variable because it is related both to the variable of interest (prevalence of antibodies) and to the groups being compared (rural and urban). Ignoring sex in the analysis leads to a bias in the results.
It is therefore important to analyze males and females separately. Applying the chi-squared test shows that the differences between the rural and urban areas is significant for males but not for females. In this instance pooling the sexes masked a difference that existed. In other situations pooling dissimilar subsets of the data could suggest a difference or association where none exists, or could even suggest a difference the opposite way around to one which does exist.
 
Mantel-Haenszel Chi-squared Test
When confounding is present, it is important to analyze the relevant subsets of the data separately. It is often useful, however, to apply a summary test which pools and evidence from the individual subsets but which takes account of the confounding factor(s). This is particularly true when a difference is present but not significant in the individual subsets, may be because 253of their small sample size. The Mantel-Haenszel chi-squared test is used for this purpose when the data consist of several 2 × 2 tables. The following three values are calculated from each table and then summed over the tables (the notation is as defined earlier):
  1. The observed value, a,
  2. The expected value for a, Ea = eg/n, and
  3. The variance of a, Va = efgh/[n2(n − 1)].
The chi-squared value (with continuity correction) is:
It has just 1 degree of freedom irrespective of how many tables are summarized. The calculations for the data are laid out in Table 7.7(a). A total of 60 persons in the rural area had antibodies to leptospirosis compared with an expected number of 49.1, based on assuming no difference in prevalence between rural and urban areas. The Mantel-Haenszel χ2 value equals 7.08 and is significant at the 1% level.
Table 7.7   Calculation of Mantel-Haenszel χ2 test, applied to data in above table
(a) The test
Subset
A
Ea
Va
Males
36
28.7
86 × 64 × 50 × 100/(22500 × 149) = 8.2088
Females
24
20.4
34 × 216 × 150 × 100/(62500 × 249) = 7.0786
Total
60
49.1
15.2874
χ2 = 7.08, df = 1, P < 0.01
(b) Rule of 5 to check validity
Subset
Min (e, g)
g – f
Max (0, g – f)
Males
50
− 14
0
Females
34
− 66
0
Total
84
0
254
 
Validity
The Mantel-Haenszel χ2 test is an approximation. Its adequacy is assessed by the ‘rule of 5’. Two additional values are calculated for each table and summed over the tables. These are:
  1. Min (e, g), that is the smaller of e and g, and
  2. Max (0, g – f), which equals 0 if g is smaller than or equal to f, and g-f if g is larger.
Both sums must differ from the total of the expected values, ∑Ea, by at least 5 for the test to be valid. The details of these calculations for the leptospirosis data are shown in Table 7.7(b). The two sums, 84 and zero, both differ from 49.1 by 5 or more, validating the use of this test.
 
Chi-squared Test for Trend
The standard chi-squared test for 2 × c table is a general test to assess whether there are differences among the c proportions. When the categories in the columns have a natural order, however, a more sensitive test is to look for an increasing (or decreasing) trend in the proportions over the columns.
Table 7.8   Relationship between triceps skinfold and early menarche
Triceps skinfold group
Age at menarche
Small
Intermediate
Large
Total
<12 years
15(8.8%)
29(12.8%)
36(19.4%)
80
12+ years
156
197
150
503
Total
171
226
186
583
Score for trend test
1
2
3
255
An example is given below, which shows data from a study on obesity in women. It can be seen that the proportion of women who had experienced early menarche increased with triceps skinfold size (Table 7.8). This trend can be tested using the chi-squared test for trend, which has 1 degree of freedom.
The first step is to assign scores to the columns to describe the shape of the trend. The usual choice is simply to number the columns 1, 2, 3, etc… as shown here (or equivalently 0, 1, 2, etc.). This represents a trend that goes up (or down) in equal steps from column to column, and is adequate for most circumstances. Another possibility would have been to use the means of the triceps skinfold measurements in each group, reflecting a linear trend with absolute value of triceps skinfold. The difference between the standard χ2 value and the trend test χ2 value provides a chi-squared value with (C-2) degrees of freedom to test for departures from the fitted trend.
The next step is to calculate three quantities for each column of the table and to sum the results of each. These are:
  1. rx, the product of the entry, r, in the top row of the column and the score, x,
  2. nx, the product of the total, n, of the column and the score, x, and
  3. nx2 the product of the total, n, of the column and the square of the score, x2.
    256
Using N to denote the overall total and R the total of the top row, the formula for the chi-squared test for trend (with continuity correction) is:
There are various different forms for this test, most of which are algebraically equivalent. The only difference is that in some forms (N − 1) is replaced by N in the calculation of B. This difference is unimportant. The form chosen here is both relatively simple to apply and easily extended to several 2 × c tables. The subtraction of 0.5 in the top row is the continuity correction. This should be omitted if the scores are not simply the column numbers.
χ2trend = 8.04, df = 1, P < 0.005 for the present data. Thus the trend of an increasing proportion of women who had experienced early menarche with increasing triceps skinfold size is highly significant (P < 0.005).
 
Extension for several 2 × c tables
When the data consist of several subsets which cannot be pooled because of confounding variables, any trend tests should be carried out on each subset. A summary test is also available which combines the evidence from the separate trends taking into account the confounding. This is a chi-squared test, which has 1degree of freedom, irrespective of the number of subsets. A and B are calculated for each of the individual tables as described above and then summed.
Summary χ2 trend = {[|∑A|– 0.5]2/[∑B]}, df = 1257
 
More Complex Techniques
Two complex techniques (which are very powerful) for analyzing proportions and contingency tables are: first is logistic regression, a method for analyzing proportions analogous to multiple regression for continuous variables. The analysis of several 2 × 2 (or 2 × c) tables can be considered as a special case of this. For example, following data show how the prevalence of antibodies to leptospirosis varies with respect to two factors, sex and type of area. Logistic regression is more general than the chi-squared methods described above since it allows both the inclusion of continuous explanatory variables and the assessment of interaction between the variables. Logistic regression is so called because it investigates the linear dependence of the logistic transformation of the proportion on several explanatory variables, where the logistic transformation, or logit for short, is defined as:
Logit = loge [(proportion)/(1 – proportion)]
The model is fitted using a mathematical technique called maximum likelihood which also takes account of the fact that the variation of a proportion has a binomial distribution. Conditional logistic regression is the version of logistic regression appropriate for the analysis of matched data, which can be considered as a multivariate extension of McNemar's paired chi-square test.
A second and similar technique is the use of log linear models. A series of different models is fitted, as in multiple regression. The significance of a particular 258variable is assessed by fitting two models, one including and the other excluding that variable. The test is based on the difference of the deviances of the two models. This difference is distributed as chi-square, with degrees of freedom equal to the number of extra parameters fitted.
  • P = 0.05 is the conventional cut-off. There is a growing feeling that the use of such cut off be discontinued and the exact P-value, that in any case automatically comes from computer packages, be used instead. The reader then is in a better position to decide how much significance does he want to attach to the results. However, a threshold is always helpful in making definitive statements. A P < 0.05 is considered significant, P < 0.001 even stronger because the chance of Type I error is so much smaller.
  • Let us re-emphasize that the inference from statistical tests are probabilistic rather than definitive. A P < 0.05 can be interpreted to say that if 100 such samples are taken from the same target population, more than 95 are likely to show some difference from the pattern postulated in H0. The chance of error is controlled to less than 5%. This certainly works in the long run but might fail in one particular case.
  • The use of chi-square for categorical data is well established but χ2 itself is a continuous variable. Theoretically, it is based on an approximation. This approximation works fine when expected frequency in any cell is not less than 5. This is generally met when n is large and no cell probability is very small. 259Without Ek ≥ 5 for each k = 1, 2, …K, the use of chi-square is not considered valid. When the number of categories K is large then a small relaxation can be given. A rule of thumb is that not more than one-fifth of categories (i.e. K/5) should have Ek ≥ 5 and none should be less than 1. When small frequencies occur in many cells, (either because of small sample or because of very small π in some cells) then an exact multinominal test should be used. It can only be applied with the help of a computer and is available in few computer packages.
  • The chi-square is basically a two-tail test. Significance in this case implies only presence of some difference and it can seldom be labeled positive or negative. If the observed frequency is less than the expected in one cell, it has to be more also in one or more of the other cells because the total for both the observed and the expected cell frequencies is the same.
  • The χ2 criterion is the sum ∑k (Ok—Ek)2/Ek. This would be large even if one difference (Ok-Ek) is large. Thus, rejecting H0 only tells that there is at least one cell where the frequency is substantially different from the expected. It does not say where. On the other hand, if a large difference is present in only one cell, then this can be masked by the small differences in the other cells.
  • Although chi-square test is very versatile and most frequently used non-parametric test of significance, in few situations.
    260
  1. Has very less power, for example, while testing the effect of seasonality.
  2. Not appropriate (i.e. a proposed test statistic does not follow the chi-square distribution), for example, while testing the effect of birth order.
  3. Will not lead to any meaningful conclusion, for example, to evaluate the effect of any intervention program with respect to some categorical response variable.
In such situations a specific test which is specifically developed for that specific purpose should be used. Such alternative specific tests are critically overviewed and compared in “On potentials of PRE” in Statistical Methods and Application in Biology and Medicine, Proceedings of first joint conference of ISMS and IBSIR, NIMHANS, Bangalore (1999) by the same author. A novel application of Proportionate Reduction in Error (PRE) measures is proposed and illustrated for the third situation listed above in the same reference. Although one may not think of using chi-square test in this situation, there are no alternative methods available at present but this situation is very often faced especially in nutrition field. Few possible indirect ways of analysis are also reviewed and compared in the same reference.

Correlation and Regression8

In this chapter first we will discuss the closely related techniques of correlation and linear regression for investigating the linear association between two continuous variables. Correlation measures the closeness of the association, while linear regression gives the equation of the straight line that best describes it and enables the prediction of one variable from the other. Then we will discuss ‘logistic regression’ technique.
 
 
Correlation
Displayed below are body weight and plasma volume of eight healthy men (Table 8.1). A scatter diagram of these data shows that high plasma volume tends to be associated with high weight and vice versa. This association is measured by the correlation coefficient, r.
262
where x denotes weight, y denotes plasma volume, and
and
are the corresponding means. The correlation coefficient is always a number between − 1 and + 1, and equals zero if the variables are not associated. It is positive if x and y tend to be high or low together, and the larger its value the closer the association. The maximum value of 1 is obtained if the points in the scatter diagram lie exactly on a straight line. Conversely, the correlation coefficient is negative if high values of y tend to go with low values of x and vice versa.
The correlation coefficient is more conveniently calculated by noting:
It is important to note that a correlation between two variables shows that they are associated but does not necessarily imply a ‘cause and effect’ relationship.
Table 8.1   Plasma volume and body weight in eight healthy men
Subject
Body weight (kg)
Plasma volume (I)
1
58.0
2.75
2
70.0
2.86
3
74.0
3.37
4
63.5
2.76
5
62.0
2.62
6
70.5
3.49
7
71.0
3.05
8
66.0
3.12
263
In this example:
therefore:
and:
 
Significance Test
A t test is used to test whether r is significantly different from zero, or in other words whether the observed correlation could simply be due to chance.
In this example:
This is significant at the 5% level, confirming the significance of the apparent association between plasma volume and body weight.
The significance level is a function of both the size of the correlation coefficient and the number of observations. Note that a weak correlation may therefore be statistically significant if based on a large number of observations, while a strong correlation may fail to achieve significance if there are only a few observations.264
 
Linear Regression
Linear regression gives the equation of the straight line that describes how the y variable increases (or decreases) with an increase in the x variable. The choice of which variable to call y is important because, unlike for correlation, the two alternatives do not give the same result. y is commonly called the dependent variable and x the independent or explanatory variable. In this example, it is obviously the dependence of plasma volume on body weight that is of interest.
The equation of the regression line is:
where ‘a’ is the intercept and ‘b’ the slope of the line. The values for ‘a’ and ‘b’ are calculated so as to minimize the sum of the squared vertical distances of the points from the line. This is called a least squares fit. The slope ‘b’ is sometimes called the regression coefficient. It has the same sign as the correlation coefficient. When there is no correlation ‘b’ equals zero, corresponding to a horizontal regression line at height
.
and
In this example,
b = 8.96/205.38 = 0.043615 and
a = 3.0025 − 0.043615 × 66.875 = 0.0857
Thus the dependence of plasma volume on body weight is described by:
Plasma volume = 0.0857 + 0.0436 × weight.265
The regression line is drawn by calculating the coordinates of two points which lie on it. For example:
and
As a check, the line should pass through the point
The calculated values for ‘a’ and ‘b’ are sample estimates of the values of the intercept and slope from the regression line describing the linear association between x and y in the whole population. They are, therefore, subject to sampling variation and their precision is measured by their standard errors.
where
s is the standard deviation of the points about the line. It has (n − 2) degrees of freedom.
and
266
 
Significance Test
A t test is used to test whether b differs significantly from a specified value, denoted by β (Greek letter beta).
In particular it may be used to test whether b is significantly different from zero. This is exactly equivalent to the t test of r = 0. Testing this for the plasma volume and body weight example gives:
which, apart from rounding error, is the same as the t value of 2.86, obtained above, for the correlation coefficient. Thus plasma volume does increase significantly (P< 0.05) with body weight.
 
Prediction
In some situations it may be useful to use the regression equation to predict the value of y for a particular value of x, say x'. The predicted value is:
and its standard error is:
This standard error is least when x' is close to the mean
. In general, one should be reluctant to use the 267regression line for predicting values outside the range of x in the original data, as the linear relationship will not necessarily hold true beyond the range over which it has been fitted.
In the example, the measurement of plasma volume is time-consuming and so, in some circumstances, it may be convenient to predict it from the body weight. For instance, the predicted plasma volume for a man weighing 66 kg is:
0.0832 + 0.0436 × 66 = 2.96 litres
and its standard error equals:
 
Assumptions
There are two assumptions underlying the method of linear regression. The first is that, for any value of x, y is normally distributed. The second is that the magnitude of the scatter of the points about the line is the same throughout the length of the line. This scatter is measured by the standard deviation s, of the points about line as defined above. A change of scale may be appropriate if either of these assumptions does not hold, or if the relationship seems non-linear.
Note: There are situations like above where both the techniques are applicable. But there are few where only ‘correlation’ is applicable. Consider an example of ‘height of brother and sister’. When the pairs are of ‘height of son and father’ one may apply both 268correlation and regression. But when one variable is not dependent on the other in pair, we can only calculate a correlation coefficient. Following guidelines may be remembered.
  1. Can one of the variables be predetermined, or altered? Use regression.
  2. Do you wish to summarize the strength of an association? Use correlation.
 
NONPARAMETRIC CORRELATION COEFFICIENT: SPEARMAN'S RANK CORRELATION COEFFICIENT
 
Function
Of all the statistics based on ranks, the Spearman's rank correlation coefficient, was the earliest to be developed and is perhaps the best known today. This statistic, sometimes called rho and represented by rs. It is a measure of association which requires that both variables be measured in at least on ordinal scale so that the objects or individuals under study may be ranked in two ordered series.
 
Rationale
Suppose N individuals are ranked according to two variables. For example, we might arrange a group of students in the order of their scores on the college entrance test and again in the order of their scholastic standing at the end of the freshman year. If the ranking on the entrance test is dented as X1, X2, X3, …, Xn and the ranking on scholastic standing is represented by 269Y1, Y2, Y3…… Yn, we may use a measure of rank correlation to determine the relation between the X's and the Y's.
We can see that the correlation between entrance test ranks and scholastic standing would be perfect if and only if Xi = Yi for all i's. Therefore, it would seem logical to use the various differences.
as an indication of the disparity between the two sets of rankings. Suppose A received the top score on the entrance examination but places fifth in her class in scholastic standing. Her d would be − 4. B placed tenth on the entrance examination but leads the class in grades. His d is 9. The magnitude of these various di's gives us idea of how close is the relation between entrance examination scores and scholastic standing. If the relation between the two sets of ranks were perfect, every d would be zero. The larger the di's the less perfect must be the association between the two variables.
Now in computing a correlation coefficient it would be awkward to use the di's directly. One difficulty is that the negative di's would cancel out the positive ones when tried to determine the total magnitude of the discrepancy. However, if di2 is employed rather than di, this difficulty is circumvented. It is clear that the larger are the various di's the larger will be the value of ∑di2.
If
where
is the mean of the scores on the X variable, and if
, then a general expression for a correlation coefficient is.270
in which the sums are over the N values in the sample. Now when the X's and Y's are ranks, r = rs,
which is the most convenient formula for computing the Spearman rs.
As part of a study of the effect of group pressure for conformity upon an individual in a situation involving monetary risk, the researches administered the well-known F scale, a measure of authoritarianism, and a scale designed to measure social status strivings to 12 college students.
Table 8.2   Scores on authoritarianism and social static strivings
Score
Students
Authoritarianism
Social status strivings
A
82
42
B
98
46
C
87
39
D
40
37
E
116
65
F
113
88
G
111
86
H
83
56
I
85
62
J
126
92
K
106
54
L
117
81
271
Information about the correlation between the scores on authoritarianism and those on social status strivings was desired (Table 8.2).
Table gives each of the 12 student's scores on the two scales. In order to compute the Spearman's rank correlation between these two sets of scores, it was necessary to rank them in two series. The ranks of the scores (Table 8.3):
Table 8.3   Ranks on Authoritarianism and social status strivings
Rank
Students
Authori tarianism
Social status strivings
di
di2
A
2
∑di23 = 52
− 1
1
B
6
4
2
4
C
5
2
3
9
D
1
1
0
0
E
10
8
2
4
F
9
11
− 2
4
G
8
10
− 2
4
H
3
6
− 3
9
I
4
7
− 3
9
J
12
12
0
0
K
7
5
2
4
L
11
9
2
4
Σdi2=52
We observe that for these 12 students the correlation between authoritarianism and social status strivings is rs = .82.272
Tied observations: Occasionally two or more subjects will receive the same score on the same variable. When tied scores occur, each of them is assigned the average of the ranks which would have been assigned had no ties occurred, our usual procedure for assigning ranks to tied observations.
If the proportion of ties is not large, their effect on rs is negligible, and formula may still be used for computation. However, if the proportion of ties is large, then a correction factor must be incorporated in the computation of rs.
When a considerable number of ties are present, one uses following formula for computing rs:
where
 
Multiple Regression
Situations frequently occur in which we are interested in the dependency of a variable on several explanatory variables, not just one. For example, 100 women attending an antenatal clinic took part in a study to identify variables associated with birth weight of the 273child, with the eventual aim of predicting women ‘at risk’ of having a low birth weight baby. The results shows that birth weight was significantly related to age of mother, height of mother, parity, period of gestation and family income, but that these variables were not independent. For example, height of mother and period of gestation were positively correlated. The joint influence of the variables, taking account of possible correlations among them, may be investigated using multiple regression, which will be discussed in the context of this example.
 
Analysis of Variance Approach to Simple Linear regression
Consider the relationship of birth weight with height of mother. A scatter diagram suggested that this was linear. The correlation coefficient was 0.26 (P< 0.01).
Linear regression gives the equation of the straight line that best describes the relationship as:
where a and b are calculated so as to minimize the sum of squared deviations of the points about this line. The sum of squared deviations about the line which remains after this minimization process has been carried out is called the residual sum of squares. This is less than the total sum of squares of the overall variation in the birth weight by an amount which is called the sum of squares explained by the regression of birth weight on height of mother, or simply the regression sum of squares. This splitting of the total sum of squares of the overall variation in birth weight into two parts 274can be laid out in an analysis of variance table (Table 8.4). There is 1 degree of freedom for the regression and n − 2 = 98 degrees of freedom for the residual.
If there were no real association between the variables, then the regression mean square would be about the same size as the residual mean square, while if the variable associated it would be larger. This is tested using an F test:
F should be about 1 if there is not an association, and larger if there is. This F test is equivalent to the t test of b = 0 and r = 0 described earlier. In this example F = 1.4800/0.2081 = 7.11 with (1, 98) degree of freedom. From Table in appendix, the 5% point for F (1, 60) is 4.00 and for F (1,120) is 3.92. The 5% point for F (1,98) is greater than 3.92 and 4.00 and so F = 7.11 is significant at the 5% level.
Table 8.4   Analysis of variance for the linear regression of birth weight on height of mother (n = 100)
Source of variation
Sum of Squares (SS)
Degrees of freedom (d.f.)
Mean square (MS=SS/d.f)
Regression of height of mother
1.48
1
1.4800
7.11, P < 0.01
Residual
20.39
98
0.2081
Total
21.87
99
275
 
Relationship Between Correlation Coefficient and Analysis of Variance Table
The analysis of variance table gives an alternative interpretation of the correlation coefficient: the square of the correlation coefficient, r2, equals the regression sum of squares divided by the total sum of squares (0.262 = 0.0676 = 1.48/21.87) and thus is the proportion of the total variation that has been explained by the regression. We can say that height of the mother accounts for 6.76% of the total variation in birth weight. r2 is called coefficient of determination.
 
Multiple Regression with Two Variables
Consider now the relationship of birth weight with the two variables: height of mother and period of gestation. Scatter diagrams suggested that the relationship of birth weight with each of these was linear. The correlations were 0.26 (P < 0.01) and 0.39 (P < 0.001) respectively. The joint relationship can be represented by the multiple regression equation.
which in this example is:
Birth weight = a + b1 height of mother + b2 period of gestation.
This means that for any period of gestation, birth weight is linearly related to height of mother, and also, for any height of mother, birth weights is linearly related to period of gestation. b1 and b2 are called partial regression co-efficient and the corresponding correlations are partial correlations. Note that b1 and b2 will be different to the ordinary regression 276coefficients unless the two explanatory variables are unrelated.
Although it is no longer easy to visualize the regression, the principles involved are the same. Each observed birth weight is compared with (a+b1 height of mother +b2 gestation). a, b1 and b2 are chosen to minimize the sum of squares of these differences or, in other words, the variation about the regression. The results are laid out in an analysis of variance table (Table 8.5). There are now 2 degrees of freedom for the regression as there are two explanatory variables.
Table 8.5   Analysis of variance for the multiple regressions of birth weight on height of mother and period of gestation
Source of variation
(SS)
(d.f.)
MS
Regression of height of mother and period of gestation
4.43
2
2.2150
12.32, P < 0.001
Residual
17.44
97
0.1798
Total
21.87
99
The F test for this regression is 12.32 with (2,97) degrees of freedom and is significant. The regression accounts for 20.26% (4.43/21.87) of the total variation. This proportion equals R2, where R = √(0.2026) = 0.45 is defined as the multiple correlation coefficient. R is always positive as no direction can be attached to a correlation based on more than one variable.
The sum of squares due to the regression of birth weight on both height of mother and period of 277gestation comprises the sum of squares explained by height of mother (as calculated in the simple linear regression) plus the extra sum of squares explained by the period of gestation after allowing for height of mother (Table 8.6). This extra variation can also be tested for significance using an F test. The residual mean square from the multiple regression is used.
Thus the model with both variables is an improvement over the model with just height of mother, because the effect of period of gestation is significant even when height of mother is taken into account.
Table 8.6   Individual contributions of height of mother and period of gestation to the multiple regression including both variables
(a) Height of mother entered into multiple regression first
Source of variation
SS
d.f.
MS
Height of mother
1.48
1
1.48
8.23, P<0.01
Gestation adjusting for height
2.95
1
2.95
16.41, P<0.001
Height of mother and period of gestation
4.43
2
(b) Period of gestation entered into multiple regression first.
Source of variation
SS
d.f.
MS
Period of gestation
3.33
1
3.33
18.52, P<0.001
Height adjusting for gestation
1.10
1
1.10
6.12, P<0.025
Height of mother and period of gestation
4.43
2
278
Alternatively, the sum of squares explained by the regression with both variables comprises the sum of squares explained by period of gestation (calculated by simple linear regression) plus the extra sum of squares due to height of mother after allowing for period of gestation. The model with both variables is an improvement over the model with just period of gestation, since the effect of height of mother is significant even when period of gestation is taken into account.
 
Multiple Regression with Several Variables
Multiple regression can be extended to any number of variables, although it is recommended that the number be kept reasonably small, as with larger numbers the interpretation becomes increasingly more complex. Some variables, such as age and sex, may be essential for inclusion in the multiple regression equation because it is considered necessary to adjust for their effects before investigating other relationships. The inclusion of other variables may be based on the strengths of their associations with the dependent variable. The object should be to achieve a balance between, on the one hand, including sufficient variables to obtain the best possible agreement between the multiple regression equation and the data and on the other hand, not including so many variables as to make the relationship too complex to interpret. The selection of variables can be made in one of three ways.279
 
Step-up Regression
Simple linear regressions are carried out on each of the explanatory variables. The one accounting for the largest percentage of variation is chosen and kept as the first variable. Multiple regressions with two variables are then carried out adding each of the other explanatory variables in turn. The two-variables regression accounting for the largest percentage of variation is chosen. The process continues, an additional variable being selected at each stage. It stops when either the extra variation accounted for by adding a further variable is non-significant for all the remaining variables or when a preset limit for the maximum number of variables in the multiple regression has been reached.
 
Step-down Regression
A multiple regression is performed using all the explanatory variables. Variables are then dropped, one at a time. At each stage, the variable chosen for exclusion is that which makes the least contribution to the explained variation. The process continues until all remaining variables are significant, or, if the number of remaining variables is still too many, until a preset limit for the maximum number of variables in the equation has been reached.
 
Optimal Combination Regression
The stepwise regressions of methods 1 and 2 do not necessarily reach the same final choice, even if they 280end up with the same number of explanatory variables included. Neither of them will necessarily choose the best possible regression for a given number of explanatory variables. A preferable procedure is to find which single variable is best, then which pair of variables is best, then which trio, etc. by performing regressions of all possible combinations. Note that, for example, the best pair will not necessarily include the best single variable, although often it does.
 
Regression for Prediction versus Regression for Control
Suppose that an investigator was studying the variables related to air pollution. He randomly selected 25 locations in metropolitan areas and measured both the noise level and the level of air pollution at each location at a randomly selected time of day. He then fit a regression equation to the relationship between these two variables, using noise as the independent variable and air pollution as the dependent variable. He got a good fit to the data. If the equation is then, used to predict the level of air pollution from the measured noise at a 26th similar location, a valid prediction (within statistical error limits) should result. However, if the investigator tried to use the regression equation to draw conclusions concerning the “effect” of noise upon air pollution or to take action based upon this assumed effect he is likely to be wrong.
This example illustrates the difference between regression for prediction and regression for control. It demonstrates that a regression analysis of unplanned 281data may show correlation but may not show “cause and effect”. Noise is not the cause of air pollution. Rather, the real sources of air pollution like automobiles and factories, are also the sources of the noise. Since both noise and air pollution have many of the same sources, they are correlated. This correlation (not a cause-and-effect relationship) is the reason for the good fit of the regression equation to the data. Any cause-and-effect conclusions drawn directly from the regression (or correlation) analysis are likely to be nonsensical, even though the regression equation might fit the data well (or correlation coefficient is significantly large).
Matters are often further complicated when there are close interrelationships among the independent variables called “multicollinearity”. When regression equation is to be used for prediction, multicollinearity is generally not an obstacle. It is a serious problem when one wishes to draw conclusions about the causal effects of specific independent variables.
 
Relative Importance of Different X-variables
The quantity (βi2σi2y2) measures the fraction of the variance of Y attributable to its linear regression on Xi. This fraction can be reasonably regarded as a measure of relative importance of Xi. With a random sample from this population, the quantities (bi2 ∑xi2/ ∑y2) are sample estimates of these fraction. The square root of these quantities, called the standard partial regression coefficients, have sometimes been used as measures of relative importance.282
 
Multiple Regression with Discrete Explanatory Variables
It is often desirable to include discrete as well as continuous explanatory variables in a multiple regression analysis. For example, in the birth weight study, six of the 100 women had a mycoplasma infection during pregnancy, and on average the birth weight of their babies was lower. This factor is included by defining a dummy variable for infection which equals 1 for women who had mycoplasma infection and zero for women how did not. The regression equation is:
Birth weight = a + b1 height of mother + b2 period of gestation + b3 infection
This is equivalent to a pair of equations:
Birth weight = a + b1 height of mother + b2 period of gestation for women who had a mycoplasma infection, and
Birth weight = a + b1 height of mother + b2 period of gestation for women who did not have an infection.
The coefficient b3 measures the average difference in birth weight of babies born to mothers who had a mycoplasma infection compared to others of the same height and with the same period of gestation who did not have the infection. The additional sum of squares due to mycoplasma infection is found by exactly the same methods as described above. It is the difference between the sum of squares due to the three variables multiple regression including the dummy variable representing infection. And that due to the regression on just height of mother and period of gestation. It 283has 1 degree of freedom and is tested for significance using an F test.
Mycoplasma infection is a factor with two levels, presence or absence. Factors with more than two levels, such as age-group, are included by the introduction of a series of dummy variables to describe the differences. If a factor has k levels, k − 1 dummy variables are needed and the associated degrees of freedom equal k − 1.
 
Multiple Regression with Non-linear Explanatory Variables
It is often found that the relationship between the dependent variable and an explanatory variable is non-linear. There are three possible ways of incorporating such an explanatory variable in the multiple regression equation. The first and most versatile method is to redefine the variable into distinct subgroups and include it as a factor with a level corresponding to each subgroup, as described in the section above, rather than as a continuous variable. For example, age could be subdivided into five-year age-groups. The relationship with age would then be based on a comparison of the means of the dependent variable in each age-group and would make no assumption about the exact form of the relationship with age. At the initial stages of an analysis, it is often useful to include an explanatory variable in both forms, as a continuous variable and as a factor. The difference between the two associated sums of squares can then be used to assess whether there is an important non-linear 284component to the relationship. For most purposes, a subdivison into 3-5 groups, depending on the sample size, is adequate to investigate non-linearity of the relationship.
A second possibility is to find a suitable transformation for the explanatory variable. For example, in the birth weight study, it was found that birth weight was linearly related to the logarithm of family income rather than to family income itself. The third possibility is to find an algebraic description of the relationship. For example, it may be quadratic, in which case both the variable (x) and its square (x2) would be added into the equation.
 
Relationship Between Multiple Regression and Analysis of Variance
There is a large overlap between multiple regression and analysis of variance. A multiple regression where all of the explanatory variables are discrete is, in fact, the same as an analysis of variance with several factors. The two approaches give identical results. In such a situation it is recommended that the choice of method be based on the computer programs available and their relative ease of use.
Another closely related technique, which will not be described here in detail, is analysis of covariance. It is an alternative but equivalent approach to investigating differences between groups when there are also continuous explanatory variables. An example is the problem described above of comparing birth weights of babies between mothers who had 285mycoplasma infection during pregnancy and those who did not. Height of mother and period of gestation were additional explanatory variables and would be called covariates in this context. Analysis of covariance is a technique that combines the features of analysis of variance and regression. This technique is also used to increase precision in randomized experiments, to adjust for sources of bias in observational studies, to throw light on the nature of treatment effects in randomized experiments, to study regressions in multiple classifications, etc.
 
ASSESSING CAUSALITY
Listed six criteria can be considered before making any causal inferences. Each of these criteria is important in assessing the plausibility of a cause-and-effect hypothesis, but the lack of any or all of being satisfied does not reject the hypothesis.
The first criterion is the strength of the association or the magnitude of the effect. The stronger the association, the stronger the evidence. Consider the results from the studies of lung cancer and smoking. The death rate from lung cancer in cigarette smokers is 9 to 10 times the rate for nonsmokers, and in the case of heavy smokers the rate is 20 to 30 times as great. In the case of such a strong association, it is difficult to dismiss the likelihood of causality.
The second criterion is consistency of the association. Has the association been observed by several investigators at different times and places? As an example of a consistent association, observed under 286varied circumstances, consider the findings of the Advisory Committee's Report to the Surgeon-General of the United States (Unites States Department of Health Education and Welfare (1964), Smoking and Health: Report of the Advisory Committee to the Surgeon-General of the Public Health Service, Washington, DC: Public Health Service Publication No. 1103). The Committee found an association between lung cancer and smoking in 29 of the 29 case-control studies and in 7 of the 7 cohort studies.
An association that has been observed under a wide variety of circumstances lends strength to a cause-and-effect hypothesis. After all, the only way to determine whether a statistically significant result was indeed not due to only chance is by repetition. There are some situations, however, which cannot be repeated, and yet conclusions concerning cause and effect can be made. The large number of cancer death in the population of workers in the nickel refineries was far above that expected in the general population. With such a large difference in death rates, it was clear, even without any biological information to support the statistical evidence that working in these nickel refineries was a grave hazard to health.
The third criterion is specificity of association. Has the association been limited to only certain populations, areas, jobs, and so on? If an association appears only for specific segments of a population, say only in nickel refinery workers, and not in other segments, then causation is easier to infer. However, the lack of specificity does not negate the possibility of cause and effect.287
Specificity can be sharpened by refining and subdividing the variables of interest. As an example, consider again the Advisory Committee's report on smoking in Smoking and Health. The overall analysis showed an increase in the risk of death from all causes in smokers. However, specificity was increased and causal inference improved when causes of death such as cancer were sub-divided into specific-site cancers. By examining the association between smoking and forms of cancer, the specificity of smoking and lung cancer become apparent. In addition, smoking was refined by considering the method and frequency of smoking. This allowed the various researchers to identify cigarettes as a major causal factor.
The fourth criterion is temporality or time sequence of events. This criterion is most important in diseases or processes which take a long time to develop. In order to assign cause to any variable, it is necessary to know that the exposure to the potential causal variable happened sufficiently before the effect or response indeed be considered causal. In some situations, it is a case of the chicken or egg, which came first? Does a particular cause lead to a disease or do the early stages of the disease lead to that particular cause routine?
Both cohort and case-control types of studies suffer from problems in determining temporality. In a cohort study based on nonrandom assignment, one can argue that an apparent association may be due to the fact that, “pre-existing attributes lead the cases to select themselves for the supposedly casual experience”. Such argument was suggested that reasons may be 288genetically predisposed to lung cancer as well as smoking. In control studies, however, since the information is obtained in retrospect, the precedence of the various factors may not be ascertainable. For example, unless a patient's medical records are available, the time of a patient's exposure suspected causal medication may not be ascertainable. The investigator have to rely on a patient's potentially faulty recall.
The fifth criterion is the existence of a biological gradient or dose-response relationship. Causal inference can be strengthened if a monotonic relationship is present between the effect of response and the level of the hypothesized causal factor. In the smoking and lung cancer studies the death rate from lung cancer are linearly related with the number of cigarettes smoked daily. Combining this with the higher death rate for smokers than nonsmokers provided causal evidence. While the existence of an ascertainable dose-response relationship will strengthen causal inference, the lack of such a relationship will not lead to reject the hypothesis.
The final criterion is coherence of the association. Do the results conflict with facts concerning the development of a disease or condition being studied? If the results do conflict, however, one need not dismiss the apparent association, the coherence of the result is conditional on the present state of knowledge.
Our final advice to the investigator is to maintain some degree of skepticism about the duration of the study, so that objectively one can attempt to eliminate all other plausible explanations of a hypothesized result 289other than one's own. When you have eliminated the impossible, whatever remains, however improbable, must be the truth.
  • Regression problems consider the frequency distributions of one variable when another is held fixed at each of several levels. A correlation problem considers the joint variation of two measurements, neither of which is restricted by the experimenter. Examples of regression problems can be found in the study of the yields of crops grown with different amounts of fertilizer, the length of life of certain animals exposed to different amounts of radiation, the hardness of plastics which are heat-treated for different periods of time. In these problems the variation in one measurement is studied for particular levels of the other variable selected by the experimenter. Examples of correlation problems are found in the study of the relationship between IQ and school grades, blood pressure and metabolism, or height of cornstalk and yield. In these examples both variables are observed as they naturally occur, neither variable being fixed at predetermined levels.
 
LOGISTIC REGRESSION
Problem frequently encountered in epidemiologic research is (A typical question of researchers)—What is the relationship of one or more exposure (or study) variables (let us denote by E) to a disease or illness outcome (let us denote by D)? To illustrate, consider a dichotomous disease outcome with 0 representing 290not diseased and 1 representing diseased. The dichotomous disease outcome might be, for example, coronary heart disease (CHD) status, with subjects being classified as either 0, “without CHD”, or 1, “with CHD”. Suppose, further, that we are interested in a single dichotomous exposure variable, for instance, smoking status, classified as “yes” or “no”. The research question for this example is, therefore, to evaluate the extent to which smoking is associated with CHD status.
To evaluate the extent to which an exposure, like smoking, is associated with a disease, like CHD, it is often necessary to take into account or “control for” additional variables, for example, age, race and/or sex, which are not of primary interest. Let us label these three control variables as c1, c2, c3. In this example, the variables E (the exposure variables), together with c1, c2, c3 (the control variables), represent a collection of “independent” variables which we wish to use to describe or predict the “dependent” variable D. More generally, the independent variables can be denoted as X1, X2, and so on up to Xk, where k is the number of variables being considered.
We have a flexible choice for the X's, which can represent any collection of exposure variables, control variables, or even combinations of such variables of interest. Logistic regression is a mathematical modeling approach that can be used to describe the relationship of several X's to a dichotomous dependent variable such as D. Other modeling approaches are possible but logistic regression is by far the most 291popular modeling procedure used in the analysis of epidemiologic data when the illness measure is dichotomous. Logistic regression is so popular because the logistic function which describes the mathematical form upon which the logistic model is based, say f(z), is given by one over one plus e to the minus z ranges between zero and one.
The model is designed to describe a probability, or more specifically, a risk for an individual. It is set up to ensure that whatever estimate of risk we get, it will always be some number between 0 and 1, which is the range for a probability. Thus, for the logistic model, we can never get a risk estimate either above 1 or below 0. This is not always true for other possible models, which is why the logistic model is often the first choice when a probability is to be estimated. Another reason why the logistic model is popular is that the shape of the logistic function is “elongated S” (Fig. 8.1).
As shown in the graph the S shape of f(z) indicates that the effect of z on an individual's risk is minimal for low z's up until some threshold is reached, then rises rapidly over a certain range of intermediate z values, and then remains extremely high around 1 once z gets large enough. This threshold idea is thought by epidemiologists to apply to a variety of disease conditions. In other words, an S-shaped model is considered to be widely applicable for considering the multivariable nature of an epidemiologic research question.
To obtain the logistic model from the logistic function, we write z as the linear sum (α plus β1 times X1 plus β2 times X2, and so on up to βk times Xk, where the X's are independent variables of interest and α and the βi are constant terms representing unknown parameters).292
Fig. 8.1: Logistic Model
293
In essence, then, z is an index which combines the X's together. Substitute the linear sum expression for z in the formula for f(z) to get the expression f(z) = one over one plus exp. to minus the quantity α plus the sum of βiXi for i ranging from 1 to k. To view this expression as a mathematical model, we must place it in an epidemiologic context.
Consider the following general epidemiologic study framework: we have observed independent variables X1, X2, upto Xk on a group of subjects, for whom we have also determined disease status, as either 1 if “with disease” or 0 if “without disease”. We wish to use this information to describe the probability that the disease will develop during a defined study period, say T0 to T1 in a disease-free individual with independent variable values X1, X2, up to Xk which are measured at T0. The probability being modeled can be denoted by the conditional probability statement P (D = 1 | X1, X2,…., Xk). The model is defined to be logistic of the expression for the probability of developing the disease, given the X's is one over one plus exp. to the minus the quantity α plus the sum of βiXi for i from 1 to k. The terms α and βi in this model represent unknown parameters which we need to estimate based on data obtained on E and on D (disease outcome), for a group subjects.
Thus, if we knew the parameters α and the βi and we had determined the values of X1 through Xk, for a 294particular disease-free individual, we could use this formula to plug in these values and obtain the probability that this individual would develop the disease over some defined follow-up time interval. For notational convenience, we will denote the probability statement P (D = 1 | X1, X2,…., Xk) as simply P(X) where the bold X is a short-cut notation for the collection of variables X1 through Xk. Thus, the logistic model may be written as P(X) = one over one plus exp. to the minus α plus the sum of βiXi. To illustrate the use of the logistic model, suppose the disease of interest is D = CHD. Here CHD is 1 if a person has the disease and 0 if not. We have three independent variables of interest: X1 = CAT, X2 = AGE and X3 = ECG. CAT stands for catecholamine level, which is 1 if high and 0 if low, AGE is continuous, and ECG denotes electrocardiogram status, which is 1 if abnormal and 0 is normal.
Suppose that we have a data set of 609 subjects (males) on which we have measured CAT, AGE and ECG at the start of study. These people are then followed for 9 years to determine CHD status. Suppose that in the analysis of this data set, we consider a logistic model given the expression P(X) = one over one plus exp. to the minus α plus β1 times CAT plus β2 times AGE plus β3 times ECG. We would like to “fit” this model. By the term “fit”, we mean that we use the data set to come up with estimates of the unknown parameters α, β1, β2, and β3. Using common statistical notation, we distinguish the parameters from their estimators by putting a “hat” symbol on top of a 295parameter to denote its estimator. Thus the estimators of interest here are α hat, β1 hat, β2 hat, and β3 hat. Suppose the results of our model fitting yield the following estimated parameters: α hat = − 3.911, β1 hat = 0.652, β2 hat = 0.029, and β3.hat = 0.342. Our fitted model thus becomes P hat of X = one over one plus exp. to minus the linear sum (− 3.911 plus 0.652 times GAT plus 0.029 times AGE plus 0.342 times ECG). We have replaced P by P “hat” in the formula because our estimated model will give us an estimated probability, not the exact probability.
Suppose we want to use our fitted model, to obtain the predicted risk for a certain individual. To do this, we would need to specify the values of the independent variables (CAT, AGE, ECG) for this individual, and then plug these values into the formula for the fitted model to compute the estimated probability or risk for this individual. To illustrate the calculation of a predicted risk, suppose we consider an individual with CAT = 1, AGE = 40 and ECG = 0. Plugging these values into the fitted model yields the value 0.1090. Thus for a person with CAT = 1, AGE = 40 and ECG = 0, the predicted risk using the fitted model is 0.1090, that is, it is estimated that this person would have about an 11% risk for CHD. For the same fitted model, we compare the predicted risk of a person with CAT = 1, AGE = 40 and ECG = 0 with that of a person with CAT = 0, AGE = 40 and ECG = 0. For the first person the risk value is 0.1090. The second probability is computed the same way (but this lime we must replace CAT = 1 by CAT = 0). The predicted risk for this person 296turns out to be 0.0600. Thus, using the fitted model, the first person with high catecholamine level has an 11% risk for CHD whereas the second person with low catecholamine level has a 6% risk for CHD over the period of follow-up of the study.
Notice that, in this example, if we divide the predicted risk of the person with high catecholamine by that of the person with low catecholamine, we get a risk ratio estimate, denoted by RR, of 1.82. Thus, using the fitted model, the person with high CAT has almost twice the risk of the person with low CAT, assuming both persons are of AGE 40 and with no previous ECG abnormality. We have just seen that it is possible to use a logistic model to obtain a risk ratio estimate that compares two types of individuals. If our study design is not a follow-up study or if some of the X's are not specified, we can't estimate RR this way. Nevertheless, it may be possible to estimate RR indirectly. To do this, however, we first compute an odds ratio, usually denoted as OR and we must make some assumptions. In fact, the odds ratio (OR), not the risk ratio (RR), is the only measure of association directly estimated from a logistic model regardless of whether the study design is follow-up, case-control or cross-sectional. An important feature of the logistic model is that it is defined with a follow-up study orientation. That is, as defined, this model describes the probability of developing a disease of interest expressed as a function of independent variables presumed to have been measured at the start of a fixed follow-up period. Therefore it is natural to wonder 297whether the model can be applied to other studies, like case-control or cross-sectional. The answer to the question is: big yes. The logistic model can be applied to case-control and cross-sectional data. Although logistic modeling is applicable to case-control and cross-sectional studies, there is one important limitation in the analysis of such studies. Whereas in follow-up studies, as has been demonstrated earlier, it is possible to use a logistic model to predict the risk for an individual with specified independent variables, it is not possible to predict individual risk for case-control or cross-sectional studies. In fact, only estimates of odds ratios can be obtained for case-control and cross-sectional studies.
Any odds ratio, by definition, is a ratio of two odds, written here as odds sub 1 divided by odds sub 0, where the subscripts indicate two individuals or two groups of individuals being compared. Now we give an example of an odds ratio in which we compare two groups, called group 1 and group 0. Using our CHD example involving independent variables CAT, AGE and ECG, group one might denote persons with CAT = 1, AGE = 40 and ECG = 0, whereas group zero might denote persons with CAT = 0, AGE = 40 and ECG = 0. More generally, when describing an odd ratio, the two groups being compared can be defined in terms of the bold X symbol, which denotes a general collection of X variables from 1 to k. Let X1 denote the collection of X's that specify group 1 and let X0 denote the collection of X's that specify group 0. Then298
We thus have a general exponential formula for the risk odds ratio from a logistic model comparing any two groups of individuals, as specified in terms of X1 and X0. Our example illustrates an important special case of the general odds ratio formula for logistic regression that applies to 0 − 1 variables. That is, it is possible to obtain an adjusted odds ratio by exponentiating the coefficient of a 0 − 1 variable in the model. In our example, that variable is CAT, while the other two variables, AGE and ECG are the ones adjusted for. More generally, if the variable of interest is XL, a 0 − 1 variable, then e to the bL, (where bL is the coefficient of XL), gives an adjusted odds ratio involving the effect of XL adjusted or controlling for the remaining X variables in the model. Suppose, for example, our focus had been on ECG, also a 0-1 variable, instead of on CAT in a logistic model involving the same variables CAT, AGE and ECG. Then e to the b3 (where b3 is the coefficient of ECG) would give the adjusted odds ratio for the effect of ECG, controlling for CAT and AGE. Thus we can obtain an adjusted odds ratio for each 0-1 variable in the logistic model by exponentiating the coefficient corresponding to that variable.
Note: A binary variable is also obtained when a continuous variable is dichotomized, such as diastolic BP ≤ 90 and > 90 mmHg. In all such cases, the dependent variable is actually the proportion or 299probability of subjects falling into a specified category. Denote this probability by π for population and by π for proportion in the sample. It has been observed that the results lend themselves to easy and useful interpretation if the binary dependent variable is transformed to
[π/(1 – π)] represents the odds for presence of response in the subjects. Logistic transformation also helps to linearize the relationship between dependent π and explanatory variables, whereas the relationship between untransformed π and other factors is generally nonlinear, taking the shape of S in place of a line. Since the logistic model is described in terms of logarithms, what is additive on logarithmic scale is multiplicative on the linear scale. Before using the model for any inferential purpose, however, it is essential to test the adequacy of the fit. Among several methods available to assess the adequacy of the model, ‘Log Likelihood’ and ‘Classification Accuracy’ are often used.
 
Log Linear Models
Log-linear models are sometimes very useful in analyzing multi-way tables. These models are easily understood and appreciated with the help of 2-way tables. Be it hypothesis of independence or of homogeneity, the expected frequency under H0 is
This gives
300
where, µ = – ln (n), αr = ln (Or.) and β = ln (Oc). These are general “mean” and the main effects of rows and columns respectively. They can be redefined to satisfy conditions such as ∑r αr = 0 and ∑c βc = 0 that sometime help to make these quantities more interpretable. The model says that logarithm of expected cell frequency is linear combination of an overall effect, the effect of the rth category of first variable and the cth category of the second attribute. Hence, the name log-linear models. If H0 is not true then ln (Erc) will be something else. Let the difference be denoted by θrc. Then,
The component θrc measures interaction. θrc = 0 for all (r, c) implies independence or homogeneity. The test of the hypothesis of no interaction (qrc = 0 for all r, c) is the usual χ2. In case of log-linear models, however, it is more convenient to use another criterion called G2. For a two-way table, this is defined as:
This also follows a chi-square distribution with (R − 1) (C − 1) df when each Erc ≥ 5. In other words, χ2 and G2 are asymptotically (i.e. for large n) equivalent.
A similar explanation can be given for a three-way table. In this case, log-linear model (3 way)
This is called a saturated model since it contains all possible interactions. The estimates of the parameters of the model depend upon the hypothesis of interest. The adequacy of fit, or its lack, is tested by G2. The model can be easily extended to 4 or more variables though the interpretation steeply becomes complex. 301Generally, a model is specified by leaving out the interaction or the main effect under test and P-value is seen corresponding to the value of G2 obtained from the data. A series of models may have to be fitted to come to a focused conclusion. Among many uses of log-linear models, one is to evaluate net association between two variables after removing the effect of the others.
  • A log-linear model does not consider any variable dependent on the others. If such dependence is to be considered then a variation of log-linear, called logit model, is used. This model examines the logarithm of the ratio of frequencies in those with the characteristics and those without characteristics. Thus the dependent variable in the logit model should have dichotomous categories. If they are polytomous then they need to converted to meaningful dichotomous categories. Another method particularly useful in case of dichotomous dependent is the logistic regression.
  • Just as in the case of classical χ2 = ∑ (O – E)2 /E, log linear models disregard the order if any present in the categories of the variables. Each variable is considered on nominal scale. If ordinal scale is present and is to be given due consideration then different methods are required. Ordinal and metric categories allow investigation of presence or absence of a trend or a gradient in proportion as well as its nature.

Vital Statistics9

 
MEASURES OF MORBIDITY
Morbidity has been defined as ‘any departure (subjective or objective) from a state of physiological well-being.’ To measure the morbidity in a community two principal morbidity measures, namely, the incidence rate (which answers the question ‘How many new cases?’) and the prevalence rate (which answers ‘How common is the disease?’) are used.
Incidence rate is defined as “the number of new cases occurring in a defined population during a specified period of time.” It is given by the formula:
[(Number of new cases of specific disease during a given time period)/(Population at Risk)] × K
Where K is some constant to avoid small decimal fractions. Generally the period is one year and the value of K is 1000. Incidence measures the rate at which the new cases are occurring (coming into being) in a unit (say per 1000) population during a specified (say annual) period. It is not influenced by the duration of the disease.303
Prevalence rate is defined as “the number of cases of a particular disease (or having an attribute) existing in a population at a specified point in time, per unit (say per 1000) of the population at that time”. This is some times called as point prevalence as we consider point of time. Point prevalence rate is the census type of measure indicating how frequent a disease (or condition) is at that point in time. Point prevalence rate is given by the formula:
[{Number of all current cases (Old and New) of a specified disease existing at a given point in time}/{Estimated population at the same point time}] × 1000
When the term ‘Prevalence rate’ is used without any further qualification, it is taken to mean ‘point prevalence’. Point prevalence can be made specific for age, sex and other relevant factors or attributes.
When period time is considered for counting number of cases, the rate is called as Period Prevalence Rate. It measures the frequency of all current cases (old and new) existing during a defined period of time (e.g. annual prevalence) expressed in relation to a defined population. Period prevalence rate is given by the formula:
[{Number of existing cases (new and old) of a specified disease during a given period of time interval}/{Estimated mid-interval population at risk}] × 1000
Following figure (Fig. 9.1) will help to clear the differences in above terminology.304
Fig. 9.1: Valid subjects (cases) to be included for calculation of appropriate rates
305
 
Relationship between Prevalence and Incidence
Prevalence depends on two factors, the incidence and duration of illness. Assuming that the population is stable and incidence and duration are unchanging, the relationship between prevalence (P) and incidence (I) can be expressed as:
Where D is mean duration.
The above equation shows that the longer the duration of the disease, the greater its prevalence.
For example: For a stable population, incidence is 10 cases per 1000 population per year. Mean duration of disease is 2 years then the prevalence is 10 × 2 = 20 per thousand population. But if the mean duration is 10 years then the prevalence is 10 × 10 = 100 per 1000 population.
 
Special Incidence Rates
 
Attack Rate
An attack rate is an incidence rate used only when the population is exposed to risk for a limited period of time such as during an epidemic. It relates the number of cases in the population at risk and reflects the extent of the epidemic. Attack rate is given by the formula:
[(Number of new cases of a specified disease during a specified time interval)/(Total population at risk during the same interval)] × 100
Attack rate is usually expressed as percent.306
 
Secondary Attack Rate
It is defined as ‘the number of exposed persons developing the disease within the range of the incubation period, following exposure to the primary case. It is given by the formula:
[(Number of exposed persons developing the disease within the range of the incubation period)/ (Total number of exposed or “susceptible” contacts)] × 100
The denominator consists of all persons who are exposed to the case, but may be restricted only to ‘susceptible’ contacts if means are available to distinguish the susceptible persons from the immune. Primary case is excluded from both the numerator and denominator.
 
Ascertainment Corrected Rates
Precise incidence or prevalence rates are central to epidemiology but they are also very difficult to achieve. A critical component of disease monitoring is the degree of undercount. One way to improve rates is by correcting them for level of ascertainment. Ascertainment level can be considered a primary determinant in the calculation of rate. Incidence estimates from passive surveillance systems are useful for accurate comparison (between different populations, communities or countries) if they are adjusted to reflect degree of under-ascertainment. Active surveillance systems, such as population-based registries would also benefit greatly from ascertainment correction as 100% 307enumeration is typically, too expensive, very labor intensive and almost impossible to achieve for broad monitoring of disease frequency. Capture-recapture methodology can be applied to evaluate the degree of undercount and this estimate can be used to provide an ascertainment corrected rate.
 
Measures of Mortality
Mortality rates are used for a number of purposes. They may be employed in explaining trends and differentials in overall mortality, in deciding priorities for health action and the allocation of resources, in designing intervention programmes and in the assessment and monitoring of public health problems and programmes. They may even give important clues for epidemiological research. Commonly used measures are described below.
 
Crude Death Rate
It is defined as ‘number of deaths per 1000 estimated mid-year population in one year in a given place’. It is given by the formula:
[(Number of deaths during the year)/(Mid-year population)] × 1000
It measures the rate at which deaths are occurring from various causes in a given population, during a specified period. However, it is very important to note that crude death rate depends heavily on the age distribution of the population and the ratio of males to females. It must be remembered especially while in 308using these rates in comparison with one another or as a measure of the success of some treatment or procedure.
One way to deal with this is to calculate ‘age specific death rates’, which are defined as the number of deaths in particular age group in a period per 1000 population in that age group in the period. Separate calculations may be made for males and females to give age-sex specific rates. The specific death rates could be ‘cause’ or ‘disease specific’, e.g. tuberculosis, cancer, accidents. They can also be made specific for many variables such as income, religion, housing, etc. The limitation of the crude death rate is exposed when we compare the age-specific rates between the two populations as shown in following (hypothetical) example (Table 9.1).
Table 9.1   A comparison of death rates
Age Group (Year)
Population A
Population B
Population
Deaths
Age specific DR per 1000
Population
Deaths
Age specific DR per 1000
0
500
2
4
400
1
2.5
1–9
2000
8
4
300
1
3.3
10–19
2000
12
6
1000
5
5
20–34
1000
10
10
2000
18
9
35–59
500
20
40
2000
70
35
60 +
100
15
150
400
50
125
All ages
6100
67
11.0
6100
145
23.8
309
Comparison of the two populations shows that B has, in every age group, a lower (age-specific) death rate than A. Yet its death rate at all ages (the crude death rate) is more than double the rate of A. The seeming contradiction is due to differences in the age composition of populations. The higher crude death rate in population B is due to the older population compared with population A which has a relatively younger population; 72% of B's population is over age 20 and only 26% of A's population. The same problem arises frequently when we need to compare the death rates prevailing in some area at different times.
By moving away from the crude death rate to the more detailed age-specific rates an important and attractive feature of the crude death rate namely its ability to portray an impression/to summarize the information, is lost. Standardization precisely does that by adjusting for age difference, i.e. restore this property/feature of crude death rate. Standardization yields single standardized or adjusted rate (after removing the confounding effect of different age structures) by which the mortality experience can be compared directly.
 
Standardization
There are two methods of standardization, one direct and the other indirect method of standardization (which is not described in this book). In direct method of standardization the mortality rates at ages 310(age-specific death rates) in the different populations (or at different times) are applied to some common standard population, to find-out what would be the total death rate in that standard population if it were exposed first to A's rates and then to B's rates at each age. These rates are comparable with one another and show whether B's rates at ages would lead to a better or worse total rate than A's rates if they had populations of the same age type.
In table below, a frequently used standard age composition (IARC Scientific Publications, No. 15 and WHO Health for all series No. 4) is shown with an example of computation of the age standardized death rates for population A and B (Table 9.2).
Table 9.2   Computation of age standardized death rate: Direct method
Age Group
Standard Population
Population A
Population B
Age-specific Death Rate
Expected number of deaths
Age-specific Death Rate
Expected number of deaths
0
2,400
4.0
10
2.5
6
1–9
19,600
4.0
78
3.3
65
10–19
18,000
6.0
108
5.0
90
20–34
22,000
10.0
220
9.0
198
35–59
27,000
40.0
1,080
35.0
945
60 +
11,000
150.0
1,650
125.0
1,375
All ages
1,00,000
3,146
2,679
Age Standardized Death Rate
31.46
26.79
311
Age standardized Death Rate in population A and B is 31.46 and 26.79 per 1000 respectively. Before standardization they were 11.0 and 23.8 per 1000. Taking a population of the same age-distribution thus shows the more favorable mortality experience of B, and the fallacy of the crude is avoided. It is usual to use the national population as standard when inter-regional comparisons are made. Combining two populations may also create the standard population and the method may be used for comparing populations with respect to other (other than mortality experience) risk factors. One example is given below.
In a study of lung cancer and smoking, 42% of cases and 18% of controls were heavy smokers. Age was found to be a confounding variable. Information is displayed in Table 9.3.
Table 9.3   Heavy smokes in Lung Cancer cases and controls
Age
Cases
Controls
Total Number
Heavy Smokers
Total Number
Heavy Smokers
Number
Percentage
Number
Percentage
40–49
400
200
50
100
50
50
50–59
100
10
10
400
40
10
Total
500
210
42
500
90
18
Because age is a confounding variable, age-adjustment was carried out, by creating a standard population of combine number of subjects in both age groups (i.e. 500 each). Calculations are shown in (Table 9.4).312
Table 9.4   Age-adjusted proportions
Age
Standard Number of subjects
Cases
Controls
Heavy smokers
Heavy smokers
Percentage
Expected Number
Percentage
Expected Number
40–49
500
50
250
50
250
50–59
500
10
50
10
50
Total
1000
300
300
Age-adjusted Rate (percentage)
30
30
Standardization is (i.e. standardized rates are) very sensitive to the choice of standard population. However, it does not really matter if by the use of a different standard population the standardized rates of the areas (or groups) under study are changed, so long as their relative position in not changed materially if the object is comparison only. And mostly the comparative position between the standardized rates of two areas (or groups) or between two points of time remains same when the standard population is changed (though very rarely serious differences may result).
 
Adjusted Mean/Rate
Standardization is a fairly general method and can be used in a large variety of setups. It can be used to 313standardize means or rates, although then it is called the adjusted mean or rate.
For Example: In a survey of 300 adults, the mean serum folate level (nmol/L) in current, former, and never smokers of different age groups was found as follows (hypothetical data for illustration). The number of subjects is in parentheses.
Smoking status
Age (years)
Unadjusted mean for all adults
Age-adjusted mean
20–39
40–59
60+
Current smokers
6.5 (80)
7.0 (40)
8.5 (10)
6.8 (130)
7.1
Former smokers
7.5 (10)
8.5 (20)
9.0 (40)
8.6 (70)
8.1
Never smokers
7.0 (50)
8.0 (40)
9.0 (10)
7.6 (100)
7.7
Tolal
6.8 (140)
7.7 (100)
8.9 (60)
7.5 (300)
The unadjusted mean in column four is the “crude” mean calculated in the usual manner, although with due consideration to the different numbers of subjects in different age groups as is needed for grouped data. This is called the unadjusted mean. For example, unadjusted mean serum folate level for current smokers
The age distribution of current smokers is very different from those of former smokers. Of 70 former smokers, as many as 40 are in the age group 60 + years. 314This number is only 10 out of 130 in the current smokers. This can affect the mean serum folate level because this level depends on age. The difference in means in persons with different smoking status thus is not necessarily real but could be partially due to the difference in their age structure. The effect of age disparity can be removed by calculating the age-adjusted mean. For this, the age distribution of total subjects (last row) can serve as the standard. When this “standard” is used on the means in the people with different smoking status, the age-adjusted mean is obtained. Age-adjusted mean serum folate level for current smokers
Similarly for the former smokers and the never smokers. The adjusted means are shown in last column.
The large difference in the unadjusted means in current and former smokers decreased considerably after age adjustment was done. Much of this difference was due to the differential age structure of the subjects in these two categories. This adjustment brought the groups to a common base with respect to age and made them comparable. Such adjustment can be done with respect to any characteristic and in any measure. Now consider following hypothetical data for a characteristic (for example, religion or nutritional/socioeconomic status categories) adjusted rates (per 100 that is percentage) (Table 9.5).315
Table 9.5   Computation of Adjusted rate
Categories of characteristic (for which the adjustment is wanted)
Group
Item
A
B
C
Total
Adjusted %
I
Disease + ves
3
2
3
8
No. observed
50
80
120
250
Percentage
6.0
2.5
2.5
3.2
3.82
II
Disease +ves
2
3
1
6
No. observed
30
50
80
160
Percentage
6.67
6.0
1.25
3.75
4.50
III
Disease +ves
15
6
0
21
No. observed
150
30
20
200
Percentage
10.0
20.0
0.0
10.5
9.01
Total
No. observed
230
160
220
610
Percentage (of 610)
37.70
26.23
36.07
When total subjects in each of the category of characteristic is “standard” used on the rates in the people with different groups, the adjusted (w.r.t. that characteristic) rate is obtained. For example, we expect 37.70 percent of 250 subjects in group I to be observed in category A (which is 94.25), then if same rate observed means 5.65 disease +ves. For category B and C these are 1.64 and 2.25 respectively. That makes the total 9.54 and the rate 3.82. Similarly calculations for group II and group III yield these figures as 4.02, 2.46, 0.72, sum 7.2 and 7.54, 10.49, 0.0, sum 18.03. The adjusted rates are shown in last column.316
 
Standardized Mortality Ratio (SMR)
Standard mortality ratio is a ratio (expressed usually as a percentage) of the total number of deaths that occurred in the area (or group) to the number of deaths that would have been expected to occur if the area (or group) had experienced the death rates of a standard population (or any other reference population). Standardized mortality ratio for any year shows the number of deaths registered in that year as a percentage of those which would occurred had the age mortality of some standard or reference year operated on the age population of the year in question. Thus we have a standardized mortality ratio to express the rise or fall in mortality.
This particular index has been extensively used in the studies of occupational mortality. National population death rates are taken as the standard rates. Applying these to the population of a particular occupational group, we find the number of deaths that would have occurred in that group if it had experienced the national mortality rates. It gives a measure of the likely excess risk of mortality due to the occupation.
SMR = [(Observed deaths)/(Expected deaths)] × 100
If the ratio had a value greater than 100, then the occupation would appear to carry a greater mortality risk than that of the whole population. If the ratio had value less than 100, then the occupation's risk of mortality would seem to be proportionately less than that for the whole population. Calculation of SMR for particular occupation, say coal workers, is shown in Table 9.6.317
Table 9.6   Calculation of SMR for coal workers
Age group
National population age-specific DR
Coal workers
Population
Expected deaths
25–34
3.0
400
1.2
35–44
5.0
300
1.5
45–54
8.0
200
1.6
55–64
25.0
100
2.5
Total
1000
6.8
SMR = Observed deaths/Expected deaths × 100
10/6.8 × 100 = 147
As already pointed out, instead of age, death rate could be specific to disease. The disease specific death/mortality rate is the number of persons dying of disease in a period per 1000 population within which the deaths occurred. Then the disease specific “case fatality rate” which is the number of persons dying of the disease in the period per 100 people who have the disease. This is really a ratio expressed per 100 but often called as rate. Both rates have the numerator as deaths attributable to the disease, but the denominators (reference populations) differ.
From the case fatality rate we can obtain standardized mortality ratio that enables comparison of deaths from a particular disease over a period of time. Suppose we know that for men case fatality rate for disease X in 1975 is 2.5 per hundred. We will use 1975 as a base year and obtain subsequent years fatality rates relative to 1975. In 1980 the male population who 318may contract the disease (at risk population) was 50,000 people. We apply the 2.5 per hundred rate found in 1975 to the 1980 male population and predict 1250 deaths. If in fact we observed 1450 deaths, then this represents a worse situation than reflected by the 1975 rate. We can obtain a relative picture by computing the standardized mortality ratio as 1450/1250 × 100 = 116.
 
Others Measures of Mortality
It is sometimes useful to know what proportion of total deaths are due to a particular cause (e.g. cancer) or what proportion of deaths are occurring in a particular age group (e.g. under 5 or above the age of 50 years). “Proportional mortality rate” expresses the number of deaths due to a particular cause or a specific age group per 100 or per 1000 total deaths. Thus we have
  1. Proportional mortality from a specific disease
    = [Number of deaths from the specific disease in a year/Total deaths from all causes in that year] * 100
  2. Under-5 Proportional mortality rate
    = [Number of deaths under 5 years of age during a given period/Total deaths during the same period] * 100
  3. Proportional mortality rate for aged 50 years and above
    = [Number of deaths of persons aged 50 years and above/Total deaths of all age groups in that year] * 100
    319
The “infant mortality rate” (IMR) is the number of deaths of infants under 1 year old in a period (generally of 1 year) per 1000 live births in the period.
It is given by the formula:
IMR = [Number of deaths in a year of children less than 1 year of age/Number of live births in the same year] * 1000
IMR is described as a most sensitive index of health and level of living of people. It is often treated separately because infant mortality is the largest single age-category of mortality and deaths at this age are due to a peculiar set of diseases and conditions to which the adult population is less exposed or less vulnerable. IMR is affected rather quickly and directly by specific health programmes and hence may change more rapidly than the general death rate.
Further analysis based on this rate may require information on whether the death is in the first seven days, the first 28 days or later. The “early neonatal mortality rate” for a period (usually of 1 year) is the number of deaths occurring within the first week of life per 1000 live births in the period. The “neonatal mortality rate” is the number of deaths of infants under 28 days old per 1000 live births in the period, and the “post neonatal mortality rate” is the number of deaths of infants aged 28 days and over (upto 1 year) per 1000 live births. Thus infant mortality rate is the sum of the neonatal and post neonatal mortality rates. The distinction between the two latter rates emphasizes congenital factors in the first 28 days as against environmental factors later.320
In all the infant related mortality rates the population at risk is defined as the number of live births, since still births are not at risk of dying subsequently. The period usually used for the calculations is 1 year and as it is usually a calendar year, some of the deaths reported within the calendar year really relate to births of the preceding year. And of course some births during the year will not result in a death until early in the following calendar year. In a stable birth rate—mortality rate situation these marginal inaccuracies will not have much practical significance. If there are rapid changes in the birth rate, or if we are dealing with small numbers then ideally the deaths should be related to the relevant live births.
The definition of “still birth” is: A still born child is a child which has issued forth from the mother after the 28th week of pregnancy and which did not at any time after being completely expelled from the mother breathe or show any other sign of life. The “still birth rate” is defined as the number of still births in a period per 1000 births (live and still) in the period. There is possibly some overlap of the still birth rate with the early neonatal mortality rate. The “perinatal mortality rate” includes both still births and deaths within the first week, thus avoiding problems of distinguishing and recording correctly the time of death. The “perinatal mortality rate” is defined as the number of still births and deaths occurring in the first week of life per 1000 births (live and still).
Figure 9.2 shows the sub division of the first year of life from conception and so helps to clear the above terminology.321
Fig. 9.2: Division of period indicating appropriate inclusion of period for calculation of different rates
322
A maternal death is defined as “the death of a woman while pregnant or within 42 days of termination of pregnancy irrespective of the duration and site of pregnancy from any cause related to or aggravated by the pregnancy or its management. “Maternal mortality rate” measures the risk of women dying from “puerperal causes” and is given by the formula:
[Total number of female deaths due to complications of pregnancy, child birth or within 42 days of delivery from “puerperal causes” in an area during a given year/Total number of registered (live and still) births in the same area and year] * 1000.
This rate is sometimes expressed as per 1000 live births. Registered births are used as an approximation to pregnancies and the fact that multiple births can occur is ignored for the purpose of this calculation.
Life expectancy or expectation of life at a given age is the average number of years which a person at that age is expected to live under the mortality pattern prevalent in the community or country. Life expectancy at birth is used most frequently. It is a good indicator of socioeconomic development in general. An indictor of long term survival, it can be considered as a positive health indicator. Survival rate is the proportion of survivors in a group (e.g. of patients) studied and followed over a period (e.g. a 5 year period). It is a method of describing prognosis in certain disease condition.
 
Survival
Survival is complementary to mortality but the term is used when the concern is with duration instead of 323percentage. For example, the duration of survival of a patient of cancer after detection of malignancy, such as 5 year survival probability is 10% of a case of leukemia. The probability is expressed in terms of a percentage but is related to the duration of survival. The most popular measure of survival is expectation of life. The other measure commonly used in medicine is survival function.
 
LIFE EXPECTANCY
The average number of years expected to be lived by individuals in a population is called expectation of life. This can be calculated at birth or at any other age. The expectation of life at birth (ELB) can be crudely interpreted as the average age at death. The ideal method to compute the expectation of life is by observing a large cohort (called radix) of live births as long as any individual of the cohort is alive. This may take more than 100 years and thus is impractical. It is assumed that the individuals at different ages are exposed to the current risks of mortality. Thus, the current age-specific death rates are used on a presumed cohort of, say, 100,000 persons. The average so obtained is number of years a newborn is expected to live if the current levels of mortality continue. The method is used in many other applications, as well. The expectation of life is a very popular measure of health on one hand and of socio-economic development on the other. It is considered a very comprehensive indicator because many aspects of development seem to reflect on the longevity of people.324
 
Life Table
The computation of expectation of life at birth or at any other age is convenient by preparing, what is called, a life table. It utilizes the information regarding people alive at different ages, the age-specific death rates and the years lived by segments of population of different age. Even though it is desirable to do computation for each year of age, say, for age 36 years, 37 years, etc. but the mortality rates are generally available for age-groups such as 35-39 years. When such groups are used, this is called an abridged life table. When each single year of age is used then it is called a complete life table.
Life table is a fairly general method and is used to compute many other indicators in a variety of situations. More common of these are Potential years of life lost (PYLL), Healthy life expectancy, Disability adjusted life years (DALYs), Life expectation in the absence of a cause of death. The method can be used on any arrival-departure process, of which birth-death process is a particular case. The basic structure is that subjects join the group at different points of time, remain in the group for varying periods and leave at different points of time. For example, oral contraceptors start taking pill at different points of time, continue taking it for different periods, say from 1 to 36 months, and then stop. Life table method can be used to assess the average duration of continuation of intake in this situation.
There are two types of life tables—the cohort or generation life table and the current life table. The 325cohort life table provides a “longitudinal” perspective in that it follows the mortality experience of a particular cohort. For example, all persons born in the year 1900, from the moment of birth through consecutive ages in successive calendar years, upto the death of the last member of the cohort are followed. The difficulties involved in constructing a cohort life table are apparent. However, cohort life table do have practical applications in studies of animal populations and modified cohort life tables (called “clinical life tables”) have proved useful in the analysis of patients survival or relief from any particular event (or many such events simultaneously – multiple decrement tables) in studies of treatment effectiveness. One such example along with clinical life table is given in chapter on designs. The better known current life table, characterized as “cross-sectional”. It thus provides a “snapshot” of current mortality experience.
The life table is a statistical technique for estimating some clinically important parameters of longitudinal nature such as the probability of the outcome under study within any specified time interval, and the mean and median length of time until the outcome occurs. The life table is the method of choice for estimating these parameters because it bases its estimates on data from all patients who were followed up. The follow-up may be short (e.g. weeks) or long (e.g. years). It provides more precise estimates than does the “one point at a time” approach. That is, this technique takes care of ‘serial intake’ and ‘serial dropouts’, which is a common feature of many longitudinal clinical studies.326
 
MEASURES OF FERTILITY AND REPRODUCTION
Human fertility is responsible for biological replacement and maintenance of the human race. It not only affects the size of a population but also affects the age and sex structure of the population. Age distribution of a population is more sensitive to the changes in fertility behaviour than the changes in the force of mortality.
The Crude birth rate (CBR): It is defined as the ratio of the total live births in a calendar year in a particular area to the total mid-year population of that area multiplied by K. Symbolically, CBR = [Number of live births/Mid-year population] * 1000
The middle of the year is used because this is the point at which roughly half of the year's changes in the size of the population will have occurred. If the events that change the population size were distributed evenly throughout the year, the population at the mid-year would be exactly be the same as the average number of the persons present during the year. CBR is a crude measure of childbearing because the denominator contains a large segment of population that is not exposed to childbearing.
The effect of the age-structure of the population on the CBR can be reduced by computing the sex-age adjusted birth rate (SAABR). This is defined as the number of births per 1000 of a weighted aggregate number of women in the various five-year age groups from 15 to 44 (UN Manual III, ). The UN has recommended a standard set of weights in computing this aggregate number of women. These are 1, 7, 7, 6, 4, 1, and are nearly proportional to the typical relative 327fertility rates in various age-groups viz, 15-19, 20-24, 25-29, 30-34, 35-39 and 40-44. These have been derived from a study of the data of 52 countries of which 15 had comparatively high fertility and 37 had low fertility.
In terms of notations,
ASSBR= [Number of live births in a year/(1F1 + 7 F2 + 7 F3 + 6 F4 + 4 F5 + 1 F6)] * 1000
where F1, F2, …………, F6 are the number of women in the age groups 15-19, 20-24, ……., 40-44 respectively.
The ratio of children under age 5 years to women of childbearing age is taken as an index of fertility derived from the age-sex distribution of population. The index is also referred to as general fertility ratio or index of effective fertility. One may define a measure of fertility by using the mid-year population of women in the childbearing age for the denominator of the rate instead of the total mid-year population. The rate so constructed is called “General fertility rate”(GFR). It is defined as the ratio of total number of yearly births to the total number of females (mid-year population) of childbearing ages, i.e.
GFR = [Number of live births/Number of females of reproductive age group]*1000.
This measure of fertility is further refined by relating “Legitimate Births “ to married women. It is then called general marital fertility rate (GMFR) and is defined as GMFR = [Legitimate Births/Married female population of age (15-44) years] * 1000. The frequency of childbearing varies markedly from one age group to another within the population. The study of the age pattern of fertility throws light on some 328important aspects of reproduction in any community. The Age-Specific Fertility Rate (ASFR) is the number of births per year per 1000 women of a specified age. ASFR (nfx) = [Number of live births to mother of a specified age group/Mid-year female population of the specified age group] * 1000.
The age-specific fertility rates may be computed either for each single year of age or for age intervals, usually, a five-year age-interval. The form of the age curve of fertility varies among different populations and may change in any population, in the course of time. It may be noted that the ASFRs are not affected by any variations in the age structure and, therefore, they may be considered a better indicator of trend over the time. Differences in the age pattern of childbearing may also be measured in terms of the median age of child-bearing (median age of mother) or the mean age of childbearing. When comparison between the reproductive performances of two populations is to be made, the fertility measures discussed so far are not much useful unless they have been standardized. This problem is solved by computing the summary measures of fertility. Total fertility is one of them.
Total fertility rate (TFR) is defined as the sum of the age-specific fertility rates of women by single years of age from 15 to 44 (in some cases 15 to 49) and is expressed “per woman”, i.e. the sum is divided by 1000. If the ASFR's are given for 5-year age-intervals, then the sum is multiplied by 5. The TFR is equivalent to the number of births per woman of reproductive age is her complete reproductive life time if she 329experiences the ASFR of a given year and experiences no mortality. Since TFR is not affected by the age structure of women under study, it is an effective summary rate for describing the frequency of childbearing in a year.
 
 
Gross Reproduction Rate (GRR)
Gross reproduction rate (GRR) is an example of synthetic cohort measure; it is a special case of TFR. While, TFR measures the total number of children that a cohort of women will have, the GRR measures the number of female children it will have. It is a sum of ASFRs based on female births only. For example, a GRR is 1.72, implies that if 100 mothers follow the current schedule of fertility, they will be replaced by 172 girls, assuming, of course, that no one dies during the reproductive life. In this way, it is a replacement index indicating potential fertility of the future.
 
Net Reproduction Rate (NRR)
NRR indicates the experience of a hypothetical cohort of females, which undergo the current schedules not only of fertility but also of mortality. Thus, it is an improvement over GRR in which adjustments are made for mortality. This adjustment is needed because some women die before completing their reproductive phase and thus do not have chance to contribute fully to the fertility levels. A NRR is always less than GRR.
NRR measures the extent to which a cohort of newly born girls will replace their mothers under predetermined schedules of fertility as well as of mortality. NRR of 1.0 is often referred to as “Fertility at the 330replacement level”. Since the value of NRR depends not only on fertility but also on mortality conditions, this concept must be handled with caution. It may be mentioned that reaching a target of NRR = 1or replacement level fertility is an important goal of population policy of many countries with high fertility levels. Fertility control is an acknowledge programme in many developing countries. An indicator to measure the performance of this programme is the couple protection rate (CPR). This is the percentage of couples that are using some method of fertility control. Only those couples are generally counted where the age of woman is between 15 and 45 years. Usually, fertility impact of family planning programme is measured in terms of births averted. This is number of births in the population, which are saved by couples using some family planning device.
 
HOSPITAL STATISTICS
Only few important and frequently used indices of utilization of health facility services are given here.
 
Average Duration of Stay in Health Facility
Is the total number of inpatient days of care provided to separated patients (this term is means the termination of the occupation of a health facility bed by a patient either through discharge, transfer to another health care institution, or death) inclusive of newborn babies, in a period divided by the total number of separated patients exclusive of newborn babies. In computing the length of stay, the day of admission is 331counted but the day of discharge is not counted. Admission and separation on the same day is counted as one day. The formula for computing the average length of stay of inpatients is Average duration of stay in health facility = {(Total number of inpatient days of care provided to separated patients)/(Total number of separations)}
 
Bed Occupancy Ratio
This is the ratio of occupied bed-days to the available bed-days as determined by bed capacity, during any given period of time.
The formula is: Bed occupancy ratio = [{(Actual number of occupied bed-days) × 100}/{Available bed-days}].
 
Turnover Interval
This is the mean number of days that a bed is not occupied between two admissions.
The formula is: Turnover interval = {(Number of vacant bed-days)/(Total number of separations)}.
 
Population
One of the important aspects of a population study is its age-sex structure. This is the distribution of people by age (or age group) and sex. Age-sex structure is diagrammatically represented by means of age-sex pyramid, popularly known as “population pyramid”. In this diagram, two histograms showing age distribution of population separately for males and 332females are placed base to base opposite one another. For developing country, the population pyramid is generally has a broad base and a tapering top. For a developed country, it generally shows a bulge in the middle and has a narrower base. These are illustrated in Figure 2.6.
The other important aspect of population is its growth. This is the difference between births and deaths, and in some cases affected by migration. If we exclude migration from our consideration, the population generally follow a trend as described in “demographic cycle”. Various phases of this cycle are as follows:
Phase I:
High and Stationery, where the birth and death rates are high.
Phase II:
Early Expanding, where the birth rate is still high and the death rate is high but declining.
Phase III:
Late Expanding, with slowly falling birth rate and rapidly falling death rates.
Phase IV:
Low Stationery, where low birth rate is balanced by low death rate.
Phase V:
Declining, where the population will be declining with low death rates and lower birth rates.
Generally, the growth of the population follow a geometric rate is developing countries and arithmetic rate in developed countries.
We often need the population size estimate on a current date (following the latest census called post-censal estimates) or for the past years (years between 333censuses called inter-censal estimates). We also need the population for the future dates and are known as population projections. Population estimates or projections can be made for a whole country or for regions within a country. It can be done for all persons for the segment of the population. Vital statistics method and mathematical methods (assuming arithmetic increase, geometric increase, or exponential increase) are popularly used for this purpose. Consi-dering limitations of ‘Logistic curve’ for population projection, a component method is generally preferred by demographers. Discussion of all the details is beyond the scope of this book.
 
On Health Indexes
In the post independence era there has been an enormous increase in the demand for health services in our country (India), concurrently, the objectives of the health services have expended from the extension of life expectancy to the achievement of physical, mental and social well being, i.e. health defined by WHO. It has become apparent that meeting these objectives require comprehensive health planning. The formulation and evaluation of policies, plans and programs and the efficient and effective implementation of health activities cannot be carried out without adequate knowledge of the existing health situation of a given population. For this, the decision makers in health services need a comprehensive numerical expression of health status. Thus a fundamental problem confronting to health administrators and planners is the measurement of the 334level of health of their community; and nothing could be more valuable than to have at their command one or more yardsticks to help them in their task.
Indexes based on mortality, such as crude and age-adjusted death rates, infant mortality rates, and the expectation of life have traditionally been used as measures of levels of health. Control of mortality has always been a paramount goal of health activities and variations in death rates are direct measures of progress towards that goal. The assumption is often made that changing mortality reflects changes in other aspects of health as well. However, after a considerable decline in mortality rates or increase in expectation of life, they often reach a plateau. But stability of the mortality rates would not imply no change in health status. It would merely emphasize a difficulty inherent in the use of mortality statistics as measures of health status. They say little about the living, while the health of the living has become a very important aspect of health status. They provide an inadequate basis for assessing the need for and success of many health measures. Their limitations have been adequately discussed in the literature. These considerations suggest that more sensitive and informative indexes of levels of health might be obtained with information based on health characteristics of living, over and above mortality. It is not surprising, therefore, that in recent years many health status indexes have been proposed that are designed to provide a more comprehensive picture of health status at both individual and community levels. Although many are there, they are not discussed here.

Miscellany10

 
FEW OTHER ASPECTS OF CLINICAL TRIALS
 
Clinical Life Table
The life table is a statistical technique for estimating such clinically important parameters of longitudinal course as the probability that a patient will experience the outcome under study within any specified time interval, and the mean and median length of time until the outcome occurs. The life table is the method of choice for estimating these parameters because it bases its estimates on the data from all patients who were followed up, whether or not for a short time or a long time and thus provides more precise estimates (than does the ‘one point at a time’ approach). Therefore, life table is the best technique, which takes care of serial intakes and serial dropouts.
We apply the method to follow-up data on manic depressive patients maintained with prophylactic lithium carbonate or with control regimen. Table 10.1 presents the first six rows of the life table for 96 bipolar patients maintained with lithium carbonate.336
Table 10.1   Portion of clinical life table for bipolar patients maintained well with lithium carbonate four weeks
Serial number of the interval
Interval inweeks
(a) Number starting interval well
(b) Number last observed well during interval
(c) Adjusted number studied in interval
(d) Number (Failing in Interval
(e) Interval-specific probability of failureqw = (d)/(c)
(f) Interval specific probability of remaining well
(g) Cumulative probability of remaining well
w
w to w + 4
Nw
Lw
Nw= [(a)−(b)/2]
Fw*
pw = 1−(e)
Pw = product (f)
0
0–4
96
0
96
7
0.07
0.93
0.93
1
4–8
89
1
88.5
7
0.08
0.92
0.86
2
8–12
81
0
81
9
0.11
0.89
0.76
3
12–16
72
0
72
7
0.10
0.90
0.69
4
16–20
65
3
63.5
2
0.03
0.97
0.66
5
20–24
60
4
58
6
0.10
0.90
0.60
* Including dropouts.
337
In this table, all dropouts are assumed to have failed at the time of dropout.
Column (a) contains the number of patients who were still well at the start of the indicated four-week interval. Column (b) contains the number of patients who were lost in the interval to follow-up but were well when lost. Column (c) contains the adjusted number of patients studied in various intervals. Column (d) contains the number of patients who failed during the interval. Column (e) contains the intervalspecific failure rates, denoted by “q”. Given that a patient has remained well until the start of a specified interval, “q” is the estimated probability that he fails before the end of the interval. Column (f) contains the estimated p(= 1 – q) probability that a patient who was well until the start of the interval will still be well at the end of the interval. The importance of the conditional probabilities in column (f) resides in their successive multiplication to yield unconditional probabilities, denoted by “P”, are tabulated in column (g). That is,
We have not shown calculations for the control group but the following figure (Fig. 10.1) presents, separately for the control and lithium, curves describing the resulting probabilities of remaining continuously well. This was derived by assuming 1. all dropouts were well at the time of dropping out 2. all dropouts have failed at the time of dropping out. Quartiles may be estimated as the times at which the curves cross the values P = 0.75, 0.50, and 0.25, and medians correspond to p = 0.50.338
Fig. 10.1: Life Table—Comparison of two groups
339
Statistical comparison between the two curves can be made by the method of log-ranks, which is briefly described below as method for comparing all life table probabilities.
 
Statistical Inference
The following presents methods for setting confidence limits about any single life table probability and for comparing probabilities associated with two treatment regimens.
 
Confidence Limits and Hypothesis Tests for Life Table Probabilities
In order to set confidence limits about any of life table probabilities P, and to compare the value of P associated with two treatment regimens, one requires an estimate of the standard error of P. Consider PW, the probability of remaining continuously well from the start of follow-up at least until week w. The approximate standard error of PW, (due to Greenwood) is given by the square root of the following:
where the summation extends from the first study interval through the interval (w − 4) to w. Each ratio in the summation is calculated separately for each interval, with N' being the adjusted number of patients studied in the interval (column c of the life table), q the conditional probability of failure in the interval (column e), and p = 1 – q (column f).
Consider, for example, the probability P24 = 0.60 for the lithium carbonate series. Its squared standard error is calculated as follow:340
Var (P24) = {(0.60)2 [(0.07/96*.93) + (0.08/88.5*0.92) + (0.11/81*0.89) + (0.10/72*0.90) + (0.03/63.5*0.97) + (0.10/58*0.90)]} = 0.0026,
and thus its standard error is equal to 0.05. A 95% confidence interval for the probability of remaining continuously well with lithium carbonate the first 24 weeks of follow-up, assuming that all dropouts failed, is therefore 0.60 ± 1.96 *0.05, or the interval from 0.05 to 0.70. The constant 1.96 was obtained from tables of the standard normal distribution.
 
Comparing Two Life Table Probabilities
If Pw(1) and Pw(2) are probabilities derived from two independent samples, their difference may be tested for significance by referring the critical ratio
to the standard normal distribution.
 
Comparing All Life Table Probabilities
The question is whether or not, considering all of the data, the differences between the two sets of probabilities are statistically significant. The comparison of two samples at all time points simultaneously can be effected by means of the chi-square procedure. Within each study interval, let N'(1) and N'(2) be the two adjusted numbers of patients (from column c of the life table), P(1) and P(2) the two conditional probabilities of remaining well (from column f of the life table), and P their weighted average, i.e. P = (N'(1) * p(1) + N'(2) * p(2))/(N'(1) + N'(2)). The following statistic:341
which incorporates a correction for discontinuity, may be referred to tables of chi-squares with one degree of freedom to test for the significance of the difference between the two series of probabilities. The summation extends from the first study interval through that interval, following which no members of either group 1 or group 2 were observed to remain well (i.e. following which either N'(1) or N'(2) became zero). This test is known as ‘log rank’ test. The value of chi-square statistic for the overall difference between the course with lithium carbonate (assuming that all dropouts failed) and the course with control regimens (assuming that all dropouts were well) was shown to be 13.47 (df = 1, P < .001). The overall difference between the lithium carbonate and control samples is therefore highly significant.
 
Survival Functions
Survival data arise when the aim is to study the time elapsed from some particular starting point to the occurrence of an event. The term ‘survival’ presumes that the event is death but, in practice, this could be any ‘failure’ such as occurrence of metastasis, toxicity or relapse; or any success such as recovery, discharge from the hospital or disappearance of a complaint. We use the term survival time to generically mean time to any event under investigation. The objective of statistical methods of survival is to estimate the expected time to the event. Life table method is 342applicable when the survival time is in grouped form. In many clinical situations, the exact survival time is available for each individual. If so, grouping is inadvisable because it entails loss of information. The method of Kaplan-Meir is used to estimate the probability of survival in this case.
More often than not, the interest is in comparing survival pattern in one group with that in the other such as in the test and placebo groups, medical vs surgical group, and treatment-1 vs treatment-2 group. For example, the interest might be in time to develop metastases after diagnosis of breast cancer. The treatment under comparison could be radical (Halsted mastectomy) and conservative surgery. The comparison is generally done with the help of a long-rank test. Extent of exposure approach to log-rank test is very nicely described in Anderson S et al, Statistical methods for comparative studies: Techniques for bias reduction; John Wiley and Sons, New York, 1980, –, . The time from randomization to the last date the live patient was examined is known as the ‘censored survival time’. Censored observations can arise in three ways: (a) the patient is known to be still alive when the trial analysis is carried out; (b) the patient was known to be alive at some past follow-up, but the investigator has since lost trace of him; (c) the patient has died of some cause totally unrelated to the disease in question.
The following points deserve special attention.
  • Define the entry point or the starting point. This could be either the day of diagnosis, the day of starting therapy, or the day of onset.
    343
  • Define the exit point or the end-point. In many cases this could be death. However, in some cases, as mentioned earlier, the interest could be in recurrence, occurrence of a particular complication, etc. Decide that the intervening period between entry and exit has to be event-free or occurrence of other events is to be disregarded or not.
  • Some subjects can always be lost to follow-up. This may happen because the subject has moved out of the area, stopped co-operating, or is just untraceable. There is a built-in mechanism in the survival methods to take care of such losses.
 
Vaccine Trials
Most of the points/concepts discussed so far are also applicable for vaccine trials with certain modifications where necessary (in view of the fact that now the subjects are normal healthy individuals). The primary goal of preclinical testing of a new vaccine product, or new combination vaccines comprised of previously licensed antigen(s), or vaccines presented in new formulations or new delivery systems should be to demonstrate that the vaccine is suitable for testing in humans. In phase I clinical studies, initial testing of a vaccine is carried out in small numbers (e.g. 20) of healthy adults, to test the properties of a vaccine, tolerability, and, if appropriate, clinical laboratory and pharmacological parameters. Phase I studies are primarily concerned with safety. Phase II studies involve larger numbers of subjects and are intended to obtain preliminary information about a vaccine's 344ability to produce its desired effect (usually immunogenicity) in the target population and general safety. To fully assess protective efficacy and safety of a vaccine, extensive phase III are required.
Adjuvant(s) may be included in new vaccines to promote immune responses to particular antigens, or to target a particular immune response. It is important that the adjuvant(s) used comply with pharmacopoeia requirements where they exists, and that do not cause unacceptable reactogenicity (i.e. events that are considered to have occurred in causal relationship to the vaccination. These reactions may be either local or systemic). Preclinical studies should evaluate the adjuvant/antigen combination as formulated for clinical use. It should be noted that no adjuvant is licensed in its own right but only as a component of particular vaccine.
Vaccine efficacy is the percentage reduction in the incidence rate of disease in vaccinated compared to unvaccinated individuals. Vaccine (protective) efficacy is calculated according to the following formula.
where lu = incidence unvaccinated population, lv = incidence vaccinated population, RR = relative risk [in case control studies or other studies when the incidence of target disease or adverse event is low, replace RR by odds ratio (OR)].
In addition to intrinsic efficacy, effectiveness depends on the heterogeneity in susceptibility, rates of exposure to infectious agents and protection conferred by the vaccination. Vaccine effectiveness may 345also be influenced by time-related changes in protection caused by intrinsic properties of the vaccine (waning of efficacy and boosting), changes in vaccination coverage, correlation of vaccine strains with circulating strains, selection of strains not included in the vaccine following introduction of the vaccine, and population characteristics (such as age distribution). That is, vaccine effectiveness measures direct and indirect protection.
For evaluation of common local reactogenicity, approximately 300 subjects are needed for each group. However, depending on the type of vaccine, the disease indication, and the target population, enrolment of more than 5000 may be appropriate in order to provide reasonable assurance of safety pre-licensure in randomized, controlled settings. Randomized studies must have sufficient power to provide reliable rates of common (>1/100 and <1/10) adverse events, and to detect less common events, but not necessarily very rare (<1/10000) adverse events. Seroconversion is predefined increase in antibody concentration, considered to correlate with the transition of seronegative to seropositive, providing information on the immunogenicity of a vaccine. If there are pre-existing antibodies, seroconversion is defined by a transition from a clinical unprotected to a protected state. Seroprotection is a similar term with higher cut-off.
Bridging studies intended to support the extrapolation of efficacy, safety, and immunogenicity data from one formulation, population, dose regimen to another. The need for performing bridging studies 346should be considered carefully and justified in the protocol. The endpoints for clinical bridging studies are usually the relevant immune responses and clinical safety parameters.
 
Evaluation of Complementary/Alternative Therapies
According to a recent report (for the Liverpool Public Health Observatory) on evaluating complementary treatments, ‘clinical trials are the best way of assessing the efficacy of treatments but not in their present form’. Clinical trials are known to characterize common side effects but not those occurring more rarely and drug's long-term safety also remains un-addressed. In some instances, the idea of a common diagnosis for trial participants may not be appropriate and it becomes essential to distinguish between treatment sensitive and insensitive patients to help tailor the treatment better. A more treatment oriented classification of diseases taking sensitivity to treatment into account which would result in a much more homogeneous group of trial participants could be useful.
Though many indigenous treatments/drugs are time-tested, most of them are not scientifically/statistically evaluated. To bring many such indigenous treatments/drugs into main stream, it may be necessary to apply a quick (modified) methodology of clinical trials. Moreover, in view of many medicines working together and ‘treatment of choice’ type of problem, some modifications seem to be essential in existing clinical trials methodology to assess the efficacy of indigenous treatments.347
A growing body of evidence suggests that the patient can gain an optimal therapeutic effect with minimal toxicity by administering drugs at carefully selected times of day. If the data are qualitative, tests like Edward, Walter and Elwood, Roger, St. Leger, David & Newell, Hewitt, Pocock, and Freedman are used. If the data are quantitative, either cosinor analysis or box-jenkins ARIMA modeling methods are used. But to learn the whole methodology to find out times when a given drug should be administered to maximize its utility (effectiveness as well as safety) is one very important area of research. A well versed pharmacoepidemiologist can play an important role in this regard.
 
Adequacy of Trial/Sample Size
Following two tables are prepared (Tables 10.2 and 10.3) to find the minimum number of patients required in each of the treatment groups for a study with these results to be reasonably sure (with one sided α of 0.05) that a Relative Risk Reduction of 25% or 50% would have been missed if it had occurred. The approach discussed here, to answer this question (was the trial big enough?) is based on post trial results – how many patients needed depends on what is found. The rows of these tables represent the rates of events among patients on the control or placebo treatment and the columns show the rates of events among patients on the experimental treatment. Take the actual study results (event rates) and plug them into these tables, that is, identify the cell where they intersect.348
Example: For a trial with the following findings
Item
Control (Placebo) Group
Experimental (Treatment) Group
Number
nc = 100
nE = 100
Rate
pc = 0.45
pE = 0.40
If we track these results into tables we see that the trial needed 254 patients per group to be confident that it had not missed a risk reduction of 25% but the trial needed only 22 patients per group to be able to detect 50% risk reduction. Thus the trial was too small to reject a 25% improvement but large enough to reject a 50% risk reduction in adverse event (say, case-fatality) rates.
By a relative risk reduction of 50% we mean a halving of an event rate. Thus a reduction from 0.40 to 0.20 is a 50% relative risk reduction and a reduction from 0.40 to 0.30 is a 25% relative risk reduction. If the size of the treatment groups in the published (or unpublished) report exceeds those in these tables, you can be pretty sure that the trial was, indeed, big enough to detect clinically important differences if they had occurred. Tables are also available to find out the power of trial for given sample size.
 
Measures of Clinical Significance
Clinical significance goes beyond arithmetic and is determined by clinical judgment. Nevertheless following measures could be of help to sort out whether the benefits of a particular treatment are big enough.349
Table 10.2   Was the trial big enough to show risk reduction of 25% if it occurred?
Observed rate of events in the experimental group
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
Observed rate of events in the control group
0.95
14
27
68
391
0.90
11
18
38
110
1057
0.85
14
25
54
185
4889
Trials up here have risk reductions of 25 % or more
0.80
11
18
33
78
326
0.75
13
22
44
112
635
0.70
11
16
28
57
165
1524
0.65
13
20
35
75
250
6349
0.60
10
15
24
43
99
402
0.55
12
17
28
53
132
722
0.50
13
20
33
65
180
1607
0.45
10
15
22
38
79
254
0.40
11
16
25
44
98
381
0.35
12
18
28
50
121
634
0.30
13
19
30
57
1296
0.25
10
13
20
33
64
196
4537
0.20
Trials down here needed fewer than 10 patients per group
10
14
20
34
71
261
0.15
10
14
20
35
78
371
0.10
10
13
20
34
80
589
0.05
12
17
30
74
1245
350
Table 10.3   Was the trial big enough to show risk reduction of 50% if it occurred?
Observed rate of events in the experimental group
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.08
0.06
0.04
0.02
Observed rate of events in the control group
0.98
14
24
50
165
5803
0.95
12
19
37
102
921
0.90
14
26
58
236
Trials up here have risk reductions of 50 % or more
0.85
12
19
38
108
995
0.80
10
15
27
63
256
0.75
12
21
41
116
1059
0.70
16
29
66
268
0.65
13
22
43
120
1082
0.60
11
17
30
68
270
0.55
13
22
42
119
1059
0.50
11
17
30
66
260
0.45
13
22
42
113
987
0.40
11
16
28
62
239
0.35
13
20
38
102
867
0.30
10
15
26
55
205
0.25
12
18
33
86
699
0.20
13
22
45
160
0.15
Trials down here needed fewer than 10 patients per group
10
15
26
64
482
0.10
11
17
32
102
254
2017
0.08
14
25
66
131
453
0.06
12
20
44
76
179
1313
0.04
10
16
31
47
87
274
0.02
12
22
30
47
97
561
351
Consider the following results of a trial where two groups (two types of patients with same disorder) are given either placebo or active treatment, and occurrence of death or other major complications are recorded (Table 10.4).
Table 10.4   Rate of occurrence of death or other major complication
Group
Adverse Event Rate
Placebo (P)
Treatment (T)
Group I
0.20
0.06
Group II
0.12
0.04
We calculate the percent reduction in risk of adverse events achieved through active treatment by comparing the difference in the complication (adverse events) rates between placebo and active patients what would have occurred with no treatment (i.e. the complication rate among placebo patients). This is called as relative Risk Reduction (RRR) which is equal to (P-T)/P × 100. For group I RRR is {[(0.20 − 0.06)/(0.20)] × 100]} = 70% and for group II it is {[(0.12 − 0.04)/(0.12)] × 100]} = 67%.
These relative risk reductions mean that the risk of death or other major complications was reduced by two thirds through active treatment (in patients of both the groups). Generally, relative risk reductions of more than or equal to 50% almost always and of more than equal to 25% often, are considered to be clinically significant. The relative risk reduction is a quick and useful measure of clinical significance. It usually appears right as long as there is a placebo or “no treatment “group 352and as long as specific treatment targets (such as death, stroke or some given level of disability) are being counted rather than some continuous measure like blood pressure or antibody level. Of course, in case of such measure also it is possible to make only two categories as below and above (or equal to) some meaningful cut-off point.
Patients in both the groups benefit almost equally in terms of relative risk reduction, from this particular treatment although the former (i. e. group I) are much more likely to go on to death or other major complications with or without treatment. The higher priority for detecting and treating patients of type like in group I is not captured in the relative risk reduction and this measure therefore lacks an important element of clinical significance. We can capture this missing element by sticking to the absolute risks of bad outcomes in the absence and presence of treatment. These absolute difference in risks of death or other major complications is 0.14 for patients of group I but only 0.08 for patients of group II. Type of patient (Group I or II) now show the greater absolute gains from treating group I type patients and captures the clinical significance of this difference.
The reciprocal of the absolute risk reduction (ARR) turns out to be number of patients we need to treat in order to prevent one complication of their disease. Thus we need to treat only (1/0.14) = 7 group I type of patients in order to prevent one of them from going on to death or other major complication, whereas we need to treat (1/.08) = 12 group II type of patients to prevent one of them from suffering such a 353complication. This measure on clinical significance called the “Number Needed to Treat (NNT)” which is equal to [(1)/(P-T)], nicely emphasizes the effort you and your patients will have to expand in order to accomplish a very tangible treatment target.
NNT could be very useful in analyzing any clinical data, as we expect NNTs for effective treatments to be in the range 2-4. NNTs for prophylaxis will be larger (for example, a study found that use of aspirin to prevent one death at five weeks after myocardial infarction has an NNT of 40). Suppose the results of a trial to compare two treatments are summarized as in following table (Table 10.5).
Table 10.5   Notation for a study comparing two treatments
Treatment
Success
Failure
Total
Control
a
c
a + c
Test
b
d
b + d
The probability, or risk, of success under the control treatment is Pcontrol = a/(a + c) and under the test/active treatment is Ptest = b/(b + d). The difference in risks, known as the absolute risk reduction is given by ARR = Ptest – Pcontrol = [{b/b + d)} – {a/(a + c)}]. The risk ratio, or relative risk is RR = Ptest ÷ Pcontrol = [{b/b + d)} ÷ {a/(a + c)}]. The relative risk reduction RRR= Ptest − Pcontrol)/Pcontrol = RR − 1. NNT is inverse of absolute value of the risk difference; i.e. NNT = 1/(ARR). If Ptest > Pcontrol, then this is the number of people that one would expect to have to treat in each group so that one extra person benefits from the test 354treatment. This can be seen since if we give the test treatment and control treatment to n patients, then the difference in the number of cured under each treatment is n × ARR. If n = 1/ARR then the difference is one.
A further method of summarizing the results is to use the odds of an event rather than the probability. The odds of an event ODDStest = b/d and ODDScontrol = a/c. A probability ranges from 0 to 1, whereas an odds ranges from 0 to infinity. The odds ratio is OR = ODDStest ÷ ODDScontrol = bc/ad. This approximates the RR when the successes are rare (say with incidence less than 20%). The odds ratio for Failure is given by OR = ad/bc which is just the inverse of the OR for success. This is not true for the relative risk.
NNT is a very useful measure as when the absolute prevalence of events is low, relative risk reduction is not a good measure of clinical significance. When the absolute baseline risk of the bad clinical outcome is high, even modest relative risk reductions generate gratifyingly small NNTs. Thus a treatment that produces only 10% relative risk reduction in a baseline risk of 0.9 is as clinically significant as one that produces a 30% relative risk reduction in a baseline risk of 0.3. Rather small changes in the absolute baseline risk of a rare clinical event lead to big changes in the numbers of patients we need to treat in order to prevent one. So a powerful but risky treatment that produces a relative risk reduction of 50% for a bad outcome, but also produces an equally serious side effect in 1% of those who receive it, is not worth giving to patients with a 0.01 risk of the bad outcome.355
 
Confidence Interval for RRR and NNT
Instead of reporting point estimates of RRR and NNT, reporting of Confidence Intervals is more informative, especially so for negative trials. There was a lot of debate regarding superiority of test of significance and confidence interval. The controversy still persists. However, in my opinion both are important or both supply complementary (and/or supplementary) and very important, useful information. Nevertheless, in few peculiar situations one has a definite advantage over the other, e.g. while dealing with risk one should always calculate confidence interval. Following are formulas to do so and the example given illustrates the calculations.
95% confidence interval for a relative risk reduction is:
and for number needed to treat is
Example: Consider the following results of a trial
Control (Placebo)
Experimental (Treatment)
Number of Patients
nP = 200
nT = 200
Adverse Events Rate
P = 0.20
T = 0.05
95% CI for RRR is − 45 to 105% and
95% CI for NNT is − 5 to 11.
Recently the number needed to treat concept is extended to compare strategies for disease screening and developed a new statistic, the number needed to screen defined as number of people that need to be 356screened to prevent one death or one adverse event which could form the basis of strategy for disease screening.
 
FEW QUANTITATIVE ASPECTS OF CLINICAL REASONING
 
Sensitivity, Specificity, etc.
The process of diagnosis, i.e. diagnostic journey requires two essential steps. The first is the establishment of diagnostic hypothesis followed by attempts to reduce their number by progressively ruling out specific diseases. This process requires very sensitive test. Such test when normal (-ve) permits the physician to confidently exclude the disease. The next step is the pursuit of a strong clinical suspicion. This process requires a very specific test. Such a test when abnormal (+ve) should essentially confirm the presence of the disease.
TEST RESULT
DISEASE STATE
Disease
No Disease
Positive
True Positives (TP)
False Positives (FP)
Negative (FN)
False Negatives
True Negatives (TN)
Sensitivity of the test is the ability to identify correctly those who have the disease. It is determined by identifying the proportion of patients with disease in whom the test is positive.357
Specificity of the test is the ability to identify correctly those who do not have the disease. It is determined by identifying the proportion of patients without disease in whom the test is negative.
When a test is used either for screening or to exclude a diagnostic possibility, it must be sensitive. When two or more such tests are available, that with the highest sensitivity is preferred. The process of confirming a disease requires a test whose specificity is high. When two or more tests are available for this purpose, that with the highest specificity is preferred. The intelligent selection of test depends upon the purpose intended.
Since very few tests are both highly sensitive and specific, two or more tests are often used in combination to enhance sensitivity or specificity of the diagnostic process. There are two principal forms of combination—tests-in-series and tests-in-parallel. With tests-in-series the person is called positive only if he tests positive to all of a series of tests and negative if he tests negative to any. With tests-in-parallel the person is called positive if he tests positive to any of the tests and negative if he tests negative to all. Series testing results in the lowest sensitivity but the highest specificity whereas parallel testing results in the highest sensitivity but the lowest specificity. Multiple tests are 358most helpful when all tests are normal (negative) thus tending to exclude disease and/or when all tests are abnormal (positive), thus tending to confirm disease. Multiple tests are least helpful when one test is positive and the others normal (negative). Nevertheless, use of appropriate type of combination testing (according to requirement) enhances sensitivity or specificity.
Results of a test with high values of sensitivity and specificity are affected by prevalence. When prevalence is low, even a highly specific test will give relatively large number of false positives. This necessitates to define two more terms. Predictive value of a positive test which is the likelihood that an individual with a positive test has the disease,
and predictive value of a negative test which is the likelihood that a person with a negative test does not have the disease;
 
Validity and Reliability
Validity and Reliability are the two basic components of any measurement or evaluation procedure. Validity refers to the degree to which the process is measuring or evaluating a thing which it intended to measure or evaluate. Sensitivity and specificity are two components of validity. To know the disease state, i.e. true diagnosis may be based on more refined methods than are used in the test; or it may be based on evidence which emerges after the passage of time, for instance at autopsy. Since this is a criterion against which we find validity, it is called as criterion validity. The 359efficiency, (or efficacy) of a diagnostic test is its ability to indicate the presence or absence of a disease correctly, i.e the percentage of all results that are true results, whether positive or negative. That means, efficiency is an addition of sensitivity and specificity.
Reliability (or Repeatability) refers to the degree of consistency. It is the level of agreement between replicate (repeated) measurements. It is the amount of agreement between two or more clinicians and between two or more exams by the same clinician on different occasions. These two elements are termed inter-observer and intra-observer consistency. Quantification of the amount of agreement is done by using “kappa” statistic for qualitative outcome and for quantitative responses, intraclass correlation, denoted by η (eta) is generally used.
 
Clinical Disagreement
Clinicians who examine the same patient often disagree. In fact, the clinician who examines the same patient twice often disagrees with his or her own earlier findings. For example, examination of the optic fundus is a universally accepted component of the physical examination of a patient suspected of having cardiovascular disease or diabetes, and in hypertensive patients, it may provide a better index of prognosis than measurement of the blood pressure. However, when two clinicians carefully examined the same set of 100 fundus photographs the disagreements documented in following table were generated (hypothetical data) (Table 10.6).360
Table 10.6   Agreement between two clinicians examining the same set of 100 fundus photographs
Second Clinician
Little or no retinopathy
Moderate or severe retinopathy
First Clinician
Little or no retinopathy
46
10
Moderate or severe retinopathy
12
32
They agreed (using the K-W system of classification) that 46 of these 100 patients had little or no retinopathy and that a further 32 had moderate or serve retinopathy. Thus, they agreed with each other about three-quarters of the time and disagreed whether patients had retinopathy about one-quarter of the time. Overall agreement was (46 + 32)/100 or 78%. However, this description of their agreement is rather superficial, because even if the second clinician simply tossed a coin rather than studied each fundus photograph, he would agree with the first clinician part of the time by chance alone (by calling it little or no retinopathy if the coin landed heads and moderate or severe retinopathy if the coin landed tails).
We could determine chance agreement by assuming that the second clinician was tossing a coin and he obtained 58 heads and 42 tails (equals marginal totals for him). Thus, he would call a photograph little or no retinopathy 58% of the time and would call it moderate or severe retinopathy 42% of the time. That is, if chance 361alone were operating, we would expect 58% of the 56 (i.e. 32.5) photographs judged by the first clinician to exhibit little or no retinopathy would also be called little or no retinopathy, by the second clinician and 42% of the 44 (i.e. 18.5) photographs judged by the first clinician to exhibit moderate or severe retinopathy by the second clinician. Therefore the agreement between these two clinicians that we would expect based on chance alone is (32.5 + 18.5) / 100 or 51%. Since the observed agreement was 78% and the expected agreement on the basis of chance alone was 51%, the actual agreement beyond chance is the remainder (78% − 51%) = 27%. The potential agreement beyond chance is 100% − 51% = 49%. A way of combining these into a clinically useful index is calculating the ratio of the actual to potential agreement beyond chance. This index known as Cohen's “Kappa Coefficient” (27% / 49% = 0.55 for our example) represents the proportion of potential agreement beyond chance that was actually achieved.
One can use Kappa to look at multiple levels of agreement (low, normal, high or grade I, II, III, IV) rather than just two levels. Further, if the extent of disagreement is important we can weigh the Kappa for the degree of this disagreement.
 
Tests Yielding a Quantitative Result
In dealing with diagnostic test that yield a quantitative result (e.g. blood sugar, blood pressure), the situation is different, in the sense that clear cut presence or absence of disease cannot be said. There will be overlap of the distributions of an attribute for diseased and 362non-diseased persons. The ideal test is one in which there is no overlap in the range of results among patients with and without a disease in question. Figure 10.2 shows a hypothetical distribution of results of such a test.
All subjects without the disease have test values lower than those observed among patients with the disease. Line ‘A‘ defines the test cut-off point, above which the result is 100% sensitive and below which it is 100% specific. There are no false-negative or false- positive test results. However, for most tests that have a range, some overlap of findings is seen among those with and without the disease, as indicated in Figure 10.3.
All values that fall to the right of a line are considered to be positive, those to the left are negative. The choice of cut-off point ‘A‘ for this hypothetical test results in high specificity of 98% but limited sensitivity of 60%.
Fig. 10.2: Hypothetical distribution of results of an ideal test
363
Fig. 10.3: Hypothetical distribution of results of most tests used in clinical medicine
Such a cut-off point may be appropriate to confirm a suspected diagnosis. But cannot be used to screen or to exclude the disease because of its low sensitivity. The choice of cut-off point ‘ C‘ would be appropriate for these purposes since all-most-all patients with disease are identified (sensitivity of 95% but specificity falls to 60%). Point ‘B‘ is intermediate between two (sensitivity of 80%, specificity 85%). Each cut-off point thus defined a set of operating characteristics for the test in question. A change in cut-off value alters sensitivity and specificity. As sensitivity is increased, specificity falls and vice versa.
 
Receiver Operating Characteristic (ROC) Curves
Because no single value or cut-off point of an individual test can be expected to have both a perfect sensitivity and a perfect specificity it is often necessary to determine which value or cut-off point is the most 364appropriate for a given purpose. A graph can be constructed that correlates true and false positive rates or sensitivity and (1-specificity). Such a graph which is known as receiver operating characteristic (ROC) curve of a test, demonstrates that different definitions of normal versus abnormal may be appropriate depending on whether one wishes to confirm the presence of disease via a positive result on a test that has a high specificity or to exclude the presence of disease via a negative result on a test that has high sensitivity. Such graph is useful to decide the optimum cut-off point according to the purpose of the administration of the test.
Usually, a scale or questionnaire is used with a given cut-off point which separates, among the diseased and the non-diseased, those individuals who score above or below cut-off point. The result is a classical 2 × 2 contingency table. ROC curve can be described as summarizing all possible sets of 2 × 2 decision matrices that result when the cut-off is varied from the largest to the smallest possible value. It is generally obtained by plotting the true positive rate (sensitivity) against false positive rate (1 – specificity) for all possible cut-off points of the screening instrument. The advantages of using the ROC analysis compared with giving a single value of sensitivity and specificity at one cut-off point are:
  1. assessment of the discriminating ability of the instrument across the total spectrum of morbidity,
  2. comprehensive comparative assessment of the performance of two or more screening tests and
    365
  3. assessment of the effect of varying the threshold score.
Cost-benefit approach and decision tree approach to this issue is important. But in many cases we do not have accurate estimates of the additional health or financial costs (risk/benefits from the treatment, discomforts, etc.) associated with errors in diagnosis. One generally followed approach to this problem is to choose a cut-off point that minimizes our mistakes. To find out such a cut-off point which is optimum, a simple and convenient way of plotting of ROC curve is—On horizontal (say X) axis all possible cut-off points or scores are drawn in sequence. On first vertical (say Y1) axis sensitivity % and on second vertical (say Y2) axis specificity % are drawn. For each cut-off point or score there is pair of sensitivity and specificity. Corresponding with each cut-off point or score two points are plotted. When points corresponding with sensitivity and points corresponding with specificity are joined separately, we get two curves. They can be called as sensitivity and specificity curves respectively. The point where these two curves cross each other, is an optimal cut-off point. A perpendicular line which meet horizontal X-axis can be drawn from such a point which facilitates identification of optimal cut-off point.
To illustrate the use of this method one real example is given in Figure 10.4, where results of a pilot study are used. Purpose of this pilot study was to establish an optimum cut-off point of a psychological screening tool to be used for the main survey which involved psychological screening on large scale. This screening tool had range of the score from 0 to 20.366
Fig. 10.4: Example of alternative way of drawing an ROC curve
Individual is labeled as screen negative if his score is equal to or below and screen positive if his score is more/above some definite score.
 
Clinical Versus Statistical Significance
Nearly all information in medicine is empirical in nature and is gathered from samples of subjects studied from time to time. Besides all other sources of uncertainty the samples themselves tend to differ from one another. For instance, there is no reason that the 10-year survival rate of cases of carcinoma breast in two groups of women of 120 each, the first group born on odd days of any month and the second group on even days of any month, is different, but there is a high likelihood that this would be different. This happens because of sampling error or sampling fluctuation. This depends on two things:367
  1. The sample size n, and
  2. The intrinsic interindividual variability in the subjects.
The former in fully under control of the investigator. The latter is not under human control, yet its influence on medical decisions can be minimized by choosing an appropriate design and by using appropriate methods of sampling. The sources of uncertainty other than intrinsic in the subjects, such as due to observer variations and measurement errors, are minimized by adopting suitable tools for data collection.
Sample size n plays a dominant role in statistical inference. The standard error (SE) can be substantially reduced by increasing n. This helps to increase the reliability of the results. A narrow confidence interval (CI) is then obtained that can really help in drawing a focused conclusion. At the same time, a side effect of large n is that a very small difference can become statistically significant. This may or may not be clinically/medically significant. We know that clinical significance goes beyond arithmetic and is determined by clinical judgment. Few important measures of clinical significance are described in chapter on designs.
Looking for clinical significance even when the results are statistically significant is very important. There are situations where a result could be clinically important but is not statistically significant. Consideration of these two possibilities leads to two very useful yardsticks for interpreting an article on a clinical trial. These yardsticks are:368
  1. If the difference is statistically significant is it clinically significant as well?
  2. If the difference is not statistically significant was the trial big enough to show a clinically important difference if it had occurred?
It is possible to determine ahead of time how big the study should be. However, most trials that reach negative conclusions either could not or would not put enough patients in their trials to detect clinically significant differences. That is, the β errors of such trials are very large and their power (or sensitivity) is very low. When Freiman et al. reviewed a long list of trials that had reached “negative” conclusions they found that most of them had too few patients to show risk reductions of 25% or even 50%. Tables to find out if the sample size was adequate to detect 25% or 50% risk reduction are given earlier in this chapter.
Distinction must be made between ‘not significant’ and ‘insignificant’. Statistical tests are for the former and not for the latter. A statistically ‘not significant’ difference is not necessarily ‘insignificant’. With statistical inference, the results can seldom, if ever, be absolutely conclusive as the P-value never becomes zero. There is always a possibility, however small, that the observed difference arose by chance alone. Whenever statistical significance is not reached, the evidence is not considered in favor of H0– it is only not sufficiently against it. The word significant in common parlance is understood to mean noteworthy, important. Statistical significance too has the same 369connotation but it can sometimes be at variance with medical significance. A statistically significant result can be of no consequence in the practice of medicine and a medically significant finding may occasionally fail test of statistical significance. The SEs depend heavily on the sample size. A result based on a large sample is much more reliable than a similar result based on a small sample. This reflects on the width of CI on one hand and on P-value on the other. A small and clinically unimportant difference can become statistically significant if the sample size is large.
Suppose it is known that 70% of those with sore throat are automatically relieved within a week without treatment due to self-regulating mechanism in the body. A drug was tried on 800 patients and 584 (73%) cured in a weeks time. Since P = 0.0322 is very small the null hypothesis is extremely unlikely to be true and is rejected. Statistical significance is achieved and the conclusion of 73% cure rate observed in the sample being really more than 70% seen otherwise is reached. But, is this difference of 3% worth pursuing the drug? Is it medically important to increase the chances of relief from 70 to 73%? Perhaps not. Thus a statistically significant result can be medically not significant.
Some cautions are required in interpreting statistical non-significance also. Consider the following results of a trial in which patients on regular tranquilizer were randomly assigned to continued conventional management and a tranquilizer support group.370
Tranquilizer support group
Conventional management group
Still taking tranquilizer after 16 weeks
5
10
Stopped taking tranquilizer by 16 weeks
10
5
Total
15
15
Though the number of patients stopped taking tranquilizer is double in the support group than in the conventional group yet the difference in not statistically significant [χ2 (with Yate's correction) = 2.13, P = 0.1441, Fisher's exact P = 0.1431]. There is a clear case of a trial on an enlarged n. If the same type of result is found on n = 30 in each group then the difference would become statistically significant. The conclusion that the evidence is not enough to conclude presence of difference remains scientifically valid so long as n remains 15 in each group.
· In a very good book on experience of heart problem, an author who himself is a good physician, states that much of the problems he had to face were due to missed diagnosis. This happened because: trade mill test and ECG results were not interpreted considering that trade mill test's sensitivity is 75% (which is found to be as low as 55% in one of the well planed studies recently) and ECG's sensitivity is only 30% (to detect heart disease). Test results are always interpreted in light of prior information. Patient's cholesterol level was (at that time) 244 mg. These 371things naturally lead to missed diagnosis. Moral of the story is: interpretation of a test result should be done carefully considering test's characteristics (sensitivity and specificity) and correct estimate of ‘prior probability’ based on all background factors, history, presence of signs-symptoms, physical/clinical examination findings, etc.
 
QUALITY CONTROL IN MEDICAL AND HEALTH CARE
While variabilities are endogenous to the medical care systems, errors are exogenous. Blood pressure readings vary for individual to individual and in the same individual from time to time. Various observers can report different readings in the same person at the same time and even a standardized manometer can provide variable readings. These intra and inter individual variations and observer, instrument and technique differences are endogenous to the system and occur despite best care. They can only be minimized but not eliminated. Statistical methods provide help in detecting trends despite such variations and the conclusions so drawn seem to work in large percentage of future cases.
Errors are distinct from variabilities. Clinicians may genuinely differ on dosage of a drug to be given to a specific patient but giving it two times when the prescription is three times a day is an error. Admitting a patient who can be managed in OPD or vice-versa is an error. Infection developing in a patient after admission to a hospital indicates a lapse on the part of 372the hospital system. Quality control relates to the control of such errors. These are exogenous to the system and can be theoretically eliminated. Experience however suggests that in practice these too can only be controlled and can seldom be completely eliminated.
Quality control is a welcome term in the industrial sector but the health sector, particularly in India, is yet to be sensitized to this concept. Medical care is a big industry in many parts of the world and the same standards apply in this industry as in other industries. Providing good care is appreciated as a pre-requisite for maintaining and increasing clientele and perhaps profits too. In an activity which deals with life and health of people, concern however is with humanitarian aspects of doing best to prolong life and reduce suffering. In this context, quality of medical care assumes importance much more than in other industries. In India too medical care is slowly but surely moving into private hands and quality concerns are becoming prominent. This has accentuated due to Consumer Protection Act which now puts paid medical care also under its preview. It is falsely felt that ‘commercialization of this noble profession’ is due to this fact. However, such fact may only leads to quality conciseness. In the public sector also, where the services are mostly free, the awareness of the management is increasing and the patients are becoming more demanding. Quality thus is gaining increasing importance in all spheres of medical care.373
Emphasis in industry is on the ‘process’, the product is merely an outcome of that process, the analogy follows that a patient's condition is largely an outcome of the process: patient care and so statistical quality control techniques will be useful in improving patient care. All fourteen principals (Deadly Diseases), that affects the quality, given by Dr W Edward Deming an American physicist, who started teaching the Science of Statistical Quality Control (SQC) to the top bosses of industries of Japan, in early fifties, which resulted in high quality low cost Japanese industrial products, can easily be adopted here.
 
ADVERSE PATIENT OUTCOMES
A patient generally has to pass through a web of facilities to get the treatment. He has to describe complaints and respond to the questions on history of the problem, on the environmental exposures, on the extent and type of disabilities, etc. The attending clinician asks questions which are geared to meet the requirement of the situation in terms of the condition of the patient, his intellectual level and the type of complaints. He also examines the patient as required to detect signs and obtains measurements such as body temperature, pulse rate, blood pressure and weight. An interim therapy is sometimes started. Then the laboratory and radiological investigations are ordered if considered desirable by the clinician. The laboratory and the radiological units carry out the investigations and report the findings. The clinician reassesses and sometimes the cycle restarts. Outdoor patients get the 374supply of medicine from a pharmacy as per the prescription and inject the drug without supervision. Indoor patients are mostly administered prescribed dosages by the nursing staff. A hospital does all this for a large number of patients day after day. It is unrealistic to expect that all steps will be correctly done in all cases all the time. Errors do occur which result in an unexpected adverse outcome in some patients. Some adverse outcomes go unnoticed and nothing very much can be done about them. Nevertheless, some outcomes can be monitored. The following is a list of such outcomes. These indicate that the error has occurred somewhere in patient management including in clinician's judgment of the condition of the patient.
  1. Admission due to adverse results of outpatient management.
  2. Admission for complications of problem on previous hospital admission.
  3. Operations for perforation, laceration or injury of an organ incurred during invasive procedure.
  4. Adverse reaction of drug or of transfusion.
  5. Unplanned return to the operation theatre.
  6. Infection developing subsequent to admission.
  7. Transfer from general care to special care unit.
  8. Other complications occurring after the start of the therapy.
It is preferable to keep a separate track of each type of adverse outcome so that corrective steps can be easily taken. A hospital has to set a limit within which these will be tolerated. It is reasonable to distinguish adverse patient outcomes which arise due to incomplete medical knowledge or due to unavailability 375of equipments from those which arise due to care-lessness or incompetence. There could be a tendency to classify most adverse outcomes to the former. A responsible management would check such tendencies and ensure that such misclassification remains minimal. As a matter of fact, any adverse outcome is undesirable and everything must be done to avoid a recurrence. Practically though, these can only be minimized and not eliminated. This required that all outcomes be monitored but resources may not allow monitoring of all patients. A sampling scheme can be used so that a truly representative group of patients is monitored with regard to the outcome. Statistical monitoring can be done which requires that the system is declared to loose quality if the level of adverse outcomes exceeds a statistically determined limit. This limit is set in relation to a realistic target.
If the target is not more than 1% (p = 0.01) getting adverse outcome in admitted patients then the upper tolerance limit is {[np] + 1.645 √[np(1 – p)]} = {[200 * 0.01] + 1.645√[200 * 0.1 * 0.99]} = 4.3 in this case. Thus, adverse outcome in 5 patients out of 200 sampled would indicate that the patient management system is not under control for adverse outcomes with respect to the target of 1%. Note that this limit corresponds to one-sided 95% confidence interval. Lower limit in this case is not required.
It is customary in the industries to monitor quality with the help of a control chart. This chart contains lines corresponding to the values of np and np + 1.645 √[np(1 – p)]. The number of adverse outcomes in a 376random sample of n patients is plotted on this chart periodically. For the hypothetical data, this chart would look like as shown in Figure 10.5. Any point outside the upper limit (sample 4) indicates loss in quality against the fixed target. Investigations are required at that stage to find why such high number of patients has adverse outcomes. More often than not, an assignable cause would be found which could be eliminated or minimized.
 
Medication Errors
A patient suffering from a disease whose therapy is known must be ultimately cured. This may however not happen in some cases due to carelessness on the part of the patient in following the prescribed regimen. However, if a patient remains uncured despite fully following the advice then this must be attributed to the error somewhere in patient management.
Fig. 10.5: Typical control chart
377
Either the attending clinician lacks the required expertise or the tools/equipments/drugs/surgeries are faulty. Some of this may result into adverse outcome, which could be controlled by the methods described earlier, but some may not. They may be deviations from prescription and are called medication errors. These may or may not result into know adverse outcome. In quality control, this error is known as non-conformance. In a hospital, this could be in terms of administration of doses at times other than prescribed, in wrong amount, omission of a dose or giving extra dose, or administration of an un-prescribed drug. It is the responsibility of the management to develop and install a system to monitor all such errors.
Since the incidence of medication errors would vary from hospital to hospital and in a hospital from ward to ward and from time to time, there is a need to account for these variations. Statistical methods are equipped precisely to handle such variations. The incidence of errors is generally computed in terms of rate per 100 opportunities. Depending upon the load of patients, availability of staff and keenness of the hospital to provide quality care, an average proportion p of errors is fixed. If this rate is fixed at 1 per 1000 opportunities the p = 0.001. If there are an average of n such opportunities per day in a hospital then the upper limit of tolerance again is {[np] + 1.645 √[np (1 –p)]}. In a 500 bedded hospital with 80% occupancy, at the rate of 4 per patient, the number of opportunities of medication errors per day is n=1600. If p is 0.001, 378the upper limit of tolerance is {[1600 * 0.001] + 1.645 √[1600* 0.001 * 0.999]}, which equals 3.7. If the errors found on any particular day are 4 or more, then its cause must be investigated and remedial steps taken. It is likely that one kind of error (e.g. medication not at specified time intervals) predominates. If so, remedial step on this alone can substantially reduce the errors.
Regular and full monitoring of medication errors could be difficult. In that case, patients in randomly selected wards could be surveyed on random days to find medication errors, if any. The tolerance limit will change depending upon the “number” of opportunities” (n) surveyed. A control chart of the type shown in Figure 10.5 should be drawn for medication errors also. These could be of immense help in monitoring the quality. Similar charts are used in the medical laboratory also to maintain the quality of services.
 
OTHER AIDS TO MINIMIZE CLINICAL ERRORS
Patient management is admittedly an art in which the uniqueness of the individual, skill of the clinician and the interaction of the two play an extremely important role. Yet help is regularly taken of tools such as laboratory investigations and imaging to narrow down the spectrum of possible diagnosis and to monitor the improvement. So much so that gold standard, or a fool proof evidence, wherever exists, mostly is in terms of the investigation results rather than in terms of signs and symptoms. In the context of quality of medical care, these tools are now essential aids to minimize 379errors. In an age of increased emphasis on quantitative thinking, some new tools can be devised and their use proposed to assist the clinician in minimizing the errors. Following is a brief discussion of two such tools. First is the etiology diagram and the second is computer based expert system. Each can be used as effective tools to improve the quality of medical care.
Ishikawa diagram, whose name comes from its original proposer, tries to present visual picture of the interaction of cause to produce an effect. In the context of medicine, the cause could be various etiological factors. A simple version of this could be called an etiology diagram. One such diagram is Figure 10.6 which is for myocardial infarction (MI). The process leading to MI is intricate and the representation in this figure is necessarily simple. It should thus be considered only a model because model by definition is a simple version of the complex process. We have tried to show the direction in which predisposing factors influence the outcomes. The figure is essentially a hypothesis since the exact web of causation is perhaps not fully known. We are trying to trace the MI in this figure and any reverse relationship, though valid, is not important for this purpose. Despite its simplicity, the diagram seems capable to alert some unsuspecting clinicians to look specially for the underlying factors before assessing risk or rendering advice.
Computer based expert systems can sometimes provide considerable help in reducing errors in medical decisions. These systems, as a first step, yield the most probable diagnosis or the spectrum of diagnoses on the basis of signs, symptoms, history and investigation reports.380
Fig. 10.6: Suggested etiology diagram for myocardial infarction
381
Note that the expert systems are thus statistical in nature. Sometimes the expert system would alert about the missing component of history and sometimes may prompt on the investigations additionally required. It may also suggest alternatives available for managing a patient. The advantage of the expert system is that they are capable of processing large amount of information without error as per the programme and knowledge base given to it by a group of experts after considerable consultation among themselves. An individual clinician may not know as much and his brain may fail in recollecting all that is know at the point of time when the patient is confronted. But such systems have limitations too. First, such a system would be only as good as the ‘experts’ themselves whose knowledge has gone into it. Real experts are rare, particularly those who are willing to collaborate in developing a computer based system. Thus many such systems remain of doubtful value. Second is that, irrespective of the ‘expertise’ of the system, it is seldom able to think as critically as human mind can. It could never be helpful in a situation which has not been anticipated and incorporated at the time of its preparation. Thus the use of expert system should be done with extreme caution. They must be considered only as an aid and not as a guide. The decision at each step has to be entirely that of the attending clinician. It is upto him to agree or disagree with the suggestions of the system. This can act at most as a supplementary.382
 
Quality of Measurements
Even the most sophisticated statistical treatment of data cannot lead to correct conclusions if the data are basically wrong. Before analysis is done, everything possible should be done to ensure that the information is correct. This mostly depends on the validity and reliability of the tools and instruments used for collection of information and on the quality of data actually obtained through these instruments.
In a clinic setup, health is assessed by eliciting the history, conducting a physical examination, carrying out laboratory or radiological investigations, and sometimes longitudinally monitoring the progress of the subject. This requires use of a questionnaire or schedule for systematically eliciting and recording information; medical instruments such as stethoscope, sphygmomanometer, and weighing machine; laboratory chemicals, reagents, and methods including machines such as an analyzer; X-ray and other imaging machines and all that go along with them; diagnostic criteria and definitions including scoring systems if any; and prognostic indicators. All these are generically instruments in a statistical sense because it is through them that the data are obtained. Various procedures for patient management such as surgery and treatment regimens also come under this general term. These can also be called “Tests” under certain conditions instead of instruments. In a practical sense no instrument is perfect because the performance of the instrument also depends, to a degree, on the human input. The quality of medical decisions depends very 383substantially on their dependability in actual use. This is assessed in terms of the validity and reliability of the instruments.
  • The most elegant design of a clinical study will not overcome the damage caused by unreliable or imprecise measurement. Measurement error is one of the major sources of bias in epidemiological studies. Important sources of measurement error are: errors in the design of the instrument (for example, phrasing of questions that lead to misunderstanding or bias), omissions in the protocol for the use of the instrument (for example, failure to specify a method to handle unanticipated situations consistently), poor execution of the study protocol (for example, failure of data collectors to follow protocol in same manner for all subjects), limitations due to subject characteristics (for example, poor recall or tendency of subjects to over-report socially desirable behaviors and under-report socially undesirable behaviors), and errors during data entry or analysis.
  • Systematic quality control procedures before, during, and after data collection are necessary to identify and correct measurement errors. Careful design of the data collection forms, complete documentation of study procedures, and pre-testing of the data collection instrument will eliminate some sources of error. The training and monitoring of data collectors should emphasize uniform execution of the study protocol. Review of the completed data forms and checks on the computerized data 384will uncover data items that need clarification or correction.
 
Quality of Data
Valid and reliable instruments can give erroneous results when not used with sufficient care. Reliability is concerned with whether the instrument (any data collection procedure) will produce the same result when administered repeatedly to an individual. Validity is concerned with whether the instrument (any data collection procedure) is actually measuring what it purports to be measuring. Following simple diagram should make the idea clear (Fig. 10.7).
Fig. 10.7: Target value
O: Values indicating reliability but no validity
X: Values indicating validity but no reliability
Sometimes there exists a so-called ‘gold standard’ against which the new instrument (any data collection procedure) is to be compared. The new instrument (any data collection procedure) may be simpler to apply, or cheaper, and in a particular application one might wish to use it in place of gold standard for these reasons. The validation procedure simply involves comparing the results from the two methods and has been called ‘correlational or criterion’ validity. 385For example, patients could be classified as being depressed by an expert psychiatrist, and the results compared with those predicted by a self-administered questionnaire. Often, however, a standard method does not exist and evidence is accumulated from a number of procedures, what is known as ‘construct’ validation. ‘Content or face’ validity basically means that the instrument defines the condition and the questions it asks are seasonable within the context and adequately cover the area they are suppose to measure.
In addition, some errors creep in inadvertently due to ignorance. Sometimes data are deliberately manipulated to support a particular viewpoint. Errors in measurement can arise due to several factors, few are as listed below.
 
Lack of Standardization in Definitions
If it is not decided beforehand that an eye will be called blind when VA<3/60 or VA<1/60, then different observers may use different definitions. When such inconsistent data are merged, an otherwise clear signal from the data may fail to emerge, leading to a wrong conclusion.
 
Lack of Care in Obtaining or Recording Information
This can happen when for example, sufficient attention is not paid to the appearance of Korotkoff sounds while measuring BP or to the waves appearing on a monitor for a patient in critical condition. This can also happen when responses from patients are accepted 386without probing and some of them may not be consistent with the response obtained on other items.
 
Inability of the Observer to Get Confidence of the Respondent
This can be due to language or intellectual barriers if the subject and observer come from widely different backgrounds. They may then not understand each other and generate wrong data. Correct information can be obtained only when the observer enjoys full confidence of the respondent.
 
Bias of the Observer
Some agencies deliberately underreport starvation deaths as a face-saving device, and over report deaths due to calamities such as flood, cyclone, and earthquake to attract funds and sympathy. Improvement in condition of a patient for reasons other than therapy can be wrongly ascribed to the therapy.
 
Variable Competence of the Observer
More often than not, an investigation is a collaborative effort involving several observers. Not all observers have the same competence or the same skill. Assuming that each one works to his fullest capability, faithfully following the definitions and protocol, variation can still occur in measurement and in assessment of diagnosis and prognosis, because one observer may have different acumen in collating the spectrum of available evidence than the others.387
Many inadvertent errors can be avoided by giving the observers adequate training in the standard methodology proposed to be followed for collection of data and by adhering to the protocol as outlined in the instruction sheet. Intentional errors are, however, nearly impossible to handle and can remain unknown until they expose themselves. One approach is to be vigilant regarding the possibility of such errors and deal with them when they come to notice. If these errors are noticed before reaching the publication stage, steps can sometimes be taken to correct the data. If correction is not possible, the biased data may have to be excluded altogether from analysis and conclusions. It is sometimes believed that bad data are better than none at all. This can be true if sufficient care is exercised in ensuring that the effect on the conclusion of bias in bad data has been minimized if not eliminated.
 
Statistical Fallacies
Statistics show that the risk of dying from an illness while in the hospital is many times greater than the risk of dying from an illness while at home. By the same token, there is a strong association between dying and being in bed. While most of the people many not be able to precisely identify and explain the logical fallacies implied by this type of statistical reasoning, the nonsense of it is sufficiently apparent so that no one is likely to avoid beds or hospitals in order to prolong his life. Yet the same sort of statistics too often finds its way into the general health literature and mass media where the errors of logic are less apparent and so pass unrecognized.388
It is rightly said, “He who accepts statistics indiscriminately will often be duped unnecessarily. But he who distrusts statistics indiscriminately will often be ignorant unnecessarily.” The science of statistics should not be condemned because it can be abused (fallacies designed to mislead) and misused (fallacies committed unintentionally) for fault lies not with statistics as such but with the user of the subject. For example, if a child cuts his finger with a sharp knife, it is not the knife that we should blame. If a person takes a wrong medicine or excessive dose of medicine and dies, we cannot blame the medicine as such. Statistics are very much like clay of which one can make a God or Devil as one pleases.
Every-day we see misuses of statistics which affect the outcomes of elections, change public policy, win arguments, get readers for newspapers, impress readers, support prejudices, inflame hatreds, provoke fears, sell products, etc, etc. It is common to sneer at the subject and say that, “You can make statistics say anything,” (Lies, Dam lies and statistics). It is only through abuse that you make statistics say, “anything”. Good statistics tell only the truth. Many examples of misuses involve several offences, these multiple misuses can be called as mega misuses. In fact there is no limit as to the kinds of misuses and new ones are being committed daily. Few of them are discussed below.
 
Volunteers Sample or Self-selection of Patients
Very frequent source of error in the assessment of a vaccine (or even therapy) lies in the comparison of 389persons who volunteer or choose to be inoculated (or volunteer on behalf of their children) with those who do not volunteer or choose. The volunteers may tend to come from a different age groups, they may tend to include more or fewer males than females, they may tend to be drawn more from one social class than another.
The volunteers for inoculation may be persons more careful of their health, more aware of the presence of an epidemic in the community. They may on such occasion take other steps to avoid infection. One survey in the early days of vaccination against poliomyelitis showed that the vaccinated children came more from mothers with a high standard of education. Such mothers may also endeavor to protect their children from infection in other ways. They may have smaller families so that the inoculated child is automatically less exposed to infection from siblings. In one pioneer trial of vaccines against whooping cough it was found that 47% of the inoculated children were only children (i.e. with no siblings) compared with 20% of the control children.
An interesting example of the effects of self-selection by patients is to be found in the statistics of migraine. It has often been stated that the patients are more intelligent than the average; and a study of patients seen in general practice showed that the prevalence was relatively high in the professional class. On the other hand an epidemiological study of a random sample of a whole population showed no such relationship with intelligence or social class. Clinical 390evidence may well be true—migraine patients who attend their doctors are more intelligent and of higher social class. But this is the result of their behaviour, their own self-selection in seeking medical aid. We cannot safely project the observations of patients seen to the population of patients as a whole, seen or not seen.
 
Absence of Exposed to Risk or Standard of Comparison
From clinical records compiled at a hospital (or general practice) many characteristics of sick persons can be studied—age, sex, religion or caste, occupation, family and personal history, etc. From these records much can be learned about the symptoms, the pathology, the course and the prognosis of the disease, either in total group of sick persons or in subgroups. But in the absence of knowledge of the exposed to risk, very little can be learned of its epidemiology.
We have the numerators of the fractions but no denominators. Proportional rates are usually quite inadequate substitutes. For example, the fact that 30% of certain cases are aged say between 40 and 50 years may mean that this age group is particularly vulnerable to the disease in question; but it may mean merely that 30% of the population from which the hospital draws its patients are aged 40 to 50. An enquiry is made into the home conditions of each infant dying in the first year of life in a certain area over a selected period of time, and it is found that 20% of these infants lived under unsatisfactory housing conditions. Do such conditions, or factors associated with them, lead to a 391high rate of infant mortality? We need information as to the proportion of all infants who were born in that area over that period of observation who live under unsatisfactory housing conditions. If 20% of all infants live under such conditions, then 20% of the deaths may reasonably be expected from those houses and unsatisfactory housing appears unimportant. In a study of road accidents, it was found that truck drivers involved in accidents, 75% had consumed alcohol during some period of hours previous to the accident and 25% had not. Before the ratio of three alcoholics to 1 non alcoholic amongst the accident cases be interpreted, information is also required as to the comparable ratio amongst drivers not involved in accidents.
For example, suppose there were 1000 drivers on the road and 48 accidents were recorded. Of the 48 drivers involved in these accidents three-quarters were found to have consumed alcohol (i.e. 36) and one-quarter (i.e. 12) have not. If three-quarters of all 1000 drivers had consumed alcohol within a few hours of driving and one-quarter had not, the populations exposed to risk of accident were 750 and 250. The accident rates were then, identical—namely 36 in 750 and 12 in 250 or 4.8% in each group.
Knowledge of exposed to risk, or at least of the ratio of alcohol consumers to non-consumers in a random sample of all truck drivers, is essential before conclusions can be drawn from the ratio in the accident cases. Similarly, from a finding that only 20% of accidents were caused by female drivers can it be concluded that females drive more carefully?392
In a study of peptic ulcer patients, it was found that 55% belonged to blood group ‘O’, 25% to blood group ‘A’, 15% to ‘B’ and 5% to blood group ‘AB’. Then, will it be correct to say that “The person belonging to blood group ‘O’ is more likely to develop peptic ulcer than the person belonging to any other blood group”?
 
MIXING OF NON COMPARABLE GROUPS OR RECORDS
In a clinical trial, the number of deaths observed among the 120 individuals (80 males + 40 females) given a new treatment was 32 while the number of deaths observed amongst the 90 individuals (30 males + 60 females) taken as controls was 30. It is known that for this particular disease the fatality rate is twice as high among females as it is amongst males. Fatality rate in treatment group is 26.7% (32/120 × 100) and in control group 33.3% (30/90 × 100). This comparison suggests that the new treatment is of some value. But the fatality rates of the total number of individuals must be influenced by the proportions of the two sexes present in each sample, as the fatality rates are so much different for males and females. We must consider the fact that males and females are not equally represented in the sample treated and in the sample taken as control.
If the fatality rates among males and females are 20% and 40% respectively (suppose that this is known from the literature) then calculations displayed in the Table 10.7 makes the fallacy clear.393
Table 10.7   Fatality rates for males and females (hypothetical data)
Group
Items
Male
Females
Combined
Treatment Group
Number
80
40
120
No. of deaths
16 (E)
16 (E)
32
Fatality rate
20%
40%
26.7%
Control Group
Number
30
60
90
No. of deaths
6 (E)
24 (E)
30
Fatality rate
20%
40%
33.3%
E = Expected
The comparison of like with like, i.e. males with males and females with females show the treatment was of no value, since the fatality rates of treated and untreated sex groups are identical (and in fact equal to the normal rates).
 
Ecological Fallacy
Ecological Fallacy can be defined as: A logical fallacy inherent in making causal inferences from group data to individual behaviors. For example, suppose that a comparison between countries A and B reveals that there is a higher frequency of myopia (nearsighted vision) in A than in B, and that a larger proportion of the population A than of B watches television. This then leads to a suspicion that television viewing and myopia are associated. However, it is possible that this suspected relationship between watching 394television and myopia is actually the result of another characteristic that is more prevalent in country A than in country B. The greater percentage of people who watch television in country A than B may reflect, say, the higher economic status of that country.
This in turn may affect another characteristic of the population, such as the education system (including a lot of reading) which is causally related to myopia. The non-causal relationship between television viewing and myopia is the ecological fallacy. Another, very often quoted example of ecological fallacy is: It is found that on the average, provinces with greater proportions of Protestants had higher suicide rates and that provinces with greater proportions of Catholics had lower suicide rates. It is concluded from these data that Protestants are more likely to commit suicide than are Catholics.
The causal inference is not logically correct, because it may have been Catholics in predominantly Protestant provinces who were taking their own lives. (It may be that the higher suicide of Catholics who have minority-group status in such places. But even such alternative hypothesis also constitute an ecological fallacy.) This logical flaw, called the ecological fallacy results from making a causal inference about an individual phenomenon or process (e.g. suicide) on the basis of observations of groups.
It is seen that the correlation coefficient between two individual level variables is generally not the same as that between those same variables for aggregates into which the individuals are grouped. As a result of 395the grouping operation, one may have controlled for the effects of other variables making the ecological estimate less biased than the individual estimate or one may have included various confounding variables making the ecological level co-relation more biased.
Confounding is considered a particularly egregious fault in ecological studies. Indeed the ecological fallacy is often defined as a problem of confounding. But sometimes ecological fallacy arises because the aggregated variable may measure a different construct than its namesake on the individual level. For example, a hung jury is a jury that is indecisive, it cannot decide whether the accused is guilty or innocent. However, to deduce that the members of such a jury are indecisive would be absurd. Members of a hung jury are very decisive, so much so that they cannot be persuaded to change their mind. Attributing to the members of this group the characteristic of that group (indecisiveness) is thus a case of the ecological fallacy.396

Appendix

A-1: Areas for the Standard Normal Distribution
Z
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.5000
0.4960
0.4920
0.4880
0.4840
0.4801
0.4761
0.4721
0.4681
0.4841
0.1
0.4602
0.4562
0.4522
0.4483
0.4443
0.4404
0.4364
0.4325
0.4286
0.4247
0.2
0.4207
0.4168
0.4129
0.4090
0.4052
0.4013
0.3974
0.3936
0.3897
0.3859
0.3
0.3821
0.3783
0.3745
0.3707
0.3669
0.3632
0.3594
0.3557
0.3520
0.3483
0.4
0.3446
0.3409
0.3372
0.3336
0.3300
0.3264
0.3228
0.3192
0.3156
0.3121
0.5
0.3085
0.3050
0.3015
0.2981
0.2946
0.2912
0.2877
0.2843
0.2810
0.2776
0.6
0.2743
0.2709
0.2676
0.2643
0.2611
0.2578
0.2546
0.2514
0.2483
0.3451
0.7
0.2420
0.2389
0.2358
0.2327
0.2296
0.2266
0.2236
0.2206
0.2177
0.2148
0.8
0.2119
0.2090
0.2061
0.2033
0.2005
0.1977
0.1949
0.1922
0.1894
0.1867
0.9
0.1841
0.1814
0.1788
0.1762
0.1736
0.1711
0.1685
0.1660
0.1635
0.1611
1.0
0.1587
0.1562
0.1539
0.1515
0.1492
0.1469
0.1445
0.1423
0.1401
0.1379
1.1
0.1357
0.1335
0.1314
0.1292
0.1271
0.1251
0.1230
0.1210
0.1190
0.1170
1.2
0.1151
0.1131
0.1112
0.1093
0.1075
0.1056
0.1038
0.1020
0.1003
0.0985
1.3
0.0968
0.0951
0.0934
0.0918
0.0901
0.0885
0.0869
0.0853
0.0838
0.0823
1.4
0.0808
0.0793
0.0778
0.0764
0.0749
0.0735
0.0721
0.0708
0.0694
0.0681
1.5
0.0668
0.0655
0.0643
0.0630
0.0618
0.0606
0.0594
0.0582
0.0571
0.0559
398
1.6
0.0548
0.0537
0.0526
0.0516
0.0505
0.0495
0.0485
0.0475
0.0465
0.0455
1.7
0.0446
0.0436
0.0427
0.0418
0.0409
0.0401
0.0392
0.0384
0.0375
0.0367
1.8
0.0359
0.0351
0.0344
0.0336
0.0329
0.0322
0.0314
0.0307
0.0301
0.0294
1.9
0.0287
0.0281
0.0274
0.0268
0.0262
0.0256
0.0250
0.0244
0.0239
0.0233
2.0
0.0228
0.0222
0.0217
0.0212
0.0207
0.0202
0.0197
0.0192
0.0188
0.0183
2.1
0.0179
0.0174
0.0170
0.0166
0.0162
0.0158
0.0154
0.0150
0.0146
0.0143
2.2
0.0139
0.0136
0.0132
0.0129
0.0125
0.0122
0.0119
0.0116
0.0113
0.0110
2.3
0.0107
0.0104
0.0102
0.0099
0.0096
0.0094
0.0091
0.0089
0.0087
0.0084
2.4
0.0082
0.0080
0.0078
0.0075
0.0073
0.0071
0.0069
0.0068
0.0066
0.0064
2.5
0.0062
0.0060
0.0059
0.0057
0.0055
0.0054
0.0052
0.0051
0.0049
0.0048
2.6
0.0047
0.0045
0.0044
0.0043
0.0041
0.0040
0.0039
0.0038
0.0037
0.0036
2.7
0.0035
0.0034
0.0033
0.0032
0.0031
0.0030
0.0029
0.0028
0.0027
0.0026
2.8
0.0026
0.0025
0.0024
0.0023
0.0023
0.0022
0.0021
0.0021
0.0020
0.0019
2.9
0.0019
0.0018
0.0018
0.0017
0.0016
0.0016
0.0015
0.0015
0.0014
0.0014
3.0
0.0013
0.0013
0.0013
0.0012
0.0012
0.0011
0.0011
0.0011
0.0010
0.0010
For z ≥ 3.10, the probability P(Z ≥ z) is less than one in 1000 and can be taken as almost zero for most practical applications
399
A-2: Critical Values of Student's ‘t’ Probability
df
0.1
0.05
0.025
0.01
0.005
1
3.078
6.314
12.706
31.821
63656
2
1.886
2.920
4.303
6.965
9.925
3
1.638
2.353
3.182
4.541
5.841
4
1.533
2.132
2.776
3.747
4.604
5
1.476
2.015
2.571
3.365
4.032
6
1.440
1.943
2.447
3.143
3.707
7
1.415
1.895
2.365
2.998
3.499
8
1.397
1.860
2.306
2.896
3.355
9
1.383
1.833
2.262
2.821
3.250
10
1.372
1.812
2.228
2.764
3.189
11
1.363
1.796
2.201
2.718
3.106
12
1.356
1.782
2.179
2.681
3.055
13
1.350
1.771
2.160
2.650
3.012
14
1.345
1.761
2.145
2.624
2.977
15
1.341
1.753
2.131
2.602
2.947
16
1.337
1.746
2.120
2.583
2.921
17
1.333
1.740
2.110
2.567
2.898
18
1.330
1.734
2.101
2.552
2.878
19
1.328
1.729
2.093
2.539
2.861
20
1.325
1.725
2.086
2.528
2.845
21
1.323
1.721
2.080
2.518
2.831
22
1.321
1.717
2.074
2.508
2.819
400
23
1.319
1.714
2.069
2.500
2.807
24
1.318
1.711
2.064
2.492
2.797
25
1.316
1.708
2.060
2.485
2.787
26
1.315
1.706
2.056
2.479
2.779
27
1.314
1.703
2.052
2.473
2.771
28
1.313
1.701
2.048
2.467
2.763
29
1.311
1.699
2.045
2.462
2.756
30
1.310
1.697
2.042
2.457
2.750
40
1.303
1.684
2.021
2.423
2.704
50
1.299
1.676
2.009
2.403
2.678
60
1.296
1.671
2.000
2.390
2.660
120
1.289
1.658
1.980
2.358
2.617
α
1.282
1.645
1.960
2.327
2.576
401
A-3: Critical Values of Chi-square
df
α = 0.10
α = 0.05
α = 0.01
1
2.706
3.841
6.635
2
4.605
5.991
9.210
3
6.251
7.815
11.345
4
7.779
9.488
13.277
5
9.236
11.070
15.086
6
10.645
12.592
16.812
7
12.017
14.067
18.475
8
13.362
15.507
20.090
9
14.684
16.919
21.666
10
15.987
18.307
23.209
11
17.275
19.675
24.725
12
18.549
21.026
. 26.217
13
19.812
22.362
27.688
14
21.064
23.685
29.141
15
22.307
24.996
30.578
16
23.542
26.296
32.000
17
24.769
27.587
33.409
18
25.989
28.869
34.805
19
27.204
30.144
36.191
20
28.412
31.410
37.566
21
29.615
32.671
38.932
22
30.813
33.924
40.289
23
32.007
35.172
41.638
24
33.196
36.415
42.980
402
25
34.382
37.652
44.314
30
40.256
43.773
50.892
35
46.059
49.802
57.342
40
51.805
55.758
63.691
45
57.505
61.656
69.957
50
63.167
67.505
76.154
60
74.397
79.082
88.379
70
85.527
90.531
100.425
80
96.578
101.879
112.329
90
107.565
113.145
124.116
100
118.498
124.432
135.807
The value tabulated is c such that P(χ2 ≥ c) = α
403
A-4: Critical Values of ‘F’
Numerator df (ν1)
Denominator df (ν2)
1
2
3
4
5
6
7
8
9
10
1
161.45
199.50
215.71
224.58
230.16
233.99
236.77
238.88
240.34
241.88
2
18.51
19.00
19.16
19.25
19.30
19.33
19.35
19.37
19.36
19.40
3
10.13
9.55
9.28
9.12
9.01
8.94
8.89
8.85
8.81
8.79
4
7.71
6.94
6.59
6.39
6.26
6.16
6.09
6.04
6.00
5.96
5
6.61
5.79
5.41
5.19
5.05
4.95
4.88
4.82
4.77
4.74
6
5.99
5.14
4.76
4.53
4.39
4.28
4.21
4.15
4.10
4.06
7
5.59
4.74
4.35
4.12
3.97
3.87
3.79
3.73
3.68
3.64
8
5.32
4.46
4.07
3.84
3.69
3.58
3.50
3.44
3.39
3.35
9
5.12
4.26
3.86
3.63
3.48
3.37
3.29
3.23
3.18
3.14
10
4.96
4.10
3.71
3.48
3.33
3.22
3.14
3.07
3.02
2.98
11
4.84
3.93
3.59
3.36
3.20
3.09
3.01
2.95
2.90
2.85
404
12
4.75
3.89
3.49
3.26
3.11
3.00
2.91
2.85
2.80
2.75
13
4.67
3.81
3.41
3.18
3.03
2.92
2.83
2.77
2.71
2.67
14
4.60
3.74
3.34
3.11
2.96
2.85
2.76
2.70
2.65
2.60
15
4.54
3.68
3.29
3.06
2.90
2.79
2.71
2.64
2.59
2.54
16
4.49
3.63
3.24
3.01
2.85
2.74
2.66
2.59
2.54
2.49
17
4.45
3.59
3.20
2.96
2.81
2.70
2.61
2.55
2.49
2.45
18
4.41
3.55
3.16
2.93
2.77
2.66
2.58
2.51
2.46
2.41
19
4.38
3.52
3.13
2.90
2.74
2.63
2.54
2.48
2.42
2.38
20
4.35
3.49
3.10
2.87
2.71
2.60
2.51
2.45
2.39
2.35
21
4.32
3.47
3.07
2.84
2.68
2.57
2.49
2.42
2.37
2.32
22
4.30
3.44
3.05
2.82
2.66
2.55
2.46
2.40
2.34
2.30
23
4.28
3.42
3.03
2.80
2.64
2.53
2.44
2.37
2.32
2.27
24
4.26
3.40
3.01
2.78
2.62
2.51
2.42
2.36
2.30
2.25
25
4.24
3.39
2.59
2.76
2.60
2.49
2.40
2.34
2.28
2.24
26
4.23
3.37
2.98
2.74
2.59
2.47
2.39
2.32
2.27
2.22
27
4.21
3.35
2.96
2.73
2.57
2.46
2.37
2.31
2.25
2.20
405
28
4.20
3.34
2.95
2.71
2.56
2.45
2.36
2.29
2.24
2.19
29
4.18
3.33
2.93
2.70
2.55
2.43
2.35
2.28
2.22
2.18
30
4.17
3.32
2.92
2.69
2.53
2.42
2.33
2.27
2.21
2.16
40
4.08
3.23
2.84
2.61
2.45
2.24
2.25
2.18
2.12
2.08
60
4.00
3.15
2.76
2.53
2.37
2.25
2.17
2.10
2.04
1.99
120
3.92
3.07
2.68
2.45
2.29
2.18
2.09
2.02
1.96
1.91
α
3.84
3.00
2.60
2.37
2.21
2.10
2.01
1.94
1.88
1.83
The value tabulated is c such that P(F ≥ c) = 0.05
IndexAAdjusted mean/rate Adverse patient outcomes Age-specific fertility rates Alternative hypothesis Analysis of variance one-way analysis assumptions calculations variance table Assessing causality BBalanced designswith replication without replication Bayes’ rule Biometry CCentral limit theorem Clinical disagreement Clinical errors Clinical life table Clinical significance Clinical trial Community trials Comparison of proportions Chi-squared test for contingency tables × 2 table comparison with normal test continuity correction exact test for × 2 tables larger tables validity Chi-squared test for trend Mantel-Haenszel Chi-squared test McNemar's Chi-squared test more complex techniques significancesingle proportion two proportions two proportions-paired case Complementary/alternative therapies Compliance Conditional probabilitycounting methods combinations factorials permutations Confidence interval , , , , difference rates and proportions π and µ Confidence limits Correlation Crude birth rate Crude death rate EEcological fallacy Epidemiology analysis of epidemiologic studies analytic study blinding or masking double single triple descriptive studies case report case series correlation cross-sectional experimental studies measurement of risk observational studies case-control study cohort study intervention studies placebo randomization block simple stratified Exploratory study FFertility and reproduction GGeneral fertility rate Gross reproduction rate HHospital statistics bed occupancy ratio duration of stay in health facility turnover interval Household survey Hypothesis testing procedure Hypothesis tests Incidence KKAP studies LLaw of probabilityaddition rule multiplication rule Legitimate births Life expectancy Life table Linear regression assumptions predication significance test Log linear models Logistic regression MMeasurement of risk Measures of central tendency mean measure(s) of central tendency for grouped data median mode other geometric mean harmonic mean percentile quartiles quantiles weighted mean/average Measures of variability coefficient of variation probability classical or a-priori probability equally likely mutually exclusive range standard deviation Medical biostatistics uses Medication errors Morbidity Mortality , Multicenter trials Multiple regression with discrete explanatory variables with non-linear explanatory variables with several variables optimal combination regression step-down regression step-up regression with two variables NNatural experiments Net reproduction rate Non comparable group or records Non-Gaussian distributions advantages disadvantages Nonparametric statistical inference , , Kruskal-Wallis test multiple comparisons Null hypothesis , OOdds ratio PPolitical state Population study Power and OC curves Presentation of data graphical presentation bar diagram cumulative frequency polygon or ogive frequency polygon growth chart or road to health chart histogram line graph pictogram pie diagram population pyramid scatter diagram spot or contour map tabular presentation frequency distribution table Prevalence Preventive trials Probability distributionsbinomial Gaussian or normal Poisson Program trials Proportional mortality rate P-value , QQuality control in medical and health care Quality control methods Quality of data Quality of measurements RReceiver operating characteristic curves Reliability SSample set upone sample two sample Sample size determination Sampling advantages data collection tools interview or self- administered questionnaire observation publication errors design effect methodsnonprobability probability pretests and pilot surveys principal steps in a sample survey probability sampling designs cluster sampling disadvantages interpenetrating samples inverse sampling multiphase sampling multistage sampling quota sampling simple random sampling stratified sampling systematic sampling sample size assumptions general procedure of determining sampling and nonsampling errors coverage errors observational errors processing errors response errors selection bias terminology Sampling distributions Sensitivity Sequential trials Significance test Simple linear regression Spearman's rank correlation coefficientfunction rationale Special incidence rates ascertainment corrected rates attack rate secondary attack rate Specificity Squivalence trials Standard deviation Standardization Standardized mortality ratio Statistical fallacies Statistical inference Statistical significance , , Statistical testing Statistics Student's ‘t’ test Survival Survival functions TTolerance interval Total fertility rate Two samples comparisonpaired large samples ties Wilcoxon's signed- ranked test unpaired large samples Mann-Whitney U test Two-samples ‘t’ test two-way analysis Types of data qualitativenominal ordinal quantitative variablesinterval ratio Types of study designs confounding factors cross-over studies matching non-randomized concurrent control studies historical control studies randomized VVaccine trials Validity XX-variables