Principles of Assessment in Medical Education Tejinder Singh, Anshu
INDEX
×
Chapter Notes

Save Clear


Basics of Assessment1

Tejinder Singh
It is said that ‘assessment is the tail that wags the curriculum dog’. While this statement amply underscores the importance of assessment in any system of education, it also cautions us about the pitfalls that can occur when assessment is improperly used. Students focus on learning what is asked in the examination. As teachers, we can exploit this potential of assessment to give a particular direction to student learning. Simply stated, it means that if we ask them questions related to factual recall, they will try to memorize facts; but if we frame questions requiring application of knowledge, they will learn to apply their knowledge. In this chapter, we are going to introduce the basic concepts related to assessment of students and how we can maximize the effect of assessment in giving a desired shape to learning. This chapter has been deliberately placed at the beginning of the book so that assessment methodologies described later can be understood better.
 
 
Basic terminology in assessment
Let us first clarify the terminology. You may have heard related terms like measurement, assessment, evaluation, etc. and seen them used interchangeably. There are subtle differences in these terms, and they need to be used in the right context with the right purpose, so that they convey the same meaning to everyone. Interestingly these terms also tell a story about how educational testing has evolved over the years.
Measurement was the earliest technique used in educational testing. It meant assigning numbers to the competence exhibited by the students, e.g. marking a multiple choice question paper is a form of measurement. 2Since measurement is a physical term, it was presumed that it should be as precise and as accurate as possible. As a corollary, it also implied that anything which could not be measured should not form part of the assessment package. The entire emphasis was on objectivity and providing standard conditions so that the results represented only student learning (also called true score) and nothing else.
While such an approach may have been appropriate to measure physical properties (weight, length, temperature, etc.), it certainly did not capture the essence of educational attainments. There are a number of qualities which we want our students to develop but which are not amenable to accurate measurement. Can you think of some of them? You may have rightly thought of communication, ethics, professionalism, etc. which are as important as other skills and competencies, but which cannot be precisely measured.
 
 
What do we mean by assessment?
The term ‘assessment’ has come to represent a much broader picture. It includes some attributes which can be measured precisely and also others which cannot be measured precisely (Linn and Miller, 2005). So you objectively measure some aspects, subjectively interpret others and then form a judgment about the level of student achievement. In a way, all assessment can be said to be measurement if we consider that even qualitative data is ‘measured’ using nominal and ordinal scales. However, viewing assessment as a combination of measurement and non-measurement gives a better perspective from the teachers’ point of view. A number of experts favor this approach, defining assessment ‘as any formal or purported action to obtain information about the competence and performance of a student’ (van der Vleuten and Schuwirth, 2010). We would like to clarify here that though both the terms assessment and evaluation involve passing a value judgment on learning, traditionally the term ‘assessment’ is always used in the context of student learning. Evaluation, on the other hand, is used in the context of the educational programs. Assessment of students is a very important input (though not the only one) to judge the value of an educational program.
Let us also clarify some more terms that are often loosely used in the context of student assessment. ‘Test’ and ‘tool’ are two such terms. Conventionally, a ‘test’ generally refers to a written instrument which is used to assess learning. Tests can be paper-pencil based or computer based. On the other hand, a ‘tool’ refers to something used to observe skills or behaviour to assess extent of learning. Objective structured clinical examination (OSCE) checklists and rating scales are examples of assessment tools.
 
Why do we need to assess students?
The conventional answer to this question is: so that we can categorize them as ‘pass’ or ‘fail’. But more than making this decision, a number of 3other advantages accrue from assessment. Rank ordering the students, (e.g. for selection), measuring improvement over a period of time, providing feedback to students and teachers about areas learnt well and areas requiring further attention, maintaining the quality of educational programs are some of the other important reasons for assessment.
Assessment in medical education is especially important because we are certifying students as fit to deal with human lives. Many a time, the actions of doctors can make a difference between life and death. This makes it all the more important to use the most appropriate tools to assess their learning. You will also appreciate that medical students are required to learn a number of practical skills, many of which can be life saving. Assessment is a means to ensure that all students learn these skills.
 
Types of assessment
Assessment can be classified in many ways, depending on the purpose for which it is being done. As discussed in the preceding paragraphs, assessment can be used not only for certification, but also to provide feedback to teachers and students. Based on this perspective, assessment can be classified as formative or summative.
  1. Formative or summative assessment: Formative assessment refers a. to assessment conducted with the primary purpose of providing feedback to students and teachers. To be useful, formative assessment should happen as often as possible-in fact, experts suggest that it should be almost continuous. And since the purpose is diagnostic (and remedial), it should be able to bring out the strengths and weaknesses of students. If students disguise their weaknesses and try to bluff the teacher, the purpose of formative assessment is lost. This feature has important implications in designing assessment for formative purposes.
    Formative assessment should not be used for final certification. It implies designating certain assessment opportunities as formative only so that teachers can identify deficiencies of the students and undertake remedial action. A corollary of this statement is that all assignments need not be graded-or all grades need not be considered for calculation of final scores. From this perspective, all assessments are summative-they become formative only when they are used to provide feedback to students to make learning better.
    Summative assessment, on the other hand, implies testing at the end of the unit/semester/course. Please note that summative does not refer to the end of the year University examinations only. Assessment becomes summative if the results are going to be used to make educational decisions. Summative assessment is intended to test if 4the students have attained the objectives laid down for the particular unit of activity. It is also used for certification and registration purposes, (e.g. giving a license to practise medicine). Did you notice that we said attainment of listed objectives? The implication is to inform students well in advance, right at the beginning of the course, about what is expected from them when they finish the course, so that they can shape their learning accordingly. Most institutions ignore this part, leaving it to students to make their own interpretations based on inputs from various sources, mainly from senior students. No wonder then, that we are often frustrated by the way students learn.
    Of late, there has been a trend towards blurring the boundary between formative and summative assessment. A purely formative assessment without any consequences is not taken seriously by anyone. On the other hand, a purely summative assessment has no learning value. There is no reason why the same assessment cannot be used for providing feedback as well as for calculation of the final score. A middle path may be to take ‘three best scores out of five’ or a similar figure. However, whatever pattern is decided must be clearly conveyed to students at the beginning of the course.
    We strongly believe that every teacher can play a significant role in improving student learning by judicious use of assessment for learning. Every teacher may not be involved with setting high stake question papers-but every teacher is involved with developing locally made assessments to provide feedback to the students. Throughout this book, you will find a tilt towards the formative function of assessment.
  2. Criterion or norm referenced assessment: Yet another purpose of assessment that we listed above was to rank order the students, (e.g. for selection purposes). From this perspective, it is possible to classify assessment as criterion referenced testing (CRT) and norm referenced testing (NRT).
    As the names indicate, CRT involves comparing the performance of students against fixed criteria. This is particularly useful for term-end examinations where we want to ensure that students have attained the competencies desired for that course or unit of the course. Results of CRT can only be a pass or a fail. Let's see an example. If the objective is that the student should be able to perform a cardiopulmonary resuscitation, then he must perform all the essential steps to be classified as pass. The student cannot pass if he performed only 60% of the steps! CRT however requires establishment of an absolute standard before starting the examination.
    NRT, on the other hand, implies rank ordering the student. NRT only tells us how students did in relation to each other-it does not tell us 5‘what’ they did. There is no fixed standard and ranking can happen only after the examination.
    Again, there can be variations and one of the means commonly employed is a two-stage approach. i.e. first use CRT to decide who should pass and then use NRT to rank order them. Traditionally in India we have been following this mixed approach-however we do not seem to have established defensible standards of performance so far and seem to arbitrarily take 50% as the cut-off for pass/fail. The issue of standard setting has been discussed in Chapter 20.
 
Attributes of good assessment
We have argued for the importance of ass essment as an aid to learning. It is related to many factors. The provision of feedback, (e.g. formative assessment) definitely improves learning (Burch et al, 2006; Rushton, 2005). Similarly the test-driven nature of learning again speaks for the importance of assessment (Dochy and McDowell, 1997). What we will like to emphasize here is that the reverse is also true, i.e. when improperly used, assessment can distort learning. We are all aware of the adverse consequences on the learning of interns that occurs when selection into postgraduate courses is based only on results of one MCQs test.
There are a number of qualities that good assessment should possess. Rather than going into the plethora of attributes available in literature, we will restrict ourselves to the four most important attributes of good assessment as listed by van der Vleuten and Schuwirth (2005). These include validity, reliability, acceptability and consequences of assessment.
 
Validity
Validity is the most important characteristic of good assessment. Traditionally, it is defined as ‘measuring what it intends to measure’. While this definition is correct, it requires-as Downing and Yudkowsky (2009) say- a lot of elaboration. Let us try to understand validity better.
Until a few years ago, validity was viewed as being of various types (content validity, concurrent validity, predictive validity, construct validity, etc.) (Fig. 1.1). This concept had the drawback of seeing assessment as being valid in one situation but not in another. Let us draw a parallel between validity and honesty as an attribute. Just as it is not possible for a person to be honest in one situation and dishonest in another (then he would not be called honest!), so is true for validity.
Validity is now seen as a unitary concept, which has to be inferred from various evidences (Fig. 1.2). Let's come back to the ‘honesty’ example. One would look at a person ‘s behavior at work, at home, in a situation when he finds something expensive lying on the roadside or how he pays his taxes, and then make an inference about his honesty.
6
zoom view
Fig. 1.1: Earlier concept of validity
zoom view
Fig. 1.2: Contemporary concept of validity
In the same way, validity is a matter of inference. Validity refers to the interpretation that we make out of assessment data. Implied within this is the fact that validity does not refer to the tool or results-rather, it refers to what interpretation we make from the results obtained by use of that tool. From this viewpoint, no test or tool is inherently valid or invalid.
Inferring validity requires empirical evidence. Generally, it requires content-related evidence to see if the test relates to the content of the objectives set for that particular course. If performing basic skills (say giving an intramuscular injection or draining an abscess) is part of the MBBS course and if these skills are not assessed in the examination, then content-related validity evidence is lacking. Similarly, if the number of questions is not proportional to the content, (e.g. if 50% weightage is given to CNS questions at the cost of anemia, which is a much more common problem) the interpretation may not be valid.
Construct: Validity also requires construct related evidence. Let us discuss ‘construct’ in a little more detail. Construct is a collection of inter-related components, which together give a meaning. Physique, complexion, 7poise, confidence and many other attributes are considered to decide if a person is beautiful. Here beauty is a construct. Similarly, in educational settings, subject knowledge, its application, data gathering, interpretation of data and many other things go into deciding clinical competence. In this case, clinical competence is a construct. In medicine, educational attainment, problem solving, professionalism and ethics are some other examples of constructs.
All assessment in education aims at assessing a construct. We are not interested in knowing if a student can enumerate five causes of hepatomegaly. But we are interested in knowing, if he can take a relevant history based on those causes. In this context, construct-related evidence becomes the most important way to infer validity. Simply stated, results of assessment will be more valid, if they told us about the problem solving ability of a student rather than about his ability to list five causes of each of the symptoms shown by the patient. As a corollary, it can be said that if the construct is not fully represented, (e.g. testing only presentation skills but not physical examination skills during a case presentation), validity is threatened. Messick (1989) has called this construct underrepresentation (CU).
Other influences: While content and construct seem to be directly related to the course, the way a test is conducted can also have influence on the validity. You may give a question to test understanding of concepts, but if you start marking the papers on the basis of handwriting, validity is threatened. If a test is taken in a hot, humid and noisy room, its validity becomes low (You've guessed right-the test becomes more a test of a candidate's ability to concentrate in the presence of distractions rather than his educational attainment-the construct changes). If a student spends more time in understanding the complex language of an MCQ rather than on its content, then validity is threatened. Similarly, leaked question papers, incorrect key, equipment failure, etc. can have a bearing on validity. Messick (1989) calls this construct irrelevance variance (CIV).
Let us try to explain this concept in a different way. Let's say you conduct an essay type test and try to assess knowledge, skills and professionalism from the same. We would expect that there would be low correlation between the scores on the three domains. On the other hand, if we give three different tests, say essays, MCQs and oral examination to assess knowledge, we would expect a high correlation between scores. If we were to get just the opposite situation, i.e. high correlation in the first setting and low in the second, construct irrelevance variance would be said to exist.
You can think of many common examples from your own settings, which induce CIV in our assessment. Too difficult or too complicated questions, language which is not understood by students, words which confuse the students and ‘teaching to the test’ are some of the factors which will induce 8CIV. Making OSCE stations which test only analytical skills will result in invalid interpretation about practical skills of a student by inducing CIV.
 
How can we build in validity?
Assessment should match the contents of the course and provide proportional weightage to each of the contents. Blueprinting and good sampling of content is a very helpful tool to ensure content representation and is dealt with in Chapter 19. Also implied is the need to let students know right in the beginning about what is expected of them at the end of the course. Using questions, which are neither too difficult nor too easy, which are worded in a way appropriate to the level of the students and maintaining confidentiality of the examinations are some methods of building validity (Table 1.1).
Similarly, the tools should aim to test accepted constructs rather than individual competencies like recall of knowledge. Often, it is better to use multiple tools to get different pieces of information on which a judgment of student attainment can be made. It is also important to select tools, which can test more than one competency at a time. There is no use of having one OSCE station to test history taking, another for skills and yet another for professionalism. Each station should be able to test more than one competency. This not only provides an opportunity for wider sampling by having more competencies tested at each station but also brings in the concept of integrated assessment.
 
Reliability
Let us now move to the second important attribute of assessment and that is reliability. Commonly reliability refers to ‘reproducibility of the scores’ or as some others put it, ‘getting the same results under same conditions’. Again like validity, while this definition is correct, it requires a lot of elaboration.
Table 1.1   Factors threatening validity
Factor
Remediation
Too few items or cases
Increase the number of items/cases;increase the frequency of testing
Unrepresentative content sampling
Blueprinting; Subject experts’ feedback
Too easy/too difficult questions
Test and item analysis
Items violating standard writing guidelines
Screening of items, faculty training
Problems with test construction/administration/scoring
Appropriate administrative measures, faculty training, monitoring mechanisms
9
People often tend to confuse between the terms ‘objectivity’ and ‘reliability’. Objectivity refers to reproducibility of the scores so that anyone marking the test would mark it the same way. There are certain problems in interpreting reliability in this way. For example, if a key is wrongly marked in a test, everyone would mark the test similarly and generate identical scores. But are we happy with this situation? No. Because of the wrong key, we cannot interpret the scores correctly. Let us add some more examples. Suppose at the end of the final professional MBBS, we were to give the students a test paper containing only 10 MCQs. The results will be very objective, but they will not be a reliable measure of students’ knowledge. There is no doubt that objectivity is a useful attribute of any tool, but it is more important to have items (or questions) which are fairly representative of the universe of items which are possible in a subject area and at the same time enough in number so that the results are generalizable. In other words in addition to objectivity we also need an appropriate and adequate sample to get reliable results.
We are not very comfortable with the other definition of reliability, i.e. ‘obtaining same results under similar conditions’ either. While this may be true of say a biochemical test, it is not completely true of an educational test. Let us say, during the final professional MBBS examination, we give a long case to a student in a very conducive setting, where there is no noise or commotion and the patient is very cooperative. But we know that in actual practice this seldom happens and there are no completely similar conditions where tests are administered. Similarly, no two patients with similar diagnoses will have same presentation. In the past, educationists have tried to make examinations more and more controlled and standardized (OSCE and standardized patients, for example) so that the results represent only student attainment and nothing else. We argue that it might be better to work in reverse-i.e. conduct examinations in settings as close to actual ones as possible so that reproducibility can be ensured. This is the concept of authentic assessment.
We will also like to argue that objectivity is not sine qua non of reliability. A subjective assessment can be very reliable if based on adequate content expertise. We all make predictions-subjective-about the potential of our students and we rarely go wrong! The point that we are trying to make is that in educational testing there is always a degree of prediction involved. Will the student whom we have certified as being able to handle a case of mitral stenosis in the medical college actually be able to do so in practice? To us, reliability is therefore the degree of confidence that we can place in our results (try viewing reliability as ‘rely-ability’).
A common reason for low reliability is the content specificity of the case. Many examiners will prefer to have a neurological case in the final examination in medicine. It is presumed that a student who can handle this case can also handle a patient with anemia or malnutrition. Nothing 10can be farther from truth. Herein lies the importance of including a variety of cases in the examination to make them representative of what the student is actually going to see in real life. You will recall what we said earlier that a representative and adequate sampling is also important to build validity.
Viewing reliability of educational assessment differently from that of other tests has important implications. Let us presume that we have to give a test in clinical skills to a final year student. If we look at reliability merely as reproducibility-or in other words, getting the same results if the same case is given again to the student under same conditions-then we would try to focus on precision of scores. However, if we conceptualize reliability as confidence in our scores, then we would like to examine the student under different conditions and on different patients so that we can generalize our results. We might even like to add feedback from peers, patients and other teachers to come to a conclusion about the performance of the student.
Often we go by the idea that examiner variability can induce a lot of unreliability in the results. To some extent this may be true. While examiner training is one solution, it is equally useful to have multiple examiners. We have already discussed about including a variety of content in the assessment. This may not be possible on one occasion but can happen when we carry out assessment on multiple occasions. The general agreement in educational assessment is that a single measurement-howsoever perfect-is flawed for making educational decisions. Therefore it is important to collect information on a number of occasions using a variety of tools. The key dictum to build reliability (and validity) for any assessment is to have multiple tests on multiple content areas by multiple examiners using multiple tools in multiple settings.
Validity and reliability of a test are very intricately related. To be valid, a test has to be reliable. A judge cannot form a valid inference if the witness who is being examined is unreliable. Thus reliability is a precondition for validity. But let us caution you that it is not the only condition.
 
Acceptability
The third important attribute of assessment is acceptability. A number of assessment tools are available to us and sometimes we can have a variety of methods for the same objective. Portfolios, for example, can provide as much information as can be provided by rating scales. MCQs can provide as much information about knowledge as can be obtained by oral examinations. However, acceptability by students, raters, institutions and society, at large, can play a significant role in accepting or rejecting a tool. MCQs, despite all their problems, are accepted as a 11tool for selecting students for postgraduate courses, while more valid and reliable methods like portfolios may not be. This is not to suggest that we should sacrifice good tools based on likes or dislikes, but to suggest that all stakeholders need to be involved in the decision-making process about use of assessment tools.
Linked to the concept of acceptability is also the issue of practicability. While we may have developed very good tools for assessing communication skills of our students, resource crunches may not allow us to use these tools on a large scale.
 
Educational impact
The consequences of assessment are a very significant issue. The consequences can be in terms of student learning, consequences for the students and consequences for the society. We have already referred to the impact of MCQ-based selection tests on student learning. For the students, a wrong assessment decision can act as a double edged sword. A student who has wrongly been failed has to face consequences in terms of time and money. On the other hand, if a student is wrongly passed, society has to deal with the issues of having an incompetent physician.
Assessments do not happen in vacuum. They happen within the context of certain objectives. For each assessment there is an expected use-it could be making pass/fail judgments, selecting students for an award or simply to provide feedback to teachers. Asking four questions viz. why are the students being assessed; who is going to use this data; at what time and for what purpose can bring a lot of clarity in the process, and help in selecting appropriate tools.
 
Utility of assessment
Before we end this chapter, let us introduce the concept of utility of assessment. van der Vleuten and Schuwirth (2005) have suggested a conceptual model for the utility of any assessment.
Utility = Validity × Reliability × Acceptability × Educational impact × Feasibility
This concept is especially important because it helps us to compensate for deficiencies in assessment tools by their strengths. Some tools may be low on reliability, but can still be useful if they are high on their educational impact. For example, OSCE has a high reliability, but little educational value. Mini-CEX, on the other hand, may be low on reliability, but has a higher educational value due to its feedback component. Still both are equally useful to assess students. Similarly, if certain assessment has a negative value for any of the parameter, (e.g. if an assessment promotes unsound learning habits), then its utility may be negative.
12
Quality assurance: It is of paramount importance to maintain good quality control over assessment tools and processes. As we go through various tools in the subsequent chapters, we will be discussing them individually. However, as a general rule, the following principles should be followed:
Tool
Quality assurance
MCQs, EMQs, KFT
Presubmission check lists, Content review by experts, Matching with ILOs, Test and item analysis, Internal consistency of the test using Cronbach's alpha, Point biserial correlation
Written assignments
Double marking with interrater correlations
OSCE stations
Peer review, Weighted check lists, reliability, station total (minus that station) correlation
Ratings
Multiple observers, Interobserver correlation
Portfolios
Qualitative methods, structuring, including tools with good validity and reliability
MCQs: Multiple Choice Questions; EMQs: Extended Matching Questions; KFT: Key Feature Test; ILO: Intended Learning Outcomes; OSCE: Objective Structured Clinical Examination.
In the chapters that follow, we will elaborate on the implications of this knowledge for assessment design.
REFERENCES
  1. Burch, V. C., Saggie, J. C., & Gary, N. (2006). Formative assessment promotes learning in undergraduate clinical clerkships. South African Med J, 96, 430–433.
  1. Dochy, F. J. R. C., & McDowell, L. (1997). Assessment as a tool for learning. Studies in Educational Evaluation, 23, 279–298.
  1. Downing, S.M. & Yudkowsky, R. (2009). Assessment in health professions education. (1st edn.). New York: Routledge.
  1. Linn, R. L., & Miller, M. D. (2005). Measurement and assessment in teaching. (9th edn.). Prentice Hall.  Upper Saddle River,  NJ:
  1. Messick, S. (1989). Validity In R.L. Linn (ed.). Educational measurement (pp 13–104) 3rd edn. American Council on Education & Macmillan Publishing Co.  New York: 
  1. Rushton, A. (2005). Formative assessment: a key to deep learning? Med Teacher, 27, 509–513.
  1. van der Vleuten, C. P. M., & Schuwirth, L. W. T. (2005). Assessing professional competence: from methods to programmes. Med Educ, 39, 309–317.
  1. van der Vleuten, C.P.M., & Schuwirth, L.W.T. (2010) How to design a useful test: the principles of assessment. In T. Swanwick. (ed.) Understanding medical education: Evidence, theory and practice. (1st edn.). Wiley-Blackwell & ASME.  West Sussex: 
13FURTHER READING
  1. Black, P., & William, D. (1998). Assessment and classroom learning. Assessment in Education, 5, 7–74.
  1. Dent, J.A., & Harden, R.M. (2005). A practical guide for medical teachers. (3rd edn.). Churchill Livingstone-Elsevier.  Edinburgh: 
  1. Epstein, R. M., & Hundert, E. M. (2002). Defining and assessing professional competence. JAMA, 287, 226–235
  1. Fredriksen, N. (1984). Influences of testing on teaching and learning. American Psychologist, 39, 193–202.
  1. Gibbs, G., & Simpson, C. (n.d.). Does your assessment support your students' learning? [Electronic Version]. Retrieved September 21, 2011 from http://artsonline2.tki.org.nz/documents/GrahamGibbAssessmentLearning.pdf
  1. Gronlund, N.E. (2003). Assessment of student achievement. (7th edn.). Allyn and Bacon.  Boston: 
  1. Holmboe, E. S., & Hawkins, R. E. (2008). Practical guide to the evaluation of clinical competence. (1st edn.). Mosby-Elsevier  Philadelphia: 
  1. Jackson, N., Jamieson, A., & Khan, A. (2007). Assessment in medical education and training: A practical guide. (1st edn.) Radcliffe Publishing.  Oxford: 
  1. Miller, G. E. (1976). Continuous assessment. Med Educ, 10, 611–621.
  1. Norcini, J. (2003). Setting standards on educational tests. Med Educ, 37, 464–469.
  1. Singh, T., Gupta, P., & Singh, D. (2009) Principles of Medical Education. (3rd edn.). Jaypee Bros.  New Delhi: 
  1. Swanwick, T. (ed.) (2010) Understanding medical education: Evidence, theory and practice. (1st edn.). Wiley-Blackwell & ASME.  West Sussex: 
  1. Wass, V., Bowden, R., & Jackson, N. (2007) Principles of assessment design. In N. Jackson, A. Jamieson, & A. Khan(eds.). Assessment in medical education and training: A practical guide. (1st edn.) Radcliffe Publishing.  Oxford: