Standard Setting Panels: How it feels to be a panelist
— Lauren Barrows, December 29, 2017
You’ve probably read the title of the blog post and asked yourself: “what is standard setting and why do we need it?” That’s a really good question! Standards actually surround us. We are told to maintain the air pressure in our car tires at a specific amount e.g. 30 pounds per square inch (30 psi). This is because this is the optimum pressure for the performance of the tires. In an assessment context this concept is applied when decisions need to be made. For instance, how do you decide whether or not someone has passed a course, or knows enough to be safe to drive a car, or to practice as an electrician or medical professional? What if you need to describe several levels of performance, such as grades A – E on a course or bands 1 – 12 on a language proficiency test like Paragon’s CELPIP test? Rather than make these decisions arbitrarily (which wouldn’t be very fair), assessment professionals use a process called standard setting. This involves bringing together a group of people who are experts in the subject being assessed (in Paragon’s case language teaching and language assessment experts). Facilitators then lead this group through a series of activities that establish the minimum score that a test taker needs to achieve in order to be placed at a particular performance band.
Standard setting meetings can be carried out in a number of different ways but they usually follow a structure with at least five main stages:
- Panelists define the target test taker (also referred to as the minimally competent candidate).
- The standard setting methods are explained and practiced.
- There is one judgement round during which panelists either look at test items and estimate, for each item, the probability of a minimally competent candidate at a particular performance level getting the item correct or they look at test performances and decide which performance is most likely to have been produced by that minimally competent candidate.
- The panelists review the outcomes of their individual judgements and discuss the minimally competent candidate in relation to specific items or performances.
- There is a second judgement round in which the panelists confirm their individual judgements.
Core to the standard setting meeting is the definition of a minimally competent test taker at a particular performance level (e.g. CAEL CE band 60). This is the hypothetical test taker who possesses the minimum level of knowledge and skills necessary to perform at each performance level. Before you can make any judgements, you need to discuss and agree on what the minimally competent candidate at each performance level can or cannot do.
In the rest of this blog post I describe my experience as a standard setting panelist tasked with setting the Reading, Writing, Listening, and Speaking cut scores for Paragon’s CAEL Computer Edition (CAEL CE).
The experience of being a panelist was simultaneously interesting, inspiring, and intense. I and about 15 other teachers, researchers, educators, and test developers spent four 8-hour days, discussing the characteristics of the minimally competent test taker at each performance level, reviewing test questions and test taker responses following two different standard setting methods, a modified-Angoff procedure (Reading and Listening) and Bookmark procedure (Writing and Speaking).
From the outset I felt enabled by the facilitators to be a representative of my stakeholder group, but at the same time to remain an independent professional. I was also continually mindful of the impact our task would have on test takers. My awareness of the importance and implications of the cut scores we were setting certainly influenced the attention and care with which I undertook my tasks as a panelist.
As previously mentioned, CAEL CE cut scores were set using two standard setting methods, a modified-Angoff procedure and the Bookmark Method. The modified-Angoff procedure asks panelists to look at each test question and estimate the probability (0 to 100%) of a minimally competent test taker at a particular performance level answering a particular test question correctly. The Bookmark Method asks panelists to review an ordered set of test taker performances (e.g. a writing response) and to place a “bookmark” on the performance that a minimally competent test taker at a particular performance level would have had the skills to produce.
As a panelist, my preferred method was the Bookmark Method. I preferred this method because it was so concrete. I could see (or hear) the content and features of test takers’ responses. However, one difficulty I found with this method was that there were some instances where I wanted to place my bookmark on a non-existent response that was in between two responses in the ordered response booklet. Yet the task forced me to choose only one response. With the modified-Angoff method I did not experience this difficulty as I could choose any probability between 1 and 100. However when following the modified-Angoff procedure I internally wondered what criteria I was using to differentiate a 66 percent chance of a test taker answering a question correctly, for example, from a 67 percent chance probability of the same test taker.
A key feature of both standard setting methods was the facilitated panel discussion between our Round 1 and Round 2 judgements. Without exception, our discussions were lively and intense as we sought to explain our ratings to each other and debated the skills needed to answer or respond appropriately to a particular question or prompt. During these discussions, I realized that other panelists had struggled over similar ratings and responses as I had. It was interesting to hear everyone’s different perspectives, especially how we each attributed more or less significance to particular characteristics in the questions and the responses.
The prevailing ethos in the panel was to reach a common understanding by the final cut score recommendation. Yet there remained room for divergence, and I was glad of the opportunity afforded by the standard setting methods to revise or keep my initial judgments in Round 2 rating. During Round 2 I made deliberative judgements that considered my professional training, experience as a test taker, panelist discussions, and test taker impact data. Whereas in Round 1 I felt more like an appraiser, looking only at the test from a particular angle, in Round 2 I felt more like a jurist in that I was making the best possible judgement after considering and weighting a wide variety of evidence.
When we were at last shown our final cut score recommendations it was obvious that as a group we were more consistent in our ratings after Round 2. We had not achieved consensus, which was never the goal, but we had come together and reached a level of agreement. When I looked at the final recommendations I was able to see my own perspectives reflected and maintained in them.
The entire standard setting process allowed us panelists to deliberate the match between what the test developers intended to do and what they actually did and it drew out our diverse professional judgements about the test. Throughout the CAEL CE standard setting process I gained a renewed appreciation of the level of expertise and professionalism evident within the Test Research and Development Division and the amount of behind the scenes work that goes into the creation of a high-stakes language proficiency test like CAEL CE. I enjoyed my participation on the panel and would readily recommend the experience to anyone.
“We all have high standards and some day we hope to live up to them.”
— Attributed to Groucho Marx