The links below are abstracts from Paragon’s most recent research.

Latent Class Structural Equation Modeling as a Tool for Developing Validity Arguments

Jake E. Stone, Yan Liu, Amery D. Wu

Conference Presentation
The 9th Conference of the International Test Commission (ITC), July 2 to 5, 2014, San Sebastian, Spain

Research Report Number: CELPIP-2014-02-01

Purpose

The Canadian English Language Proficiency Index Program General (CELPIP-G) Test is a standardized assessment of English functional ability in working and community settings. The interpretation the CELPIP-G test scores are criterion-referenced to the 12-level Canadian Language Benchmarks (CLB) and used for Canadian immigration and citizenship purposes. Validity is vital to score interpretation and use when CELPIP-G is used for such high-stakes decisions. The purpose of this study is to examine the intended claims of the CELPIP-G in such that (1) the CELPIP-G scores reflect individuals’ English functional ability and (2) higher functioning participants would be classified as having higher CLB levels as assessed by the CELPIP-G.

Method

The revised CELPIP-G Test was pilot tested on a sample of 350 voluntary participants who were living in Canada on various types of visa or were permanent residents and citizens. Participants were surveyed on their English language background (e.g., years of studying English) as well as how their current engagement in English at workplace and in the community (e.g., go shopping and reading work reports). Latent class analysis was conducted to identify groups of participants who differed in the ways they engagement using structural Equation Modeling (SEM).  Test takers’ English language backgrounds were then modeled as predictors for the latent classes, which, in turn, were modeled as predictors of CELPIP-G assessed CLB levels.

Results

Three latent classes were identified. The first class had little regular social and no work engagement. The second class engaged socially and in work settings other than office environments. The third class engaged both socially and in office environments. It was found that English language background predicted latent class membership and the latent class predicted CLB levels. The study concludes that SEM-based LCA is a strong method for developing warrants that support a validity argument.

Comparing the Rating Effectiveness of Personalized vs. Non-personalized Feedback to the On-Line Raters of English Speaking and Writing Assessment

Alex Volkov, Kristina Chang, Jake E. Stone, Michelle Y. Chen, Amery D. Wu

Conference Presentation
The 9th Conference of the International Test Commission (ITC), July 2 to 5, 2014, San Sebastian, Spain

Research Report Number: CELPIP-2013-11-01

Purpose

This study investigates the effects of type of feedback to the raters (personalized vs. non-personalized, and control) on their performance in rating the constructed responses of the speaking and writing assessment. Even though there is plenty of research on initial rater training (e.g., Wang, 2010), methods for on-going feedback and calibration are not sufficiently studied. Personalized rater performance reports are costly and have yielded mixed results as to their effectiveness (Elder, Knoch, Barkhuizen, & von Randow, 2005).

Method

The speaking and writing components of the Canadian English Language Proficiency Index Program- General (CELPIP-G) are part of a large scale, standardized assessment for high-stakes immigration and citizenship purposes. Test takers construct their own written and verbal responses to the task requirements. Responses to each task are rated by at least two independent raters on a scale of 1-5 on four dimensions of proficiency. Rating assignments are managed through a centralized online rating system. Many-facets Rasch Measurement (MFRM) model was used to identify underperforming raters. The identified raters were randomly assigned to two feedback-receiving methods. With the personalized method, ratings on all four dimensions by a specific underperforming rater, along with the original response, are juxtaposed with those of the same response evaluated by the benchmark (calibration) raters who have shown strong validity and reliability in rating. With the non-personalized method, only exemplar ratings from benchmark raters are shown to the underperforming raters. Remaining raters function as the control group.

Results

Ratings under different experimental conditions will be recorded weekly and feedback provided until the time when the underperforming rater has met pre-specified calibration criteria or until the end of February. The effect of the feedback method will be evaluated by the times of feedback until calibrated and the weekly changes in performance (MFRM analyses and exact and adjacent agreement).

References:

Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: does it work? Language Assessment Quarterly, 2(3), 175–196.

Wang, B. (2010). On rater agreement and rater training. English Language Teaching 3(1), 108-112.