The links below are abstracts from Paragon’s most recent research.

Latent Class Structural Equation Modeling as a Tool for Developing Validity Arguments

Jake E. Stone, Yan Liu, Amery D. Wu

Conference Presentation
The 9th Conference of the International Test Commission (ITC), July 2 to 5, 2014, San Sebastian, Spain

Research Report Number: CELPIP-2014-02-01


The Canadian English Language Proficiency Index Program General (CELPIP-G) Test is a standardized assessment of English functional ability in working and community settings. The interpretation the CELPIP-G test scores are criterion-referenced to the 12-level Canadian Language Benchmarks (CLB) and used for Canadian immigration and citizenship purposes. Validity is vital to score interpretation and use when CELPIP-G is used for such high-stakes decisions. The purpose of this study is to examine the intended claims of the CELPIP-G in such that (1) the CELPIP-G scores reflect individuals’ English functional ability and (2) higher functioning participants would be classified as having higher CLB levels as assessed by the CELPIP-G.


The revised CELPIP-G Test was pilot tested on a sample of 350 voluntary participants who were living in Canada on various types of visa or were permanent residents and citizens. Participants were surveyed on their English language background (e.g., years of studying English) as well as how their current engagement in English at workplace and in the community (e.g., go shopping and reading work reports). Latent class analysis was conducted to identify groups of participants who differed in the ways they engagement using structural Equation Modeling (SEM).  Test takers’ English language backgrounds were then modeled as predictors for the latent classes, which, in turn, were modeled as predictors of CELPIP-G assessed CLB levels.


Three latent classes were identified. The first class had little regular social and no work engagement. The second class engaged socially and in work settings other than office environments. The third class engaged both socially and in office environments. It was found that English language background predicted latent class membership and the latent class predicted CLB levels. The study concludes that SEM-based LCA is a strong method for developing warrants that support a validity argument.

Comparing the Rating Effectiveness of Personalized vs. Non-personalized Feedback to the On-Line Raters of English Speaking and Writing Assessment

Alex Volkov, Kristina Chang, Jake E. Stone, Michelle Y. Chen, Amery D. Wu

Conference Presentation
The 9th Conference of the International Test Commission (ITC), July 2 to 5, 2014, San Sebastian, Spain

Research Report Number: CELPIP-2013-11-01


This study investigates the effects of type of feedback to the raters (personalized vs. non-personalized, and control) on their performance in rating the constructed responses of the speaking and writing assessment. Even though there is plenty of research on initial rater training (e.g., Wang, 2010), methods for on-going feedback and calibration are not sufficiently studied. Personalized rater performance reports are costly and have yielded mixed results as to their effectiveness (Elder, Knoch, Barkhuizen, & von Randow, 2005).


The speaking and writing components of the Canadian English Language Proficiency Index Program- General (CELPIP-G) are part of a large scale, standardized assessment for high-stakes immigration and citizenship purposes. Test takers construct their own written and verbal responses to the task requirements. Responses to each task are rated by at least two independent raters on a scale of 1-5 on four dimensions of proficiency. Rating assignments are managed through a centralized online rating system. Many-facets Rasch Measurement (MFRM) model was used to identify underperforming raters. The identified raters were randomly assigned to two feedback-receiving methods. With the personalized method, ratings on all four dimensions by a specific underperforming rater, along with the original response, are juxtaposed with those of the same response evaluated by the benchmark (calibration) raters who have shown strong validity and reliability in rating. With the non-personalized method, only exemplar ratings from benchmark raters are shown to the underperforming raters. Remaining raters function as the control group.


Ratings under different experimental conditions will be recorded weekly and feedback provided until the time when the underperforming rater has met pre-specified calibration criteria or until the end of February. The effect of the feedback method will be evaluated by the times of feedback until calibrated and the weekly changes in performance (MFRM analyses and exact and adjacent agreement).


Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: does it work? Language Assessment Quarterly, 2(3), 175–196.

Wang, B. (2010). On rater agreement and rater training. English Language Teaching 3(1), 108-112.

Mapping Language Use and Communication Challenges to the Canadian Language Benchmarks and CELPIP-General LS within Workplace Contexts for Canadian New Immigrants

Christine Doe, Scott Douglas, Liying Cheng

Research Report (PDF)

Overview of the Project

The study examined language use and communication challenges among Canadian immigrants working in typical workplace settings for newcomers. The participants included in the analysis for this report are new immigrants, i.e., they came to Canada within the past five years. In this study, we examined how new Canadian immigrants’ perceived language use and challenges mapped onto the Canadian Language Benchmarks (CLB) and the CELPIP-General LS levels of test performance. This mapping provided actual indicators of language use and communication challenges in relation to what new Canadian immigrants working in entry-level-type workplace positions can do, and how well they do in reference against two commonly used benchmark (CLB) and proficiency criteria (CELPIP-General LS) in Canada.

The study focused on entry-level workplace positions because there is very limited empirical research examining language use and communicative challenges among new Canadian immigrants who work in positions that are not regulated by professional organizations (Derwing & Munro, 2009). In this study, we identified key competencies, perceived and actual, associated with language use and communication challenges in the workplace. Such findings can directly inform test design as the basis for measuring language proficiency within workplace contexts.

Goal of the Study

In order to examine the language use and communication challenges of new Canadian immigrants working positions typically filled by newcomers, we established a two-fold goal for the study:

  • To investigate the types of interactions in entry-level workplace contexts; and
  • To examine the meaning of participants’ scores as they relate to the CLB.