Studies on comparability of assessment conducted by the Educational Research Institute are consistent with a rangę of activities undertaken by the Central Examination Board (Centralna Komisja Egzaminacyjna - CKE) that are designed to ensure reliability of assessment of exam responses to constructed responses. Assessment of responses to such open questions is inherently burdened with the rater effect which results from the fact that assessment is conducted by humans with all their knowledge, experience, beliefs, empathy and other individual characteristics. Therefore, differences in exam results depend not only on the varying level of skills examined, but also on the rater. At humanities exams, where writing of an essay is required, a significant portion of variation in the exam results is generated by the effects associated with the personal characteristics of the rater (in this report, in generał refer-red to as the rater effect). In the presented studies, the effect was analysed for two major Matura exam subjects - Polish language and mathematics, which are susceptible in a completely different way to these factors.
Respective chapters of this report present studies in this area conducted in Roland to datę, main methodological assumptions, specification of the hierarchical rater model used in the analyses and study results. The main value of the report is the search for Solutions designed to minimise the rater effect on the performance of students sitting the exam.
2011 and 2012 were selected for the rater effect study. For practical reasons, exam papers were drawn from two Regional Examination Boards (Okręgowa Komisja Egzaminacyjna - OKE): Jaworzno and Kraków. It was important to collect a set of papers with the following characteristics:
1. diversity of the level of examined skills against the entire scalę used in the study;
2. for the Polish language - diversity of essay topics for the analysis of the rater-essay topie interaction.
Studies were conducted on a representative sample of raters involved in the assessment of Matura exams, stratified for Regional Examination Boards (extended by teachers that are not raters and with students, in the case of the Polish language Matura exam). A complex model was used to link all the papers and raters from the eight OKE districts, as a result of which each essay from the Polish language exam was assessed by one rater from each OKE. In the case of Matura exam in mathematics, for which papers on the basie and extended level were selected, each paper was assessed by four raters (each from a different OKE).
Assessment was conducted in conditions that were possibly closest to those present at the main Matura exam session (held in May). An analogical assessment coordination structure was applied, the teams were located in the same cities, and unchanged assessment criteria and schemes were applied.
Results of the analyses
Studies have shown that raters have significant influence on assessment results, both for the Polish language and mathematics Matura exam. In addition, in the case of the Polish language, this influence is dearly stronger. Difference in leniency at the level of test between 25% of the most lenient raters and 25% of the most stringent raters for the Polish language is (depending on the year and essay topie) from 3.1 to 3.7 percentage points of the exam result For mathematics, these differences are between 0.87 to 1.36 percentage points of the exam result2. Such result is not surprising. As research show, rating essays will always be to some extent subjective. Reaching fuli rater agreement is utopian. Demanding 100% rater agreement leads to petrification of criterion-referen-ced assessment. As a consequence, it can has a negative influence on teaching important skill of writing essays.
Differences in the leniency of assessment occur not only at the individual level, but also at the level of Regional Examination Boards. In the case of the Polish language, in some OKE these differences are systematic and are present in all essay topics examined. The average difference between the most stringent and most lenient board was 1.6 percentage points of the exam result. Differences in the exam results of students at the OKE level show that the effect of raters' leniency explains a large portion of this diversity (on average for all topics 11% of results' variance is explained). In the case of mathematics differences at the OKE level also occur, but
Result for the entire exam - including short-answer tasks and an essay for the Polish language as well as dosed and open tasks for mathematics.