The Cohen-Kappa is a statistical coefficient that represents the degree of accuracy and reliability in a statistical classification. It measures the agreement between two councillors (judges) who, by their purpose, classify each of the categories that are mutually exclusive. This statistic was introduced in 1960 by Jacob Cohen in the journal Educational and Psychological Measurement. where in is the relative correspondence observed among the advisors, and pe is the hypothetical probability of a random agreement. By comparing two methods of measurement, it is interesting not only to estimate both the bias and the limits of the agreement between the two methods (interdeccis agreement), but also to evaluate these characteristics for each method itself. It is quite possible that the agreement between two methods is bad simply because one method has broad convergence limits, while the other is narrow. In this case, the method with narrow limits of compliance would be statistically superior, while practical or other considerations could alter that assessment. In any event, what represents narrow or broad boundaries of the agreement or a large or small bias is a practical assessment. Kappa is a way to measure agreements or reliability and to correct the frequency with which ratings might consent to chance. Cohens Kappa,[5] who works for two councillors, and Fleiss` Kappa,[6] an adaptation that works for any fixed number of councillors, improve the common likelihood that they would take into account the amount of agreement that could be expected by chance. The original versions suffered from the same problem as the probability of joints, as they treat the data as nominal and assume that the evaluations have no natural nature; if the data does have a rank (ordinal measurement value), this information is not fully taken into account in the measurements. On the surface, these data appear to be available for analysis using methods for 2 × 2 tables (if the variable is classified) or correlation (if numerical) that we have previously explained in this series.

[1.2] However, further examination would show that this is not true. In these two methods, the two measures relate to different variables for each individual (for example. B exposure and result, height and weight, etc.) whereas, in the `agreement studies`, the two measures refer to the same variable (for example). B, breast x-rays, measured by two radiologists or hemoglobin using two methods). Let us now consider a hypothetical situation in which examiners do exactly that, i.e. assign notes by throwing a coin toss; Heads – pass, tails – Table 1, situation 2. In this case, one would expect that 25% (-0.50 × 0.50) of the students would receive the results of both and that 25% of the two would receive the « fail » grade – a total approval rate « expected » for « not » or « fail » of 50% (-0.25 – 0.25 – 0.50). Therefore, the observed approval rate (80% in situation 1) must be interpreted to mean that a 50% agreement was foreseen by chance. These auditors could have improved it by 50% (at best an agreement minus the randomly expected agreement – 100% 50% – 50%), but only reached 30% (observed agreement minus the randomly expected agreement – 80% 50% – 30%).