Sunday, October 11, 2009

The Ross K. Smith scale and standards

(Click here for more on speaker point scale changes)

Judge scale variation

In a post on edebate (mirror, mirror), Brian DeLong suggests that tournaments adopting the 100-point (RKS) scale provide an interpretation of the scale to judges to make sure points are allocated fairly. His motivation is based partially on a discrepancy he saw between judges in the Kentucky results from this year (mirror): "Some people are on this 87=average boat, while others place average at around 78-80ish". Here is a chart of the average points given out by judges at Kentucky this year:

Some of these discrepancies could be between judges who simply saw teams of different abilities. One way correct for this is to compare the points given by each judge to the points given to the same competitors by other judges. From this, we can see how much a judge skews high or low. Applying this skew to the actual average points (86.37), we get an estimate of each judge's perception of the average:

As in the first chart, the red line show the actual average point distribution; the blue line shows the distribution of estimates of judge perceptions of the average. To get a feel for how the estimate works, here are two examples:

  • Alex Lamballe gave an average of 90.83 points, but other judges judging the same debaters gave them an average of 83.00 points. His skew is 7.83 points, so we estimate that he perceives the average points to be 7.83 points higher than the true average of 86.37. His estimated perceived average is 94.20.
  • Justin Kirk gave an average of 79.50 points, but other judges judging the same debaters gave them an average of 90.00 points. His skew is -10.50 points, so we estimate that he perceives the average points to be 10.50 points lower than the true average of 86.37. His estimated perceived average is 75.87.

From these two extreme examples, it should be clear that this method of estimating judge-perceived averages is quite inexact. I think it is mostly useful as a check to ensure that the point skew in the first graph is not solely due to the differing quality of teams seen by different judges. Clearly, there is some variation in what judges think "average" is.

But how can we check if Kentucky showed more variation in judge point scale than other tournaments? One way is to measure judge scale variation with a test that has a scale-invariant expected distribution. The Mann-Whitney U test is approximately normal for reasonably-sized samples, so we can use that to find the Z-score for each judge at a tournament. The larger the variance of judge Z-scores, the more variation there is in judge point scale. (As above, the two samples are the points given by a certain judge and the points given to the same debaters by the rest of the pool.)

The 2009 tournament showed more judge scale variance than any of the other Kentucky tournaments in the database. If we compare the distribution of judge scale Z-scores from 2009 to the combined distribution of judge scale Z-scores from Kentucky in 30-point-scale years:

There is clearly more judge scale variation under the 100-point scale.

The Wake Forest tournament was the first tournament to change to the 100 point scale, in 2007. There was no corresponding jump in judge scale variance at Wake that year, but Wake provided a reference scale with extensive documentation.

Reference point scales

DeLong also suggested a translation of the 30-point scale to the 100-point scale. The Kentucky judges from 2008 and 2009 also implicitly suggest a translation between the scales by the distribution of their votes. For instance, a 27 was the at 10th percentile of points given in 2008; this value is closest to an 80 in 2009, since 80 was at the 11th percentile in 2009.

The chart below compares DeLong's translation with the implicit translation by the Kentucky and Wake judges. It also charts the translations proposed by Michael Hester (mirror, mirror) and Roy Levkovitz as well as a very simple "last digit" translation, calculated subtracting 20 from a score under the 30-point system and then multiplying by 10: 27.5 becomes 75, 29.0 becomes 90, and so on.

The implicit translation induced by the Kentucky judging from 2008 is:

30pt scale100pt scale
28 88
27 80

The other translations (except Wake) are lower for most of the scale.

Another way to compare translations is to see how they would affect the cumulative point distributions. For example, at Kentucky in 2009, a 30th percentile performance would earn an 85. At Wake in 2007, a 70th percentile performance would earn a 90.

(Included in this chart is the point scale of Brian Rubaie, which he gives in terms of position, rather than a translation from the 30-point scale.)

In the cumulative distribution chart, as in the translation chart, the "last digit" translation is close to the proposal of DeLong. The comparatively larger point range Hester gives to the 40%-85% area is alluded to in his proposal: "these two have large ranges b/c they are the areas i want to distinguish the most". Levkovitz has a similar reasoning:

The difference to me between a 29.5 and 30 is quite negligible; both displayed either perfection or near perfection so their range should be smaller. But it is within what we call now 28.5, 28, and 27.5s that I want to have a larger range of points to choose from. The reasoning for this is that I want to be able to differentiate more between a solid 28 and a low 28, or a strong 28.5 and low 28.