Sunday, October 11, 2009

The Ross K. Smith scale and standards

(Click here for more on speaker point scale changes)

Judge scale variation

In a post on edebate (mirror, mirror), Brian DeLong suggests that tournaments adopting the 100-point (RKS) scale provide an interpretation of the scale to judges to make sure points are allocated fairly. His motivation is based partially on a discrepancy he saw between judges in the Kentucky results from this year (mirror): "Some people are on this 87=average boat, while others place average at around 78-80ish". Here is a chart of the average points given out by judges at Kentucky this year:

Some of these discrepancies could be between judges who simply saw teams of different abilities. One way correct for this is to compare the points given by each judge to the points given to the same competitors by other judges. From this, we can see how much a judge skews high or low. Applying this skew to the actual average points (86.37), we get an estimate of each judge's perception of the average:

As in the first chart, the red line show the actual average point distribution; the blue line shows the distribution of estimates of judge perceptions of the average. To get a feel for how the estimate works, here are two examples:

  • Alex Lamballe gave an average of 90.83 points, but other judges judging the same debaters gave them an average of 83.00 points. His skew is 7.83 points, so we estimate that he perceives the average points to be 7.83 points higher than the true average of 86.37. His estimated perceived average is 94.20.
  • Justin Kirk gave an average of 79.50 points, but other judges judging the same debaters gave them an average of 90.00 points. His skew is -10.50 points, so we estimate that he perceives the average points to be 10.50 points lower than the true average of 86.37. His estimated perceived average is 75.87.

From these two extreme examples, it should be clear that this method of estimating judge-perceived averages is quite inexact. I think it is mostly useful as a check to ensure that the point skew in the first graph is not solely due to the differing quality of teams seen by different judges. Clearly, there is some variation in what judges think "average" is.

But how can we check if Kentucky showed more variation in judge point scale than other tournaments? One way is to measure judge scale variation with a test that has a scale-invariant expected distribution. The Mann-Whitney U test is approximately normal for reasonably-sized samples, so we can use that to find the Z-score for each judge at a tournament. The larger the variance of judge Z-scores, the more variation there is in judge point scale. (As above, the two samples are the points given by a certain judge and the points given to the same debaters by the rest of the pool.)

The 2009 tournament showed more judge scale variance than any of the other Kentucky tournaments in the debateresults.com database. If we compare the distribution of judge scale Z-scores from 2009 to the combined distribution of judge scale Z-scores from Kentucky in 30-point-scale years:

There is clearly more judge scale variation under the 100-point scale.

The Wake Forest tournament was the first tournament to change to the 100 point scale, in 2007. There was no corresponding jump in judge scale variance at Wake that year, but Wake provided a reference scale with extensive documentation.

Reference point scales

DeLong also suggested a translation of the 30-point scale to the 100-point scale. The Kentucky judges from 2008 and 2009 also implicitly suggest a translation between the scales by the distribution of their votes. For instance, a 27 was the at 10th percentile of points given in 2008; this value is closest to an 80 in 2009, since 80 was at the 11th percentile in 2009.

The chart below compares DeLong's translation with the implicit translation by the Kentucky and Wake judges. It also charts the translations proposed by Michael Hester (mirror, mirror) and Roy Levkovitz as well as a very simple "last digit" translation, calculated subtracting 20 from a score under the 30-point system and then multiplying by 10: 27.5 becomes 75, 29.0 becomes 90, and so on.

The implicit translation induced by the Kentucky judging from 2008 is:

30pt scale100pt scale
29.598
29.094
28.591
28 88
27.584.5
27 80
26.575
26.072

The other translations (except Wake) are lower for most of the scale.

Another way to compare translations is to see how they would affect the cumulative point distributions. For example, at Kentucky in 2009, a 30th percentile performance would earn an 85. At Wake in 2007, a 70th percentile performance would earn a 90.

(Included in this chart is the point scale of Brian Rubaie, which he gives in terms of position, rather than a translation from the 30-point scale.)

In the cumulative distribution chart, as in the translation chart, the "last digit" translation is close to the proposal of DeLong. The comparatively larger point range Hester gives to the 40%-85% area is alluded to in his proposal: "these two have large ranges b/c they are the areas i want to distinguish the most". Levkovitz has a similar reasoning:

The difference to me between a 29.5 and 30 is quite negligible; both displayed either perfection or near perfection so their range should be smaller. But it is within what we call now 28.5, 28, and 27.5s that I want to have a larger range of points to choose from. The reasoning for this is that I want to be able to differentiate more between a solid 28 and a low 28, or a strong 28.5 and low 28.

2 comments:

  1. The Harvard tournament this year is recommending a direct translation of the 30pt scale into a 100pt scale (so that 27.5 = 75, 29 = 90, etc). What this functionally does is lops off the first digit and moves the decimal one place to the right. It seems to me that this defeats the purpose. Sure, in a strict sense there will be a larger differentiation (5 points is more than .5 points), but there is no *relative* difference between them. Creating a ratio between the two scales ensures the exact same problem of differentiation. It may make us feel better to say that there is a "3 point difference between the 1st and 2nd speaker," but we're functionally saying the same thing when we say that there is a ".3 point difference" on the 30 point scale.

    Changing the numbers alone can't solve the problem (as long as the ratio between those numbers stays the same). Speaker points have become too self referential. What is a 28 speech? What is a 29.5 speech? What is a 20 speech? These numbers have no meaning on their own. Changing them merely replicates the problem. What is needed is a better rubric for people to evaluate across the numbers. Its the job of the rubric to create differentiation.

    ReplyDelete
  2. Anonymous:

    What problem should a new speaker point scale attempt to solve? I think there are at least two:

    1. Judges do not use the same scale, so seeding and awards are dependent on who judged a team rather than how they performed.
    2. Judges do not have enough granularity in their scale, so small differences in performance lead to large difference in points awarded: a 27.8 looks 0.5 points better than a 27.7, when it should only look 0.1 points better.

    I think the RKS scale solves the latter, but only until point inflation ruins it like it ruined the 30-point scale.

    For the problem #1, I agree that having a 77 denote something other than 27.7 might help. One idea is that the point scale could be 0-100, with judges expected to award each speaker his or her expected percentile out of all debaters at that tournament. This would anchor the points to the judging pool as a whole, as well as providing a way for judges to see if they were giving high or low points.

    Unfortunately, this still doesn't provide an absolute anchor. It might combat point inflation, but only if tournaments had a way to normalize the points of judges who consistently give high points. I'm not sure what an absolute reference scale would even look like. Another problem would be that judges don't know the strength of the teams at a tournament before it occurs, especially at large, diverse tournaments and early in the season.

    ReplyDelete