Sunday, November 8, 2009

RKS scale not dramatically increasing predictive accuracy

If moving to the Ross K. Smith 100-point scale increases the accuracy of speaker points, and if speaker points accurately reflect a team's abilities, we might expect that the scale change would help speaker points predict future winners. To my surprise, there is not a dramatic increase in the predictive accuracy of speaker points in the three large tournaments that moved to the new scale while maintaining the same number of teams and preliminary rounds.

I used the following model: Predictions are made for rounds five and later, using at least the first four preliminary rounds. The model makes no prediction if the two teams have different win-loss records. If they have the same win-loss record and the same total points, the model also does not make a prediction. Otherwise, it predicts that whichever team had the higher points for the predicting rounds will win the debate.

If two teams have the same record for rounds one through four, but they don't meet until round eight, the model still predicts the winner based on their points in rounds one through four, since the teams might well have debated in round five. It also makes predictions based on their points in rounds one through five, one through six, and one through seven, assuming their win-loss records are the same after rounds five, six, and seven. All predictions are based on rounds one through four, five, six, or seven; no predictions ever discard any early prelims.

Wake Forest, 8 prelims

Year Teams Tied points, no prediction Correct prediction Incorrect prediction Accuracy
2006-2007 138 11 89 50 63.0%
2007-2008 134 1 101 79 56.1%

UNLV open division, 7 prelims

Year Teams Tied points, no prediction Correct prediction Incorrect prediction Accuracy
2008-2009 56 0 32 23 58.2%
2009-2010 54 0 20 11 64.5%

Harvard, 8 prelims

Year Teams Tied points, no prediction Correct prediction Incorrect prediction Accuracy
2008-2009 80 9 57 27 66.1%
2009-2010 87 4 68 33 66.7%

Wednesday, November 4, 2009

RKS translation at Harvard

The Harvard tournament this year had a very explicit and very simple translation from the 30-point scale to the 100-point (or Ross K. Smith) scale. Using the results packet, we can see how the judging pool interpreted the RKS scale:

If we apply the Harvard translation to historical tournaments, the switch to RKS this year was accompanied by more point inflation than usual. The older data (drawn from debateresults.com) shows that a median speaker used to earn between 27.8 and 28.0 points (78 to 80 points, under RKS). This year, he or she earned 28.25 points (82.5, under RKS). A 28.0 (80, under RKS) used to mean 55th to 65th percentile. This year, it meant about 35th percentile.

It will be interesting to see how many judges circled the indicator on the front of their ballots indicating that they were conforming to the suggested scale translation.

It should also be noted that Harvard's points this year were substantially lower than Kentucky's points this year:

Sunday, October 11, 2009

The Ross K. Smith scale and standards

(Click here for more on speaker point scale changes)

Judge scale variation

In a post on edebate (mirror, mirror), Brian DeLong suggests that tournaments adopting the 100-point (RKS) scale provide an interpretation of the scale to judges to make sure points are allocated fairly. His motivation is based partially on a discrepancy he saw between judges in the Kentucky results from this year (mirror): "Some people are on this 87=average boat, while others place average at around 78-80ish". Here is a chart of the average points given out by judges at Kentucky this year:

Some of these discrepancies could be between judges who simply saw teams of different abilities. One way correct for this is to compare the points given by each judge to the points given to the same competitors by other judges. From this, we can see how much a judge skews high or low. Applying this skew to the actual average points (86.37), we get an estimate of each judge's perception of the average:

As in the first chart, the red line show the actual average point distribution; the blue line shows the distribution of estimates of judge perceptions of the average. To get a feel for how the estimate works, here are two examples:

  • Alex Lamballe gave an average of 90.83 points, but other judges judging the same debaters gave them an average of 83.00 points. His skew is 7.83 points, so we estimate that he perceives the average points to be 7.83 points higher than the true average of 86.37. His estimated perceived average is 94.20.
  • Justin Kirk gave an average of 79.50 points, but other judges judging the same debaters gave them an average of 90.00 points. His skew is -10.50 points, so we estimate that he perceives the average points to be 10.50 points lower than the true average of 86.37. His estimated perceived average is 75.87.

From these two extreme examples, it should be clear that this method of estimating judge-perceived averages is quite inexact. I think it is mostly useful as a check to ensure that the point skew in the first graph is not solely due to the differing quality of teams seen by different judges. Clearly, there is some variation in what judges think "average" is.

But how can we check if Kentucky showed more variation in judge point scale than other tournaments? One way is to measure judge scale variation with a test that has a scale-invariant expected distribution. The Mann-Whitney U test is approximately normal for reasonably-sized samples, so we can use that to find the Z-score for each judge at a tournament. The larger the variance of judge Z-scores, the more variation there is in judge point scale. (As above, the two samples are the points given by a certain judge and the points given to the same debaters by the rest of the pool.)

The 2009 tournament showed more judge scale variance than any of the other Kentucky tournaments in the debateresults.com database. If we compare the distribution of judge scale Z-scores from 2009 to the combined distribution of judge scale Z-scores from Kentucky in 30-point-scale years:

There is clearly more judge scale variation under the 100-point scale.

The Wake Forest tournament was the first tournament to change to the 100 point scale, in 2007. There was no corresponding jump in judge scale variance at Wake that year, but Wake provided a reference scale with extensive documentation.

Reference point scales

DeLong also suggested a translation of the 30-point scale to the 100-point scale. The Kentucky judges from 2008 and 2009 also implicitly suggest a translation between the scales by the distribution of their votes. For instance, a 27 was the at 10th percentile of points given in 2008; this value is closest to an 80 in 2009, since 80 was at the 11th percentile in 2009.

The chart below compares DeLong's translation with the implicit translation by the Kentucky and Wake judges. It also charts the translations proposed by Michael Hester (mirror, mirror) and Roy Levkovitz as well as a very simple "last digit" translation, calculated subtracting 20 from a score under the 30-point system and then multiplying by 10: 27.5 becomes 75, 29.0 becomes 90, and so on.

The implicit translation induced by the Kentucky judging from 2008 is:

30pt scale100pt scale
29.598
29.094
28.591
28 88
27.584.5
27 80
26.575
26.072

The other translations (except Wake) are lower for most of the scale.

Another way to compare translations is to see how they would affect the cumulative point distributions. For example, at Kentucky in 2009, a 30th percentile performance would earn an 85. At Wake in 2007, a 70th percentile performance would earn a 90.

(Included in this chart is the point scale of Brian Rubaie, which he gives in terms of position, rather than a translation from the 30-point scale.)

In the cumulative distribution chart, as in the translation chart, the "last digit" translation is close to the proposal of DeLong. The comparatively larger point range Hester gives to the 40%-85% area is alluded to in his proposal: "these two have large ranges b/c they are the areas i want to distinguish the most". Levkovitz has a similar reasoning:

The difference to me between a 29.5 and 30 is quite negligible; both displayed either perfection or near perfection so their range should be smaller. But it is within what we call now 28.5, 28, and 27.5s that I want to have a larger range of points to choose from. The reasoning for this is that I want to be able to differentiate more between a solid 28 and a low 28, or a strong 28.5 and low 28.

Friday, September 11, 2009

Topic side bias

Every summer there is discussion about the side bias of prospective controversy areas and resolutions. Using the data from DebateResults.com, we can see what side bias past resolutions have exhibited.

Under the Bradley-Terry model, the energy, China, and security guarantee topics had small, yet highly statistically significant (p < .001), negative bias. The other three topics show no statistically significant bias either way.

The BT model suggests that the largest topic side bias in the past six years was 12.4%, on the energy topic. With a 12.4% neg bias, in an otherwise evenly matched round, the neg has a 52.9% chance of winning. For more details, see the description of how the bias is measured.

Tuesday, July 28, 2009

Update on speaker point scales

In response to a question from cross-x.com, I checked to see if the point distributions are any smoother for teams that might clear. It turns out that they aren't, except to the extent that the distributions have a narrower range:

Thursday, July 23, 2009

Changing the speaker point scale

The Georgia State tournament is moving to a 100-point speaker point scale next year. This will make them the third large [N.B.] tournament to use a non-standard scale. I have put up some charts of how changing the scale affected point distributions at Wake and USC.

As you might expect, the number of distinct speaker points per round and per speaker both increased. On the other hand, judges didn't use the whole range of points. At Wake (100-point scale), points clustered around multiples of 5. At USC (30-point scale, finer granularity), points clustered around the old half-point scale.

Are any other tournament directors planning on switching point scales? If you considered doing so, but decided not to, what stopped you? If you are switching, why did you choose the scale you chose?

N.B. Two other (smaller) tournaments changed speaker point granularity since '03-'04: the Weber State round robin used a granularity of 0.1 at their '08-'09 tournament, and the Northern Illinois tournament seems to have changed from the usual half-point granularity to a full-point granularity for their '05-'06 tournament.

Wednesday, May 13, 2009

CEDA Nationals vs NDT attendance

I have put up some charts about CEDA Nats vs. NDT attendance.

The summary:

  • There was a sharp drop off in CEDA Nats attendance in 2009. Some coaches have suggested this was due to its placement in Pocatello, Idaho.
  • From 2004-2008, 57% of CEDA Nats competitors debated almost exclusively in the open division at other tournaments the year in question. CEDA Nats 2009 had a novice breakout, but we won't be able to see if or how attendance changed until Bruschke's 2008-2009 database dump comes out. Even then, any analysis will have to contend with the decreased total attendance.
  • Fewer than half of NDT teams skip CEDA Nats. There is very little correlation between NDT success and skipping CEDA.

The raw numbers and the code used to generate them are available.