Using Jon Bruschke's databases on debateresults.com, I have calculated some historical judge side bias.

For instance, for the 2006-2007 season (the courts topic), the aff won 50.10% of all ballots. The judges below are those most likely to differ from the general judge population that year:

Aff bias:

- Joe Koehle, Aff:56, Neg:30
- Kirk Evans, Aff:26, Neg:10
- Austin Carson, Aff:10, Neg:1

Neg bias:

- Joe Patrice, Aff:22, Neg:45
- Greta Stahl, Aff:26, Neg:50
- Erik Holland, Aff:1, Neg:10
- Brian DeLong, Aff:23, Neg:44
- John Nagy, Aff:15, Neg:32

Other years and more statistical details (as well as the code used to calculate them) are available.

You need to develop a system to weight wins and losses. It could be that 'stronger' teams tend to be negative in front of certain judges due to the random number generator.

ReplyDeleteDoes your formula control for any covariance, such as the overall win percentages of aff versus neg in a given division? For example, there's a possibility that novice debates are more likely to skew negative than open debates, etc. Seems like there's a lot of potential confounding variables that go into the decisions. Also, as anonymous commented, how do you correct for strength of teams? It seems that a judge might just see several presets that skew the sample - given the low power of a given judge's sample size (at most 70 or 80, much lower for many of those judges whom you've indicated are biased outside of a given confidence interval) even a random few highly imbalanced presets can wreck havoc on your findings...

ReplyDeleteTo follow up on my previous comment, I asked a few statistics people for suggestions on dealing with weighting. Here is the response:

ReplyDeleteThe main problem is that you have a tiny sample size. With 50-100 debates per year, you could potentially just eliminate all cases where a team that ended up higher in the rankings beat a team below them, retroactively scaled.

I would suggest something like exclusion of rounds where one team was ahead by 2 rounds at the end of the debate (i.e. exclude a 6-2 4-4 round).So once you have the final rankings, if #1 beat #4 in the first round, just eliminate that from consideration. Presumably they're a better debating team if they won. Do this for all 50-100 matches, see what you have left.

Potentially, only eliminate ones where the final rankings differ by more than 2-3 or some arbitrary margin. Or repeat the analysis with various sets of criteria.

Then take whatever is left, and test them. You could try something like a sign test, where + is affirmative, and - is the opposite, and just see if there's any sort of proof of bias at a 90% level. It's not going to be hard and fast, but it might be interesting.

Bias from the PRNG is something I hadn't considered yet, Anonymous. That's an intriguing idea.

ReplyDeleteI do plan on adjusting the numbers based on team strength, but your idea is much more interesting than that. I think a source code review of the common software would be revealing.

Gonzo, the formula has no controls for division. You make a good point.

ReplyDeleteI suspect (though I have not checked) that there is a slight negative bias in JV, as some teams may be made up of one debater who should be in the open division and one who should be in the novice division. The better debater, as a 2N, can often win the debate without much help from his/her partner.

The confidence interval takes into account the number of ballots a judge has produced, and creates larger confidence intervals for judges with fewer ballots.

I think the main problem with the numbers is related to the total number of judges. So many judges were tested, finding a few with skewed ballot counts may simply be identifying the tails of a distribution with no greater variance than if each round had been decided by a coin flip.

Thanks for asking around, Rob. I do plan on adjusting for team strength, but I'll probably first look at resolution bias.

ReplyDelete