Scoring fairness - Robotically Manipulated Payload Challenge

A minimum of five judges will be assigned to score each eligible submission. Those judges will offer both scores and comments against each of four distinct criteria. Each criterion will be scored on a 1-5 scale. Those scores will combine to produce a total normalized score.

The most straightforward way to ensure that everyone is treated by the same set of standards would be to have the same judges score every submission; unfortunately, due to the number of submissions that we may receive, that is not possible.

Since the same judges will not score every submission, the question of fairness needs to be explained carefully. One judge scoring a submission may take a more critical view, giving any assigned candidate a range of scores only between 1.0 and 2.0, as an example; meanwhile, another judge may be more generous and want to score every submission between 4.0 and 5.0.

For illustrative purposes, let’s look at the scores from two hypothetical judges:

We have a way to address this issue. We ensure that no matter which judges are assigned to you, each submission will be treated fairly. To do this, we use a widely accepted technique called Z-score normalization, which relies on two measures of distribution: the mean and the standard deviation.

The mean is simply a judge’s average score. We calculate it by adding up every score that judge was assigned and dividing by the number of scores. Formally:

The standard deviation measures the “spread” of a judge’s scores — whether they cluster tightly around their average or vary widely. As an example, imagine that two judges both give the same average score, but one gives many ones and fives, while the other gives mostly threes. It wouldn’t be fair if we didn’t account for that difference. Formally:

The first step in normalization is to convert each score into a Z-score: the number of standard deviations that score sits above or below the judge’s own average. Formally:

A Z-score measures every score against its own judge’s pattern, so scores from a harsh judge and scores from a generous judge become directly comparable.

To make these standardized scores easier to read, we then map them back onto the original 1-5 scale, using the average and standard deviation of all scores across all judges. Formally:

The result is a normalized score on the same 1-5 scale applicants were judged on, but adjusted so that every applicant is treated fairly regardless of which judges happened to be assigned to their submission.

If we apply this process to the same two judges in the example above, we can see the outcome of the final normalized scores. They appear more similar, because they are now aligned with typical distributions across the total judging population

We are pleased to answer any questions you have about the scoring process. You can find answers to common scoring questions in our FAQ, or join our virtual information session on June 18.