Judge¶

In ProgressGym, we evaluate an Examinee (an algorithm) through tasks. In the code, those tasks are instantialized through classes called Judges, which handles queries from Examinees and updates the human proxy models. This page documents the base Judge class as well as the three Judges we’ve implemented.

class benchmark.framework.JudgeBase(**kwargs)¶

JudgeBase is the base class for all judges. A judge is the benchmarking algorithm that evaluates the performance of an examinee. Each judge class corresponds to a challenge.

__init__(**kwargs)¶

abstract eval_snapshot(examinee: ExamineeBase) → None¶: Evaluate the examinee’s performance at the current snapshot. This method is called by the judge at every iteration. The base class implementation only does logging. It is recommended to does your own eval and then call the base class implementation to perform logging.

abstract interpret_result(result: Dict[str, Any]) → float¶: Given an benchmark result dictionary, calculate a single score that represents the overall performance of the examinee. HIGHER scores must mean better performance. This method is called by the leaderboard to rank the examinees.

abstract produce_final_result() → Dict[str, Any]¶: Return the final result of the evaluation. This method is called at the end of test() to get the final evaluation metrics. A reference score may be calculated here, but it will not be used by the leaderboard, in order to prevent manual score manipulation. The base class implementation only performs logging. You should override this method in your subclass to fill in the evaluation metrics, while preserving logging-purposed dict fields returned by the base class implementation.

abstract query_from_examinee(prompt: str | Data | List[Dict], model: Model | None = None) → str | Data | List[Dict]¶: This method is called by the examinee to query the judge, which the judge will answer according to human preferences at the current timestep. The examinee will use this information to learn about the latest human preference, and update its language model accordingly. The base class implementation answers the prompt by directly querying self.current_model. You could either call the base class implementation in your subclass’s implementation (possibly supplying a different model), or override it if necessary.

abstract reset(**kwargs) → None¶: Reset the internal state of the judge to start a new evaluation. This method is called before each test. The base class implementation resets the internal state of the judge to the initial state. Normally, you should optionally call the base class implementation in your subclass’s implementation, and then add any additional reset logic that you need.

test(examinee: ExamineeBase, **kwargs) → Dict[str, Any]¶: Run the examinee and evaluate its performance. This method is called by the user to evaluate the examinee. The method returns a dictionary of evaluation metrics. The keys of the dictionary are the names of the metrics, and the values are the corresponding values of the metrics. The method operates by moving the examinee and the judge through a series of timesteps, where the judge evaluates the examinee at every timestep. Every iteration of examinee_iter corresponds to the passing of a timestep. Normally, you should not override this method in your subclass. Instead, you should implement the reset, eval_snapshot, tick, query_from_examinee, and produce_final_result methods in your subclass.

abstract tick() → None¶: Move the internal state of the judge to the next timestep. This method is called by the judge at every iteration. The base class implementation moves the judge to the next timestep by incrementing current_timestep by 1 (or more if necessary). You should optionally call the base class implementation in your subclass’s implementation, and then add any additional logic that you need.

class challenges.coevolve.CoevolveJudge(**kwargs)¶

CoevolveJudge is a judge that evaluates the performance of an examinee by comparing the actual and simulated history of human preferences. It takes into account the bidirectional influence between the AI system (i.e. the examinee) and the human preferences, and evaluates the size of influence of the AI system on the evolutionary trajectory of human preferences.

eval_snapshot(examinee: ExamineeBase) → None¶: Move the simulated history one timestep forward, and evaluate the distance between the actual and simulated history.

classmethod interpret_result(result: Dict[str, Any]) → float¶: Given an benchmark result dictionary, calculate a single score that represents the overall performance of the examinee. HIGHER scores must mean better performance. This method is called by the leaderboard to rank the examinees.

produce_final_result() → Dict[str, Any]¶: Produce the final result of the evaluation from the supplementary_data dict. A reference score may be calculated here, but it will not be used by the leaderboard, in order to prevent manual score manipulation.

query_from_examinee(prompt: str | Data | List[Dict]) → str | Data | List[Dict]¶: Query the Examinee for the response to a prompt, using the simulated model.

reset(**kwargs) → None¶: In addition to the base class implementation, reset the simulated model and the queries (used for moving the simulated history forward).

tick() → None¶: Let the actual history move one timestep forward, without changing the simulated history.

update_human_proxy(influencer: Model, epochs: float, comment: str) → None¶: Update the human proxy model with the influence of the influencer model. The implementation here supports resuming from a previous checkpoint in case of failure, which leads to its relatively complex structure. CoevolveJudge requires special treatment to handle checkpoints since it contains more than one model, and the base class only implements checkpoint loading for the judge model. A plain implementation without checkpoint loading would be much simpler.

class challenges.follow.FollowJudge(**kwargs)¶

FollowJudge is a judge that evaluates the updating performance of an examinee, It measures whether the examinee could update itself to match human preferences, when giving it new human preference data at a new timepoint.

eval_snapshot(examinee: ExamineeBase, ground_truth_model: Model | None = None) → None¶: Evaluates examinee’s performance

classmethod interpret_result(result: Dict[str, Any]) → float¶: Given an benchmark result dictionary, calculate a single score that represents the overall performance of the examinee. HIGHER scores must mean better performance. This method is called by the leaderboard to rank the examinees.

produce_final_result() → Dict[str, Any]¶: Produce the final result of the evaluation from the supplementary_data dict. A reference score may be calculated here, but it will not be used by the leaderboard, in order to prevent manual score manipulation.

query_from_examinee(prompt: str | Data | List[Dict]) → str | Data | List[Dict]¶: This method is called by the examinee to query the judge, which the judge will answer according to human preferences at the current timestep. The examinee will use this information to learn about the latest human preference, and update its language model accordingly. The base class implementation answers the prompt by directly querying self.current_model. You could either call the base class implementation in your subclass’s implementation (possibly supplying a different model), or override it if necessary.

reset(**kwargs) → None¶: Reset the internal state of the judge to start a new evaluation. This method is called before each test. The base class implementation resets the internal state of the judge to the initial state. Normally, you should optionally call the base class implementation in your subclass’s implementation, and then add any additional reset logic that you need.

tick() → None¶: Move the internal state of the judge to the next timestep. This method is called by the judge at every iteration. The base class implementation moves the judge to the next timestep by incrementing current_timestep by 1 (or more if necessary). You should optionally call the base class implementation in your subclass’s implementation, and then add any additional logic that you need.

class challenges.predict.PredictJudge(**kwargs)¶

eval_snapshot(examinee: ExamineeBase) → None¶: Evaluates examinee’s performance

query_from_examinee(prompt: str | Data | List[Dict]) → str | Data | List[Dict]¶: This method is called by the examinee to query the judge, which the judge will answer according to human preferences at the current timestep. The examinee will use this information to learn about the latest human preference, and update its language model accordingly. The base class implementation answers the prompt by directly querying self.current_model. You could either call the base class implementation in your subclass’s implementation (possibly supplying a different model), or override it if necessary.

reset(**kwargs) → None¶: Reset the internal state of the judge to start a new evaluation. This method is called before each test. The base class implementation resets the internal state of the judge to the initial state. Normally, you should optionally call the base class implementation in your subclass’s implementation, and then add any additional reset logic that you need.

tick() → None¶: Move the internal state of the judge to the next timestep. This method is called by the judge at every iteration. The base class implementation moves the judge to the next timestep by incrementing current_timestep by 1 (or more if necessary). You should optionally call the base class implementation in your subclass’s implementation, and then add any additional logic that you need.