Quickstart: Implementing your own experiment

General Instructions

To run the benchmark for your own alignment algorithm and assess its temporal alignment abilities, you have to implement your own algorithm as a subclass of benchmark.framework.ExamineeBase.

You could either implement it in a new .py file in algorithm and use the command-line interface for running it, or by directly implementing a subclass of benchmark.framework.ExamineeBase in your own script and use the Python API for running it. Refer to Quickstart: Running the benchmark for guidance on how to run the benchmark on the examinee you implemented.

Implementation Instructions

Feeling daunted? Don’t worry. Read through the following descriptions of benchmark.framework.ExamineeBase, and you’ll understand the overall workings of the examinee class in our framework.

class benchmark.framework.ExamineeBase(**kwargs)

ExamineeBase is the base class for all examinees. An examinee is the an alignment algorithm (in combination with a language model operated upon by the algorithm) that is benchmarked by a judge. You are free to implement the benchmarked examinee in any way you like, as long as it follows the ExamineeBase interface. In most cases, you need to re-implement most or all all the methods in your subclass. Base implementations are only provided as an example.

__init__(**kwargs)
abstract get_current_model() Model

Return the current model that the examinee is using at this timestep. The base class implementation returns the current_model attribute. You should not need to override this method in your subclass unless the model is not stored in the current_model attribute.

abstract query_from_judge(prompt: str | Data | List[Dict], model: Model | None = None) str | Data | List[Dict]

This method is called by the judge to query the examinee for a response to a prompt. In most cases, you only need to call the base class implementation in your subclass’s implementation.

abstract reset(**kwargs) None

Initialize the examinee, including endowing it with a language model. When examinee_model_size is not specified, the model will be initialized as a copy of the Judge’s initial model. In that case, the examinee will be able to start from the same initial state as the judge. Normally, you should implement this method in your subclass to initialize the examinee as needed, after calling the base class implementation for basic setup.

abstract run(judge: JudgeBase) Iterable

This method is called by the judge to start the examinee. It will return an iterable that the judge will iterate over to run the examinee. Every iteration corresponds to the passing of a timestep. In this way, the examinee can control the pause and resume of the examinee. At every iteration: 1. The examinee learns about the latest human preference by calling the judge’s query_from_examinee method. 2. After it has updated its language model, it yields control back to the judge and allow it to evaluate it (by calling query_from_judge).

Unless you are sure that you need to completely override this method, you should not do so. Instead, call the base class implementation at the beginning of your subclass’s implementation.

We have implemented the four baseline examinees as described in ProgressGym: The Progress Alignment Framework in ./algorithm. You can turn to those implementations for reference for your own implementation.

After implementation, use our pre-implemented dummy challenge to verify and debug your implementation.

$ python run_benchmark.py
  --algorithms=your_algorithm
  --challenges=Dummy
  --output_filename=dummy_debugging_run

After that, use run_benchmark.py as described in ProgressGym: The Progress Alignment Framework to start your experiment.

The benchmark.framework.JudgeBase class is the base class for all Judges (i.e. benchmark tasks). You can refer to documentations for this class to better understand the workings between Judge and Examinee . Of course, you can implemente your own Judge class as well. Our implementations of the three major tasks are in ./challenges, for your reference.

class benchmark.framework.JudgeBase(**kwargs)

JudgeBase is the base class for all judges. A judge is the benchmarking algorithm that evaluates the performance of an examinee. Each judge class corresponds to a challenge.

__init__(**kwargs)
abstract eval_snapshot(examinee: ExamineeBase) None

Evaluate the examinee’s performance at the current snapshot. This method is called by the judge at every iteration. The base class implementation only does logging. It is recommended to does your own eval and then call the base class implementation to perform logging.

abstract interpret_result(result: Dict[str, Any]) float

Given an benchmark result dictionary, calculate a single score that represents the overall performance of the examinee. HIGHER scores must mean better performance. This method is called by the leaderboard to rank the examinees.

abstract produce_final_result() Dict[str, Any]

Return the final result of the evaluation. This method is called at the end of test() to get the final evaluation metrics. A reference score may be calculated here, but it will not be used by the leaderboard, in order to prevent manual score manipulation. The base class implementation only performs logging. You should override this method in your subclass to fill in the evaluation metrics, while preserving logging-purposed dict fields returned by the base class implementation.

abstract query_from_examinee(prompt: str | Data | List[Dict], model: Model | None = None) str | Data | List[Dict]

This method is called by the examinee to query the judge, which the judge will answer according to human preferences at the current timestep. The examinee will use this information to learn about the latest human preference, and update its language model accordingly. The base class implementation answers the prompt by directly querying self.current_model. You could either call the base class implementation in your subclass’s implementation (possibly supplying a different model), or override it if necessary.

abstract reset(**kwargs) None

Reset the internal state of the judge to start a new evaluation. This method is called before each test. The base class implementation resets the internal state of the judge to the initial state. Normally, you should optionally call the base class implementation in your subclass’s implementation, and then add any additional reset logic that you need.

test(examinee: ExamineeBase, **kwargs) Dict[str, Any]

Run the examinee and evaluate its performance. This method is called by the user to evaluate the examinee. The method returns a dictionary of evaluation metrics. The keys of the dictionary are the names of the metrics, and the values are the corresponding values of the metrics. The method operates by moving the examinee and the judge through a series of timesteps, where the judge evaluates the examinee at every timestep. Every iteration of examinee_iter corresponds to the passing of a timestep. Normally, you should not override this method in your subclass. Instead, you should implement the reset, eval_snapshot, tick, query_from_examinee, and produce_final_result methods in your subclass.

abstract tick() None

Move the internal state of the judge to the next timestep. This method is called by the judge at every iteration. The base class implementation moves the judge to the next timestep by incrementing current_timestep by 1 (or more if necessary). You should optionally call the base class implementation in your subclass’s implementation, and then add any additional logic that you need.

Similarly, use the dummy examinee to verify your implementation.

$ python run_benchmark.py
  --algorithms=Dummy
  --challenges=your_task
  --output_filename=dummy_debugging_run