Quickstart: Running the benchmark¶

Requirements¶

ProgressGym is developed and tested with Python 3.8 on Linux. Please also make sure you have conda installed on your machine.

We thank the authors of the following libraries that ProgressGym is built upon:

Installation¶

You could install ProgressGym by cloning the codebase:

$ git clone https://github.com/PKU-Alignment/ProgressGym.git
$ cd ./ProgressGym
$ . ./initialize.sh

$ git clone https://github.com/PKU-Alignment/ProgressGym.git
$ cd ./ProgressGym
$ . ./initialize.sh --multinode

…if you intend to run ProgressGym on a multi-node environment.

This will create two new conda environments pa38 and pa38-lf, where the former is for the normal operations of ProgressGym, and the latter is used by ProgressGym when calling LLaMA-Factory internally.

When calling entry-point scripts of ProgressGym (e.g., run_benchmark.py and build_dataset.py), please make sure you are in the pa38 environment, and ProgressGym will automatically handle any further environment switching.

Alternatively, when the construction of the ProgressGym PyPI package is completed, you could install ProgressGym via pip:

$ pip install progressgym

Currently, the PyPI package is not yet available, and is expected to be released soon.

Basic Usage¶

Building Dataset and Training Models

Note that you do not need to run this part yourself unless you want to replicate our findings, as the HisText dataset has already been utilized to train the HistLlama model sequence, and is available on Huggingface, and the benchmark script will automatically download the HistLlama models when needed.

Running

$ python build_dataset.py

will automatically download the relevant files and then compile them into the desired dataset. Specifically, it will…

Download and decompress raw data files from multiple sources, placing them in ./dataset/raw_downloads
Organize downloaded data into an unified format, and the resulting historical text dataset is saved as a series of files ./dataset/dataset_textseq/[century_num]/[year_num].json
Train a series of historical language models ./dataset/dataset_modelseq/[century_num] using the historical text dataset.

Running The Benchmarks on Command Line

Running

$ DEFAULT_TEMPLATE=alpaca python run_benchmark.py
     --algorithms=ALGO1[,ALGO2[,ALGO3[,...]]]
     --challenges=SUBTASK1[,SUBTASK2[,SUBTASK3[,...]]]
     --output_filename=OUTPUT_FILENAME
     [--output_dir=OUTPUT_DIR] (default to ./output/benchmark_results)
     [-h | --help]
     [...] (additional arguments will be supplied to the algorithms when they are instantiated; only string values are supported)

will produce the benchmark results. Omitting

--challenge [challenge_name]

will default to testing all challenges. Results will be saved to [output_directory]/[test_name].json, with [output_directory] set to default as ./output/benchmark_results. For each test, results are saved as a dictionary. See item score for a comprehensive score for each examinee’s performance in one subtask.

See comments at the beginning of run_benchmark.py for a detailed usage of the command line arguments.

$ DEFAULT_TEMPLATE=alpaca python run_benchmark.py
     --algorithms=ALGO1[,ALGO2[,ALGO3[,...]]]
     --challenges=SUBTASK1[,SUBTASK2[,SUBTASK3[,...]]]
     --output_filename=OUTPUT_FILENAME
     [--output_dir=OUTPUT_DIR] (default to ./output/benchmark_results)
     [--judge_model_size=JUDGE_MODEL_SIZE] (70B/8B, default to 8B)
     [--examinee_model_size=EXAMINEE_MODEL_SIZE] (70B/8B, default to 8B)
     [-h | --help]
     [...] #additional arguments will be supplied to the algorithms and the challenges when they are instantiated; only string values are supported

Running The Benchmarks with Python API

You can also run the benchmark with the Python API using the run_benchmark function.

run_benchmark.run_benchmark(ExamineeClass: Type[ExamineeBase], JudgeClass: Type[JudgeBase], **kwargs) → Dict[str, Any]¶

Run a single benchmarking test on a single examinee and a single judge, and return the results.

Parameters:

ExamineeClass – Necessary, examinee class object representing the algorithm to be evaluated. Can be any subclass of ExamineeBase, including user-implemented ones. Note that this is the class object, not an instance of the class.
judge (Type[JudgeBase]) – Necessary, judge class object representing the challenge to be evaluated. Can be any subclass of JudgeBase, including user-implemented ones. Note that this is the class object, not an instance of the class.
kwargs (Dict[str, str]) – Optional, additional arguments to be passed to the examinee and the judge. Pass the same str-typed arguments as you would in the command line.

Returns:

A dictionary containing the results of the benchmarking test. The dictionary is in the exact same format as the results of command-line benchmarking.

Return type:

Dict[str, Any]

Example:

from progressgym import run_benchmark, CoevolveJudge, LifelongDPOExaminee # if using PyPI package
results = run_benchmark(LifelongDPOExaminee, CoevolveJudge)

It’s worth noting that the Python API is more flexible than the command line interface, as you can implement your own Judge and Examinee classes (inheriting from JudgeBase and ExamineeBase respectively) and pass them to the run_benchmark function.