ProgressGym: The Progress Alignment Framework

Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk.

To empower research in progress alignment, we introduce ProgressGym, an experimental framework that codifies real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve).

Aside from example alignment algorithms (Lifelong RLHF, Lifelong DPO, Extrapolative RLHF, Extrapolative DPO), we provide APIs for user-implemented examinees as well as judges, so that you can evaluate your own algorithm on your custom progress alignment tasks.

Check out the Quickstart: Implementing your own experiment section for basic usage.

Contents