‘Pretraining Language Models with Human Preferences’

“Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs.”

Read the paper and see the full list of authors in ArXiv.

View on Site: ‘Pretraining Language Models with Human Preferences’