Mattral/Improving-LLM-Models-with-RLHF-PPO-DPO
A modular, production-grade framework for Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO).
GitHub repository with 23 stars and 5 forks.
Language: Python
Topics: dpo, large-language-models, llm-alignment, machine-learning, policy-optimization, ppo, reinforcement-learning-from-human-feedback, reward-modeling, rlhf