aerosta/rewardhackwatch
Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
GitHub repository with 12 stars and 0 forks.
Language: Python
Topics: agent-safety, ai-safety, distilbert, fastapi, huggingface, llm-agents, machine-learning, misalignment, nlp, pytorch