hyeonsangjeon/gdpval-realworks
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
GitHub repository with 14 stars and 2 forks.
Language: Python
Topics: ai-evaluation, anthropic, azure-openai, benchmark-automation, code-interpreter, dashboard, evaluation, github-actions, gpt-5, huggingface