hyeonsangjeon/gdpval-realworks

Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).

GitHub repository with 14 stars and 2 forks.

Language: Python

Topics: ai-evaluation, anthropic, azure-openai, benchmark-automation, code-interpreter, dashboard, evaluation, github-actions, gpt-5, huggingface

Open provider repository

24h trend summary

Trending score 0.09, activity score 0.38, stars gained +0, forks gained +0.

Latest metric snapshot

2026-06-13: 14 stars and 2 forks.

Similar repositories

  1. 1. huggingface/cadgenbench

    A benchmark for AI-driven CAD generation and editing

    GitHub repository with 62 stars and 5 forks.

    Trending score: 0.94; stars gained: +8; forks gained: +2.

    Language: Python

    Topics: 3d, ai-evaluation, benchmark, cad, huggingface, image-to-3d

  2. 2. Neal006/memorylens

    The open-source benchmark for LLM memory decay. Measure how Naive, RAG, Chunked RAG, Cascading, and SummaryMemory degrade over 100 conversation turns. Ebbinghaus forgetting curves, 5-provider LLM eval, multi-seed CI. No API key needed.

    GitHub repository with 7 stars and 2 forks.

    Trending score: 0.15; stars gained: +0; forks gained: +0.

    Language: Python

    Topics: ai-evaluation, benchmarking, chatbot, conversation-memory, ebbinghaus, evaluation

  3. 3. hyeonsangjeon/gdpval-realworks

    Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).

    GitHub repository with 14 stars and 2 forks.

    Trending score: 0.09; stars gained: +0; forks gained: +0.

    Language: Python

    Topics: ai-evaluation, anthropic, azure-openai, benchmark-automation, code-interpreter, dashboard

  4. 4. NoesisVision/nasde-toolkit

    CLI for benchmarks & evals of AI coding agents โ€” on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.

    GitHub repository with 10 stars and 0 forks.

    Trending score: 0.04; stars gained: +0; forks gained: +0.

    Language: Python

    Topics: agent-benchmark, agent-evaluation, ai-coding-agents, ai-evaluation, claude-code, claude-skills

  5. 5. vishwanathakuthota/openvals

    Open-source AI model evaluation and benchmarking framework for LLMs (OpenAI, Ollama, Claude, Gemini)

    GitHub repository with 7 stars and 6 forks.

    Trending score: 0.03; stars gained: -1; forks gained: +0.

    Language: Python

    Topics: ai-agents, ai-evaluation, ai-evaluation-framework, ai-quality, ai-reliability, ai-safety

Trending in Python

  1. 1. mvanhorn/last30days-skill

    AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

    GitHub repository with 40,614 stars and 3,271 forks.

    Trending score: 5.82; stars gained: +1,312; forks gained: +87.

    Language: Python

  2. 2. chopratejas/headroom

    Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server.

    GitHub repository with 24,986 stars and 1,636 forks.

    Trending score: 5.73; stars gained: +2,844; forks gained: +202.

    Language: Python

    Topics: agent, ai, anthropic, claude-code, compression, context-engineering

  3. 3. pewdiepie-archdaemon/odysseus

    Self-hosted AI workspace.

    GitHub repository with 69,622 stars and 8,812 forks.

    Trending score: 5.70; stars gained: +951; forks gained: +165.

    Language: Python

  4. 4. NousResearch/hermes-agent

    The agent that grows with you

    GitHub repository with 192,291 stars and 33,524 forks.

    Trending score: 5.48; stars gained: +990; forks gained: +282.

    Language: Python

    Topics: ai, ai-agent, ai-agents, anthropic, chatgpt, claude

  5. 5. safishamsi/graphify

    AI coding assistant skill (Claude Code, Codex, OpenCode, Cursor, Gemini CLI, and more). Turn any folder of code, SQL schemas, R scripts, shell scripts, docs, papers, images, or videos into a queryable knowledge graph. App code + database schema + infrastructure in one graph.

    GitHub repository with 66,406 stars and 6,716 forks.

    Trending score: 5.25; stars gained: +1,314; forks gained: +109.

    Language: Python

    Topics: claude-code, graphrag, knowledge-graph, codex, openclaw, skills

  6. 6. hugohe3/ppt-master

    AI generates a real, editable PowerPoint from any document โ€” native shapes & animations, speaker notes voiced as audio narration, and the option to follow your own .pptx template, not slide images ยท by Hugo He

    GitHub repository with 27,093 stars and 2,416 forks.

    Trending score: 5.10; stars gained: +903; forks gained: +61.

    Language: Python

    Topics: ai-agent, powerpoint, pptx, presentation, office, slides

Trending topic: ai-evaluation

  1. 1. huggingface/cadgenbench

    A benchmark for AI-driven CAD generation and editing

    GitHub repository with 62 stars and 5 forks.

    Trending score: 0.94; stars gained: +8; forks gained: +2.

    Language: Python

    Topics: 3d, ai-evaluation, benchmark, cad, huggingface, image-to-3d

  2. 2. Neal006/memorylens

    The open-source benchmark for LLM memory decay. Measure how Naive, RAG, Chunked RAG, Cascading, and SummaryMemory degrade over 100 conversation turns. Ebbinghaus forgetting curves, 5-provider LLM eval, multi-seed CI. No API key needed.

    GitHub repository with 7 stars and 2 forks.

    Trending score: 0.15; stars gained: +0; forks gained: +0.

    Language: Python

    Topics: ai-evaluation, benchmarking, chatbot, conversation-memory, ebbinghaus, evaluation

  3. 3. hyeonsangjeon/gdpval-realworks

    Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).

    GitHub repository with 14 stars and 2 forks.

    Trending score: 0.09; stars gained: +0; forks gained: +0.

    Language: Python

    Topics: ai-evaluation, anthropic, azure-openai, benchmark-automation, code-interpreter, dashboard

  4. 4. NoesisVision/nasde-toolkit

    CLI for benchmarks & evals of AI coding agents โ€” on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.

    GitHub repository with 10 stars and 0 forks.

    Trending score: 0.04; stars gained: +0; forks gained: +0.

    Language: Python

    Topics: agent-benchmark, agent-evaluation, ai-coding-agents, ai-evaluation, claude-code, claude-skills

  5. 5. vishwanathakuthota/openvals

    Open-source AI model evaluation and benchmarking framework for LLMs (OpenAI, Ollama, Claude, Gemini)

    GitHub repository with 7 stars and 6 forks.

    Trending score: 0.03; stars gained: -1; forks gained: +0.

    Language: Python

    Topics: ai-agents, ai-evaluation, ai-evaluation-framework, ai-quality, ai-reliability, ai-safety