confident-ai/deepeval

The LLM Evaluation Framework

GitHub repository with 16,133 stars and 1,526 forks.

Language: Python

Topics: evaluation-framework, evaluation-metrics, llm-evaluation, llm-evaluation-framework, llm-evaluation-metrics, python

Open provider repository

Latest metric snapshot

2026-06-13: 16,133 stars and 1,526 forks.

Similar repositories

  1. 1. huggingface/lighteval

    Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

    GitHub repository with 2,444 stars and 486 forks.

    Trending score: 0.66; stars gained: +3; forks gained: +2.

    Language: Python

    Topics: evaluation, evaluation-framework, evaluation-metrics, huggingface

  2. 2. SJTU-DENG-Lab/Diffulex

    Flexible and Pluggable Serving Engine for Diffusion LLMs

    GitHub repository with 71 stars and 15 forks.

    Trending score: 0.04; stars gained: +0; forks gained: +0.

    Language: Python

    Topics: dllm, evaluation-framework, inference-engine

Trending in Python

  1. 1. mvanhorn/last30days-skill

    AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

    GitHub repository with 40,614 stars and 3,271 forks.

    Trending score: 5.82; stars gained: +1,312; forks gained: +87.

    Language: Python

  2. 2. chopratejas/headroom

    Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server.

    GitHub repository with 25,425 stars and 1,676 forks.

    Trending score: 5.73; stars gained: +2,844; forks gained: +202.

    Language: Python

    Topics: agent, ai, anthropic, compression, context-engineering, context-window

  3. 3. pewdiepie-archdaemon/odysseus

    Self-hosted AI workspace.

    GitHub repository with 69,712 stars and 8,823 forks.

    Trending score: 5.70; stars gained: +951; forks gained: +165.

    Language: Python

  4. 4. NousResearch/hermes-agent

    The agent that grows with you

    GitHub repository with 192,363 stars and 33,537 forks.

    Trending score: 5.48; stars gained: +990; forks gained: +282.

    Language: Python

    Topics: ai, ai-agent, ai-agents, anthropic, chatgpt, claude

  5. 5. safishamsi/graphify

    AI coding assistant skill (Claude Code, Codex, OpenCode, Cursor, Gemini CLI, and more). Turn any folder of code, SQL schemas, R scripts, shell scripts, docs, papers, images, or videos into a queryable knowledge graph. App code + database schema + infrastructure in one graph.

    GitHub repository with 66,467 stars and 6,719 forks.

    Trending score: 5.25; stars gained: +1,314; forks gained: +109.

    Language: Python

    Topics: antigravity, claude-code, codex, gemini, graphrag, knowledge-graph

  6. 6. hugohe3/ppt-master

    AI generates a real, editable PowerPoint from any document — native shapes & animations, speaker notes voiced as audio narration, and the option to follow your own .pptx template, not slide images · by Hugo He

    GitHub repository with 27,112 stars and 2,418 forks.

    Trending score: 5.10; stars gained: +903; forks gained: +61.

    Language: Python

    Topics: ai-agent, aippt, office, powerpoint, powerpoint-generation, ppt

Trending topic: evaluation-framework

  1. 1. dokimos-dev/dokimos

    LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog, Embabel, and any LLM client.

    GitHub repository with 39 stars and 3 forks.

    Trending score: 0.80; stars gained: +1; forks gained: +0.

    Language: Java

    Topics: agent-evaluation, evaluation-framework, evaluation-metrics, java, langchain4j, llm-evaluation

  2. 2. lizhiyao/oh-my-knowledge

    Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.

    GitHub repository with 11 stars and 2 forks.

    Trending score: 0.77; stars gained: +2; forks gained: +0.

    Language: TypeScript

    Topics: ai, benchmark, claude, knowledge-engineering, llm, prompt-engineering

  3. 3. huggingface/lighteval

    Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

    GitHub repository with 2,444 stars and 486 forks.

    Trending score: 0.66; stars gained: +3; forks gained: +2.

    Language: Python

    Topics: evaluation, evaluation-framework, evaluation-metrics, huggingface

  4. 4. Tuesdaythe13th/benchmarkswithoutborders

    BENCHMARKS WITHOUT BORDERS

    GitHub repository with 6 stars and 0 forks.

    Trending score: 0.04; stars gained: +0; forks gained: +0.

    Language: Jupyter Notebook

    Topics: agentic, ai, benchmarking, benchmarks, evals, evaluation-framework

  5. 5. SJTU-DENG-Lab/Diffulex

    Flexible and Pluggable Serving Engine for Diffusion LLMs

    GitHub repository with 71 stars and 15 forks.

    Trending score: 0.04; stars gained: +0; forks gained: +0.

    Language: Python

    Topics: dllm, evaluation-framework, inference-engine