huggingface/lighteval
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
GitHub repository with 2,451 stars and 489 forks.
Language: Python
Topics: evaluation, evaluation-framework, evaluation-metrics, huggingface
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
GitHub repository with 2,451 stars and 489 forks.
Language: Python
Topics: evaluation, evaluation-framework, evaluation-metrics, huggingface
Trending score 0.78, freshness score 0.18, stars gained +4, forks gained +3.
2026-06-15: 2,451 stars and 489 forks.
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
GitHub repository with 19,653 stars and 1,522 forks.
Trending score: 3.44; stars gained: +58; forks gained: +4.
Language: Python
Topics: evaluation, hacktoberfest, hacktoberfest2025, langchain, llama-index, llm
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
GitHub repository with 26,532 stars and 5,845 forks.
Trending score: 2.94; stars gained: +20; forks gained: +9.
Language: Python
Topics: agentops, agents, ai, ai-governance, apache-spark, evaluation
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
GitHub repository with 4,220 stars and 722 forks.
Trending score: 2.06; stars gained: +6; forks gained: +3.
Language: Python
Topics: chatgpt, claude, clip, computer-vision, evaluation, gemini
Evaluate and improve models and agents using environments
GitHub repository with 982 stars and 179 forks.
Trending score: 1.69; stars gained: +4; forks gained: +1.
Language: Python
Topics: reinforcement-learning, reinforcement-learning-environments, rl-environment, rl-training, gym, agents
Domain-neutral AI red-team framework for recursive self-improvement governance
GitHub repository with 44 stars and 2 forks.
Trending score: 1.39; stars gained: +2; forks gained: +1.
Language: Python
Topics: agentic-ai, ai-safety, evaluation, governance, python, recursive-self-improvement
Open-source benchmark for browser AI agents on daily tasks.
GitHub repository with 393 stars and 22 forks.
Trending score: 1.37; stars gained: +2; forks gained: +0.
Language: Python
Topics: ai-agents, benchmark, browser-automation, browser-use, dataset, evaluation
Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server.
GitHub repository with 27,902 stars and 1,891 forks.
Trending score: 6.49; stars gained: +2,776; forks gained: +250.
Language: Python
Topics: agent, ai, anthropic, claude-code, compression, context-engineering
利用AI大模型,一键生成高清短视频 Generate short videos with one click using AI LLM.
GitHub repository with 88,031 stars and 12,625 forks.
Trending score: 6.02; stars gained: +1,097; forks gained: +218.
Language: Python
Topics: ai, automation, chatgpt, moviepy, python, shortvideo
Self-hosted AI workspace.
GitHub repository with 71,392 stars and 9,098 forks.
Trending score: 5.98; stars gained: +834; forks gained: +140.
Language: Python
The agent that grows with you
GitHub repository with 194,052 stars and 33,977 forks.
Trending score: 5.92; stars gained: +753; forks gained: +209.
Language: Python
Topics: ai, ai-agent, ai-agents, anthropic, chatgpt, claude
Security scanner for AI agent skills. Detect vulnerabilities, malicious patterns, and security risks.
GitHub repository with 5,654 stars and 427 forks.
Trending score: 5.61; stars gained: +874; forks gained: +76.
Language: Python
Learn it. Build it. Ship it for others.
GitHub repository with 32,676 stars and 5,366 forks.
Trending score: 5.59; stars gained: +762; forks gained: +135.
Language: Python
Topics: agents, ai, ai-agents, ai-engineering, computer-vision, course
🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
GitHub repository with 29,108 stars and 3,015 forks.
Trending score: 3.83; stars gained: +75; forks gained: +8.
Language: TypeScript
Topics: analytics, autogen, evaluation, langchain, large-language-models, llama-index
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
GitHub repository with 19,653 stars and 1,522 forks.
Trending score: 3.44; stars gained: +58; forks gained: +4.
Language: Python
Topics: evaluation, hacktoberfest, hacktoberfest2025, langchain, llama-index, llm
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.
GitHub repository with 22,220 stars and 1,979 forks.
Trending score: 3.36; stars gained: +39; forks gained: +11.
Language: TypeScript
Topics: ci, ci-cd, cicd, evaluation, evaluation-framework, llm
Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki.
GitHub repository with 16,292 stars and 2,106 forks.
Trending score: 3.22; stars gained: +32; forks gained: +8.
Language: Go
Topics: agent, agentic, ai, chatbot, embeddings, evaluation
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
GitHub repository with 26,532 stars and 5,845 forks.
Trending score: 2.94; stars gained: +20; forks gained: +9.
Language: Python
Topics: agentops, agents, ai, ai-governance, apache-spark, evaluation
A Go framework for building production agent systems with graph workflows, tools, memory, A2A, AG-UI, MCP, evaluation, and observability.
GitHub repository with 1,352 stars and 165 forks.
Trending score: 2.59; stars gained: +13; forks gained: +2.
Language: Go
Topics: a2a, a2a-protocol, ag-ui, agent, agent-framework, ai