SalesforceAIResearch/SCUBA
SCUBA: Salesforce Computer Use Benchmark
GitHub repository with 9 stars and 1 forks.
Language: Python
Topics: benchmark, browser-use-agent, computer-use-agent, crm
SCUBA: Salesforce Computer Use Benchmark
GitHub repository with 9 stars and 1 forks.
Language: Python
Topics: benchmark, browser-use-agent, computer-use-agent, crm
2026-06-04: 9 stars and 1 forks.
🔍 The hardest search benchmark in the wild — vague, multi-turn, proactive. 200 long-horizon tasks with persona-driven progressive disclosure, scored by verifiable schema-free knowledge-graph evaluation. No vibes, just triplet F1.
GitHub repository with 774 stars and 2 forks.
Trending score: 1.88; stars gained: +100; forks gained: +0.
Language: Python
Topics: agentic-ai, benchmark, llm, proactive-agent, search, search-agent
BEHAVIOR-1K: a platform for accelerating Embodied AI research. Join our Discord for support: https://discord.gg/bccR5vGFEx
GitHub repository with 1,502 stars and 205 forks.
Trending score: 1.17; stars gained: +14; forks gained: +0.
Language: Python
Topics: benchmark, embodied-ai, robotics, simulation
Benchmark for the quality of LLM-generated test suites — anti-fragility, rigor, mocking discipline, reuse — scored against human baselines, not coverage. Python, JS/TS, Go.
GitHub repository with 18 stars and 1 forks.
Trending score: 1.04; stars gained: +9; forks gained: +0.
Language: Python
Topics: benchmark, claude, code-quality, llm, mocha, pytest
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
GitHub repository with 1,273 stars and 328 forks.
Trending score: 0.92; stars gained: +7; forks gained: +1.
Language: Python
Topics: benchmark, llm, ai, language-model-agent, conversational-agents
MTEB: Massive Text Embedding Benchmark
GitHub repository with 3,288 stars and 617 forks.
Trending score: 0.69; stars gained: +4; forks gained: +1.
Language: Python
Topics: benchmark, clustering, information-retrieval, sentence-transformers, sts, text-embedding
Data for the MTEB leaderboard
GitHub repository with 58 stars and 159 forks.
Trending score: 0.57; stars gained: +2; forks gained: +0.
Language: Python
Topics: benchmark, benchmarkresults, clustering, information-retrieval, retrieval, semantic-search
The agent that grows with you
GitHub repository with 180,881 stars and 31,021 forks.
Trending score: 5.79; stars gained: +1,360; forks gained: +322.
Language: Python
Topics: ai, ai-agent, ai-agents, anthropic, chatgpt, claude
SkillOpt is a text-space optimizer that trains reusable natural-language skills for frozen LLM agents through trajectory-driven edits, validation-gated updates, and deployable best_skill.md artifacts.
GitHub repository with 4,892 stars and 487 forks.
Trending score: 4.55; stars gained: +340; forks gained: +27.
Language: Python
Topics: agent-skills, self-evolving-agents
754 structured cybersecurity skills for AI agents · Mapped to 5 frameworks: MITRE ATT&CK, NIST CSF 2.0, MITRE ATLAS, D3FEND & NIST AI RMF · agentskills.io standard · Works with Claude Code, GitHub Copilot, Codex CLI, Cursor, Gemini CLI & 20+ platforms · 26 security domains · Apache 2.0
GitHub repository with 13,233 stars and 1,551 forks.
Trending score: 4.53; stars gained: +301; forks gained: +38.
Language: Python
Topics: ai-agents, claude-code, cybersecurity, incident-response, mitre-attack, penetration-testing
Turn any technical book PDF into a Claude Code skill — ready to study, reference, and use while you work.
GitHub repository with 4,166 stars and 523 forks.
Trending score: 4.43; stars gained: +415; forks gained: +37.
Language: Python
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.
GitHub repository with 130,154 stars and 21,149 forks.
Trending score: 4.42; stars gained: +277; forks gained: +38.
Language: Python
Stealth Chromium that passes every bot detection test. Drop-in Playwright replacement with source-level fingerprint patches. 30/30 tests passed.
GitHub repository with 23,119 stars and 1,836 forks.
Trending score: 4.24; stars gained: +250; forks gained: +17.
Language: Python
Topics: anti-detect, bot-detection, browser-automation, chromium, cloudflare, fingerprint
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research · 浏览器里运行的安卓模拟器 · Browser-hosted Android Simulator · Verifiable Evaluation · Scalable Online RL Training
GitHub repository with 499 stars and 79 forks.
Trending score: 3.43; stars gained: +84; forks gained: +10.
Language: TypeScript
Topics: agent, agents, ai, android, automation, benchmark
🔍 The hardest search benchmark in the wild — vague, multi-turn, proactive. 200 long-horizon tasks with persona-driven progressive disclosure, scored by verifiable schema-free knowledge-graph evaluation. No vibes, just triplet F1.
GitHub repository with 774 stars and 2 forks.
Trending score: 1.88; stars gained: +100; forks gained: +0.
Language: Python
Topics: agentic-ai, benchmark, llm, proactive-agent, search, search-agent
Minecraft-style voxel benchmark for comparing AI models (Arena + Sandbox)
GitHub repository with 244 stars and 17 forks.
Trending score: 1.23; stars gained: +19; forks gained: +1.
Language: TypeScript
Topics: ai, benchmark, llm, nlp, voxel, comparison-benchmarks
BEHAVIOR-1K: a platform for accelerating Embodied AI research. Join our Discord for support: https://discord.gg/bccR5vGFEx
GitHub repository with 1,502 stars and 205 forks.
Trending score: 1.17; stars gained: +14; forks gained: +0.
Language: Python
Topics: benchmark, embodied-ai, robotics, simulation
Benchmark for the quality of LLM-generated test suites — anti-fragility, rigor, mocking discipline, reuse — scored against human baselines, not coverage. Python, JS/TS, Go.
GitHub repository with 18 stars and 1 forks.
Trending score: 1.04; stars gained: +9; forks gained: +0.
Language: Python
Topics: benchmark, claude, code-quality, llm, mocha, pytest
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
GitHub repository with 1,273 stars and 328 forks.
Trending score: 0.92; stars gained: +7; forks gained: +1.
Language: Python
Topics: benchmark, llm, ai, language-model-agent, conversational-agents