SalesforceAIResearch/SCUBA

SCUBA: Salesforce Computer Use Benchmark

GitHub repository with 9 stars and 1 forks.

Language: Python

Topics: benchmark, browser-use-agent, computer-use-agent, crm

Open provider repository

Latest metric snapshot

2026-06-04: 9 stars and 1 forks.

Similar repositories

  1. 1. VibeBench/VibeSearchBench

    🔍 The hardest search benchmark in the wild — vague, multi-turn, proactive. 200 long-horizon tasks with persona-driven progressive disclosure, scored by verifiable schema-free knowledge-graph evaluation. No vibes, just triplet F1.

    GitHub repository with 774 stars and 2 forks.

    Trending score: 1.88; stars gained: +100; forks gained: +0.

    Language: Python

    Topics: agentic-ai, benchmark, llm, proactive-agent, search, search-agent

  2. 2. StanfordVL/BEHAVIOR-1K

    BEHAVIOR-1K: a platform for accelerating Embodied AI research. Join our Discord for support: https://discord.gg/bccR5vGFEx

    GitHub repository with 1,502 stars and 205 forks.

    Trending score: 1.17; stars gained: +14; forks gained: +0.

    Language: Python

    Topics: benchmark, embodied-ai, robotics, simulation

  3. 3. rollinsio/beyond-test-coverage

    Benchmark for the quality of LLM-generated test suites — anti-fragility, rigor, mocking discipline, reuse — scored against human baselines, not coverage. Python, JS/TS, Go.

    GitHub repository with 18 stars and 1 forks.

    Trending score: 1.04; stars gained: +9; forks gained: +0.

    Language: Python

    Topics: benchmark, claude, code-quality, llm, mocha, pytest

  4. 4. sierra-research/tau2-bench

    τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    GitHub repository with 1,273 stars and 328 forks.

    Trending score: 0.92; stars gained: +7; forks gained: +1.

    Language: Python

    Topics: benchmark, llm, ai, language-model-agent, conversational-agents

  5. 5. embeddings-benchmark/mteb

    MTEB: Massive Text Embedding Benchmark

    GitHub repository with 3,288 stars and 617 forks.

    Trending score: 0.69; stars gained: +4; forks gained: +1.

    Language: Python

    Topics: benchmark, clustering, information-retrieval, sentence-transformers, sts, text-embedding

  6. 6. embeddings-benchmark/results

    Data for the MTEB leaderboard

    GitHub repository with 58 stars and 159 forks.

    Trending score: 0.57; stars gained: +2; forks gained: +0.

    Language: Python

    Topics: benchmark, benchmarkresults, clustering, information-retrieval, retrieval, semantic-search

Trending in Python

  1. 1. NousResearch/hermes-agent

    The agent that grows with you

    GitHub repository with 180,881 stars and 31,021 forks.

    Trending score: 5.79; stars gained: +1,360; forks gained: +322.

    Language: Python

    Topics: ai, ai-agent, ai-agents, anthropic, chatgpt, claude

  2. 2. microsoft/SkillOpt

    SkillOpt is a text-space optimizer that trains reusable natural-language skills for frozen LLM agents through trajectory-driven edits, validation-gated updates, and deployable best_skill.md artifacts.

    GitHub repository with 4,892 stars and 487 forks.

    Trending score: 4.55; stars gained: +340; forks gained: +27.

    Language: Python

    Topics: agent-skills, self-evolving-agents

  3. 3. mukul975/Anthropic-Cybersecurity-Skills

    754 structured cybersecurity skills for AI agents · Mapped to 5 frameworks: MITRE ATT&CK, NIST CSF 2.0, MITRE ATLAS, D3FEND & NIST AI RMF · agentskills.io standard · Works with Claude Code, GitHub Copilot, Codex CLI, Cursor, Gemini CLI & 20+ platforms · 26 security domains · Apache 2.0

    GitHub repository with 13,233 stars and 1,551 forks.

    Trending score: 4.53; stars gained: +301; forks gained: +38.

    Language: Python

    Topics: ai-agents, claude-code, cybersecurity, incident-response, mitre-attack, penetration-testing

  4. 4. virgiliojr94/book-to-skill

    Turn any technical book PDF into a Claude Code skill — ready to study, reference, and use while you work.

    GitHub repository with 4,166 stars and 523 forks.

    Trending score: 4.43; stars gained: +415; forks gained: +37.

    Language: Python

  5. 5. anthropics/claude-code

    Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflows - all through natural language commands.

    GitHub repository with 130,154 stars and 21,149 forks.

    Trending score: 4.42; stars gained: +277; forks gained: +38.

    Language: Python

  6. 6. CloakHQ/CloakBrowser

    Stealth Chromium that passes every bot detection test. Drop-in Playwright replacement with source-level fingerprint patches. 30/30 tests passed.

    GitHub repository with 23,119 stars and 1,836 forks.

    Trending score: 4.24; stars gained: +250; forks gained: +17.

    Language: Python

    Topics: anti-detect, bot-detection, browser-automation, chromium, cloudflare, fingerprint

Trending topic: benchmark

  1. 1. Purewhiter/mobilegym

    MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research · 浏览器里运行的安卓模拟器 · Browser-hosted Android Simulator · Verifiable Evaluation · Scalable Online RL Training

    GitHub repository with 499 stars and 79 forks.

    Trending score: 3.43; stars gained: +84; forks gained: +10.

    Language: TypeScript

    Topics: agent, agents, ai, android, automation, benchmark

  2. 2. VibeBench/VibeSearchBench

    🔍 The hardest search benchmark in the wild — vague, multi-turn, proactive. 200 long-horizon tasks with persona-driven progressive disclosure, scored by verifiable schema-free knowledge-graph evaluation. No vibes, just triplet F1.

    GitHub repository with 774 stars and 2 forks.

    Trending score: 1.88; stars gained: +100; forks gained: +0.

    Language: Python

    Topics: agentic-ai, benchmark, llm, proactive-agent, search, search-agent

  3. 3. Ammaar-Alam/minebench

    Minecraft-style voxel benchmark for comparing AI models (Arena + Sandbox)

    GitHub repository with 244 stars and 17 forks.

    Trending score: 1.23; stars gained: +19; forks gained: +1.

    Language: TypeScript

    Topics: ai, benchmark, llm, nlp, voxel, comparison-benchmarks

  4. 4. StanfordVL/BEHAVIOR-1K

    BEHAVIOR-1K: a platform for accelerating Embodied AI research. Join our Discord for support: https://discord.gg/bccR5vGFEx

    GitHub repository with 1,502 stars and 205 forks.

    Trending score: 1.17; stars gained: +14; forks gained: +0.

    Language: Python

    Topics: benchmark, embodied-ai, robotics, simulation

  5. 5. rollinsio/beyond-test-coverage

    Benchmark for the quality of LLM-generated test suites — anti-fragility, rigor, mocking discipline, reuse — scored against human baselines, not coverage. Python, JS/TS, Go.

    GitHub repository with 18 stars and 1 forks.

    Trending score: 1.04; stars gained: +9; forks gained: +0.

    Language: Python

    Topics: benchmark, claude, code-quality, llm, mocha, pytest

  6. 6. sierra-research/tau2-bench

    τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    GitHub repository with 1,273 stars and 328 forks.

    Trending score: 0.92; stars gained: +7; forks gained: +1.

    Language: Python

    Topics: benchmark, llm, ai, language-model-agent, conversational-agents