sierra-research/tau2-bench

τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

GitHub repository with 1,335 stars and 343 forks.

Language: Python

Topics: benchmark, llm, ai, language-model-agent, conversational-agents

Open provider repository

Latest metric snapshot

2026-06-11: 1,335 stars and 343 forks.

Similar repositories

  1. 1. VibeBench/VibeSearchBench

    🔍 The hardest search benchmark in the wild — vague, multi-turn, proactive. 200 long-horizon tasks with persona-driven progressive disclosure, scored by verifiable schema-free knowledge-graph evaluation. No vibes, just triplet F1.

    GitHub repository with 1,008 stars and 63 forks.

    Trending score: 3.33; stars gained: +50; forks gained: +37.

    Language: Python

    Topics: agentic-ai, benchmark, llm, proactive-agent, search, search-agent

  2. 2. hogeheer499-commits/strix-halo-guide

    Complete guide to running large language models locally on AMD Strix Halo / Ryzen AI MAX+ 395 with Radeon 8060S (gfx1151) and 96GB/128GB unified memory. Covers BIOS config, Ubuntu/kernel setup, Ollama, llama.cpp Vulkan/RADV, ROCm/HIP, vLLM, and 70B/120B GGUF evidence.

    GitHub repository with 142 stars and 6 forks.

    Trending score: 1.97; stars gained: +9; forks gained: +0.

    Language: Python

    Topics: amd, benchmark, gfx1151, llama-cpp, llm, local-llm

  3. 3. TIGER-AI-Lab/ClawBench

    Open-source benchmark for browser AI agents on daily tasks.

    GitHub repository with 393 stars and 22 forks.

    Trending score: 1.37; stars gained: +2; forks gained: +0.

    Language: Python

    Topics: ai-agents, benchmark, browser-automation, browser-use, dataset, evaluation

  4. 4. cxcscmu/SkillLearnBench

    SkillLearnBench is the first benchmark for evaluating continual learning methods that automatically generate agent skills.

    GitHub repository with 47 stars and 3 forks.

    Trending score: 1.26; stars gained: +12; forks gained: +0.

    Language: Python

    Topics: agent-skills, automatic, benchmark, continual-learning, skill-generation

  5. 5. StanfordVL/BEHAVIOR-1K

    BEHAVIOR-1K: a platform for accelerating Embodied AI research. Join our Discord for support: https://discord.gg/bccR5vGFEx

    GitHub repository with 1,518 stars and 205 forks.

    Trending score: 1.26; stars gained: +3; forks gained: +0.

    Language: Python

    Topics: robotics, simulation, benchmark, embodied-ai

  6. 6. xlang-ai/FineVLA

    Scalable annotation pipeline for action-aglined fine-grained instruciton for Visual-language-Action model

    GitHub repository with 19 stars and 0 forks.

    Trending score: 1.24; stars gained: +10; forks gained: +0.

    Language: Python

    Topics: benchmark, caption, caption-generation, fine-grained, roboitcs, vision-language-action-model

Trending in Python

  1. 1. chopratejas/headroom

    Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server.

    GitHub repository with 27,902 stars and 1,891 forks.

    Trending score: 6.49; stars gained: +2,776; forks gained: +250.

    Language: Python

    Topics: agent, ai, anthropic, claude-code, compression, context-engineering

  2. 2. harry0703/MoneyPrinterTurbo

    利用AI大模型,一键生成高清短视频 Generate short videos with one click using AI LLM.

    GitHub repository with 88,031 stars and 12,625 forks.

    Trending score: 6.02; stars gained: +1,097; forks gained: +218.

    Language: Python

    Topics: ai, automation, chatgpt, moviepy, python, shortvideo

  3. 3. pewdiepie-archdaemon/odysseus

    Self-hosted AI workspace.

    GitHub repository with 71,308 stars and 9,089 forks.

    Trending score: 5.98; stars gained: +834; forks gained: +140.

    Language: Python

  4. 4. NousResearch/hermes-agent

    The agent that grows with you

    GitHub repository with 193,883 stars and 33,934 forks.

    Trending score: 5.92; stars gained: +753; forks gained: +209.

    Language: Python

    Topics: ai, ai-agent, ai-agents, anthropic, chatgpt, claude

  5. 5. NVIDIA/SkillSpector

    Security scanner for AI agent skills. Detect vulnerabilities, malicious patterns, and security risks.

    GitHub repository with 5,654 stars and 427 forks.

    Trending score: 5.61; stars gained: +874; forks gained: +76.

    Language: Python

  6. 6. rohitg00/ai-engineering-from-scratch

    Learn it. Build it. Ship it for others.

    GitHub repository with 32,527 stars and 5,342 forks.

    Trending score: 5.59; stars gained: +762; forks gained: +135.

    Language: Python

    Topics: agents, ai, ai-agents, ai-engineering, computer-vision, course

Trending topic: benchmark

  1. 1. VibeBench/VibeSearchBench

    🔍 The hardest search benchmark in the wild — vague, multi-turn, proactive. 200 long-horizon tasks with persona-driven progressive disclosure, scored by verifiable schema-free knowledge-graph evaluation. No vibes, just triplet F1.

    GitHub repository with 1,008 stars and 63 forks.

    Trending score: 3.33; stars gained: +50; forks gained: +37.

    Language: Python

    Topics: agentic-ai, benchmark, llm, proactive-agent, search, search-agent

  2. 2. Purewhiter/mobilegym

    MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research · 浏览器里运行的安卓模拟器 · Browser-hosted Android Simulator · Verifiable Evaluation · Scalable Online RL Training

    GitHub repository with 618 stars and 98 forks.

    Trending score: 2.50; stars gained: +12; forks gained: +1.

    Language: TypeScript

    Topics: benchmark, mobile-agent, reinforcement-learning, vlm, agents, gym

  3. 3. hogeheer499-commits/strix-halo-guide

    Complete guide to running large language models locally on AMD Strix Halo / Ryzen AI MAX+ 395 with Radeon 8060S (gfx1151) and 96GB/128GB unified memory. Covers BIOS config, Ubuntu/kernel setup, Ollama, llama.cpp Vulkan/RADV, ROCm/HIP, vLLM, and 70B/120B GGUF evidence.

    GitHub repository with 142 stars and 6 forks.

    Trending score: 1.97; stars gained: +9; forks gained: +0.

    Language: Python

    Topics: amd, benchmark, gfx1151, llama-cpp, llm, local-llm

  4. 4. SemiAnalysisAI/InferenceX

    Open Source Continuous Inference Benchmark Research Platform Kimi K2.6, DeepSeekv4, GLM5 - GB200 NVL72 vs MI355X vs B200 vs GB300 NVL72 & soon™ TPUv6e/v7/Trainium2/3

    GitHub repository with 1,098 stars and 194 forks.

    Trending score: 1.96; stars gained: +4; forks gained: +1.

    Language: Shell

    Topics: ai, amd, benchmark, cuda, gb200, llm

  5. 5. TIGER-AI-Lab/ClawBench

    Open-source benchmark for browser AI agents on daily tasks.

    GitHub repository with 393 stars and 22 forks.

    Trending score: 1.37; stars gained: +2; forks gained: +0.

    Language: Python

    Topics: ai-agents, benchmark, browser-automation, browser-use, dataset, evaluation

  6. 6. Shiyao-Huang/awesome-agent-evolution

    Open survey and evidence map for AI agent evolution, self-evolving agents, memory, skills, harnesses, benchmarks, and agent-swarm systems.

    GitHub repository with 228 stars and 10 forks.

    Trending score: 1.27; stars gained: +1; forks gained: +1.

    Language: JavaScript

    Topics: agent-evolution, agent-framework, agent-swarm, ai-agent, ai-agents, ai-research