VibeBench/VibeSearchBench
🔍 The hardest search benchmark in the wild — vague, multi-turn, proactive. 200 long-horizon tasks with persona-driven progressive disclosure, scored by verifiable schema-free knowledge-graph evaluation. No vibes, just triplet F1.
GitHub repository with 774 stars and 2 forks.
Language: Python
Topics: agentic-ai, benchmark, llm, proactive-agent, search, search-agent