https://arxiv.org/pdf/2308.03688.pdf
https://github.com/THUDM/AgentBench
AgentBench is a new comprehensive benchmarking tool to evaluate LLMs' performance in interactive, pragmatic environments. The tool currently incorporates eight distinct environments. Five of these are brand new, crafted specifically for this initiative, namely the Operating System, Database, Knowledge Graph, Digital Card Game, and Lateral Thinking Puzzles. The remaining three - House-Holding, Web Shopping, and Web Browsing - are adaptations from previously published datasets.
Models like GPT-4 displayed a commendable ability to manage various tasks. However, a stark contrast was observed between these models and their open-source peers. While the open-source models have shown competitive results in other benchmarks, their performance in the multifaceted tests posed by AgentBench was noticeably inferior.
The toolkit, datasets, and environments are publicly accessible.