AI模型基准测试

人工智能模型基准测试 魔法竞技场

MagicArena 是一个具有竞争力的基准测试平台，旨在通过并排的人类比较来评估和排名视觉生成式 AI 模型。

人工智能模型基准测试 AGI-Eval

AGI-Eval 是一个专门的评估社区，旨在对各种 AI 大型语言模型的能力和性能进行基准测试。

人工智能模型基准测试 H2O EvalGPT

An advanced evaluation system by H2O.ai that utilizes Elo rating methodologies to benchmark and rank Large Language Models (LLMs).

人工智能模型基准测试 LLMEval3

A professional evaluation benchmark from Fudan University’s NLP Lab designed to measure the performance and reliability of large language models.

人工智能模型基准测试 MMBench

MMBench is a comprehensive evaluation framework designed to measure the capabilities of multimodal large language models across a wide array of visual and textual tasks.

人工智能模型基准测试 HELM

A standardized, holistic evaluation framework from Stanford University designed to measure the performance and safety of large language models.

人工智能模型基准测试 OpenCompass

OpenCompass is an open-source evaluation framework developed by the Shanghai AI Lab to provide standardized, comprehensive benchmarking for large language models.

人工智能模型基准测试 FlagEval

An open-source evaluation framework developed by the Beijing Academy of Artificial Intelligence (BAAI) to standardize and scale LLM benchmarking.

人工智能模型基准测试 LMArena

A crowdsourced benchmarking platform where users battle-test Large Language Models through blind side-by-side comparisons.

人工智能模型基准测试 MMLU

MMLU is a comprehensive benchmark designed to evaluate the general knowledge and problem-solving capabilities of large language models across a vast array of disciplines.