AI模型基準測試

人工智慧模型基準測試 魔法競技場

MagicArena 是一個具有競爭力的基準測試平台，旨在透過並排的人類比較來評估和排名視覺生成式 AI 模型。

人工智慧模型基準測試 AGI-Eval

AGI-Eval 是一個專門的評估社區，旨在對各種 AI 大型語言模型的能力和性能進行基準測試。

人工智慧模型基準測試 H2O EvalGPT

An advanced evaluation system by H2O.ai that utilizes Elo rating methodologies to benchmark and rank Large Language Models (LLMs).

人工智慧模型基準測試 LLMEval3

A professional evaluation benchmark from Fudan University’s NLP Lab designed to measure the performance and reliability of large language models.

人工智慧模型基準測試 MMBench

MMBench is a comprehensive evaluation framework designed to measure the capabilities of multimodal large language models across a wide array of visual and textual tasks.

人工智慧模型基準測試 HELM

A standardized, holistic evaluation framework from Stanford University designed to measure the performance and safety of large language models.

人工智慧模型基準測試 OpenCompass

OpenCompass is an open-source evaluation framework developed by the Shanghai AI Lab to provide standardized, comprehensive benchmarking for large language models.

人工智慧模型基準測試 FlagEval

An open-source evaluation framework developed by the Beijing Academy of Artificial Intelligence (BAAI) to standardize and scale LLM benchmarking.

人工智慧模型基準測試 LMArena

A crowdsourced benchmarking platform where users battle-test Large Language Models through blind side-by-side comparisons.

人工智慧模型基準測試 MMLU

MMLU is a comprehensive benchmark designed to evaluate the general knowledge and problem-solving capabilities of large language models across a vast array of disciplines.