AIモデルのベンチマーク

AIモデルのベンチマーク マジックアリーナ

MagicArenaは、人間との比較を通して、視覚生成AIモデルを評価・ランク付けするために設計された、競争的なベンチマークプラットフォームです。

AIモデルのベンチマーク AGI評価

AGI-Evalは、様々なAI大規模言語モデルの能力と性能をベンチマークするために設計された、専門的な評価コミュニティです。

AIモデルのベンチマーク H2O EvalGPT

An advanced evaluation system by H2O.ai that utilizes Elo rating methodologies to benchmark and rank Large Language Models (LLMs).

AIモデルのベンチマーク LLMEval3

A professional evaluation benchmark from Fudan University’s NLP Lab designed to measure the performance and reliability of large language models.

AIモデルのベンチマーク MMBench

MMBench is a comprehensive evaluation framework designed to measure the capabilities of multimodal large language models across a wide array of visual and textual tasks.

AIモデルのベンチマーク HELM

A standardized, holistic evaluation framework from Stanford University designed to measure the performance and safety of large language models.

AIモデルのベンチマーク OpenCompass

OpenCompass is an open-source evaluation framework developed by the Shanghai AI Lab to provide standardized, comprehensive benchmarking for large language models.

AIモデルのベンチマーク FlagEval

An open-source evaluation framework developed by the Beijing Academy of Artificial Intelligence (BAAI) to standardize and scale LLM benchmarking.

AIモデルのベンチマーク LMArena

A crowdsourced benchmarking platform where users battle-test Large Language Models through blind side-by-side comparisons.

AIモデルのベンチマーク MMLU

MMLU is a comprehensive benchmark designed to evaluate the general knowledge and problem-solving capabilities of large language models across a vast array of disciplines.