LLM Evaluation - AIToolsFly

Benchmarks de modelos de IA AGI-Eval

AGI-Eval is a specialized evaluation community designed to benchmark the capabilities and performance of various AI large language models.

Benchmarks de modelos de IA H2O EvalGPT

An advanced evaluation system by H2O.ai that utilizes Elo rating methodologies to benchmark and rank Large Language Models (LLMs).

Benchmarks de modelos de IA LLMEval3

A professional evaluation benchmark from Fudan University’s NLP Lab designed to measure the performance and reliability of large language models.

Benchmarks de modelos de IA HELM

A standardized, holistic evaluation framework from Stanford University designed to measure the performance and safety of large language models.

Benchmarks de modelos de IA OpenCompass

OpenCompass is an open-source evaluation framework developed by the Shanghai AI Lab to provide standardized, comprehensive benchmarking for large language models.

Benchmarks de modelos de IA FlagEval

An open-source evaluation framework developed by the Beijing Academy of Artificial Intelligence (BAAI) to standardize and scale LLM benchmarking.

Benchmarks de modelos de IA MMLU

MMLU é um benchmark abrangente projetado para avaliar o conhecimento geral e as capacidades de resolução de problemas de grandes modelos de linguagem em uma vasta gama de disciplinas.

Benchmarks de modelos de IA C-Eval

A comprehensive evaluation suite designed to assess the knowledge and capabilities of large language models (LLMs) specifically in the Chinese language.

Benchmarks de modelos de IA SuperCLUE

A professional evaluation framework providing standardized benchmarks to measure the intelligence and utility of Chinese-language Modelos de IA.

Benchmarks de modelos de IA CMMLU

A comprehensive evaluation benchmark designed to measure the general knowledge and linguistic capabilities of Large Language Models in Chinese.