Phân phối nước hoa, mỹ phẩm làm đẹp chính hãng

Benchmark		Gemini 2.5 Flash-Lite Preview 06-17 Non-thinking	Gemini 2.5 Flash-Lite Preview 06-17 Thinking	Gemini 2.5 Flash Non-thinking	Gemini 2.5 Flash Thinking View 2.5 Flash	Gemini 2.5 Pro Thinking View 2.5 Pro
Input price	$/1M tokens (no caching)	$0.10	$0.10	$0.30	$0.30	$1.25 $2.50 > 200k tokens
Output price	$/1M tokens	$0.40	$0.40	$2.50	$2.50	$10.00 $15.00 > 200k tokens
Reasoning & knowledge Humanity's Last Exam (no tools)		5.1%	6.9%	8.4%	11.0%	21.6%
Science GPQA diamond		64.6%	66.7%	78.3%	82.8%	86.4%
Mathematics AIME 2025		49.8%	63.1%	61.6%	72.0%	88.0%
Code generation LiveCodeBench (UI: 1/1/2025-5/1/2025)		33.7%	34.3%	41.1%	55.4%	69.0%
Code editing Aider Polyglot		26.7%	27.1%	44.0%	56.7%	82.2%
Agentic coding SWE-bench Verified	single attempt	31.6%	27.6%	50.0%	48.9%	59.6%
	multiple attempts	42.6%	44.9%	60.0%	60.3%	67.2%
Factuality SimpleQA		10.7%	13.0%	25.8%	26.9%	54.0%
Factuality FACTS grounding		84.1%	86.8%	83.4%	85.3%	87.8%
Visual reasoning MMMU		72.9%	72.9%	76.9%	79.7%	82.0%
Image understanding Vibe-Eval (Reka)		51.3%	57.5%	66.2%	65.4%	67.2%
Long context MRCR v2 (8-needle)	128k (average)	16.6%	30.6%	34.1%	54.3%	58.0%
	1M (pointwise)	4.1%	5.4%	16.8%	21.0%	16.4%
Multilingual performance Global MMLU (Lite)		81.1%	84.5%	85.8%	88.4%	89.2%

Methodology

Gemini results: All Gemini scores are pass @1."Single attempt" settings allow no majority voting or parallel test-time compute; "multiple attempts" settings allow test-time selection of the candidate answer. They are all run with the AI Studio API with default sampling settings. To reduce variance, we average over multiple trials for smaller benchmarks. Aider Polyglot score is the pass rate average of 3 trials. Vibe-Eval results are reported using Gemini as a judge. Google's scaffolding for "multiple attempts" for SWE-Bench includes drawing multiple trajectories and re-scoring them using model's own judgement. For Aider results differ from the official leaderboard due to a difference in the settings used for evaluation (non-default).

Result sources: Where provider numbers are not available we report numbers from leaderboards reporting results on these benchmarks: Humanity's Last Exam results are sourced from https://agi.safe.ai/ and https://scale.com/leaderboard/humanitys_last_exam, LiveCodeBench results are from https://livecodebench.github.io/leaderboard.html (1/1/2025 - 5/1/2025 in the UI), Aider Polyglot numbers come from https://aider.chat/docs/leaderboards/. FACTS come from https://www.kaggle.com/benchmarks/google/facts-grounding. For MRCR v2 which is not publically available yet we include 128k results as a cumulative score to ensure they can be comparable with other models and a pointwise value for 1M context window to show the capability of the model at full length. The methodology has changed in this table vs previously published results for MRCR v2 as we have decided to focus on a harder, 8-needle version of the benchmark going forward.

Input and output price reflects text, image and video modalities.

Gemini

Gemma

Generative models

Experiments

Projects

Publications

News

AI for biology

AI for climate and sustainability

AI for mathematics and computer science

AI for physics and chemistry

AI transparency

News

Careers

Milestones

Education

Responsibility

The Podcast

Gemini

Model family

Hands-on with Gemini 2.5

Make an interactive animation

Create your own dinosaur game

Code a fractal visualization

Plot interactive economic data

Animate complex behavior

Code particle simulations

Adaptive and budgeted thinking

Gemini 2.5 is state-of-the-art across a wide range of benchmarks.

Benchmarks

As we develop these new technologies, we recognize the responsibility it entails, and aim to prioritize safety and security in all our efforts.

Gemini’s advanced thinking, native multimodality and massive context window empowers developers to build next-generation experiences.

Developer ecosystem