Model Comparison and Evaluation in HagiCode

Scope and method

Goal: Provide model-selection guidance based on real integration experience in HagiCode.
Task types: Frontend component implementation, backend API refactoring, test completion, and documentation generation.
Evaluation axes: delivery effectiveness (can it reliably finish real tasks) and cost-effectiveness (cost + domestic availability).

Latest test date: 2026-03-08
Test period: 2026-03-01 to 2026-03-08
Sample basis: Subjective evaluation from real HagiCode engineering workflows, not vendor benchmark numbers.
Applicability: Conclusions are scoped to this project’s current workflow and constraints.

The following models were actually integrated and used by our team:

Model	Test Date	Delivery effectiveness	Cost-effectiveness	Primary experience
GPT 5.4	2026-03-08	Very high	Medium-high	Frequently exceeds baseline requirements with strong engineering quality
GPT 5.3 Codex	2026-03-08	Very high	Medium-high	High completion quality within scope, strong engineering output
GLM 5	2026-03-08	High	High	Stable overall performance for our requirements
GLM 4.7	2026-03-08	High	Very high	Reliable delivery with better cost control
Minimax M2.5	2026-03-08	Medium-high	Highest	Can achieve most goals, but code-closing errors happen more often
Qwen 3.5 / Code Next	2026-03-08	Medium	Medium-high	Lower completion ranking in our scenarios

Ranked by task completion quality and engineering practice quality:

Except Qwen, all other tested models can achieve our target outcomes to some degree.
GLM 4.7+ (GLM 4.7 and GLM 5) generally completes our requirements smoothly.
GPT 5.3 Codex and GPT 5.4 not only complete requirements but also produce better engineering practices and implementation quality.
Minimax M2.5 has a recurring weakness: code-closing errors (e.g., incomplete bracket/block closure), so extra review is needed.

Ranked by economic cost and practical availability in China:

Note: this ranking is intentionally different from delivery-effectiveness ranking.

Quality-first: choose GPT 5.4 / GPT 5.3 Codex.
Balanced strategy: choose GLM 5 / GLM 4.7.
Cost-first: choose Minimax M2.5 (with stricter code-closure checks).
Practical routing: use premium models for critical tasks and cost-efficient models for routine tasks.

For models not listed here, we currently have no test data and no hands-on experience, so we do not provide evaluations.

If sponsors provide access to additional models, we will run experience-based evaluations in our real workflow and update this page.