Model Comparison and Evaluation in HagiCode
Scope and method
Section titled “Scope and method”- Goal: Provide model-selection guidance based on real integration experience in HagiCode.
- Task types: Frontend component implementation, backend API refactoring, test completion, and documentation generation.
- Evaluation axes: delivery effectiveness (can it reliably finish real tasks) and cost-effectiveness (cost + domestic availability).
Test-time and scenario notes
Section titled “Test-time and scenario notes”- Latest test date: 2026-03-08
- Test period: 2026-03-01 to 2026-03-08
- Sample basis: Subjective evaluation from real HagiCode engineering workflows, not vendor benchmark numbers.
- Applicability: Conclusions are scoped to this project’s current workflow and constraints.
This page only lists tested models
Section titled “This page only lists tested models”The following models were actually integrated and used by our team:
- GLM 4.7
- GLM 5
- Qwen 3.5
- Qwen Code Next
- GPT 5.3 Codex
- GPT 5.4
- Minimax M2.5
Comparison snapshot (tested models)
Section titled “Comparison snapshot (tested models)”| Model | Test Date | Delivery effectiveness | Cost-effectiveness | Primary experience |
|---|---|---|---|---|
| GPT 5.4 | 2026-03-08 | Very high | Medium-high | Frequently exceeds baseline requirements with strong engineering quality |
| GPT 5.3 Codex | 2026-03-08 | Very high | Medium-high | High completion quality within scope, strong engineering output |
| GLM 5 | 2026-03-08 | High | High | Stable overall performance for our requirements |
| GLM 4.7 | 2026-03-08 | High | Very high | Reliable delivery with better cost control |
| Minimax M2.5 | 2026-03-08 | Medium-high | Highest | Can achieve most goals, but code-closing errors happen more often |
| Qwen 3.5 / Code Next | 2026-03-08 | Medium | Medium-high | Lower completion ranking in our scenarios |
Delivery-effectiveness ranking (author recommendation)
Section titled “Delivery-effectiveness ranking (author recommendation)”Ranked by task completion quality and engineering practice quality:
- GPT 5.4
- GPT 5.3 Codex
- GLM 5
- GLM 4.7
- Minimax M2.5
- Qwen (3.5 / Code Next)
Key findings
Section titled “Key findings”- Except Qwen, all other tested models can achieve our target outcomes to some degree.
- GLM 4.7+ (GLM 4.7 and GLM 5) generally completes our requirements smoothly.
- GPT 5.3 Codex and GPT 5.4 not only complete requirements but also produce better engineering practices and implementation quality.
- Minimax M2.5 has a recurring weakness: code-closing errors (e.g., incomplete bracket/block closure), so extra review is needed.
Cost-effectiveness ranking (cost + domestic availability)
Section titled “Cost-effectiveness ranking (cost + domestic availability)”Ranked by economic cost and practical availability in China:
- Minimax M2.5
- GLM 4.7
- GLM 5
- Qwen 3.5 / Code Next
- GPT 5.3 Codex
- GPT 5.4
Note: this ranking is intentionally different from delivery-effectiveness ranking.
Selection guidance
Section titled “Selection guidance”- Quality-first: choose GPT 5.4 / GPT 5.3 Codex.
- Balanced strategy: choose GLM 5 / GLM 4.7.
- Cost-first: choose Minimax M2.5 (with stricter code-closure checks).
- Practical routing: use premium models for critical tasks and cost-efficient models for routine tasks.
Untested-model statement
Section titled “Untested-model statement”For models not listed here, we currently have no test data and no hands-on experience, so we do not provide evaluations.
If sponsors provide access to additional models, we will run experience-based evaluations in our real workflow and update this page.