Skip to content

Model Comparison and Evaluation in HagiCode

  • Goal: Provide model-selection guidance based on real integration experience in HagiCode.
  • Task types: Frontend component implementation, backend API refactoring, test completion, and documentation generation.
  • Evaluation axes: delivery effectiveness (can it reliably finish real tasks) and cost-effectiveness (cost + domestic availability).
  • Latest test date: 2026-03-08
  • Test period: 2026-03-01 to 2026-03-08
  • Sample basis: Subjective evaluation from real HagiCode engineering workflows, not vendor benchmark numbers.
  • Applicability: Conclusions are scoped to this project’s current workflow and constraints.

The following models were actually integrated and used by our team:

  • GLM 4.7
  • GLM 5
  • Qwen 3.5
  • Qwen Code Next
  • GPT 5.3 Codex
  • GPT 5.4
  • Minimax M2.5
ModelTest DateDelivery effectivenessCost-effectivenessPrimary experience
GPT 5.42026-03-08Very highMedium-highFrequently exceeds baseline requirements with strong engineering quality
GPT 5.3 Codex2026-03-08Very highMedium-highHigh completion quality within scope, strong engineering output
GLM 52026-03-08HighHighStable overall performance for our requirements
GLM 4.72026-03-08HighVery highReliable delivery with better cost control
Minimax M2.52026-03-08Medium-highHighestCan achieve most goals, but code-closing errors happen more often
Qwen 3.5 / Code Next2026-03-08MediumMedium-highLower completion ranking in our scenarios

Delivery-effectiveness ranking (author recommendation)

Section titled “Delivery-effectiveness ranking (author recommendation)”

Ranked by task completion quality and engineering practice quality:

  1. GPT 5.4
  2. GPT 5.3 Codex
  3. GLM 5
  4. GLM 4.7
  5. Minimax M2.5
  6. Qwen (3.5 / Code Next)
  • Except Qwen, all other tested models can achieve our target outcomes to some degree.
  • GLM 4.7+ (GLM 4.7 and GLM 5) generally completes our requirements smoothly.
  • GPT 5.3 Codex and GPT 5.4 not only complete requirements but also produce better engineering practices and implementation quality.
  • Minimax M2.5 has a recurring weakness: code-closing errors (e.g., incomplete bracket/block closure), so extra review is needed.

Cost-effectiveness ranking (cost + domestic availability)

Section titled “Cost-effectiveness ranking (cost + domestic availability)”

Ranked by economic cost and practical availability in China:

  1. Minimax M2.5
  2. GLM 4.7
  3. GLM 5
  4. Qwen 3.5 / Code Next
  5. GPT 5.3 Codex
  6. GPT 5.4

Note: this ranking is intentionally different from delivery-effectiveness ranking.

  • Quality-first: choose GPT 5.4 / GPT 5.3 Codex.
  • Balanced strategy: choose GLM 5 / GLM 4.7.
  • Cost-first: choose Minimax M2.5 (with stricter code-closure checks).
  • Practical routing: use premium models for critical tasks and cost-efficient models for routine tasks.

For models not listed here, we currently have no test data and no hands-on experience, so we do not provide evaluations.

If sponsors provide access to additional models, we will run experience-based evaluations in our real workflow and update this page.