We evaluated Devstral 2 against DeepSeek V3.2 and Claude Sonnet 4.5 using human evaluations conducted by an independent annotation provider, with tasks scaffolded through Cline. Devstral 2 shows a clear advantage over DeepSeek V3.2, with a 42.8% win rate versus 28.6% loss rate. However, Claude Sonnet 4.5 remains significantly preferred, indicating a gap with closed-source models persists.
To be fair, not all knowledge of LLM comes from training material. The other way is to provide context to instructions.
I can imagine someone someday develops a decent way for LLMs to write down their mistakes in database and some clever way to recall most relevant memories when needed.
Manufacturers will be required to offer spare parts and publish security updates for an extended period. Energy labels will show a repairability index as well as energy efficiency.
Thank you for being honest about performance