• 7 Posts
  • 5 Comments
Joined 9M ago
cake
Cake day: Mar 17, 2025

help-circle
rss

We evaluated Devstral 2 against DeepSeek V3.2 and Claude Sonnet 4.5 using human evaluations conducted by an independent annotation provider, with tasks scaffolded through Cline. Devstral 2 shows a clear advantage over DeepSeek V3.2, with a 42.8% win rate versus 28.6% loss rate. However, Claude Sonnet 4.5 remains significantly preferred, indicating a gap with closed-source models persists.

Thank you for being honest about performance


Blue Origin lands it’s New Glenn rocket on landing platform
cross-posted from: https://lemmy.ml/post/38941578 > The second company to manage that after SpaceX.
fedilink





State-of-the-art LLM agents do not perform calculations, they call external tools to do that.


To be fair, not all knowledge of LLM comes from training material. The other way is to provide context to instructions.

I can imagine someone someday develops a decent way for LLMs to write down their mistakes in database and some clever way to recall most relevant memories when needed.



Manufacturers will be required to offer spare parts and publish security updates for an extended period. Energy labels will show a repairability index as well as energy efficiency.
fedilink


It is literally an algorytm made to hallucinate. The fact that it outputs accurate facts is more of a side effect.