Blue Origin lands it’s New Glenn rocket on landing platform

@[email protected]

We evaluated Devstral 2 against DeepSeek V3.2 and Claude Sonnet 4.5 using human evaluations conducted by an independent annotation provider, with tasks scaffolded through Cline. Devstral 2 shows a clear advantage over DeepSeek V3.2, with a 42.8% win rate versus 28.6% loss rate. However, Claude Sonnet 4.5 remains significantly preferred, indicating a gap with closed-source models persists.

Thank you for being honest about performance

@[email protected]

Can I use it? And if not: when can I use it?

@[email protected]

State-of-the-art LLM agents do not perform calculations, they call external tools to do that.

@[email protected]

To be fair, not all knowledge of LLM comes from training material. The other way is to provide context to instructions.

I can imagine someone someday develops a decent way for LLMs to write down their mistakes in database and some clever way to recall most relevant memories when needed.

@[email protected]

It is literally an algorytm made to hallucinate. The fact that it outputs accurate facts is more of a side effect.

Blue Origin lands it’s New Glenn rocket on landing platform

Blue Origin lands it’s New Glenn rocket on landing platform

F-Droid and Google’s Developer Registration Decree

F-Droid and Google’s Developer Registration Decree

Open models by OpenAI

Open models by OpenAI

I tried living entirely on IPv6 for a day, and here’s what happened

I tried living entirely on IPv6 for a day, and here’s what happened

Can LLMs Do Accounting? Evaluating LLMs on Real Long-Horizon Business Tasks

Can LLMs Do Accounting? Evaluating LLMs on Real Long-Horizon Business Tasks

The EU smartphone repairability law will take effect on 20 June

The EU smartphone repairability law will take effect on 20 June

Baby is healed with first personalized gene-editing treatment

Baby is healed with first personalized gene-editing treatment

Blue Origin lands it’s New Glenn rocket on landing platform

Blue Origin lands it’s New Glenn rocket on landing platform

F-Droid and Google’s Developer Registration Decreeplus-square

F-Droid and Google’s Developer Registration Decreeplus-square

Open models by OpenAIplus-square

Open models by OpenAIplus-square

I tried living entirely on IPv6 for a day, and here’s what happenedplus-square

I tried living entirely on IPv6 for a day, and here’s what happenedplus-square

Can LLMs Do Accounting? Evaluating LLMs on Real Long-Horizon Business Tasksplus-square

Can LLMs Do Accounting? Evaluating LLMs on Real Long-Horizon Business Tasksplus-square

The EU smartphone repairability law will take effect on 20 Juneplus-square

The EU smartphone repairability law will take effect on 20 Juneplus-square

Baby is healed with first personalized gene-editing treatmentplus-square

Baby is healed with first personalized gene-editing treatmentplus-square

F-Droid and Google’s Developer Registration Decree

F-Droid and Google’s Developer Registration Decree

Open models by OpenAI

Open models by OpenAI

I tried living entirely on IPv6 for a day, and here’s what happened

I tried living entirely on IPv6 for a day, and here’s what happened

Can LLMs Do Accounting? Evaluating LLMs on Real Long-Horizon Business Tasks

Can LLMs Do Accounting? Evaluating LLMs on Real Long-Horizon Business Tasks

The EU smartphone repairability law will take effect on 20 June

The EU smartphone repairability law will take effect on 20 June

Baby is healed with first personalized gene-editing treatment

Baby is healed with first personalized gene-editing treatment