


Welcome to the largest gaming community on Lemmy! Discussion for all kinds of games. Video games, tabletop games, card games etc.
Video games, tabletop, or otherwise. Posts not related to games will be deleted.
This community is focused on games, of all kinds. Any news item or discussion should be related to gaming in some way.
No bigotry, hardline stance. Try not to get too heated when entering into a discussion or debate.
We are here to talk and discuss about one of our passions, not fight or be exposed to hate. Posts or responses that are hateful will be deleted to keep the atmosphere good. If repeatedly violated, not only will the comment be deleted but a ban will be handed out as well. We judge each case individually.
Try to keep it to 10% self-promotion / 90% other stuff in your post history.
This is to prevent people from posting for the sole purpose of promoting their own website or social media account.
This community is mostly for discussion and news. Remember to search for the thing you’re submitting before posting to see if it’s already been posted.
We want to keep the quality of posts high. Therefore, memes, funny videos, low-effort posts and reposts are not allowed. We prohibit giveaways because we cannot be sure that the person holding the giveaway will actually do what they promise.
Make sure to mark your stuff or it may be removed.
No one wants to be spoiled. Therefore, always mark spoilers. Similarly mark NSFW, in case anyone is browsing in a public space or at work.
Don’t share it here, there are other places to find it. Discussion of piracy is fine.
We don’t want us moderators or the admins of lemmy.world to get in trouble for linking to piracy. Therefore, any link to piracy will be removed. Discussion of it is of course allowed.
PM a mod to add your own
Video games
Generic
Help and suggestions
If the model collapse theory weren’t true, then why do LLMs need to scrape so much data from the internet for training ?
According to you, they should be able to just generate synthetic training data purely with the previous model, and then use that to train the next generation.
So why is there even a need for human input at all then ? Why are all LLM companies fighting tooth and nail against their data scraping being restricted, if real human data is in fact so unnecessary for model training, and they could just generate their own synthetic training data instead ?
You can stop models from deteriorating without new data, and you can even train them with synthetic data, but that still requires the synthetic data to either be modelled, or filtered by humans to ensure its quality. If you just take a million random chatGPT outputs, with no human filtering whatsoever, and use those to retrain the chatGPT model, and then repeat that over and over again, eventually the model will turn to shit. Each iteration some of the random tweaks chatGPT makes to their output are going to produce some low quality outputs, which are now presented to the new training model as a target to achieve, so the new model learns that the quality of this type of bad output is actually higher, which makes it more likely for it to reappear in the next set of synthetic data.
And if you turn of the random tweaks, the model may not deteriorate, but it also won’t improve, because effectively no new data is being generated.
I stopped reading when you said according to me and then produced a wall of text of shit I never said.
Synthetic data is massively helpful. You can look it up. This is a myth.
That is enormously ironic, since I literally never claimed you said anything except for what you did: Namely, that synthetic data is enough to train models.
LIterally, the very next sentence starts with the words “Then why”, which clearly and explicitly means I’m no longer indirectly quoting you Everything else in my comment is quite explicitly my own thoughts on the matter, and why I disagree with that statment, so in actual fact, you’re the one making up shit I never said.