OpenAI Pleads That It Can’t Make Money Without Using Copyrighted Materials for Free

flop_leash_973@lemmy.world · 2 months ago

OpenAI Pleads That It Can’t Make Money Without Using Copyrighted Materials for Free

General_Effort@lemmy.world · 2 months ago

Scaling laws are disputed

Not in general.

There is not enough permissively licensed text to train models of any size, and what there is, lacks in diversity. Wikipedia, government documents, stack overflow, century old stuff, … An LLM trained on that is not likely to be called “general purpose”, because scaling laws. Sometimes such small models are trained for research purposes but I don’t have a link ready. They are not something you’d actually use. Perhaps you could look at Microsoft’s Phi series of models. They are trained on synthetic data, though that’s probably not what you are looking for.

mm_maybe@sh.itjust.works · 2 months ago

yes, I’ve extensively written about Phi and other related issues in a blog post which I’ll share here: https://medium.com/@matthewmaybe/data-dignity-is-difficult-64ba41ee9150