What happened to techbros from the 90s to now?

_number8_@lemmy.world · 2 months ago

What happened to techbros from the 90s to now?

Snot Flickerman@lemmy.blahaj.zone · edit-2 2 months ago

https://huggingface.co/datasets/defunct-datasets/the_pile_books3

This dataset is Shawn Presser’s work and is part of EleutherAi/The Pile dataset.

This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI’s mysterious “books2” dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it’s “all of libgen”, but it’s purely conjecture.

https://web.archive.org/web/20220522050247/https://huggingface.co/datasets/the_pile_books3

I emphasize “well known” because it was literally in the description when it was initially uploaded to the internet. It was always right out in the front that this was all the ebooks from private torrent tracker Bibliotik. Shawn Presser/books3 never lied about where it came from. As you can see with the archive.org link, that description about it’s sourcing was on the page in May 2022.

Bibliotik is a well known private tracker for ebooks and even peddles tools for removing DRM from ebooks. So, arguably, not only are the books pirated, but at some point, a DMCA criminal violation occurred when the DRM was stripped from them. So OpenAIs willingness to use it without question to get their company started should be evidence they’re not concerned about where the data came from or getting it in more legal ways.

BaroqueInMind@lemmy.one · 2 months ago

Thank you for the links and reading!