The biggest problem to me is what I just saw you post in another reply, that these models built upon our knowledge exist almost solely within proprietary ecosystems.
and maybe even our Mastodon or Lemmy posts!
The Washington Post published a great piece which allows you to search which websites were included in the "C4" dataset published in 2019. I searched for my personal blog jonaharagon.com
and sure enough it was included, and the C4 dataset is practically minuscule compared to what is being compiled for larger models like ChatGPT. If my tiny website was included, Mastodon and Lemmy posts (which are actually very visible and SEO optimized tbh) are 100% being scraped as well, there's no maybe about it.
Hi @[email protected]~ I'm the admin of lemmy.one (I know you also messaged me on Reddit). The specs for this server are roughly double the specs of lemmy.ml's server, except we're ~10x smaller than lemmy.ml at the moment, so we have lots of room for growth. I run mstdn.party which is one of the top 40 largest Mastodon servers according to the-federation.info, and I'm prepared to scale this community as well. If you want your community hosted on this instance, I'm happy to get that set up for you, we can talk further on Reddit.
If you're considering running your own Lemmy server instead, it is not particularly resource-intensive, I would imagine you could host up to ~1000 monthly active users on a server that costs no more than $30/month. I'm happy to invite you to some Lemmy admin communities who can provide assistance as well if you're interested in going that route.