• 1 Post
  • 142 Comments
Joined 6 years ago
cake
Cake day: July 17th, 2018

help-circle


  • @ReakDuck Yup, and that’s a much better avenue to fight against the AI companies. Because fundamentally, this is almost impossible to avoid in the ML models. We should stop complaining about how they scraped copyrighted content, this complaint won’t succeed until that legal loophole is removed. But when they reproduce copyrighted content, that could be fatal. And this applies also to reproducing GPL code samples by copilot for example.


  • @dandi8 the license of Adobe Photoshop is not open-source because it specifically restricts reverse-engineering and modifications, and a lot of other things. The license of Mistral Nemo IS open-source, because it’s Apache2.0, you are free to use it, study it, redistribute it, … open-source doesn’t say anything about giving you all the tools to re-create it, because that would mean they would need to give you the GPU time. “Open-source” simply means something else than what you think.



  • @dandi8 I’m not changing the definition of open-source. And I’m not saying models are magic. Please take your strawmen back. You are the one saying that dataset is source code, and you have no backing for this argument. I agree that dataset is the “source for training”, but that doesn’t make it “source code” as per the open-source licenses. And the tools are not the compiler. Just because something was created from something else, that doesn’t turn it into “source code”.


  • @dandi8 surprise surprise, LLMs are not a classic compiled software, in case you haven’t noticed yet. You can’t just transfer the same notions between these two. That’s like wondering why quantum physics doesn’t work the same as agriculture.

    Think of it as a database. If you have an open-source social network, all tools and code is published, free to use, but the value of the network is in the posts, the accounts, the people who keep coming back. The data in the database is not the source code


  • @dandi8 But the proof is in your quote. Open source is a license which allows people to study the source code. The source code of a model is a bunch of float numbers, and you can study it as much as you want in Mixtral and others. Clearly a model can be published without the dataset (Mixtral), and also a model can be closed, hosted, unavailable for study (OpenAI). I think you need to find some argument showing how “source code” of a model = the dataset. It just isn’t so.





  • @sunstoned Please don’t assume anything, it’s not healthy.

    To answer your question - it depends on the license of that binary. You can’t just automatically consider something open-source. Look at the license. Meta, Microsoft and Google routinely misrepresents their licenses, calling them “open-source” even when they aren’t.

    But the main point is that you can put closed source license on a model trained from open-source data. Unfortunately. You are barking under the wrong tree.


  • @sunstoned @Ephera That’s nonsense. You could write the scripts, collect the data, publish all, but without the months of GPU training you wouldn’t have the trained model, so it would all be worthless. The code used to train all the proprietary models is already open-source, it’s things like PyTorch, Tensorflow etc. For a model to be open-source means you can download the weights and you are allowed to use it as you please, including modifying it and publishing again. It’s not about the dataset.


  • @astro_ray @marvelous_coyote It seems you have the incorrect idea about what open-source means, which is quite sad here in the open-source lemmy community. Being trained on public domain material does NOT make the model open-source. It’s about the license - what the recipients of the model are allowed to do with it - open-source must allow derivative works and commercial use, on top of seeing the code, but for LLM models the “code” is just a bunch of float numbers, nothing interesting to see.


  • @cmnybo @marvelous_coyote That’s… not how it works. You wouldn’t see any copyrighted works in the model. We are already pretty sure even the closed models were trained on copyrighted works, based on what they sometimes produce. But even then, the AI companies aren’t denying it. They are just saying it was all “fair use”, they are using a legal loophole, and they might win this. Basically the only way they could be punished on copyright is if the models produce some copyrighted content verbatim.