Ethical Sourcing of Datasets

Ethical Sourcing of Datasets for LLMs

One of the key issues with LLMS (Large Language Models) is ethical sourcing of datasets used for training. This topic is especially tough in light of the way Bay Area VC-funded startups operate via Regulatory Entrepreneurship, i.e., they find a lucrative market, break the law, then get the law changed afterward. We see this with Uber, Airbnb, other startups, and now with OpenAI.

The Author’s Guild has a letter that calls on industry leaders to protect writers. There is mounting evidence that both Meta and OpenAI, have trained on pirated datasets.

Microsoft responsible AI guide published in June 2022 has no meaningful guidance on using pirated datasets or datasets that authors and creators didn’t consent to being used for AI models. The common sense recommendations for ethical sourcing of datasets could start with the following:

  • It is never acceptable to use pirated datasets (even in the rare case where supposedly companies also purchased the book.). Profiting from criminal activity is unethical, full stop.
  • Consent takes precedence over profit. Authors and creators should be required to “Opt-In” to training vs. “Opt-Out” or have no option.
  • Transparency is a requirement for LLMs. Commercial vendors should be required to disclose their training sources.
  • If they opt-in, Creative professionals and content creators should be compensated.
  • Exploiting vulnerable people worldwide for “human in the loop” training and exposing them to harmful content for poverty wages is deeply unethical and should not be allowed.

Related