An Interview With CCC’s Babis Marmanis and Catherine Zaller RowlandIn a recent town hall webinar, The Heart of the Matter: Copyright, AI Training and LLMs, CCC (Copyright Clearance Center) made the case that current artificial intelligence (AI) technology infringes copyrights, both in the way its underlying large language models (LLMs) are developed, or trained, and in the outputs it enables users to create. The presentation wove together technical principles and legal arguments. Technical principles were presented by Babis Marmanis, CCC’s EVP and CTO, while Noam Shemtov, professor at Queen Mary University of London’s School of Law, and Daniel Gervais, Vanderbilt Law School professor, focused on the relevant legal issues. Catherine Zaller Rowland, CCC’s VP and general counsel, moderated. You can read a detailed account of the webinar in the May 2024 issue of Information Today.
A few days after the webinar, I met via Zoom with Marmanis and Rowland. We explored the webinar topics in more depth, and the following is our edited and abridged conversation.
Dave Shumaker: It’s a pleasure to be speaking with you. I’d like to start the conversation by following up on the technology concepts covered in your webinar. What is it about the technology involved in LLMs that creates a copyright concern?
Babis Marmanis: LLMs would not work without word embeddings. Word embeddings can be used outside LLMs, and have been for a number of years. But LLMs use contextual word embeddings, preserving words that come before and after the given word. That’s important for LLMs because their task is to predict the next word. The sequencing was not as dominant and pronounced before LLMs. Indexes use word embeddings with the goal of retrieving the most relevant documents. In the generative AI context, you get the ability to reproduce the text of the document.
Shumaker: How does CCC’s business model relate to the copyright questions about AI? What role might CCC play in licensing for AI applications?
Catherine Zaller Rowland: CCC was established in 1978, coinciding with the effective date of the 1976 Copyright Act. Since our founding, we’ve been putting users and rightsholders together, so people can find good ways to license the works that they need. This started with photocopies in our early days, and over time, as technology advanced, we introduced new licensing, like licensing for digital rights and for text and data mining. We offer various licensing models. We have a collective, or blanket, license covering thousands of publishers and millions of items. These are voluntary and nonexclusive, so that the rightsholders and users can also enter into direct licenses. We also have transactional licenses that cover pay-per-use reprints and similar uses, and we offer software. Our software suites help people manage their rights and permissions and help rightsholders manage their works, as well.
As for AI, we see that there’s a need that can be addressed by both collective and direct licenses. We’re developing our approach, so we don’t have anything to announce right now. We’re listening to rightsholders and users, learning about their needs and determining how we can fit in. More news will be coming soon.
Shumaker: Does the AI environment require changes in CCC’s current licensing approach?
Rowland: I’d put it differently. Our licenses evolve as technology evolves. Throughout our history, we’ve kept pace with changing technologies and needs. That’s what we’re doing here.
Shumaker: Are there really two distinct needs here: one being the need for licensing to enable training the LLMs and another to enable the use and various outputs of the AI systems?
Rowland: Yes, there are different rights and issues that might arise in input to training the models versus outputs. We’re developing solutions, but we’re not in a position to say more right now.
Shumaker: I’m also curious about CCC’s role in any of the current litigation. Are any of the plaintiffs in the recent lawsuits also participants in CCC licensing? Does CCC ever file briefs in lawsuits such as these?
Rowland: Yes. For example, The New York Times stated in its complaint that it licenses some of its content through us. But we don’t keep track of this, and we don’t get involved in their litigation decisions. As for filing briefs, we’ve done this on occasion in the past. Most of the time, the filing of amicus [“friend of the court”] briefs doesn’t happen until later stages of the litigation, so we’re not at the stage of considering that right now.
Shumaker: What’s the status of social media content or content made available without charge by policy think tanks, public interest groups, and the like? Does CCC license these types of content now? Would they be included in licensing for AI uses, as we’ve been discussing?
Rowland: You may have seen in the news that social media companies are determining how they want to license their content, whether it involves their own rights or rights they may have gathered through user agreements. That’s something done through them.
Marmanis: There’s also a long tail of bloggers, aggregators, and so on. Some of them license through us. There are about 12,000 so-called publishers—people who provide content and license it through CCC. But our licensing is dominated by scholarly publishing.
Shumaker: What about Creative Commons licensing? I recall seeing a claim that if a for-profit LLM developer ingests content without permission that is covered by a Creative Commons Non-Commercial license, that would be a violation. Do issues around Creative Commons licensing affect your work?
Marmanis: One of our products is a system to support the honoring of agreements between publishers and institutions. Called RightsLink for Scientific Communications, it covers a wide range of publishers and tens of thousands of articles every year. We facilitate the processing of content and licenses, even though the license in this case is not our license, but a Creative Commons license. In the case of AI, I think both NC (non-commercial) and ND (non-derivative works) provisions may be involved.
Rowland: Creative Commons licenses are at issue in some current litigation. I’d add that there’s an attribution argument: There’s concern over licenses that include an attribution requirement. Moreover, outside the U.S., there are issues relating to moral rights and other considerations.
Shumaker: While CCC is understandably focused on copyright issues, there are other issues of concern related to AI—such as perpetuating biases, spreading disinformation, “hallucinating,” or providing false information. How do CCC and copyright intersect with these other issues?
Marmanis: Responsible AI includes licensing copyrighted materials when used and also transparency with regard to the content used in training. Our position is that this is essential to ensure accurate, ethical models. If content is licensed properly and training is transparent, a lot of those issues can be, if not resolved, at least attributed to their source. So, these issues can be tackled simply by being respectful toward copyright.
Rowland: It’s part of the solution.
Shumaker: Before we close, is there anything else you’d like to discuss?
Marmanis: I’d simply say that we look at copyright the way it was expressed in the U.S. Constitution. Its purpose is the promotion of science and useful arts. Respect for copyright advances both science and art. Technology will continue evolving, and in 10 years, we may be dealing with other new developments. But copyright has been essential and will continue to be. It incentivizes authors to create, and if it were taken away, we would have a problem. It’s my personal view that copyright is essential to moving ahead with technology.
Rowland: I agree, and I’d add that copyright isn’t an impediment to progress; it’s a vehicle for progress. And it’s really important for us to remember that as we develop new solutions.
Shumaker: And that’s a great note to end our conversation. I’ll look out for future announcements from CCC.