DSC #14: Hello, DMCA Exemption#

DSC 14 book cover
by Quinn Dombrowski, Rachael Samberg, and Erik Stallman

May 17, 2022

DOI logo https://doi.org/10.25740/rx261rg1892

Returning to the cliffhanger#

DSC 7: The DSC and Mean Copyright Law was published in September 2020. It’s still one of my favorite pieces ever on working through fair use. Unlike the Sweet Valley book series that stuck a dramatic cliffhanger at the end of each volume, the DSC tries to wrap up most of the loose ends by the end of each book. But DSC 7 was different: we mentioned this 1998 law, the Digital Millennium Copyright Act (DMCA), that put additional restrictions on the use of ebooks (and other “digital” things, like DVDs) beyond fair use – such as making it illegal to crack the encryption on ebooks so you can use the text for solidly fair use things like research. And every three years, there’s a window where people can petition for exemptions to this rule. Some of them have actually been granted, like one that allows blind people who use screen readers to crack the encryption if necessary to make the books readable by that accessibility device. 2021 was a petition year, and a bunch of DH people along with some friendly lawyers had a plan.

Meet Erik and Rachael#

But I’m getting ahead of myself, as I often do. I’m Quinn, founder of the Data-Sitters Club which you can read more about in chapter 2. This book is about a brand new (as of late 2021) exemption to the DMCA and what it does, and doesn’t, mean for DH practitioners. We’ll start with what you need to know about what new things you can do thanks to this DMCA exemption. But then we’ll give you a peek inside how the DMCA exemption came about. I had no clue how this actually worked before I got involved with it. Processes like this that always feel shrouded in mystery and jargon, leaving one with the impression that it’s not the sort of thing that “regular people” can do anything about. Which is not the case at all! If we want to improve access to doing work like this, we need more “regular people” speaking up about how bad the situation currently is for them. You (yes, you!) can help with this work, and we’ll wrap up with how to take action.

I’m lucky to be joined once again by some amazing lawyers who have been deeply involved in advocating for the DH community: Rachael Samberg and Erik Stallman.

Rachael (UC Berkeley Library) has brown hair, some of which she occasionally bleaches and dips in her favorite color (which is purple). She can have a basic conversation in Finnish. And she is perpetually covered in cat fur because she has the sweetest and funniest kitty, and Rachael and her kitty practice nightly gene-splicing…I mean, er… are best friends. Rachael uses her law degree to promote justice (!!!) and innovation (!!!) as the director of UC Berkeley Library’s Office of Scholarly Communication Services. Confused? Basically, if anyone on the UC Berkeley campus has questions about copyright, licenses, privacy, or ethics in their research, scholarship, or teaching, Rachael’s office can help.

Erik (Berkeley Law) has brown hair that is turning gray and brown eyes that are mostly staying brown. He wears an assortment of drugstore reading glasses because he loses his glasses a lot. He is the Associate Director of the Samuelson Law, Technology & Public Policy Clinic at Berkeley Law. He works with law students to help researchers and other folks because copyright law sometimes solves a problem for some people by creating problems for other people.

“Text Data Mining”, Fair Use & the DMCA#

We describe the Data-Sitters Club as a friendly, colloquial guide to “computational text analysis”, but the terminology that’s caught on in legal contexts is a little different: “text data mining”, or TDM. Rachael defines TDM – mostly for other lawyers and people who aren’t doing this kind of work – as “a research approach where scholars are using automated methods to identify patterns, or extract and describe trends in large volumes of unstructured or thinly structured digital content.” If you’ve never thought about your literary corpus as a “large volume of unstructured digital content”, you’re not alone. For legal advocacy work to be effective, you have to invest a lot of effort into translation – in all directions.

As a quick refresher for anyone who hasn’t revisited DSC #7, copyright law grants the authors of creative works a certain set of exclusive rights (such as adapting the work to another medium – RIP amazing two-season Baby-Sitters Club Netflix series 😭). But there are some limited contexts where other people can do things (without getting any permission!) that usually only the copyright holder could do. One of the most important such contexts is called “fair use”, and crucially for us, “fair use” covers a lot of situations that researchers often find themselves in. There’s no hard and fast set of rules for what is, or isn’t fair use: it’s a gray zone, but sometimes that can work in your favor as a researcher. By the end of DSC #7, we’d worked through the factors that go into determining whether something is fair use, and concluded that the Data-Sitters Club should be covered by it, though we can’t publish our corpus, not even marked-up TEI versions of the graphic novels.

There’s one catch, though, and it comes down to some details from DSC #2: Katia and the Phantom Corpus. The Data-Sitters Club project isn’t going to land us in copyright trouble because we built our corpus by scanning and OCRing the Baby-Sitters Club book series. Making our corpus that way was annoying and time-consuming in a way that wouldn’t work for large-scale projects, but it was also legal. It would not have been legal (that is, it would not have been “fair use”) to purchase ebook copies of the entire series, break the encryption, and use the resulting text files as our corpus, all because of a law called the Digital Millennium Copyright Act (DMCA). Even though creating the Data-Sitters Club corpus itself would have been legal (or fair), the DMCA makes it additionally illegal to break the encryption on encrypted ebooks in order to extract the text (or do anything else), even if you’re doing it for something that would otherwise be covered by fair use.

Now, the DMCA isn’t absolute: there’s a system in place, every three years, for people to request exemptions for certain kinds of situations. 2021 was one of those years. We’ll get to the story in a minute about how the new exemption came to be, but the tl;dr version is that we actually did succeed in getting an exemption granted to break encryption for “text data mining”! 🥳

Except it doesn’t exactly fulfill all our hopes and wishes. 😭

I don’t know anyone better than Rachael for explaining this exemption as it was written, which you can find in the rule from the Copyright Office of the Library of Congress in the Federal Register on October 28, 2021. (There’s a lot in there, because the rule covers multiple different petitions; the good stuff for our purposes is “Proposed classes 7a/7b”.)

“There are now two types of materials that the new exception allows you to break encryption on to conduct TDM: motion pictures (think DVDs) and literary works (think ebooks),” explained Rachael. “The short of all this is that the Copyright Office recommended that researchers be permitted to break encryption for purposes of text data mining on those two categories of materials.”

Sounds great so far, right? But then we get to the fine print.

“Researchers need to meet several conditions, though,” Rachael continued. “First, the encryption can only be broken by a researcher at a nonprofit institution of higher education.”

Independent scholar? Public librarian? Alt-ac working at a humanities think tank? No soup for you with this DMCA exemption!

“Second, the copy of the motion picture or ebook has to have been lawfully acquired, and owned or licensed by the institution that acquired it without any time limitations on access. The ‘no time limitations’ bit typically rules out anything licensed via a streaming service.”

“Tell me more about this ‘owned or licensed by an institution’ bit,” I said. “Does that mean researchers can only use things purchased by the library this way?”

“Not necessarily,” Rachael said. “The rule doesn’t get into the details of what it means to be owned ‘by an institution’. I think reasonable minds may differ here, but one reasonable interpretation comes down to money. If the researcher buys a DVD using their own credit card, and keeps the DVD at their house, and brings the DVD with them if they leave the institution, it’d be hard to argue that the DVD was owned by the institution. But if the researcher buys the DVD using university funds, or grant funds managed through the university, and that DVD is treated like the property of the institution rather than the researcher’s personal property, then it’s reasonable that the DVD is ‘owned or licensed by the institution.’”

Okay, so you have to be a researcher at a non-profit university, and you can’t go cracking your way through your personal DVD or ebook library. What else was in the fine print?

“The third requirement is about what you can do with the content that you’ve decrypted. You can break the encryption only for purposes of doing and verifying your research results. So, you can take the text file you got out of the ebook, run it through an algorithm, and then look at the text file with your eyeballs just to double-check that the algorithm did what you were expecting. But let’s say you also or maybe later on want to closely analyze the text file by reading it to understand it yourself. Reading the book in the text file would be violating the terms of this rule, and you’d once again be running afoul of the DMCA,” Rachael explained.

Frankly, reading ebooks via plain text files is no fun at all – trust me, I’ve tried this with some of the books I’ve scanned, and I wouldn’t recommend it to anyone. But there was one last condition that posed the biggest challenge yet.

“The last and final requirement before a researcher can rely on the new exception,” said Rachael, “is that the institution has to use ‘effective security measures to prevent further dissemination or downloading of the literary works in the corpus’, and limit access only to the persons who are identified as being able to have access for them.”

“‘Effective security measures’, huh?” I raised an eyebrow. “Who decides what counts as ‘effective’?”

“They do actually define that,” noted Rachael. “In this rule, ‘effective security measures’ are measures that have been agreed to by institutions and interested copyright owners. Or, in the absence of measures that have been agreed to by both the copyright owners and the institution, then whatever measures the institution would ordinarily use to keep its own highly confidential information secure. What this means is that, if a researcher gets through the first three hurdles to use this exception, they’ll likely need to work with an information security unit at their library or the IT department at their institution to operate in an environment that complies with those practices around highly confidential information.”

I grimaced. “Highly confidential” in my world means things like social security numbers. We have infrastructure in place that supports moderately-confidential information. It’s a hassle to deal with: you can’t download it to your own laptop, so no working offline. And you have to navigate the high-performance computing (HPC) infrastructure to run your analysis, which takes more than a little getting used to (starting with working from the command line). Currently, the folks running our HPC system at Stanford are very clear that it is not to be used for highly confidential data, though that could change over time. As for alternatives, it’s one thing to set up and run a private, secure computing system for your lab that could accommodate these requirements, if you’re funded by the NSF to the tune of millions of dollars. Humanists don’t generally have that kind of money. At least today, I don’t think I’d be able to make the security requirement work for this exemption… and I work at Stanford, of all places!

Rachael shared my frustration. “If it’s okay to editorialize for a moment…”

I laughed. “I mean, that’s like half of what we do here in the Data-Sitters Club.”

“There are security measures that global research experts agree are sufficient to protect genetic data, health data, all sorts of highly confidential things, but in the pursuit of this DMCA exemption, the publishers who opposed the rule suggested that none of these measures would sufficiently protect their copyrighted text. And that’s how we ended up with a rule that asks institutions to negotiate with publishers when a researcher wants to break DRM for TDM, or to apply measures used for other ‘highly confidential’ information. As a practical matter, negotiating with publishers each time a researcher has a TDM project they want to break DRM on isn’t a likely outcome – and not only because the publishers won’t agree, but also institutions don’t have time to do all the negotiations that would be involved. So these projects would just get shunted into the second category of being forced into a research environment used for ‘highly confidential’ data.”

That was depressing. But the bad news kept coming.

“Let’s imagine that you can figure out all four things, including the security requirement. So you’re good to go with breaking the encryption, right?”

I got a feeling this was going nowhere good.

“There is at least one major practical constraint in relying on the exemption at least for ebooks. When you ‘purchase’ an ebook, you’re not always actually purchasing it. You’re usually just licensing it. And when anyone licenses an ebook, they agree to the terms of the license, which governs what they can do with that ebook. By giving the licensor like Amazon money, you’re entering into an agreement with them, and whether it’s you or the institution making that agreement, the terms and conditions will apply to your use. And the reality is that, whether we’re talking about your license of an individual book from Amazon or the library’s licensing of thousands of DRM-protected eBooks from a vendor, many of these ebook licenses explicitly prohibit breaking the encryption to access the underlying content.” Rachael sighed. “So we went through this whole thing to have the Copyright Office issue an exception to the Digital Millennium Copyright Act that would allow for breaking DRM… and yet we now face a contractual override scenario where, fine, the copyright law says we can do something, but the institution or a researcher using institutional funds has entered into a contract that effectively squashes a right that they would now have had under copyright law. I can tell you that the extent of our ability to negotiate with publishers to remove those contractual restrictions is still TBD, but you’d imagine that it wouldn’t look really promising to undertake those negotiations with powerful publishers who don’t want us to be able to do this at all.”

The legal landscape for DVDs, Rachael noted, looked a lot better after this DMCA exemption – DVDs don’t typically come with additional license terms – but for ebooks? Or at least for the “popular titles” encrypted eBooks that most researchers want to work with?

“Look, to be clear,” Rachael tried to reassure me, “researchers can absolutely do TDM with ebooks, DVDs, or anything else they want to study. The ‘doing of the TDM’ part was and remains fair use. But if those ebooks are also DRM-protected, this new exemption permitting breaking encryption to do the TDM has both formal and practical constraints that can leave researchers at an impasse.”

It was looking like all our work had led us to a dead end.

Mommy, where do DMCA exemptions come from?#

What is the work that goes into a DMCA exemption? How did we end up here, and what does it all mean? For another perspective, I turned to Erik Stallman.

“My first adventure with DMCA exemptions involved security research,” Erik mentioned. “There was this perverse situation, where security researchers can wind up in trouble for disclosing security vulnerabilities in software, because the only way they’d have been able to find them is by breaking DRM. I worked with a group of security researchers to support an effort to make their work a little safer. When I came to the Samuelson Clinic, I got connected with Rachael who’d been looking at the impact of the DMCA on digital humanities researchers. The Clinic’s client, Authors Alliance, was interested in this issue because it represents the interests of authors – particularly academic authors — who want to write articles and books based on TDM research. Authors Alliance occupies a unique space in copyright debates because its members both rely on exclusive rights to protect their work and rely on copyright limitations and exceptions to build on existing works to create new ones. It’s easy to forget, but the whole point of copyright law is ‘to promote the progress of science and useful arts’. For academic authors, the practical restrictions on text data mining posed by the DMCA was inhibiting the progress of research and new creative works. So Authors Alliance sought an exemption. The Library Copyright Alliance and the American Association of University Professors also joined the petition.”

“How do you go about putting a proposal like this together?” I asked.

“We spent a whole semester just talking to researchers,” Erik replied. “We weren’t even thinking about it from a legal frame, we were just focused on identifying people who were impacted by this law. And that’s where we met you! And we encountered this whole community of researchers who were familiar with these issues and frustrated by the constraints on what kinds of research questions they could ask and how they could answer them.”

I still remember that first meeting with Erik and the UC Berkeley law students as if it were yesterday. One of the first meetings that was rerouted to Zoom in March 2020.

“Initially, we asked for a very broad exemption to cover all text and data mining for research purposes,” Erik explained. “That request was accompanied by letters from many different kinds of academics – faculty, grad students, librarians, at research institutions, at smaller institutions. So we filed that, and the entire content industry came out and said ‘Text data mining isn’t covered by fair use and there’s no basis in law for an exemption this broad.’”

I remembered how angry the content industry’s response made me, especially when they called out the Data-Sitters Club use case specifically as one where we shouldn’t be allowed to look at the full text in order to check our results.

“Then we filed the reply comment,” Erik continued. “In which we agreed to narrow the exemption to only cover non-commercial research by academic institutions, libraries, and museums, so no one can use it to build commercial apps or whatever. We tried to address all of the legitimate concerns that the rights holders had, and contour our exemption to the types of projects that were in the letters that scholars had submitted along with our original petition, because we knew the Copyright Office would want to see evidence in those documents.”

I nodded. “So that’s why we had to give away the opportunity for this to apply to independent scholars?”

“That was definitely part of it,” Erik said.

I thought about the back-and-forth of the reply period, the emails from Erik asking those of us who wrote letters if this or that concession would be okay. I gritted my teeth in agreeing to this; it didn’t feel good to write out independent scholars or anyone else in our broader community. But we knew it’d be hard to get anything for anyone out of this, and if the Copyright Office granted the exemption, there could be future opportunities to expand it.

“We really tried to address concerns without undermining the core of the exemption, and we also wanted to be reasonable and act in good faith. Next there was a hearing, where the Clinic students and I appeared on behalf of Authors Alliance and its co- petitioners, and on the other side were lawyers representing the American Association of Publishers, the Recording Industry of America, the Motion Picture Association, the music publishers, the video game publishers, the people who make the encryption for DVDs! Even after we narrowed the exemption, they remained uniformly opposed.”

That hearing with the Copyright Office was just like in BSC #14: Hello, Mallory, where all the babysitters ganged up on poor Mallory, refusing to accept her as a knowledgeable babysitter and legitimate member of the club until she passed a series of implausibly hard tasks, like drawing an accurate diagram of the digestive system.

“The rights holders said that the security measures for decrypted literature should be the same ones that the federal government makes its contractors use when handling non-classified information,” Erik added.

I rolled my eyes. “Why not just go for the security standards around nuclear launch codes while they’re at it?”

“We pointed out that it didn’t make sense to be that prescriptive in this case, that’s not how security practices work across universities. Each institution develops and implements its own security controls, particular to the type of data being collected and the risk they’re exposed to,” said Erik.

I remembered this well from my time working on a secure computing environment at UC Berkeley. There are guidelines about the sorts of things that should and shouldn’t be allowed (e.g. no downloading sensitive data to laptops that could be stolen or lost), but the specific hardware and software configurations were left to Research IT to determine.

“So what happened after the hearing?” I asked. “It seemed like things were quiet for a long time.”

“There were the post-hearing letters, there were some ex parte meetings – where everyone had these side-meetings with just them and the Copyright Office to make their case one last time. And then we waited.

“How did you get the news?” I asked.

“David Bamman was the person who told me about it, and my first reaction was that surely he had misread the Copyright Office’s recommendation. I tend not to be optimistic about these things. And then I read it and – wow, it’s true! And then I read the details, and the whole thing was an emotional roller coaster: surely we lost! No, we won! Wait, maybe we lost…” Erik paused to reflect on it. “I do think it’s really great that we got the exemption, and they didn’t impose a flat ban on viewing corpus content, which was a real risk. But unaffiliated libraries and archives were cut out, and the hardest part of the exemption is the security language.”

“Rachael explained the two routes: either there has to be an agreement between the scholar or university and rights holders, or whatever the university uses to secure its own ‘highly confidential’ data, as if lives were at stake over unauthorized people getting access to literature,” I sighed.

“It’s frustrating,” Erik replied. “Because there are some security controls that experts told us make sense and that people can agree on, like encryption, keeping materials on a secure server, requiring authentication, logging access, things like that. But this goes so far beyond that.”

But there was more. “If you go with your own university’s practices for highly confidential data, the rights holders can request information about the security measures being used. I have never seen an exemption that gives rights holders investigative rights.”

When I read that part of the rule, it definitely led to some texts and Slack messages involving language arguably not suitable for kids reading at the level of the Baby-Sitters Club. Even if we did everything right – even if we had access to the infrastructure to do everything right – what would stop rights holders from making our lives difficult with more questions than Mallory’s chatty youngest sister, Claire Pike? “Give us the height and width of the machine you’re storing our content on! How many blinky lights does it have? When was the last time you updated the operating system? How many people have logged in in the last 29 minutes and 12 seconds? The Copyright Office says you have to tell me, silly-billy-goo-goo!”

And all this would be the best case scenario, if somehow you were able to get around the license restrictions.

Erik sighed. “You have to take a step back and remind yourself what it means to have gotten this exemption at all. The US Copyright Office is acknowledging that computational digital humanities research – text data mining – is fair use! That’s big. And you have to take the long view here: we now have a legal beachhead for this kind of work.”

Where do we go from here?#

Good news! We got a DMCA exemption granted that legitimizes breaking encryption so we can do TDM (which we know is a fair use). Bad news! Almost everywhere that sells the ebooks currently includes contractual language that prohibits us from making use of the DMCA exemption to break the DRM. Even worse news! Even if we did get a better contract, there’s some super-difficult security requirements that I’m not even sure I could make happen at Stanford. Also, independent scholars and unaffiliated libraries and archives are shut out altogether. It feels like we’ve landed in the copyright law equivalent of middle school: in theory you have more freedom than before, but in practice, the experience of being there is pretty awkward and miserable (if you’re not a fictional character in the Baby-Sitters Club series… and sometimes even if you are.)

But we can fix it, right? The exemption will be up for renewal in 2024 and that will create the opportunity to address the parts that aren’t working or even expand it. Could we address the contracts issue?

“Not so fast,” said Rachael. “Setting permissible terms for contracts is not within the domain of the Copyright Office. They only have the authority to grant or deny exemptions to allow the breaking of encryption – they have no power over the agreements businesses make with their customers. The way to get there is to either negotiate with individual publishers, or try legislation at a state or federal level. That’s what’s happened in Europe– there’s been legislation at the EU level that will be implemented in the laws of individual countries, prohibiting contract terms that interfere with their equivalent of fair use for researchers.”

“Oh good, so we just need to get federal legislation passed. That’s so not happening in the current political climate,” I moaned.

“The thing is, people have tried it at the state level. Maryland got a law passed about ebook contract terms for libraries, but then a court found it to be unconstitutional, and so they’re not pursuing it further,” said Rachael.

But her mention of the situation in Europe gave me an idea. “Could we collaborate with researchers working in more favorable legal environments? More international collaboration is a good thing, right? Now every project in the US that wants to work with ebooks has a great excuse to go find a colleague in the EU to partner with! Would that work?”

“All of this is so new,” said Rachael. “First, EU rules prohibiting contractual override of fair use rights aren’t codified in the laws of all the individual countries yet, and not everywhere also allows breaking DRM as part of those rights. Plus, in terms of EU researchers sharing content with US researchers, what matters is their fine print. If you want to rely on someone else’s DRM exception, that other researcher would need to make sure their own contracts include rights to share materials with collaborators at other institutions, in other countries (which is unlikely). Cross-border text data mining issues are pervasive, and we’re just learning how to navigate them. No one has it all figured out yet; I’m speaking to the World Intellectual Property Organization next week about this, planting the seeds for further discussion about text data mining challenges, both from a copyright perspective and a contractual override perspective. This is a learning process for policymakers, too. And there’s very little that any single scholar or institution can do about any of this. What we need is coordinated action.”

Erik agreed. “There’s not a single, clear path forward,” he admitted. “But there are many different routes, and some of them are mutually-reinforcing. Even a little success in one area may unlock opportunities in another.”

You can help#

We’re not where we hoped to be with this DMCA exemption. It’s a beachhead, not a victory, and our best chance at renewing or improving the exemption is if we manage to use it before 2024 – or demonstrate why it’s currently unusable. But if we’re in the copyright law equivalent of middle school right now, remember what we’ve learned from the Baby-Sitters Club: the best way to survive middle school is with friends.

The Association for Computing and the Humanities (ACH), where Roopsi and I will be co-Presidents starting this summer, has set up an info page along with a mailing list that you can join to connect with other people who care about the intersection of computational text analysis, copyright law, and vendor contracts. That’s where we’ll be organizing, helped by friendly lawyers like Erik who are working on next steps.

Everyone who’s impacted by these issues is welcome – especially if you’re part of one of the groups shut out of the current exemption, either through your affiliation (or lack thereof), or by working at an institution of higher education that cannot provide you with the necessary infrastructure to fulfill the security requirements, even if you had a workaround for the license issue. And if you’re outside the US, we’d love to hear from you, too! Developments in Europe are going to be important for us to follow, both as precedent and as a way to cultivate FOMO among lawmakers – after all, no one wants to hear that US scholars are at a global disadvantage when it comes to cutting-edge research.

Whether you’re a student or librarian, an adjunct or an independent scholar, a technologist or an administrator, there’s a place for you in this work. Together we can make it out of the middle school of copyright rulings, and on to better things!

Suggested citation#

Dombrowski, Quinn, Rachael Samberg, and Erik Stallman. “DSC #14: Hello, DMCA Exemption.” _The Data-Sitters Club. _ May 17, 2022. https://doi.org/10.25740/rx261rg1892.