Computer Vision and Early Copyright History

– Hi all, thanks very much. I’ll be talking today
about, as Peter said, a bit of my dissertation research, which looks at the
introduction of copyright in the 18th century, and
copyright’s influence on the book market at large. And today, I’ll be looking at patterns in derivative artistic practices. So, text reuse, image reuse, and how those sort of
patterns were massaged by changes in copyright
law during the period. So, this is computational
research in the humanities that sort of builds on a lot of prior computational research in the humanities, the earliest example of which I could find is work by Patricia Margaret Took, who has this beautiful
hand-drawn distribution of the number of printers and the number of books published in
the early modern period. So, this is back in the seventies, this is in her dissertation. Pioneering work for studies,
computational studies, of the early book market. Later work by folks like John Feather were analyzing distribution over topics within books published in this period. Subsequent work by Alain Veylit. Starting to compute some
of the sort of basic descriptive statistics of books published, let’s say before 1800, in
the early modern period. And then, more recent work
by folks like Michael Suarez, who, again, here he’s
looking at distribution of book lengths over history. But there’s been very little
analysis of the ways in which copyright law may have
influenced the book market. In fact, there’s been
very little quantitative work done generally in the humanities. Humanities papers tend to be
presented in tweed jackets and read from a sheet of paper. There’s very little computing going on. So, to set the stage
for what we’ll discuss, I’m just going to sort of
walk through the relevant historical moments very briefly. Copyright’s invented in 1710,
statutory copyright anyway, which is sort of written law copyright. Before that, it’s all royal proclamations. And so, after 1710, there
are a number of cases that are fought over derivative works, where an author is
abridging, plagiarizing, adapting an earlier existing work. And the plaintiff wins
in all of these cases. The defendant is unable to
defend a derivative publication. Fast forward to 1741, and there’s a very prominent
case, Gyles v. Wilcox, which is fought over a publication by Fletcher Gyles, that we just saw. So, John Wilcox had basically abridged Gyles’ text, and published it. And, Gyles was understandably upset. He took Wilcox to court, and Wilcox defended his
publication on the basis that he had highly
manipulated his source text. He had transformed it in a number of ways, he wasn’t going to influence the marketability of the
original text, et cetera. And so, his derivative
publication was identified as a distinct copyrightable work and was found non-infringing. And so, this precedent in 1741 opens up the opportunity for
a number of subsequent cases, all of which have derivative
artistic text reuse that are defensible, that
are found non-infringing. So, this is a really sort of
watershed moment in history. I mean, this is the creation of fair use copyright law in 1741. At least for texts. In the image domain, there
was no fair use law until a very sort of breakthrough,
1785 case, Sayer v. Moore. And, these are the actual images that were disputed in that case. On the left, an image
of this sort of eastern sea coast by Robert Sayer. And, on the right, you can
look at just the left half of this image, on the same
by John Hamilton Moore. And, if you kind of squint at it, you can see the coloration’s
very different, obviously, but the figuration is highly similar. And, in fact, Moore’s maps
were based on Sayer’s. And this was proven in the case. It’s discovered that
certain misidentified labels in the Sayer map were
actually also transcribed into the Moore map, and that
becomes like an artistic trick. Artists would deliberately
in code false information into their pictorial graph so
that if they were reprinted, they could identify those
telltale marks, kind of clever. But, in any event, because
this was a sea chart, and because Moore’s
map, the derivative map, made a number of improvements
on the plaintiffs’ map, his map was actually found non-infringing. It was found to be a defensible work. And so this creates precedent for a number of subsequent cases, all of which revolve around
derivative image production, where a derivative image is
found to be non-infringing. And so the research question
that this part of my work sort of addresses is, “Does the rise of fair use
precedent incentivize, increase, “rates of derivative artistic practice?” So, once you get a
precedent for text reuse that says you can infringe
on other people’s works, and your work will be defensible,
it will be monetizable, do we see rises in sort of
frequency of text-imagery use? And, a number of prominent scholars have sort of contributed to this debate. Essentially, Ronan Deazley, professor of law has said essentially yes. And William St. Clair,
whose one of the few who’s tried to actually grab some numbers to analyze this question, he
said no, he doesn’t think so. But St. Clair is operating on the order of just about a dozen cases
that he’s kind of analyzing. So, what I want to do is
to address this question sort of at the level of collections. And the data for this comes from, actually, two distinct repositories. There’s a great collection
of 18th century books, called Eighteenth Century
Collections Online, those are the red dots down here. And, we can measure the
degree to which that sample represents the full corpus size, the full population, by comparing
the ECHO publication stats to the English Short Title Catalog. The latter, the ESTC, is a bibliography of all texts known to have
been published in English from 1472, the introduction
of printing, to 1800. So, it’s an absolute sort
of historical backbone against which we can measure
representativeness of a corpus, which is very rare. In more modern times, we have
no idea of the full print run of a given historical
sweep, it’s just too much. But we can see here that ECHO’s
sort of highly correlated. ECHO publication rates
are highly correlated with the ESTC, so this is a highly representative corpus sample. So, here’s a sample page image from ECHO. This is a page image, page
scan, that has an image, obviously, and some text. This is kind of a sample
ballad, a little poem form. And, so this one of roughly 35 million page scans in all of the ECHO corpus. All of which will be analyzed
in the work that follows. And, there are number of
problems with this data. One of which is that we can
measure representativeness of the ECHO corpus versus the
English Short Title Catalog, but we don’t know what’s missing from the English Short Title Catalog. We don’t know the survival rates of books. A lot of books have kind of
just disappeared from history. And, there’s actually very little work on measuring probability that… a given work would have
disappeared or that’s measuring survival rates of books. And so, if we look at the records of English Short Title Catalog, which are gathered from over
2200 libraries worldwide, we can look at the sort
of percent of books overtime that are held
in just a single library, that were basically on the
verge of becoming extinct. And, you can see it’s up to
60% in the very early period, and things kind of dip down. So, a significant portion of book history is just absolutely missing and
can’t be part of the study, but we’ve got decent
numbers that have survived. Geographically, there are problems, too. The vast majority of publishing
was happening in London, that’s this sort of
bleeding heart down here. Whereas later in the century,
the later the period, we get a lot of publishing
in Edinburgh and Dublin, which are kind of hotbeds of reprinting, because they weren’t subject to English copyright law permissions. So, printers in Edinburgh and Dublin can reprint sort of wholesale, they can sell records in London. They deliberately misidentify title pages, they lie and say that a record
was published in London, because London publications
were sort of hotter commodities. And then there are kind of intricacies within these different
geographic locations. So, for instance, if we look at optical character recognition quality over the 18th century, in
these different cities, we can see that London had nicer type, it tended to be preserved
a bit better on the page, and so the OCR rates were higher. Whereas a lot of Dublin printers are working with secondhand
presses, they have old typeface, and this sort of character… The imprint quality, the actual ink pressed on the page, is much lower. So, although Edinburgh and
Dublin are kind of, again, these hotbeds of reprinting,
we can’t capture as much from them because the
signal is less strong. So, a lot of problems
with the actual data. That said, we’ll begin
by looking at patterns in derivative text reuse. And, as we’ve observed, this is all optical
character recognition data, so these are page scans from which we’ve inferred
a character sequence. And, OCR rates are
variable over the period. We can see that sort of mean line there, drawn roughly in the 80% region. So, this is to say that
the text reuse algorithm needs to be sufficiently
fuzzy in order to work with these character irregularities. We can’t count on perfect sort of word level transcription rates. And so, the technique used
for this part of the study is called Minhash Algorithm,
which allows us to analyze text sequences and identify these hashes, these sort of… Derive these characteristics of the text, such that we can compare the
characteristics of two passages and identify with a certain probability, whether or not they match. And so, the basic notion here
is that we’re going to take each of those 35 million page scans, we’re going to break it into a sequence of multi-word windows. For each of those multi-word windows, we’re going to break that into a number of character sequences, re-character sequences. So, REP, first three letters, EPO, POR, so, we have the subsequent sliding window going over characters. We’re going to quantify each
of these character sequences, and then we’re going to take
this sort of quantifiable representation of the character sequence, and we’re going to use that to update a global vector for this window. For this vector, it’s just
a list of any numbers, and that is a parameter
of the Minhash Algorithm. So, this might be 128 set values, and we’re going to use this
value module 128 hash functions, which is just to say kind of clock time, but if you hit 12, you go back to one. So, we’re going to be
cooking these numbers by virtue of this number, just
like a clock, essentially. Each of these is going to
update this little vector, and at the end of the day, we’re going to pass a sliding
window over this vector, and we’re going to take
each of those sequences, and add this record to the list of records in which that sequence of values occurs in the vector, for the window. So, this lets us analyze
text reuse in linear time, which is much better than comparing 35 million page images to
35 million page images. It’s nice to have that optimization. And so, with this sort
of Minhash technique, we can identify fuzzy text reuse, such as the passages here. We have a recipe. It’s interesting to look at the kinds of things that are plagiarized. Recipe books, song books,
are often plagiarized. Legal books, often plagiarized. It’s highly ironic. And so, we’ll get a chance
to see some of the actual OCR underlying some of this in a second. But, this is the kind of fuzzy text reuse we’re trying to identify. Then, of course, the vast
majority of the aligned passages, the identified instances of text reuse are solely a function edition reprints. So, we have three different
editions of Robinson Crusoe, there are 120 something odd… And, in fact, if we look at reprints over the course of the century, almost half of publishing
that was happening during the period was just reprints of a given successful volume. So, we have to use elaborate measures to sort of construct this big graph and connect each of the edition reprints with connected components,
that allow us to identify that this region is all Robinson Crusoe, whereas this is an actual
plagiarism of Crusoe. So, if we flush all of this out, we remove those edition reprints, and we analyze the actual
rates of derivative publication over the period, we get
these pretty smooth curves. We can see that sort
of reprinting behavior grows throughout the century but then kind of reaches
a saturation point. It’s like almost a market saturation, and then there’s no subsequent growth. But, we don’t see a big spike after 1741, after Gyles v. Wilcox,
which we would expect to see had Gyles v. Wilcox influenced
derivative text publication. So, so far it looks like the rise of fair use in the text domain has not influenced text
reprinting practices. What about images? Well, here’s just the
sort of image corpus, how many images are
being printed over time. Several thousand annually, and these are the different image types. Okay, in the sort of processing pipeline, there are number of challenges posed. One of which is the sort of
variance within exposure rates. So, these are two different images, two different page scans,
of the same edition. And, you can see on the left, obviously, we have far less signal
than the one on the right. There are these strange eccentricities of the corpus, as well,
where sometimes two pages will be kind of consolidated
into a single page image. So, these are distinct pages, this right-side page
is one composite image that somehow represents the two images. It’s almost like an automatic
page flipper got stuck. Not sure. Other strange sort of figures
where it looks like someone’s gone in with scissors
and they particularly like this woman, so they just cut her out. Pasted on some 18th
century wall somewhere. So, this is a sort of
reprinting of the image but there’s very little signal
there to capture that with. And then, there’s the more
sort of philosophical problem, which is that you can have two images which are highly similar, and then it becomes
almost a probability game. What is the probability of simultaneous co-discovery of these two images? What’s the probability that
these images were generated independently of each other? Just by virtue of the fact that this is a large space, and there will be random occurrences, where two authors will
generate the same image. And so, even deciding in a number of cases whether or not two images are derivative can be difficult, I think, for humans. So, when we run this image computation, and I’ll say really
briefly that this is… I had the most baroque
labyrinthine pipeline for this. This involved using sort of
vectorized image representations from inception on key-point
based image analyze techniques, like SIFT and SURF,
image hashing techniques, like a perceptual hash, an average hash, combining all of those features together, and then actually just using an old school support vector machine
classifier, given that feature. And that was trained on 300000
observations, image pairs, that were binary classified
as matching or not matching. I’m happy to share that training data that was very painful to collect, with the jankiest web
app you’ve ever seen. Using that data, we can sort
of analyze all these images, and we can do a number of things. In the first place, we can identify kind of editorial practices. So, these are three different page samples from the Champions of Christendom, and you can see that the
authors kind of changed his mind about the skeleton, he didn’t
quite like this skeleton, so he just keeps upgrading this skeleton in each of the editions. Kind of interesting. But we can also identify a number of very subtle plagiarisms, where an engraver was actually referring to a source image, and
sort of re-crafting it. And so, I spent weeks analyzing
these kinds of image pairs, and playing spot the difference, trying to find exactly
where the subtleties are. But if you look at the
different groups of people, there are two people sitting here, and they’re slightly different
when moving over here. This was a kind of regular occurrence, where an author, an engraver,
would hire someone else to sort of re-craft, re-imagine an image. So, this is a perfectly derivative image, published by a different
author, a different publisher than the original source image. And we can identify these on that scale. On the left, this is an
image by Thomas Jeffreys and on the right, this is actually a publication by Edmund Burke. As you can see, it’s kind of flushed out the Jeffreys’ image. It’s ironic because
Jeffreys was the defendant in a number of image dispute cases where he was actually plagiarizing others. So, there may be a sort of history here, where Jeffreys is plagiarizing and Burke is plagiarizing on top. But if we look at the rates of image reuse over the full sweep of the century, we can see it’s very
minimal in the first place. In fact, only about half
of 1% of book publications contains derivative images,
contains plagiaristic images, over the sweep of the century. And there’s no noticeable spike after the Sayer v. Moore case in 1785. So, to return to the original question, “Does the rise of fair use precedent “increase derivative artistic practices?” It seems William St. Clair
was right to think not. Thanks very much. (audience applause) (soft music)

Leave a Reply

Your email address will not be published. Required fields are marked *