What “Big Data” doesn’t understand about literature could fill a book that it would never read

Hey, remember how the internet was going to end racism?  How the digital revolution would close the gaps between the haves and have-nots?  Maybe eliminate money altogether?

Remember that?

It’s cute when little children assign their toys superpowers.  It’s nothing but trouble when grown-ups do it.

Today we’re told that digital technology will change everything about the study of literature:  quantifying it, taking out all the messy subjectivity, and reveal stunning new insights.

Promises promises.

The case is made, most recently, by Marc Egnal writing in the New York Times.

“Can the technologies of Big Data, which are transforming so many areas of life, change our understanding of American novels?” he asks.

Notice how no one who asks that question ever says “no.”  It’s a giveaway that we’re playing games rather than engaged in serious scholarship.  Serious scholars do not ask questions to which they are already messianically convinced of the answer, unless it involves ordering off a menu or tenure.

Sure enough:  “After conducting research with Google’s Ngram database, which tabulates the frequency of words used in more than five million books, I believe the answer is yes.”

By tabulating the frequency of the use of words like “submissive, pious, domestic and pure” to describe women, along with “women’s rights,” Egnal claims to have made new discoveries about 19th century American literature.  I won’t go into this too deeply – by all means read it yourself if you want to understand the claim.  (Read the article he’s summarizing if you want an even better understanding: though it’s notable how much less bold his claims are in the academic publication than in the mainstream one.  Another sign of an academic who, deep down, believes in the hype more than the subject.)

Egnal has two problems with traditional literary scholarship.  First, there are just too many books.  It is impossible for one person to read them all.  Second, subjective bias is always a threat.

And certainly, both of those are issues.  But what’s remarkable is that the proposed method –applying “Big Data” to the study of literature – solves neither of them.  In fact, it makes them both worse.

Consider the problem that there are too many books for any person to read.  Absolutely right.  But the solution Egnal proposes is to have literary scholars read fewer of them.  Instead of people reading books and accompanying scholarly studies, he’s having Google search them – which is the equivalent of reading an index, to the extent it can be considered “reading” at all.  The end result is not a single additional book read.

Far from being an improvement to literary scholarship, this is a step down from Cliff Notes.  Go ahead and do a key word search of the Bible.  Now – can you tell me what it’s about on the basis of that data?  Do a search of The Great Gatsby … for anything you want.  Can you summarize the story?  Do you have a sense of Fitzsgerald’s tone?  His sense of humor?  His opinion about human nature?

In fact, you are far more likely to colossally misunderstand a book you haven’t read after drawing conclusions from an Ngram search.

Thus the “Big Data” approach doesn’t solve the problem that not every scholar can read every book;  instead, it suggests that scholars need to draw more conclusions on the basis of a larger number of books they haven’t read.

Are we sure that’s progress?

More to the point:  the more literary scholarship shifts to database expertise, the less time literary scholars will have to read the actual books in question.  If the whole problem is that there’s only so much time to read, then asking scholars to spend more time in databases doesn’t help.  It’s counter-productive.

Once again technophiles, having failed to actually use computers to improve a task, have redefined the task so that it’s limited to something computers can do.  It’s slight-of-hand, not progress.

The issue of subjectivity in the study of literature is an even more absurd example.  The idea that not reading a book will enhance our objective knowledge of it is a satire worthy of Mencken.

Consider Egnal’s data from the 19th century novel regarding the role of women.  Based on Google Ngram searches, we know that words like “submissive, pious, domestic and pure” to describe women peaked in the early to mid-1800s, while the term “Women’s rights” emerged in 1848 and did not peak until 1884.

So … what does that tell us?

Go ahead.  Come up with your theory.  Whatever it is.

Now … prove it.

I mean, this is “objective,” right?  So there has to be some way to demonstrate that your theory for what this data means is better than my theory, or vice-versa.  Doesn’t there?  Because that’s what “objective” means – that it’s empirically verifiable by anyone.

But while the list of facts presented … word usage … may very well be objective, none of the conclusions are.  Nor can they possibly be because, and I hate to keep emphasizing this, WE HAVEN’T READ THE BOOKS.

Far from eliminating subjectivity from literary scholarship, Big Data is creating more of it by asking more people to come up with more sweeping conclusions that there is no way to verify about books they haven’t read.

Look, this isn’t complicated:  literary scholarship is what happens when people who have studied the history and context in which books were written read those books deeply and propose ideas about them.  Over time a body of knowledge develops as ideas are proposed, argued over, refined, discarded and … in a few cases … accepted, and then stand the test of time.

At the heart of this is the reading of books by actual people.  Put as many books into databases as you want;  come up with as many grocery lists of words as you like – people reading books deeply will still be the heart of the endeavor.

That techno-utopians can’t see this … can see everything but this … is suggestive of just how little they think about the actual subjects they’re trying to improve.