Skip to content
Error

Failed to copy link to the clipboard.

Success

Link copied to the clipboard.

Introduction

An exercise in interdisciplinary explication, this section is designed to explain the concepts and processes that underpin natural language processing (NLP) to humanities scholars.

An Example from Henry James

In Youth, J. M. Coetzee’s third fictional autobiography, the narrator comments:

People in James do not have to pay the rent; they certainly do not have to hold down jobs; all they are required to do is to have supersubtle conversations whose effect is to bring about tiny shifts of power, shifts so minute as to be invisible to all but the practiced eye. . . . James wants one to believe that conversations, exchanges of words, are all that matters. Though it is a credo he is ready to accept, he cannot follow it.1

J. M. Coetzee, Youth, p/back edn. London: Vintage, 2003, pp. 64–65

Let’s take one of these supersubtle conversations as the basis of our analysis. Conducting NLP on literary texts is of course notoriously difficult. My purpose here is to set out the general process and logic at play in this activity – both as a starting point for supporting literary critique of NLP more broadly, but also as a means to think through the challenges that literary text poses for NLP and the insights that studying such examples might bring.

As a starting point, I note that I am emphatically not an expert in computational linguistics or NLP, what follows is the result of a couple of years of software coaching, self-learning (drawing on some wonderful online resources and communities), and significant help from tolerant colleagues. What I hope to do here is introduce some of the process and conceptual points at stake for readers unfamiliar with NLP. For those readers with a good working knowledge of this material, I would recommend skipping to How to Read a Chatbot.

I have chosen a passage from the end of James’s novel The Wings of the Dove (1909). While the plot of this novel does turn on financial motivations, it is generally regarded as evincing precisely the supersubtle conversations that Coetzee’s narrator attempts to emulate. The novel as a whole is representative of James’s so-called late style (complex sentences, nuanced dialogue, and famously dictated 2).3 The conclusion to the novel turns on the reader’s interpretation of the dialogue. It is therefore a semantically significant (if not necessarily representative) passage.

This is the extract as it appears in the 1909 Charles Scribner edition, as presented in Project Gutenberg’s HTML file:

He hesitated – as if there had been many things. But he remembered one of them. “Stupendous?”

“Stupendous.” A faint smile for it – ever so small – had flickered in her face, but had vanished before the omen of tears, a little less uncertain, had shown themselves in his own. His eyes filled – but that made her continue. She continued gently. “I think that what it really is must be that you’re afraid. I mean,” she explained, “that you’re afraid of all the truth. If you're in love with her without it, what indeed can you be more? And you’re afraid – it’s wonderful! – to be in love with her.”

“I never was in love with her,” said Densher.

She took it, but after a little she met it. “I believe that now – for the time she lived. I believe it at least for the time you were there. But your change came – as it might well – the day you last saw her; she died for you then that you might understand her. From that hour you did.” With which Kate slowly rose. “And I do now. She did it for us.” Densher rose to face her, and she went on with her thought. “I used to call her, in my stupidity – for want of anything better – a dove. Well she stretched out her wings, and it was to that they reached. They cover us.”

“They cover us,” Densher said.

“That’s what I give you,” Kate gravely wound up. “That’s what I've done for you.”

His look at her had a slow strangeness that had dried, on the moment, his tears. “Do I understand then—?”

“That I do consent?” She gravely shook her head. “No – for I see. You’ll marry me without the money; you won’t marry me with it. If I don’t consent you don’t.”

“You lose me?” He showed, though naming it frankly, a sort of awe of her high grasp. “Well, you lose nothing else. I make over to you every penny.”

Prompt was his own clearness, but she had no smile this time to spare. “Precisely – so that I must choose.”

“You must choose.”

Strange it was for him then that she stood in his own rooms doing it, while, with an intensity now beyond any that had ever made his breath come slow, he waited for her act. “There’s but one thing that can save you from my choice.”

“From your choice of my surrender to you?”

“Yes"—and she gave a nod at the long envelope on the table – “your surrender of that.”

“What is it then?”

“Your word of honour that you're not in love with her memory.”

“Oh – her memory!”

“Ah” – she made a high gesture – “don’t speak of it as if you couldn’t be. I could in your place; and you’re one for whom it will do. Her memory’s your love. You want no other.”

He heard her out in stillness, watching her face but not moving. Then he only said: “I’ll marry you, mind you, in an hour.”

“As we were?”

“As we were.”

But she turned to the door, and her headshake was now the end. “We shall never be again as we were!”4

Henry James, The Wings of the Dove, vol. 2. New York: Charles Scribner, 1909, Project Gutenberg e-book #30059, released September 22, 2009, HTML file, http://www.gutenberg.org/files/30059/30059-h/30059-h.htm.

In order to conduct NLP on this extract, we need to go through a number of steps. These steps should not necessarily be considered as temporally discrete, but rather as entailing certain types of activity.

Step 0.5

Before we begin, I want to flag that we have already made a number of interpretative decisions. Above, I gave you a rationale for my choice of passage. I have selected what I think to be an example of James’s “supersubtle conversations.” I have also selected the edition and the length of the passage, selections that you may or may not agree with. These decisions will affect the analysis that follows.

Step 1

This stage involves preparing the text, or raw data, so that it is machine-readable. Otherwise known as preprocessing, the activities can include:

  • Removing “noise”: Depending on your interpretation, noise might include whitespace, line breaks, and blank lines; text file headers and footers (which I have already removed in the above passage); markup and metadata, etc. We might even use an expansive definition of noise and remove words which we regard as low value in this context, for example prepositions.
  • Normalizing text: This is again, quite task and goal specific and might entail changing all letters to the same case (in this example from upper to lowercase) in order to avoid the program thinking that, for example, T and t are separate characters. It can also involve the removal of: nonalphanumeric characters and diacritics; italics (which appear frequently in the above) and bold; or contractions (of which James is very fond in our passage). We might even want to remove “stop words,” common vocabulary that could be regarded as carrying less meaning within the data set—in the case of English: prepositions, pronouns, etc.
  • Tokenization: “the task of cutting a string into identifiable linguistic units that constitute a piece of language data.”5 Tokenization produces a list of words and punctuation. This can be done at a word or sentence level. For some texts, tokenization is tricky – for example, raw text data from spoken language might not demarcate word boundaries, or written text might include punctuation that confuses the tokenizer, e.g., the full stops in U.K.
  • Punctuation removal: Depending upon our analytical aims, at some point we might decide that having tokenized our string into words, we no longer need punctuation.
  • Stemming or lemmatization: The former involves stripping affixes from words, the latter involves that stemming activity which produces dictionary-recognizable words, or lemmas. These processes are not regularized, and there are numerous different programs for doing them which produce various results.

At the end of step one, the extract might look something like this:

hesitated many thing remembered one stupendous stupendous faint smile ever small flickered face vanished omen tear little less uncertain shown eye filled made continue continued gently think really must afraid mean explained afraid truth love without indeed afraid wonderful love never love said densher took little met believe now time lived believe least time change came might well day last saw died might understand hour kate slowly rose densher rose face went thought used call stupidity want anything better dove well stretched wing reached cover cover densher said give kate gravely wound done look slow strangeness dried moment tear understand then consent gravely shook head no see marry without money will not marry not consent not lose showed though naming frankly sort awe high grasp well lose nothing else make every penny prompt clearness smile time spare precisely must choose must choose strange stood room intensity beyond ever made breath come slow waited act one thing save choice choice surrender yes gave nod long envelope table surrender word honour love memory oh memory ah made high gesture not speak could not could place one memory love want heard stillness watching face moving said marry mind hour turned door headshake end shall never

As you can see, the preprocessing stage entailed making numerous decisions and assumptions on the basis of the research question we are trying to answer. It also entailed the loss of a number of elements that the literary scholar might consider vitally important. James’s use of italics to denote inflection, his precise punctuation, tenses, and his frequent use of contractions will likely affect a scholar’s interpretation of the meaning of passage as they conduct close reading. However, it might also suggest new ways of analyzing the text data.

Step 2

At this stage, we might decide that we want to categorize our tokens in order that we can better manipulate them. In the above example the removal of stop words has resulted in the loss of pronouns, which in the case of this conversation-based passage, has significantly reduced the visibility of the subjects. We have also lost sentence structure and stems. As we move to step two, I will therefore reinsert the stop words and punctuation, reverse the lemmatization, and tokenize at the level of sentence, rather than word, in order to explore different processing possibilities.

In this step we might use:

  • Part-of-speech (POS) tagging: This process enables us to process a sequence of words according to their syntactical status; it sets us up to conduct analysis based upon the syntactical features of the text. For example, the first few words will print like this:
('He', 'PRP' [personal pronoun]), ('hesitated—as', 'VBZ' [incorrectly marked as verb, third person singular present thanks to punctuation reading error]), ('if', 'IN' [preposition/subordinating conjunction]), ('there', 'EX' [existential there])
  • Chunking/chinking: The former segments and labels multitoken sequences. We can do this in a number of ways: for example, we could chunk the text into noun phrases, proper nouns, or via other regular expressions. Chinking involves the removal of multitoken sequences from a chunk.
  • Bag-of-words model: Creating such a model involves the casting aside of word order in order to produce a count of all the word tokens in the text, for example:
he 6
hesitated 1
as 6
...
stupendous 2
love 5
she 15

This step opens up new ways of approaching the text. In applying the first two processes, it is now much easier to examine certain aspects of James’s lexical choice and phrase formation. We can also conduct statistical comparison of syntactical features of this passage against other texts – whether James’s earlier novels; modernist texts often characterized for their supposedly innovative and spare prose; that other conversational novelist, Edith Wharton; or indeed Coetzee’s writing in Youth.

If we are using a bag-of-words model that disregards word order entirely, we can use the table to examine, for example, word frequency. The table enables us to render word frequency numerically and then easily compare this data against other linguistic corpora. We might note in the above, the frequency of the female pronoun in comparison to the male – indicative of the syntactically confusing, but thematically significant, conflation of Kate and Milly in this passage. Alternatively, we might choose to compare Henry James’s representation of conversation with the spoken dialogue contained within the British National Corpus (BNC), or another similar database. Such a comparison could potentially enable us to develop a quantitative definition of the “supersubtle” features that mark James’s conversation.

The bag-of-words model is also commonly used to train machine learning algorithms and forms the basis of Baysian email spam filters which compare the frequency of various words in the email (e.g., viagra) with a comparison corpora (incidentally, this is why nineteenth-century novel data sets – freely available and rarely containing spam filter key word frequencies – were often included in “litspam” emails in the 1990s by spammers trying to circumvent this issue6). Henry James as spam.

Step 3

To (I hope) state the obvious: although we have now prepared the text, we still need to analyze the data and interpret findings (as I began to do with the pronouns above). The decisions entailed in stages one and two will likely be determined by the data, the aims of the researcher, and the kind of question she is attempting to answer. The number of these decisions, and – for literary scholars – the degree to which they can seem to disregard or lose aspects of the text we consider fundamental to analysis, should indicate that NLP is itself an interpretative process. The rendering of text into data (numerical or otherwise) entails translation and modeling.

Crucially, our analysis here is further complicated by our interest in conversation. This passage has more than one speaker; it contains within it a representation of dialogue. We have not, in the above, captured that. We need therefore to incorporate a model of conversation into our text processing. We might want to distinguish between dialogic utterances and surrounding descriptive comment (“Densher said,” “With which Kate slowly rose.”). We could tag lines of dialogue according to their speakers, conducting sentiment analysis on each character’s speech separately. We might, alternatively, want to analyze each line of dialogue as a response to a specific input (turn-taking). So much of the power of this passage comes from the verbal echoes and the shifting semantics created through repetitions: “As we were?” / “As we were.” Our processing might therefore want to flag this modeling when analyzing the passage.

Notably, in this passage, there is a significant lack of speech tags, and in the second half of the extract we don’t even have names, only pronouns. The reader must therefore follow the order of the passage closely so as to keep track of who is speaking. The effect of this is to hugely increase the mental burden on the reader, adding to the intensity of the passage. Capturing these features in our NLP analysis requires modeling of state change or context. In a chatbot, for example, we need to incorporate both short-term memory and context switching functions in order that our chatbot can respond to implicit contexts (e.g., the temporal references to “now” in the passage above) and also reset as the implicit contexts change. The mental burden on the reader in the passage is the result of having to maintain and repeatedly update their short-term memory across numerous lines of dialogue. Easy for a computer, slightly harder for a human reader.

Modeling conversation as part of our NLP of this passage, although posing lots of questions, is relatively straightforward. This is thanks to a history of intellectual cross-fertilization between NLP and linguistic analysis of conversation. Since the 1980s in particular, human-computer interaction (HCI), which studies the design of computer technologies and the ways in which humans interact with them, has drawn heavily on the work of scholars working in conversation analysis and discourse analysis (and vice versa). Such work on interactive communication has been crucial for the development of todays’ chatbots and NLP in general.

By contrast, in literary studies, conversation has often taken a backseat in scholars’ analysis (Bakhtin’s dialogic imagination notwithstanding). While a written representation of conversation, rather than its oral (or computer-mediated) embodiment, literary texts offer models of dialogue that can be instructive when examining our cultural assumptions around what literature should do in the world. They deserve their own formal analysis, and NLP can offer important insights.

Much of the pushback against literary computing in recent years has focused on the shortcomings of such models and, indeed, the logic that an activity that entails lengthy preprocessing might usefully supplant close reading.7 I would instead simply argue, rather than being put off by an illusion of objectivity, neutrality, or quantitation, literary and cultural scholars might embrace the critique and interpretation that such a process should properly entail. We need not abandon traditional methods of reading within the discipline, but in a world in which NLP is ever expanding, our expertise is valuable and sorely needed.

Endnotes

  1. J. M Coetzee, Youth (London: Vintage, 2003). 64–65.
  2. James employed an amanuensis, Theodora Bosanquet, from 1907 until his death in 1916; many scholars have commented on the effects of this upon his style. See Theodora Bosanquet, Henry James at Work (London: Hogarth Press, 1924).
  3. Shawna Ross makes an interesting case for parallels to be drawn between social media speech and Henry James’s late style, both of which utilize compression. Shawna Ross, “Hashtags, Algorithmic Compression, and Henry James’s Late Style,” Henry James Review 36, no. 1 (February 3, 2015): 24–44, https://doi.org/10.1353/hjr.2015.0005.
  4. Henry James, The Wings of the Dove, Project Gutenberg (New York: Charles Scribner’s Sons, 1909), http://www.gutenberg.org/files/30059/30059-h/30059-h.htm.
  5. Chapter 3 in Steven Bird, Ewan Klein, and Edward Loper, “Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit,” accessed June 12, 2019, http://www.nltk.org/book/.
  6. Finn Brunton, Spam: A Shadow History of the Internet, Infrastructures (Cambridge, Mass.: MIT Press, 2013). 143–151.
  7. For a recent overview of these discussions, see Nan Z. Da, “The Computational Case against Computational Literary Studies,” Critical Inquiry 45, no. 3 (March 1, 2019): 601–39, https://doi.org/10.1086/702594.

Bibliography

  • Bird, Steven, Ewan Klein, and Edward Loper. “Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit.” Accessed June 12, 2019. http://www.nltk.org/book/.
  • Bosanquet, Theodora. Henry James at Work. London: Hogarth Press, 1924.
  • Brunton, Finn. Spam: A Shadow History of the Internet. Infrastructures. Cambridge, Mass.: MIT Press, 2013.
  • Coetzee, J. M. Youth. London: Vintage, 2003.
  • Da, Nan Z. “The Computational Case against Computational Literary Studies.” Critical Inquiry 45, no. 3 (March 1, 2019): 601–39. https://doi.org/10.1086/702594.
  • James, Henry. The Wings of the Dove. Project Gutenberg. New York: Charles Scribner’s Sons, 1909. http://www.gutenberg.org/files/30059/30059-h/30059-h.htm.
  • Ross, Shawna. “Hashtags, Algorithmic Compression, and Henry James’s Late Style.” Henry James Review 36, no. 1 (February 3, 2015): 24–44. https://doi.org/10.1353/hjr.2015.0005.