(This post was originally published on L4digital.com in November of 2017.)
TALKING LIKE HUMANS ABOUT HOW COMPUTERS “TALK”
Language is beautiful. Consider the word ‘palimpsest’—the faint ghosting that remains after a previously-written thing has been erased. ‘Palimpsest’ gives me one word to describe an idea that would otherwise take thirteen.
Language is also hard, especially when it comes to describing abstract linear concepts. This is where math is really useful. In the field of Natural Language Processing (NLP), in which computers are taught to understand and generate words in order to communicate in more human ways, math is the lingua franca.
I recently attended Global Data Geeks’ Data Day Seattle specifically to learn more about NLP, but the one-day conference covered that and more across the spectrum of data science, artificial intelligence (AI), and machine learning (ML). (For a more thorough introduction to AI and ML, check out my colleagues’ “AI for Dummies”.)
Discussions of NLP, AI, and ML go hand in hand because a computer has to learn in order to process language. That happens when we provide a training set of data to the models or algorithms in use and then iteratively refine them until the model or algorithm generally does what you want it to do.
Given such highly technical subject matter, I wasn’t surprised that the presentations at Data Day Seattle were more mathematical than I’m used to. Words popped up repeatedly in talks and conversations that everyone around me seemed to understand, but which threw me for a bit of a loop. By using context clues—something I’m currently better at than a computer, but won’t be for long—I walked out of each session both more knowledgeable and more excited.
My experience at Data Day Seattle prompted me to create a glossary for others like me who are just starting to dive into NLP. This glossary introduces key concepts and terms in a way that (hopefully) even their grandmothers could understand.
As is the case any time you simplify something, there are nuances—many nuances—that have necessarily been left out in favor of accessibility.
COMMONLY USED NLP TERMS DEFINED IN A NON-MATHEMATICAL BUT HELPFUL WAY
Since most of the terms listed here are borrowed from general ML or linguistics, they have general meanings that you can use to help tease out their specialized meanings in the context of NLP. Rather than list the terms alphabetically, as most glossaries do, I’ve listed them by term frequency (TF), with the most frequently used words, based on the corpus of Data Day Seattle talks, defined first.
Machine learning: When given enough information (data), a computer uses algorithms and models to ‘learn’ to make predictions, rather than being told specifically what to do via a program.
Neural network: Inspired by what we know about how neuron structures work in the human brain, these specific types of machine learning networks use different layers: Nodes of data, or ‘neurons’, serve as inputs, and layers of mathematical computations applied to these nodes create the output layer. The output, or ‘prediction’, is based on the strength of the connections between different nodes. Neural networks are capable of deep learning, and are what often makes the news. AlphaGo Zero, for example, recently made headlines for learning enough about the game ‘Go’, without any specific training, to beat all competitors in just three days. AlphaGo Zero even beat its older sibling, AlphaGo, 100% of the time.
A specialized form of neural network used most often in NLP is the recurrent neural network (RNN). A RNN is a network that can process sequences. What does that mean? Well, traditional neural networks see a sentence simply as a collection of words because they don’t consider the order in which the words appear. But in RNNs, the sequence of the words is retained, allowing the network to have more context and therefore a greater chance at “understanding” the sentence as a whole.
(I have put “understanding” in quotes as a nod to a point that Jonathan Mugan made during Data Day Seattle. While these amazingly powerful networks can identify information and execute actions based on data inputs, Mugan says, there is no indication that they actually understand what they’re doing. Computers can’t pun.)
Embedded/embedding: Embedding is the process of taking words or phrases from a body of work and mapping them to vectors (see next entry for a description of vectors).
Vector: Simply put, word vectors are a shortcut to providing context or additional information about one particular word or phrase. When given context or additional information, a model is able to learn more precisely and more efficiently. What that additional information is can vary.
Here’s one example.
The idea that words used in the same context often mean the same thing was popularized sixty years ago by linguist J. R. Firth in his pithy statement, “You shall know a word by the company it keeps.”
So if our training set contains both “The quick brown fox jumped over the lazy dogs” and “The quick brown vulpine jumped over the lazy dogs”, the model might determine that “fox” and “vulpine” probably mean the same thing. “Fox” and “vulpine” might then be embedded in the same vector in order to help the model more quickly process a new data set that contains either of these words.
More technically, a word vector is a mathematical space where words or phrases are mapped in relation to semantically similar words or, alternately, “are embedded nearby each other”.
Feature: As a product manager, I use the word ‘feature’ quite regularly, but ‘feature’ in the context of NLP requires a slight reframing.
Rather than being a stand-alone piece of functionality, a ‘feature’ in NLP is more closely related to the idea of a distinctive attribute; it’s a value of some parameter within the text that is determined and then used later. An example of a NLP-specific feature would be the number of times a particular word is used in a body of text, or how many proper nouns a text contains. ‘Feature extraction’ is the process by which that feature is determined and made available to the model for use.
Bag of words: This is pretty much what you’d think it is: a model wherein all the words in the document or corpus being examined are considered without context, similar to the bag of letters in a Scrabble game. It’s also fun to say.
Term Frequency – Inverse Document Frequency (TF-IDF): Term Frequency – Inverse Document Frequency (TF-IDF) is a method of determining the probable subject of a document by comparing how often any one word is used to how unique that word is across an entire body of work that includes the original document. If a word is used often but only in one document, this indicates a high likelihood that it is meaningful to that particular document (as opposed to words which occur often across all documents, like ‘the’ and ‘and’).
My favorite example of TF-IDF comes from Julia Silge. Silge examined the entire corpus of Jane Austen’s work and determined that the most frequent words in each of her books were proper names, indicating, as most Jane Austen fans would agree, that the core meaning of her books comes from her characters.
Attention: In the context of NLP, ‘attention’ refers to a model looking at previous steps or bits of information, and using that information to refine the outcome for the current step. The idea is based on human attention mechanisms, whereby a person can focus strictly on one part of an image and more loosely on the parts of the image surrounding that focal point, providing additional context about what they’re looking at. Attention is another key attribute of neural networks that helps them learn faster, and thereby shortens iteration and testing time.
N-gram: ‘N-gram’ refers to the sequence of some number (n) of items from a given document or corpus. Items can be things like letters, phonemes, or words. At the word level, for example, trigram sequences contained in this very sentence could be “sequences, contained, in”, “trigram, sequences, contained”, “contained, in, this”. By storing text in this way, the model can then compare the n-grams and better predict the sequences that it’s parsing.
SO, NOW WHAT?
Now that you’re on speaking terms with NLP, let’s consider these glossary words and concepts in context.
Almost all of these concepts focus on how to better train the model (“better” can mean making it more accurate, or reducing its training time), which is obviously a key issue for those in the field. Humans use language as a code. (Remember “palimpsest”? One word covers an idea that takes thirteen words to describe.) With NLP, however, language is code. And it’s an exciting field because it’s in its infancy; its capabilities and influence are only going to grow.
What we shouldn’t forget is that our language, the human code, is more than just words. Our language represents underlying images and ideas, and those images and ideas are steeped in our own psychological biases, whether we choose to recognize them or not.
Computers can’t understand that, but they can parrot and reflect it. Microsoft’s Tay.ai made a splash last year when it was removed from Twitter twenty-four hours after it first appeared because it had so quickly picked up racist, sexist, and otherwise offensive language.
Many data scientists and technologists are working hard to recognize and remove inherent bias in the training data we feed these models, and are advocating that the rest of the field does the same. In Silge’s “She Giggles, He Gallops.”, she shines a light on how movie scripts reflect our assumptions on gender roles.
All of this makes me wonder: Since NLP reflects our own language back to us so that we can actually see what we’re saying, how might we learn from the very machines we’re trying to teach?
FURTHER LEARNING
For a truer, more detailed understanding of the field, I highly recommend digging into the resources available on the Internet. The NLP community is very open and collaborative, so there’s a lot of information that is freely available. Below are a few that I’ve used, but I’m sure that there are countless more. If there are any resources that you’d like to share, please do so in the comments below.
Attention and Memory in Deep Learning by Denny Britz
Arxiv data repository
Fast Forward blog
MY DATA DAY AGENDA
Below is the list of presentations that I attended at Data Day Seattle, accompanied by brief summaries of each. You can find many of the speakers on Twitter (and I encourage you to do so).
Jonathon Morgan (New Knowledge / Data For Democracy): This is Our Fight: Technology for Defending Public Discourse
Stefan Krawczyk (Stitch Fix): Scaling Data Science at Stitch Fix
Zornitsa Kozareva (Amazon): Conversational Assistants with Deep Learning
Rob McDaniel (Lingistic): Detecting Bias in News Articles
Sanhgamitra Deb (Chegg): Evolution of Natural Language Comprehension with Human Machine Collaboration:
Julia Silge (Stack Overflow): Text Mining Using Tidy Data Principles:
Jonathan Mugan (Deep Grammar): From Natural Language Processing to Artificial Intelligence
Jonathan Mugan (Data Day Seattle) https://www.slideshare.net/jmugan/data-day-seattle-from-nlp-to-ai
Garrett Eastham (Data Exhaust): Bootstrapping Knowledge-Bases from Text