Language, practically probabilistic!

Natural Language Processing (NLP), as a concept and subject of study, has been around for a century at least. Early on, interest in the field came primarily with a view towards developing systems for automatic translation. The United States Government attributed a great deal of money—more than $20 million—to artificial intelligence (AI) research in NLP up until 1966, when frustration with the lack of progress led to widespread cancellation of funding. A report presented to the British Science Research Council attributed AI’s failure to meet expectations to a “combinatorial explosion” problem. In rough mathematical terms, combinatorial explosion refers to a rapid increase in the complexity of a problem with increased scale. Basically, early AI was only able to work within the scope of simple problem sets and couldn’t be scaled up to deal with more complex, real-world systems like language.

Technological advances in the 1980s drew interest back to AI, but the real breakthrough for NLP came in the 1990s. Computer systems were finally powerful enough to store and process large amounts of data, making the collection and analysis of digitised texts (electronic corpora) possible, and researchers could bring statistics to bear on the subject. Since then, probabilistic analysis and modelling have come to dominate NLP applications.

Why? Because they work so well! Formal linguistic theory is too categorical to provide a realistic model for language, which is naturally fluid and tied to discursive context; as a result, it’s often ambiguous. A translator must use reasoning to determine meaning by drawing on learned world knowledge not directly available in the source text.

Here’s the crux: reasoning is probabilistic analysis. And learning is probabilistic modelling

Consider irregular verbs in English and ask yourself, just how do you know that the past-tense form of the verb bring is brought and not bringed? You know it mostly because, since you were born, the majority of English sentences you encountered used brought to communicate a bringing action that occurred in the past—not bringed. Every time you construct a sentence using brought, you are actually making a prediction that it will be understood in the context you choose to use it as the simple past tense of bring. Young children and second-language learners regularise irregular verbs at first because they too make a prediction based on prior experience. However, most of the verb forms they’ve encountered are regular, and they haven’t yet compiled enough data about irregular verbs to form an accurate cognitive language model.

Bayesian statistics frame probability as a measure of the weight of evidence supporting a proposition (e.g., brought is the past tense of bring) and allow for change when new data are presented. These mathematical models represent the fundamental basis of machine learning and AI; they can be used to “teach” computer programs to predict things (like word order, synonyms, etc.) based on prior experience—the same way we do!

Word embeddings, the newest darlings of NLP, are mathematical representations of words. They contain, within a high-dimensional linear vector, information derived from the cumulative textual context of a word. Essentially, a word is “embedded” in continuous vector space according to probabilities associated with the likelihood it occurs in similar contexts as other words. This approach is based on the assumption that words appearing in the same types of contexts also share certain dimensions of meaning. A good embedding will encode semantic information within a word’s position (distance and direction) in vector space; semantically similar words will cluster together, yielding a probabilistic model for language processing.

Multidimensional vector embeddings can even be scaled up to the sentence or document level (to identify phrases sharing similar meanings or texts covering similar topics) or down to the character level (to better model meaning differences encoded by linguistic morphology). The information these vectors comprise also appears to be universal; embeddings can be calculated for massive, multilingual corpora and AI trained to identify similar words and sentences across languages.

But wait, there’s more! Remember that reasoning and learning are probabilistic processes? “There is strong behavioral and physiological evidence that the brain both represents probability distributions and performs probabilistic inference.”

Word embeddings have even piqued the interest of researchers in neurolinguistics, who have evaluated word embeddings by comparing them to physiological cognitive data. These representations appear to predict the neurological response to word meaning quite well, and word embeddings derived from neurophysiological data also appear to perform as well or better on downstream NLP tasks than embeddings produced from textual data.