Episode 29: Learning About Natural Language Processing (NLP)

Today we're taking a deep dive into Natural Language Processing (NLP).

Natural Language Processing (NLP)

What is NLP? Basically all of the technology around processing human languages, human speech, and human writing. It's understanding language. NLP is not vision. It's an enormously broad area covering everything from:

  • Speech synthesis
  • Speech recognition
  • Siri
  • Amazon's Alexa
  • Google Assistant
  • Question and Answer
  • Google Search
  • Search-related applications

Anything language-related is NLP, anything not related to speech or language is not NLP.

NLP and AI

NLP is more machine learning than intelligence-related, but certainly, some of NLP falls under AI. Breaking down language requires things associated with AI. The AI umbrella isn't always useful because intelligence is intelligence regardless.

NLP and Compilers

People look at compilers and see that they can understand computer programing language, but not understand English. Computer languages fall into a category of artificial languages that were designed to be understood by another program. The Compiler wasn't designed to understand programs. The language used to specify programs was designed to be understood by the compiler. That's why it's harder to work with NLP because NLP wasn't designed at all, it mostly evolved.

You can specify in artificial language what syntax is and the syntax can be known. You can specify the grammar so it's unambiguous. A compiler will never be able to understand spoken English because it's too ambiguous, it's too messy. Natural Languages aren't necessarily context-free grammars.

Bounded context doesn't work. Computers can learn grammar and be able to interpret what people want to search for. You can take a search query and filter out things that aren't useful, restricting the vocabulary. In NLP toolkits there is a vocabulary for stop-works like "the."

The interesting thing about data sets is how you clean it up. We've been cleaning up number data sets for millennia. When it comes to words, stop-words are one thing, but there have to be other words, like adjectives, that are irrelevant to understand a sentence.

Research Models for NLP and Decision Trees

ALBERT is a Google model based off of BERT. It was able to take the SQUAD and beat the human best for questions and answers within that data set. ALBERT doesn't do simple stop word removal and term identification. BERT models are enormous and ALBERT is based on BERT.

As things get better in NLP, stop-word removal and grammar constraints tend to not be the best solution when they use neural models. Like machine vision models in the past, they are big models.

A lot of enterprise NLP use decision trees. Google has moved towards a statistical model. Statistics have carried us far in computer intelligence, but deep learning neural networks are winning the game right now. It looks like neural methods are going to take over.

Most people aren't trying to solve the general problem. They're trying to do a specific thing. Document categorization is an NLP problem. Some neural methods are starting to show up in document classification, a lot of it is still done in decision trees and statistical methods. They're fast, easy to understand, and easy to reason about.

Data Bias

It's important to look at the input and the outcomes when you're looking to choose your next decision inside a decision tree and neural models. If your data is biased, your outcome is going to be biased.

It's in the realm of the responsibility of the user to go through the data and give weight. To help remove bias from your data, you can:

  1. Weight data sources differently
  2. Look at and understand your data
  3. Understand the source of the data
  4. Have a diverse team look at the data

It's important to have peer reviews, studies, and models because we're getting data from all of these different sources and gluing it together.

Instead of relying on data cleansing for removing biased data, we have to work toward understanding how we can mitigate the harm of decisions that are made by algorithms within these systems and figure out how to fix them once they've been identified. You can do word-golf or language algebra in NLP you can do with word vectors.

Facebook is especially guilty of sending biased news based on your search preferences. The more you click on articles, the more you'll be shown articles from that same view or on that same topic. The more you read from it, the more you slide down the trail into a sub-group fo people with the same opinions and assume the same facts.

Facebook, Google, and Apple know what you're interested in based on your search preferences and clicking, so they keep sending you information based on those topics. They continue to show you things you are "interested" in and creating echo chambers. Showing diverse and non-echo chamber content is a risk because people are less likely to click on things that don't interest them.

Algorithms of Oppression by Noble and Weapons of Math Destruction by O-Neal cover this topic really well. The filter bubbles are less dangerous in that sense than some of the other stuff that comes out of AI models as far as creditworthiness, mortgage rates, availability of parole. That stuff is handled by BlackBox ML algorithms and that's a problem.

NLP and Enterprises

Enterprises can get started by turning NLP on their documents. A problem in large organizations is that they can't find what they're looking for. If you're going to apply NLP in your own organization, pick carefully.

Search is one place where you can apply NLP. By enhancing documents with more sophisticated content analysis in the document, it allows people to search for things in a much more targeted focus. You can build a knowledge graph for your company. Most companies don't have that. They're basically using a suped-up version of Lucine for text search. Once you build your knowledge graph, then you can reason about the graph in ways that you couldn't do before. That's one of the things that makes Google search so powerful. It's able to not only reason about the things you're searching for, but its relationship to other things.

If a company rolls out this feature, how can they maintain it? This can be done with a relatively small team, but with the right expertise. In order to effectively apply NLP, you need data. You need the ability to access the data in a readable format. If your department has 5 years' worth of memos on PDFs, that's going to be difficult. You have to be able to extract the text. It mostly takes an institutional investment in technology.

IBM, Google, and Amazon all have cloud-based NLP tools that they've been aggressively marketing within organizations. There's no winner yet though.

A Seamless NLP Experience for Customers

In some areas, it's pretty close. Look at the Amazon Echo. Amazon's home products are close as long as you stay in the standard question/command lane. Google Assistant and Siri also handle those standard things well. We expect to see a sea0change in the next 5 years in consumer electronics and voice interface systems, primarily led by Google, Amazon, and Apple.

Thanks for reading through! We want to share our best Thoughtful Software Practices with you in a free E-book. Grab your copy by subscribing to the Thoughtful Software newsletter.

Build Thoughtful Software
Fahad Shoukat
Written by

Fahad Shoukat

Fahad has a B.S. in Electrical Engineering and an MBA. He brings over 15+ years in Business Development, Strategy, Sales, Product, and Marketing in various industries such as software development and Internet of Things (IoT). His experiences have led him on an unwavering pursuit to meet thoughtful people and build thoughtful software.