Data Bias and Machine Learning

I have often heard people say, “the data speaks for itself.” This sentiment is not only naive, it is also very dangerous — especially in a world of big data and machine learning. All data is seen through a lens, and the conclusions drawn from the data will change with the perspective of the interpreter.

Data Bias and Machine Learning

Listen to our podcast episode on Data Bias

Data Bias Dilbert

Data Bias Dilbert

I have often heard people say, “the data speaks for itself.” This sentiment is not only naive, it is also very dangerous — especially in a world of big data and machine learning. All data is seen through a lens, and the conclusions drawn from the data will change with the perspective of the interpreter.

We know that if we give an article to one of our right-leaning friends they could give us a much different interpretation than if we gave the same article to one of our left-leaning friends. The lens that they use to make meaning of the article determines the conclusions they will draw from it. This is no different when raw data is algorithmically interpreted.

When we talk about data science, there can be many types of biases. Most data scientists have some sort of statistics background, so they will think of statistical biases. Those types of biases are well understood and can be controlled for.

However, those techniques really only control for biases which could be contained within the data itself, not within the analytical structure around the data.

Biased Machine Learning Software

A fairly well-known example of potentially biased analytical software is the software which Florida uses to assign risk values to criminals. Independent analysis suggested that all other things being equal, black defendants were 77 percent more likely to be marked as high risk. The founder of the software admitted that it is difficult to create a risk score that does not correlate with race in some way; controlling for race reduces the accuracy of the model, which could suggest that there are fundamental problems with the underlying framework.

Those are the more pernicious types of biases, the biases in the framework which will make meaning from the data. As the use of algorithms spreads, the risks created by these implicit biases increases.

Who is liable when the model breaks down due to unknown biases?

Most companies selling algorithmic software do not reveal the internal workings of their algorithms. This seems to make sense, the exact mechanisms their software uses is probably a large part of their intellectual property.

Furthermore, with the rise of neural networks, these companies may not even understand how their software comes to the conclusions that it makes. If they do not know how their software makes decisions, it is unlikely that they will understand what types of biases it may have.

Machine Learning IP

Realistically, companies can release either their data or their algorithms with little loss of intellectual property. What is valuable is the combination and the process which feeds the data into the algorithms.

The theoretical foundation behind Google's PageRank algorithm is publicly known and has been for some. Google's most valuable asset at this point, besides for its people, is the data which it has collected. It is possible to inspect PageRank for possible biases. In this case, any biases would be exploited for a better ranking in Google's search results. That has happened, and Google has had to improve its technology in response.

A dramatically simplified example of a biased algorithm is a weather prediction app which simply predicts rain every day. It predicts rainy days with 100% accuracy. And, if the algorithm was trained using a dataset for a region which receives rain 183 days a year, then the prediction would still be 51% accurate. In this case, it is easy to look at the precision/recall curve and identify the problem, as the complexity of algorithms grows, it becomes more difficult to identify.

Now, with this simple biased situation in mind, reconsider the criminal risk scoring algorithm we mentioned above. It may not be a huge burden to grab an umbrella every day, but there are other cases in which a flaw like this could fundamentally change a person’s life.

As a whole, the technology industry may not really understand the implicit biases in the analytics frameworks they are creating. Searching ACM digital library for "unconscious bias” yielded 6 results. One of the results mentioned voiding unconscious biases when selecting papers for publication, another discussed biases in hiring processes, and the remainder were regarding biases in computer science education. Searching for “implicit bias” instead returned eight results. These results were more relevant. One of the articles was titled "How to make decisions with algorithms." Searching over the IEEE Xplore digital library yielded slightly worse results.

Finding a dozen results, out of datastore with a half million articles, seems to support the argument that this is a blind spot for the tech industry. It is, however, possible that these are not the correct search terms, or perhaps no industry is concerned with hidden biases.

To validate the search terms and get an idea of how these topics are being treated in other industries a general repository for academic papers, JSTOR, was also searched. There, a search for "unconscious bias” yielded 26,505 results. Searching for “implicit bias” returned 93,165 results. The JSTOR articles were relevant and represented the fields of law and medicine as well as psychology and sociology. The stark difference implies that this topic may not be well covered in technical fields.

Liability with Machine Learning

This topic has started to enter public discourse, especially after the algorithms used by Facebook were questioned after the 2016 US Presidential election. And, consumers are starting to become aware of how algorithms are being used to influence them. But, businesses could also be putting themselves at risk by directly outsourcing their decision making to algorithms which they don't fully understand. It is very easy to grab some data and run it through Google Tensorflow.

Who is liable when the model breaks down due to unknown biases?

Our first step in mitigating bias issues is hiring and training people who are on guard for subtle biases that can impact machine learning results. We also advise clients when the features they are requesting may carry unquantifiable risks or yield unmeaningful results.

Further, we are open with customers about the methods used by the learning algorithms we create come to their conclusions.

Even if we cannot help our clients directly, we can provide a second opinion which, hopefully, validates their algorithmic approach. With so many potential pitfalls in this space, it just makes sense to exercise a little caution.

Build Thoughtful Software
September 4, 2018
Chris Stoll
Written by

Chris Stoll

Chris has a Masters of Science in Computer Science from The University of Akron; his thesis explored novel applications of well understood computer vision algorithms. Chris blends deep computer science knowledge with over twelve years of software engineering experience. He architects quality software solutions which are technically sound and economically efficient.