Understanding Bias in Machine Learning

We’ve talked a bit about machine learning best practices, which delved into a few issues that you might run into when implementing this model.  Of particular interest is the concept of bias.  Obviously, machines have no inherent bias, but what about the people programming them?  We’re not talking about the type of bias in your decision about whether to go to In-N-Out or Five Guys (but if we were, I’d tell you to go to Shake Shack instead).  We’re talking about the types of data bias, sampling bias, and others that affect the data that your software learns from and trains on.


Bias in Action

Let’s say that you’ve designed a system that predicts where crime occurs.  As a result, that’s where the police will tend to go to prevent these crimes.  Since the police are arresting people where they’re patrolling, the data will reinforce the original prediction and give the model greater credibility, whether it deserves it or not.

The Harvard Business Review presents the case of Boston’s StreetBump app, which pulled data from a smartphone’s GPS and accelerometer to instantly report on where potholes are in the city.  This is a fantastic example of receiving information that contains no inherent data bias, but is hampered by sampling bias.  The app wasn’t collecting data from areas that were less affluent or had an aging population, as those groups are significantly less likely to have a smartphone or unlimited data plan.

Data bias

How Can We Help Control Bias?

Machine learning necessarily uses historical data.  This may be all well and good for a period of time, but as we know, the only constant in life is change (although the case could be made that a 2500-year-old quote that’s still used is having a pretty good run).  What this means is that algorithms are not prepared to account for things like new legislation on their own.  A human element is key here, and programmers must be ready to anticipate changes and see how their model could handle them.

In the end, everything is a judgment call.  After all, humans are the ones programming these applications.

  • Inclusivity – Whether it’s including every possible document type you have or making sure you have a diversity of age, race, and gender, account for the variability that exists in the world.
  • Review the data – Is your algorithm getting a little too good at its job?  Make sure your predictive model isn’t reinforcing itself and periodically review the results you’re receiving.
  • Carefully test artificial data points – Since machines have difficulty recognizing things not present in the data, we can craft artificial data points to gauge how an algorithm will respond to a known upcoming change, or to test with potential changes.

At Extract, we use rulesets and machine learning to ensure that when we redact a document, we’ve identified all sensitive information in the document.  When we test our software and algorithms, we know it’s not enough to test a subset of documents.  Testing the accuracy of the files that the system selected for review is only the beginning.  We include documents that we don’t think contain any sensitive information; all files need to be carefully reviewed.  This helps our software be the smartest it can, and be on the lookout for any anomalies.  No data bias, no sampling bias, just an ever-improving algorithm to extract, redact, and index your documents fast.

If you'd like to learn more about how we control bias at Extract, send us a note and we'd be happy to talk with you.


Chris is a Marketing Manager at Extract with experience in product development, data analysis, and both traditional and digital marketing.  Chris received his bachelors degree in English from Bucknell University and has an MBA from the University of Notre Dame.  A passionate marketer, Chris strives to make complex ideas more accessible to those around him in a compelling way.