Episode 3: Machine Learning and Data Bias
In this episode, Chris Stoll a Sr Software Architect at Skiplist joins Andrew and I to discuss Machine Learning. Both have Masters in Computer Science and tons of experience in this space.
Machine learning is not just an algorithm or a black box. There are biases in data that all companies need to be aware of. Some can be costly if not properly addressed. This is also why diversity is so important to any organization. Maybe you don’t need machine learning in the beginning. Find out when is the right time.
We also get into discussing Pokemon Go and why Facebook is on the hot seat again. - Fahad
Where Does this Tech Fit Best?
Systems for collecting data began to be installed en mass a few years ago.
As these systems were being built, companies began building data teams as well - data science teams (though they weren't necessarily called data science teams).
Machine learning to deep learning.
Best Approaches to Starting Machine Learning Project (2:41)
2:41- Starting well
Need to perform a thorough data inventory
Is it worth having a team for your data? Where does the data reside?
Data needs transformation to become information
Need to ask the right questions of your data to find the categories of answers you need from your data.
Finally, bring in the tech - MapReduce etc.
Can also start with a simple hypothesis to test and run it like an agile software project.
Distinct Advantages of ML Algorithms (5:30)
5:30- Especially versus traditional statistical models
AI started in the 80s out of theexpert systems efforts.
2nd AI/ML revolution = statistical techniques being applied to organized data.
Data Bias Awareness (6:26)
6:26- Externalities and unexpected problems
Pokemon Go: fewer pokemon in certain neighborhoods, etc.
Need to just ask: what bias might exist in the data? Brainstorm and put in the effort.
There are models in stats already for eliminating biases - apply them to ML!
Acknowledge that some biases will be missed, at least initially.
Expose methodology so that peers can review for potential biases.
Sample to population comparisons (central limit theorem)
Need more study in this area especially in the tech industry.
HUD complaint against Facebook for ad discrimination is perfect example.
People straight up don't understand how important diversity is at a company.
Data doesn't "speak for itself." Someone is making meaning out of it.
How to Lie with Statistics, by Darrell Huff makes this point (this is one of Andrew's favorite books, watch out world)
Also: bring in domain experts. Can't rely just on data scientists to address bias.
Too few developers and engineers have never worked in/experienced other industries.
Tradeoffs with University Data Scientists (19:02)
19:02- Relied on too heavily?
Being a great data scientist requires an entrepreneur skillset.
Need to be able to hunt and observe and extrapolate so that the science is accurate
How the user experiences/interacts with the system determines the output of the system. Understanding what's happening in the domain is critical for performing excellent data science.
Classification problems quickly become search problems or even, wholly different types of problems in the process of interrogating data.
ML from Cloud Services (22:00)
22:00- How are the off-the-shelf tools and the results we get from them?
Tools are great
But they still need trained hands to use them for maximum value.
We need to make systems that help our users and don't harm them. Sometimes, these tools become hammers that transform all problems into nails. Not good.
Precision vs recall is important.
Need to take time to interpret data well within thoughtful parameters.
How Does an Enterprise Start? (25:45)
25:45- What's the starting place for an ML POC?
Proving out ML starts with an agile technique - get a minimally viable solution.
Next step - does the POC work for the larger population or problem set?
Iterating with more data.
Take smaller risks with money. Start smaller and expand thoughtfully.
Ultimately, experience counts for a lot here. Getting a gut sense of data.
PCA helps create visual correlations for quick, simple analysis of data variables.
Need mathematical intuition as well.
As is the theme: starting must include a group of cross-functional professionals. Get them in a room with whiteboards!
Not every problem is a machine learning problem. There are some great statistical methods out there for achieving excellent data results not including ML.
The initial solutions to the Netflix Prize, for instance, were based on complex matrix classification and analysis.
Netflix is now using ML for recommendations but they've earned this complexity.