The ability for a computer to turn analytical insights into data by automating data preparation, insight discovery and then finally sharing with the appropriate people has amazing potential.
Many different firms that study technology trends (Gartner, BBVA, Forbes to name a few), have been discussing a new type of analytics trend recently: Augmented Analytics.
The idea of augmented analytics is incredibly exciting.
But how far along is this approach? Is it something worth investing in now or waiting for something more substantive to come along? Is this thing all hype or is it real?
Let’s look at each component of augmented analytics and find out.
Looking at automating data preparation first, there are a number of different tools on the market attempting to automate data preparation including: IBM SPSS, Clear Story Data, and Data Robot. However, outside of IBM SPSS, none are attempting to be general-purpose tools, and even IBM SPSS requires training and a lot of setup by a human to make it automated.
But first it’s important to understand what exactly data preparation is. Data preparation can be broken down into three stages:
- Data Collection
- Data Labeling
- Data Cleaning
Data collection, the first step, is infamously hard for many reasons. As I wrote about in my NewSQL article, data comes in many forms and shapes and this makes parsing and deriving meaning from that difficult. For example, parsing an excel spreadsheet that isn’t in a CSV format report form is a non-trivial task for even a human, but for a computer with no mapping (telling it where the data is within the spreadsheet), it’s down-right impossible at the moment.
Data labeling may be the thing the computer is best at in the simple cases. What I mean is, the computer knows when something is a string of characters or a number. However, in every case but the rudimentary ones: the computer mostly falls flat. When there is an encoding issue (e.g. a number looks like a string of characters to the computer), or if the data is misleading (e.g. it looks like a number but it’s actually a date-stamp): this gets challenging and becomes incredibly hard for a computer to do without human intervention.
The final step is to clean the data in a way that can be used in an analytics pipeline. There are some great techniques that help with this step: one-hot encoding, missing value imputation, value aggregation, text mining, standardization and feature engineering using things like component analysis or SelectKBest. Using machine learning modeling, a computer can test these methodologies and determine which one or many of these may be used to train the model on the dataset and iterate on the techniques until the best combination for a certain model is discovered. Nonetheless, this may not result in the best possible model outcome or a global maximum, but it will result in the best model that the computer can generate given its own feedback loop (aka a local maximum).
At this point, the computer has generated the best version of the data it can for future analytics. However, it may still be sub-optimal as the computer can’t guarantee a global maximum outcome, and therefore a human may need to assist and tell the machine which types of data cleaning need done.
Computers are really great at finding patterns in data, showing statistical patterns and creating functions. This type of signal detection is essentially what machine learning is based on and this works really well in the real world.
However, in terms of taking this discovery and applying it to business situations, computers don’t know how to do it. Linking the results of the discovery to a valuable insight still requires the help of a domain expert or data scientist.
Once those insights are discovered they must be sent to the right people. Often times, systems require a human tag for the insight and then send the insight to all the people subscribed to that tag. At the moment, what happens to those insights once they are sent to the appropriate crowd is lost into the void. However, in the future, more intelligent systems will track who the insights are sent to, if they were implemented and the effect of those implementations.
Systems may even look at the database of insights it collects over time and suggest particular actions to take. This step of augmented analytics is still pretty far off from a reality as it requires a level of intelligence from systems that isn’t quite available yet.
Augmented analytics is a reality that many businesses will face in the coming years as new systems come available to tackle the various issues of analytics like data preparation, insight creation and insight delivery.
It will be extra important to separate the facts from the fiction in these systems and know what responsibilities and roles your organization will need in order to be able to run your analytics. These systems stand to truly disrupt what types of roles are required to build out a true analytics capacity as a business, but they also stand to waste an incredible amount of money and time from organizations that don’t understand the current capabilities and limitations of augmented analytics.
Andrew has an M.S. in Computer Science from Georgia Tech University. But he prides himself over 10 years of experience working in the software industry for well-known companies such as Diebold, Tableau, Explorys, and Onshift. After years in the corporate and startup worlds as well as running his own consulting firm, Andrew realized he had to do more to improve software products and practices. From that, Skiplist was born. Skiplist is the opportunity to focus on thoughtful, quality software and change the software consulting industry.