Azure ML and the future of data science
I attended a demo/sales pitch for Azure Machine Learning Studio (AMLS) at the Pittsburgh useR meetup this week. Unfortunately, I, along with most of the audience, was profoundly bored. However, upon further the AMLS brings up a fairly compelling issues: What is the future of data science software?
What's the problem?
On a technical level, the AMLS is fairly neat. It's a web-based GUI that builds predictive "models". You add a data-set, specify some data-cleaning steps, pick an off-the-shelf classifier, and then press the big "run" button which executes everything on the cloud(!). You can get a predictive model running in just a couple minutes.
It's a neat tool, so what's so boring? I'm fairly confident that the reason is that that the intended audience for the AMLS (and certainly the sales pitch) isn't researchers like me. Their key selling point is simplicity, not power. They didn't flat out say it, but the underlying message was "it's so simple, anybody can do it". To make things simple, you generally have to sacrifice performance and flexibility. For statistical/ML research, this isn't a good tradeoff.
However, I suspect that Microsoft is using AMLS to break into the data science for small to mediums-sized business market. This is potentially an interesting and lucrative market; these are businesses that could benefit from predictive models, but don't have the resources or expertise to recruit and support a data scientist.
Within this context, you start to realize that the allure of the "it's so simple, anybody can do it" subtext. You don't need to hire an expensive data scientist, you can do it yourself.
This is the aspect that I find troubling; I don't trust the model of just "anybody". I have to do a lot of work just to trust my models.
Modeling is really hard.
Case in point, the classifier used for their model was a Decision Jungle. None of the statisticians in the room had ever heard of this method before. A quick GoogleScholar search to finds the paper, but I guarantee it's incomprehensible to the intended user of the AMLS. With no understanding of what's going on, I have to assume that the main reason for using the technique is that it sounds cool. This is problematic.
Do we have to know how a model works to use it to make good predictions? One argument is that we can take the Kaggle competition approach: trust that validation will act as a safety valve. Who cares if they make a hideous model so long as it scores well on the validation data.
My counterargument: I doubt that any realistic model created using the AMLS will predict well. This isn't a cheap shot; I wouldn't do any better. The truth is that most of the interest models for businesses have to do with people. As any social scientist can tell you, models of human behavior typically doesn't predict well.
Now, given that our models aren't going to predict well, we actually need to know how things work to know how they will fail. Do I think there are interactions between variables? Then straightforward linear regression isn't going to pick that up. Is there class imbalance? I need to reweight my samples. Do I need to explain my model to someone? Don't use deep learning.
Alternative Approaches
Contrast this with the current system in which the opacity of statistical software serves as a bar for serious work.
Specifically, you have to know what to ask for before you can fit your model. You don't get a drop-down menu of models that you can choose from. You must put in the time to learn how to use R or python or julia or whatever in order to run your analysis. The fancy stuff isn't included by default; you have to seek it out and install it yourself. Hopefully, by the time you're ready to fit your model you've also picked up sufficient statistical background to actually know what you're doing.
It should be immediately obvious that the status quo is a terrible system. Technical skills and modeling ability are weakly correlated at best. Imagine the collect amount of suffering inflicted on everyone who just wants to run a simple regression but can't figure out how to read an Excel file into R (speaking from personal experience, sadly).
It doesn't even serve its purpose well: people just learn how to use the "lm" function in R and then just look at the stars. Or they ask questions on cross-validated or R-help about how to fit models that fundamentally don't make sense.
Moving Forward
If Microsoft's new tool is scary and the status quo isn't much better, what should we be doing instead?
One potential solution is more education. If only people had a better grasp of statistical issues, modeling for the masses wouldn't be a huge problem. I could trust that, with a competent user behind the wheel, the AMLS would save a lot of time and end up with a good, defensible model.
So maybe we should embrace it. Instead of teaching the mechanics of t-tests, we start teaching more interactive data analysis using tools like AMLS or a more beginner-friendly R?
This brings to mind Freestyle chess in which human competitors compete using the most advanced computer chess engines. The humans provide intuition and the computer raw power. The same thing can work for data analysis.