Recently, I’ve been seeing a lot of services and products advertising automation of machine learning. Data Robot and H2O.ai offer platforms that allow the creation of machine learning algorithms in point-and-click interfaces. They’ll even do the feature engineering for you! This functionality, or something like it, is slowly being built into various tools and programs. They promise to automate the creation of the whole machine learning pipeline - from feature transformations, hyperparameter tuning, to model selection. There are open-source tools that do much the same things (like TPOT, a cool module I love the idea of but can never get to actually work on a data set that isn’t trivially small).
Right now, these tools mostly aren’t great and/or are absurdly expensive (for the cost of a subscription to Data Robot, you can employ a full-time data scientist). But I have no doubt that soon tools will exist that will completely take care of the model/hyperparameter/feature-transformation process.
I’ve had people ask me if I’m worried about my job security as a data scientist. No, I am not. I can’t wait until these tools are there and open source so I can just type “import machinelearn” and just have it do the stupid hyperparameter optimization and I can get on with the hard part of the job.
When I get data to the point where it could conceivably be ingested by one of these tools, the problem is basically done. At that point I need to run a bit of code to do the grid search and find a reasonably decent model and tune the hyperparameters. Hell, if I just ran XGBoost with the default parameters at this point it would usually be almost as good as I am ever going to get it anyways. Doing the extra work of tuning things a bit more is only worth it because it’s relatively easy, and you very quickly get to the point of diminishing returns (unless you’re in a Kaggle competition, where even diminished returns might take you from 10th place to 1st so you milk every tiny incremental increase in accuracy you can).
Once you have your data in the format where you could make a Kaggle competition out of it, you’ve done the hard part. I would love it if at that point I just ran a single function that did a well optimized search that was way more thorough than my typical grid searches, and also explored some different feature transformations. Maybe my models would do marginally better, and I would save myself a few minutes writing the code. It would be nice. But if it would put you out of a job, maybe you should be seriously thinking about what skills you bring to the table.
In most data science positions I’ve heard of, the hard part isn’t building a model once the problem has been framed, data collected, samples chosen, and data is in a neat one-row-per-sample format. The hard part is getting to that point. While I don’t doubt some of these steps will be made simpler in the future as tools evolve, I can’t see anytime in the near future where the whole process could be easily automated. Translating a business problem into a prediction problem is hard and requires a lot of business knowledge coupled with abstract, quantitative thinking. Figuring out what data to use and how to get it is hard - businesses evolve and the data infrastructure isn’t always so clean, so there aren’t ready solutions here. Choosing an unbiased sample set for training can be extremely difficult and there isn’t a cookie-cutter solution to this. Most often, some structure needs to be imposed on the data from knowledge about the particulars of the problem.
I have no doubt that in the next few years, we’ll have some nice tools for automating the building of a machine learning pipeline. Hopefully once that problem is rendered trivial, fewer aspiring data scientists will try to prove their skills by showing off how accurate their model is on the Iris data set. I don’t see much impact on the field beyond that.
comments