The end-to-end process for launching a data science project is daunting – and many enterprise projects never make it to production. The process is similar in most organizations and consists of: Data collection, last mile ETL, feature engineering, and machine learning. However, while the process is understood by most teams, the actual execution is very complex and involves a high-level of operational risk.
We recently published a complete guide to operationalizing data science. In this guide, we identified five complex issues to be addressed, for a business to derive value from operationalizing data science.
Highlights from the paper:
Issue 1: Quality
There are two groups in the data science process who are not aligned operationally:
1) Data engineers build data pipelines with SQL or GUI-based tools, 2) Data scientists build machine-learning scoring pipelines using Python or R. Software engineers must often reimplement much of the work from these two groups before production can start.
Issue 2: Integrability
Data and scoring pipelines may have been developed and implemented on different technology platforms and are difficult – or impossible – to integrate.
Issue 3: Maintainability
Data science pipelines must be maintained. The traditional approach is to manually re-create the entire data science process, which increases the amount of maintenance efforts.
Issue 4: Scalability
Limited computation resources constrain data scientists to use smaller sample data sets, that do not represent the larger data sets needed for scoring, and the process may not be scalable.
Issue 5: Portability
Developing one data science process that works well for two different environments – development vs. production – is a nontrivial task.
Download the Paper
This white paper describes a holistic, platform-level approach to the problem of data science automation. To learn more, please check out the complete white paper here.