Data Science: Complex and Time-Consuming
Data science is at the heart of what many are calling the fourth industrial revolution. Businesses leverage Artificial Intelligence (AI) and Machine Learning (ML) across multiple industries and multiple use-cases to make more intelligent decisions and to accelerate decision-making processes. Data scientists play central roles in this revolution. However, according to a 2018 study published by LinkedIn, there is a national shortage of over 150,000 data science-related jobs. This severe shortage means that the race to improve the productivity of data scientists is leading to some exciting new technologies.
One of the primary challenges is the sheer complexity, iterative, and highly manual nature of the data science process. Data scientists must sift through scores of raw data, typically found in highly complex systems with hundreds of tables. Integrating and transforming those tables to create “feature tables” is at the heart of the entire process. Perhaps not surprisingly, it is the most time-consuming, tedious and iterative part of the whole data science process, often requiring in-depth knowledge of the underlying data and more importantly the business domain to create multiple “hypotheses” to be tested and developed before data scientists can even begin to build ML models. Building ML models is highly technical, requiring in-depth knowledge about machine learning and statistics. Data scientists have to choose a proper ML algorithm and carefully tune the model based on the nature of a given use case and business requirement (e.g., black-box vs. white-box). Once again, data scientists must resort to an iterative “try, rinse, and repeat” approach that is time-consuming and error-prone.
Python, The Platform of Choice for Data Scientists
Over the past decade, Python has become the most popular and powerful tool/platform for data scientists. Python is relatively easy to learn and provides a vast amount of advanced ML libraries, two factors that have been critical in the rapid rise in the platform’s popularity. Python also provides a vibrant ecosystem providing tools like Pandas for data manipulation, Numpy for numeric computation, PySpark for distributed computing, Matplotlib for data visualization, and Jupyter Notebook for rapid prototyping. This broad ecosystem of add-ons allows data scientists to manage their entire data science workflow in a single environment. Python is also more flexible, sophisticated, and open than more traditional frameworks like R or Matlab when it comes to integrating ML models into production environments. Coupled with the vast library of learning material that is free and readily available, and the choice of Python as the “de facto” platform for data science becomes more apparent.
AutoML: Replace or Accelerate Data Scientists?
Recently, automated machine learning (AutoML) has become one of the fastest-growing enabling technologies for data science. AutoML platforms have attempted to address one of the significant problem areas for data scientists: development of predictive models using machine learning algorithms. The sheer multitude of ML algorithms and models, each with unique characteristics means that selecting and manually turning proper algorithms for specific use-cases is time-consuming and prone to errors. The use of AutoML has proven to be a significant time-saver in these instances.
According to The Gartner Group, more than 40% of data science tasks will be automated by 2020. AutoML, however, is not replacing the data scientist any time soon. The primary aim of all AutoML tools is to make data scientists more productive. Traditional data science processes often follow “waterfall” approaches that require significant manual effort at each stage, and that can be very time-consuming to perform. The highly manual nature of data science makes it an ideal target for automation, to make it easier to try new ideas while giving data scientists ways to explore more use-cases and higher impact use-cases faster.