Demystifying Feature Engineering for Machine Learning
  • プロダクト
    • dotDataとは?
    • AutoML 2.0とは?
    • dotDataが選ばれる理由
    • dotData Cloud
    • dotData Enterprise
    • dotData Py
    • dotData Stream
  • ソリューション
    • 業界別
      • 銀行
      • 保険
      • 製造
      • 小売
      • 製薬
      • 通信
    • 役割別
      • BI & データアナリスト
      • データサイエンティスト
      • 経営層
      • IT&ソフトウェア
    • 価値別
      • 加速
      • 民主化
      • 拡張・強化
      • 業務適用
  • ニュース関連
    • プレスリリース
    • 掲載記事
  • 会社情報
    • 会社情報
    • お問い合わせ
    • 経営陣
  • ブログ
  • USAサイト
  • プロダクト
    • dotDataとは?
    • AutoML 2.0とは?
    • dotDataが選ばれる理由
    • dotData Cloud
    • dotData Enterprise
    • dotData Py
    • dotData Stream
  • ソリューション
    • 業界別
      • 銀行
      • 保険
      • 製造
      • 小売
      • 製薬
      • 通信
    • 役割別
      • BI & データアナリスト
      • データサイエンティスト
      • 経営層
      • IT&ソフトウェア
    • 価値別
      • 加速
      • 民主化
      • 拡張・強化
      • 業務適用
  • ニュース関連
    • プレスリリース
    • 掲載記事
  • 会社情報
    • 会社情報
    • お問い合わせ
    • 経営陣
  • ブログ
  • USAサイト
お問い合わせ

  • Sachin Andhare
  • Blog
  • June 4, 2020

Demystifying Feature Engineering for Machine Learning

What is Feature Engineering

Let’s say you are addressing a complex business problem such as predicting customer churn or forecasting product demand using applied machine learning. Assuming a team is in place and the business case identified, where do you start? The first step is to collect the relevant data to train the machine learning (ML) algorithms. This is usually followed by the selection of the appropriate algorithm or ensemble of algorithms. Choosing the right algorithm depends on the business goals (Accuracy vs Interpretability), category of the problem (Regression or Classification), nature of data (Categorical or Numerical), desired outcome, and constraints (computational resources, training time, latency). Irrespective of the choice of algorithm, whether it is logistic regression, decision tree, boosting, or neural networks, there is a fundamental requirement of providing high-quality input data containing relevant business hypotheses and historical patterns aka Feature Engineering (FE). Often the algorithms get all the limelight and many people believe that algorithms are the secret weapons in the AI battle. But it is FE that performs the magic behind machine learning.

FE is the process of applying domain knowledge to extract analytical representations from raw data, making it ready for machine learning. It involves the application of business knowledge, mathematics, and statistics to transform data into a format that can be directly consumed by machine learning models. It starts from many tables spread across disparate databases that are then joined, aggregated, and combined into a single flat table using statistical transformations and/or relational operations.

Feature Engineering

Enterprise data to ML ready data using AI-powered Feature Engineering

Practical FE is far more complicated than simple transformation exercises such as One-Hot Encoding. To implement FE, you need to write hundreds or even thousands of SQL-like queries, performing a lot of data manipulation, as well as a multitude of statistical transformations.

The Significance of Feature Engineering

ML is driven by algorithms and the algorithms are dependent on data. If you know the historical data, you can detect the pattern. Once you uncover a pattern, you can build a hypothesis. Based on the hypothesis, you can predict the likely outcome such as which customers are likely to churn in a given time period. FE is all about finding the optimal combination of hypotheses.

FE is critical because if you provide the wrong hypotheses as an input, ML cannot make accurate predictions. The quality of any provided hypothesis is vital for the success of an ML model. Quality of feature is critically important from accuracy and interpretability point of view. FE is the most iterative, time-consuming, and resource-intensive process, involving interdisciplinary expertise. It requires technical knowledge but, more importantly, domain knowledge. The data science team builds features by working with domain experts, testing hypotheses, building and evaluating ML models, and repeating the process until the results become acceptable for businesses.

Feature Engineering Automation

FE automation has vast potential to change the traditional data science process. It significantly lowers skill barriers beyond ML automation alone, eliminating hundreds or even thousands of manually-crafted SQL queries, and ramps up the speed of the data science project even without a full light of domain knowledge. It also augments our data insights and delivers “unknown- unknowns” based on the ability to explore millions of feature hypotheses just in hours.

These days automated machine learning (AutoML) is gathering a lot of attention. AutoML is tackling one of the critical challenges that organizations struggle with: the sheer length of the AI and ML project, which usually takes months to complete, and the incredible lack of qualified talent available to handle it. While current AutoML products have undoubtedly made significant inroads in accelerating the AI and machine learning process, they fail to address the most significant step, the process to prepare the input of machine learning from raw business data, in other words, feature engineering.

To create a genuine shift in how modern organizations leverage AI and machine learning, the full cycle of data science development must involve automation. If the problems at the heart of data science automation are due to lack of data scientists, poor understanding of ML from business users, and difficulties in migrating to production environments, then these are the challenges that AutoML must also resolve.

AutoML 2.0, which automates the data and feature engineering, is streamlining FE automation and ML automation as a single pipeline and one-stop-shop. With AutoML 2.0, the full-cycle from raw data through data and feature engineering through ML model development takes days, not months, and a team can deliver 10x more projects.

Summary

Contrary to popular belief, algorithms are not the most distinguishing features of applied machine learning. FE influences the performance and accuracy of ML models more than anything else. It helps reveal the hidden patterns in the data and increases the predictive power of machine learning. In order for ML algorithms to work properly, you need to provide the right input data that algorithms can understand. Oftentimes this involves complex mathematical transformations on raw data. FE provides that input data into a single aggregated format optimized for ML. It is the secret sauce that enables AI/ML to do the magic. Whether it is preventing fraud in financial services, anomaly detection in manufacturing, or predicting customer churn for insurance companies, feature engineering is the most decisive factor for AI/ML success or failure.

Read More
  • Carl Bowen
  • Blog
  • December 7, 2018

A Vision of Rapid, High-Quality Data Analysis for All Businesses

Leveraging Data to be Competitive

It is becoming increasingly important for enterprises to leverage data to be competitive. Yet, there are three challenges related to embracing data utilization that all businesses share:

  1. it takes time,
  2. advanced skills, and
  3. expertise.

Together, these challenges make it difficult for enterprises to fully leverage their data for business growth.  Data analytics is not simply prediction by machine learning. Rather, it is a process involving many different steps, including:

  1. data preparation,
  2. feature engineering,
  3. machine learning,
  4. visualization, and
  5. model operationalization.

Until now, completing this process for just a single project would have taken months. Moreover, a wide variety of highly-skilled personnel are needed for each step – such as domain experts, data scientists, data engineers, and BI engineers.  Additionally, processes and outcomes have tended to be highly dependent on the experience and intuition of each individual.

Feature Engineering Made Easy

For feature engineering in particular, it has long been thought that this step can only be done by experts, as it requires deep domain knowledge.  The results derived from machine learning have tended to be “black-box”, so often these results could not be fully leveraged in businesses.  For enterprises to benefit from the full utilization of their data, it is necessary to resolve these challenges and streamline data analysis and application.

dotData’s approach to data science solves these problems through AI and automation. The development of the dotData Platform stemmed from my experience in leading more than 100 data analysis projects at NEC, across a variety of industries.  What I found is that, no matter the industry, a common thought process could be applied on how to build the data analytics process.  From that experience, I was able to invent automated feature engineering.  This was previously the most time-consuming and manual step, requiring high levels of skill and domain knowledge.

The automation of feature engineering is core to dotData in that we can use AI to design hypotheses for features, and automate analytical processes that are applicable to various industries, businesses, or data.  Because we can automatically execute data analysis processes from data preparation through feature engineering and machine learning through to model operationalization, it solves the data analytics challenges related to time and skill sets that have existed until now.  For example, a data analytics use case for a customer of a financial business, which previously required two or three months of work by data scientists, can now be done in less than a day, with equal or better accuracy.

Data Project Completions Increase

As it becomes possible to complete projects significantly faster, there will be an exponential increase in the number of experiments and the discoveries of new use cases.  In addition, our approach provides full transparency and interpretability where the basis for the derived results is apparent.   Therefore, it can easily be implemented in business operations with high confidence and accountability.

As data analytics becomes more efficient, enterprises can operationalize it as part of their everyday processes and accelerate their data-driven initiatives.  We have made it possible for all businesses to utilize AI and machine learning, and have in fact already achieved major results across a number of industries.

As data science automation is adopted, processes that once relied on peoples’ experience and intuition will instead be executed efficiently using data.  As a result, enterprises of all types will be able to analyze data more efficiently.  They can now create better products, services, and generally be more productive while ultimately providing benefit to society as a whole.

Read More

Recent Posts

  • AIで保険業界の保険解約率を削減
  • AutoMLの普及は、データサイエンティスト時代の終わりを意味するか?
  • NECとdotData、SaaS型クラウドサービス「dotDataCloud」を日本で販売開始
  • dotData、Amazon SageMakerを利用し、dotData StreamのMLOps機能を強化
  • dotData、Microsoft Azureへのデプロイをサポート、 Microsoft Azure Marketplaceにて提供開始 dotDataがAzure上で利用可能となり、 Azureユーザーのデータサイエンスおよび機械学習プロジェクトを加速

Search

Recent Comments

    Archives

    • April 2021
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • July 2018
    • March 2018

    Categories

    • Blog
    • Events
    • Media
    • Media-JP
    • Press Releases EN
    • Press Releases JP
    • Webinars
    • White Papers

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    dotData Logo in white

    Follow us on

    About

    • プロダクト
      • dotDataとは?
      • AutoML 2.0とは?
      • dotDataが選ばれる理由
      • dotData Cloud
      • dotData Enterprise
      • dotData Py
      • dotData Stream
    • ソリューション
      • 業界別
        • 銀行
        • 保険
        • 製造
        • 小売
        • 製薬
        • 通信
      • 役割別
        • BI & データアナリスト
        • データサイエンティスト
        • 経営層
        • IT&ソフトウェア
      • 価値別
        • 加速
        • 民主化
        • 拡張・強化
        • 業務適用
    • ニュース関連
      • プレスリリース
      • 掲載記事
    • 会社情報
      • 会社情報
      • お問い合わせ
      • 経営陣
    • ブログ
    • USAサイト

    News and Events

    • プロダクト
      • dotDataとは?
      • AutoML 2.0とは?
      • dotDataが選ばれる理由
      • dotData Cloud
      • dotData Enterprise
      • dotData Py
      • dotData Stream
    • ソリューション
      • 業界別
        • 銀行
        • 保険
        • 製造
        • 小売
        • 製薬
        • 通信
      • 役割別
        • BI & データアナリスト
        • データサイエンティスト
        • 経営層
        • IT&ソフトウェア
      • 価値別
        • 加速
        • 民主化
        • 拡張・強化
        • 業務適用
    • ニュース関連
      • プレスリリース
      • 掲載記事
    • 会社情報
      • 会社情報
      • お問い合わせ
      • 経営陣
    • ブログ
    • USAサイト

    Resources

    • プロダクト
      • dotDataとは?
      • AutoML 2.0とは?
      • dotDataが選ばれる理由
      • dotData Cloud
      • dotData Enterprise
      • dotData Py
      • dotData Stream
    • ソリューション
      • 業界別
        • 銀行
        • 保険
        • 製造
        • 小売
        • 製薬
        • 通信
      • 役割別
        • BI & データアナリスト
        • データサイエンティスト
        • 経営層
        • IT&ソフトウェア
      • 価値別
        • 加速
        • 民主化
        • 拡張・強化
        • 業務適用
    • ニュース関連
      • プレスリリース
      • 掲載記事
    • 会社情報
      • 会社情報
      • お問い合わせ
      • 経営陣
    • ブログ
    • USAサイト

    • 会社概要
    • お問い合わせ
    • dotDataの経営陣