Demystifying Feature Engineering for Machine Learning
  • プロダクト
    • dotDataとは?
    • AutoML 2.0とは?
    • dotDataが選ばれる理由
    • dotData Cloud
    • dotData Enterprise
    • dotData Py
    • dotData Stream
  • ソリューション
    • 業界別
      • 銀行
      • 保険
      • 製造
      • 小売
      • 製薬
      • 通信
    • 役割別
      • BI & データアナリスト
      • データサイエンティスト
      • 経営層
      • IT&ソフトウェア
    • 価値別
      • 加速
      • 民主化
      • 拡張・強化
      • 業務適用
  • ニュース関連
    • プレスリリース
    • 掲載記事
  • 会社情報
    • 会社情報
    • お問い合わせ
    • 経営陣
  • ブログ
  • USAサイト
  • プロダクト
    • dotDataとは?
    • AutoML 2.0とは?
    • dotDataが選ばれる理由
    • dotData Cloud
    • dotData Enterprise
    • dotData Py
    • dotData Stream
  • ソリューション
    • 業界別
      • 銀行
      • 保険
      • 製造
      • 小売
      • 製薬
      • 通信
    • 役割別
      • BI & データアナリスト
      • データサイエンティスト
      • 経営層
      • IT&ソフトウェア
    • 価値別
      • 加速
      • 民主化
      • 拡張・強化
      • 業務適用
  • ニュース関連
    • プレスリリース
    • 掲載記事
  • 会社情報
    • 会社情報
    • お問い合わせ
    • 経営陣
  • ブログ
  • USAサイト
お問い合わせ

  • Sachin Andhare
  • Blog
  • June 4, 2020

Demystifying Feature Engineering for Machine Learning

What is Feature Engineering

Let’s say you are addressing a complex business problem such as predicting customer churn or forecasting product demand using applied machine learning. Assuming a team is in place and the business case identified, where do you start? The first step is to collect the relevant data to train the machine learning (ML) algorithms. This is usually followed by the selection of the appropriate algorithm or ensemble of algorithms. Choosing the right algorithm depends on the business goals (Accuracy vs Interpretability), category of the problem (Regression or Classification), nature of data (Categorical or Numerical), desired outcome, and constraints (computational resources, training time, latency). Irrespective of the choice of algorithm, whether it is logistic regression, decision tree, boosting, or neural networks, there is a fundamental requirement of providing high-quality input data containing relevant business hypotheses and historical patterns aka Feature Engineering (FE). Often the algorithms get all the limelight and many people believe that algorithms are the secret weapons in the AI battle. But it is FE that performs the magic behind machine learning.

FE is the process of applying domain knowledge to extract analytical representations from raw data, making it ready for machine learning. It involves the application of business knowledge, mathematics, and statistics to transform data into a format that can be directly consumed by machine learning models. It starts from many tables spread across disparate databases that are then joined, aggregated, and combined into a single flat table using statistical transformations and/or relational operations.

Feature Engineering

Enterprise data to ML ready data using AI-powered Feature Engineering

Practical FE is far more complicated than simple transformation exercises such as One-Hot Encoding. To implement FE, you need to write hundreds or even thousands of SQL-like queries, performing a lot of data manipulation, as well as a multitude of statistical transformations.

The Significance of Feature Engineering

ML is driven by algorithms and the algorithms are dependent on data. If you know the historical data, you can detect the pattern. Once you uncover a pattern, you can build a hypothesis. Based on the hypothesis, you can predict the likely outcome such as which customers are likely to churn in a given time period. FE is all about finding the optimal combination of hypotheses.

FE is critical because if you provide the wrong hypotheses as an input, ML cannot make accurate predictions. The quality of any provided hypothesis is vital for the success of an ML model. Quality of feature is critically important from accuracy and interpretability point of view. FE is the most iterative, time-consuming, and resource-intensive process, involving interdisciplinary expertise. It requires technical knowledge but, more importantly, domain knowledge. The data science team builds features by working with domain experts, testing hypotheses, building and evaluating ML models, and repeating the process until the results become acceptable for businesses.

Feature Engineering Automation

FE automation has vast potential to change the traditional data science process. It significantly lowers skill barriers beyond ML automation alone, eliminating hundreds or even thousands of manually-crafted SQL queries, and ramps up the speed of the data science project even without a full light of domain knowledge. It also augments our data insights and delivers “unknown- unknowns” based on the ability to explore millions of feature hypotheses just in hours.

These days automated machine learning (AutoML) is gathering a lot of attention. AutoML is tackling one of the critical challenges that organizations struggle with: the sheer length of the AI and ML project, which usually takes months to complete, and the incredible lack of qualified talent available to handle it. While current AutoML products have undoubtedly made significant inroads in accelerating the AI and machine learning process, they fail to address the most significant step, the process to prepare the input of machine learning from raw business data, in other words, feature engineering.

To create a genuine shift in how modern organizations leverage AI and machine learning, the full cycle of data science development must involve automation. If the problems at the heart of data science automation are due to lack of data scientists, poor understanding of ML from business users, and difficulties in migrating to production environments, then these are the challenges that AutoML must also resolve.

AutoML 2.0, which automates the data and feature engineering, is streamlining FE automation and ML automation as a single pipeline and one-stop-shop. With AutoML 2.0, the full-cycle from raw data through data and feature engineering through ML model development takes days, not months, and a team can deliver 10x more projects.

Summary

Contrary to popular belief, algorithms are not the most distinguishing features of applied machine learning. FE influences the performance and accuracy of ML models more than anything else. It helps reveal the hidden patterns in the data and increases the predictive power of machine learning. In order for ML algorithms to work properly, you need to provide the right input data that algorithms can understand. Oftentimes this involves complex mathematical transformations on raw data. FE provides that input data into a single aggregated format optimized for ML. It is the secret sauce that enables AI/ML to do the magic. Whether it is preventing fraud in financial services, anomaly detection in manufacturing, or predicting customer churn for insurance companies, feature engineering is the most decisive factor for AI/ML success or failure.

Read More
  • Carl Bowen
  • Blog
  • October 9, 2019

What IS Feature Engineering?

What Is Feature Engineering?
(And Why Do We Need To Automate it?)
The past few years have seen the rapid rise in the adoption of Artificial Intelligence (AI) and Machine Learning (ML) for a multitude of commercial use-cases. Beyond the “cute” factor of AI that can pick a cat out of a photo array, AI and Machine learning are being deployed to model and predict lending risk, to understand and manage customer churn, provide product recommendations, help with programmatic advertising and much more. The challenge for the business community is that the underlying practice that is at the heart of AI and Machine Learning – data science – is rooted in a complex world of statistical analysis, data manipulation, programming and more. Most businesses don’t have enough data scientists – a fact illustrated by research in 2018 by LinkedIn that showed that there would be a shortfall of over 150,000 people with data science skills in the US alone. The data science process is complex and involves multiple distinct phases, as illustrated below. A typical data science project can take months to complete – with the most complex part being the feature engineering piece.
traditional feature engineering

What IS Feature Engineering?

Surprisingly, even in our daily conversations with clients, we find that there is often some amount of confusion as to what the term “feature engineering” actually means. What exactly is feature engineering? What are the steps of the process and why does it take so long? What can we do to accelerate this process? At a most basic level, feature engineering is comprised of three distinct steps:

  1. Feature ideation 
  2. Feature selection
  3. Feature creation

The first two steps in the process, feature ideation and feature selection, often require a high degree of “domain knowledge.” Domain knowledge refers to knowledge of the underlying business requirements that must be addressed. For example, a bank might employ a team of business analysts and data analysts to work with the data science team to consider “features” that might be useful in predicting if a client is likely to convert on a “zero balance” transfer offer for a new credit card. During this phase, a high degree of analysis of data is required to understand what data sources, tables and columns might be used to create the “features” that will then be tested in the next phase.
Feature creation and testing are the next part of the process. During this phase, data scientists collaborate with business analysts and data engineers to create flat tables that combine data from multiple related tables in one single “feature table.” For example, the same bank in our previous example might take data from their web tracking system, from their customer records, and from other data sources to create a single table that provides data for individual prospective clients that might be used by a machine learning model to predict the likelihood of that consumer accepting an offer. Each feature that is created must then be evaluated against machine learning models to identify which feature/model combinations provide the best possible outcome.

Why Automate Feature Engineering?

Clearly, the process of feature engineering can be lengthy, time-consuming and resource-intensive. Most organizations simply don’t have enough talent or time to effectively evaluate all possible use cases and to evaluate all possible permutations and combinations of tables and columns of data. Automated Feature Engineering can provide a huge benefit to businesses that aim to leverage AI and ML models for their business. The word “automated feature engineering,” however, can often mean different things, depending on which vendor you are evaluating. For most providers of Automated Machine Learning (AutoML) software, “automated feature engineering” describes the process of evaluating which features – built manually using the process described above, will be most beneficial for any given machine learning model. True Automated Feature Engineering, however, leverages Artificial Intelligence (AI) to create and evaluate features automatically. This is why at dotData we talk about discovering the “unknown unknowns” using Automated Feature Engineering. By automating the entire feature building process, you can build and evaluate hundreds of thousands, potentially even millions of features automatically – exposing only the ones that pass a specific threshold – and then providing data scientists with a wealth of additional features that they may have never considered.
To be specific, Automated Feature Engineering is not a replacement for manual feature creation and evaluation but instead can provide two significant benefits: Rapid prototyping and feature augmentation. Automated Feature Engineering can be used by data scientists to accelerate the process of trial and error that is often associated with feature engineering. Feature augmentation, on the other hand, is the process of using Automated Feature Engineering to create additional features that the data scientists, business analysts and data engineerings might have never even considered.

From Months to Days

What are the benefits of Automated Feature Engineering? By far the most valuable benefit is that of accelerated performance. Many dotData clients have leveraged the Automated Feature Engineering features of our dotData Enterprise or dotData Py platforms to accelerate their data science processes, often being able to deliver in days what traditionally took five months or longer to deliver. With the exponential growth in need for AI and ML use-cases and the low availability of data science resources, Automated Feature Engineering – as part of an effective AutoML platform – can help businesses grow exponentially the number of AI and ML projects that are executed and successfully brought into production.
Learn more about our platform and about Automated Feature Engineering by visiting our website.

Read More

Recent Posts

  • AIで保険業界の保険解約率を削減
  • AutoMLの普及は、データサイエンティスト時代の終わりを意味するか?
  • NECとdotData、SaaS型クラウドサービス「dotDataCloud」を日本で販売開始
  • dotData、Amazon SageMakerを利用し、dotData StreamのMLOps機能を強化
  • dotData、Microsoft Azureへのデプロイをサポート、 Microsoft Azure Marketplaceにて提供開始 dotDataがAzure上で利用可能となり、 Azureユーザーのデータサイエンスおよび機械学習プロジェクトを加速

Search

Recent Comments

    Archives

    • April 2021
    • November 2020
    • October 2020
    • September 2020
    • August 2020
    • July 2020
    • June 2020
    • May 2020
    • April 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019
    • August 2019
    • July 2019
    • June 2019
    • May 2019
    • April 2019
    • March 2019
    • February 2019
    • January 2019
    • December 2018
    • November 2018
    • October 2018
    • July 2018
    • March 2018

    Categories

    • Blog
    • Events
    • Media
    • Media-JP
    • Press Releases EN
    • Press Releases JP
    • Webinars
    • White Papers

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org
    dotData Logo in white

    Follow us on

    About

    • プロダクト
      • dotDataとは?
      • AutoML 2.0とは?
      • dotDataが選ばれる理由
      • dotData Cloud
      • dotData Enterprise
      • dotData Py
      • dotData Stream
    • ソリューション
      • 業界別
        • 銀行
        • 保険
        • 製造
        • 小売
        • 製薬
        • 通信
      • 役割別
        • BI & データアナリスト
        • データサイエンティスト
        • 経営層
        • IT&ソフトウェア
      • 価値別
        • 加速
        • 民主化
        • 拡張・強化
        • 業務適用
    • ニュース関連
      • プレスリリース
      • 掲載記事
    • 会社情報
      • 会社情報
      • お問い合わせ
      • 経営陣
    • ブログ
    • USAサイト

    News and Events

    • プロダクト
      • dotDataとは?
      • AutoML 2.0とは?
      • dotDataが選ばれる理由
      • dotData Cloud
      • dotData Enterprise
      • dotData Py
      • dotData Stream
    • ソリューション
      • 業界別
        • 銀行
        • 保険
        • 製造
        • 小売
        • 製薬
        • 通信
      • 役割別
        • BI & データアナリスト
        • データサイエンティスト
        • 経営層
        • IT&ソフトウェア
      • 価値別
        • 加速
        • 民主化
        • 拡張・強化
        • 業務適用
    • ニュース関連
      • プレスリリース
      • 掲載記事
    • 会社情報
      • 会社情報
      • お問い合わせ
      • 経営陣
    • ブログ
    • USAサイト

    Resources

    • プロダクト
      • dotDataとは?
      • AutoML 2.0とは?
      • dotDataが選ばれる理由
      • dotData Cloud
      • dotData Enterprise
      • dotData Py
      • dotData Stream
    • ソリューション
      • 業界別
        • 銀行
        • 保険
        • 製造
        • 小売
        • 製薬
        • 通信
      • 役割別
        • BI & データアナリスト
        • データサイエンティスト
        • 経営層
        • IT&ソフトウェア
      • 価値別
        • 加速
        • 民主化
        • 拡張・強化
        • 業務適用
    • ニュース関連
      • プレスリリース
      • 掲載記事
    • 会社情報
      • 会社情報
      • お問い合わせ
      • 経営陣
    • ブログ
    • USAサイト

    • 会社概要
    • お問い合わせ
    • dotDataの経営陣