SIGN IN YOUR ACCOUNT TO HAVE ACCESS TO DIFFERENT FEATURES

FORGOT YOUR PASSWORD?

FORGOT YOUR DETAILS?

AAH, WAIT, I REMEMBER NOW!
Need Help? Email [email protected]
  • LOGIN

E-SPIN Group

CONTACT US / GET A QUOTE
  • No products in cart.
  • HOME
  • PROFILE
    • Corporate Profile
    • About us
    • Customer Overview
    • Investor Relations
    • Procurement
  • GLOBAL THEMES
    • Artificial Intelligence (AI)
    • Big Data
    • Blockchain
    • Cloud Computing
    • Cognitive Computing
    • Cyber Security
    • DevSecOps
    • Digital Transformation (DT)
    • Modern Workplace
    • Internet of Things (IoT)
    • Quantum Computing
    • More theme and feature topics
  • SOLUTIONS
    • Application Security
    • DevSecOps
    • Digital Forensics
    • IT Operations Management (ITOM)
    • Malware Analysis and Reverse Engineering
    • Network Management System (NMS)
    • Network Operation (NetOps)
    • Network Performance Monitoring and Diagnostics (NPMD)
    • Penetration Testing
    • Secure Development
    • Security Information & Event Management (SIEM)
  • INDUSTRIES
    • Aerospace & Defense
    • Automotive
    • Banking & Financial Markets
    • Chemical & Petroleum
    • Commercial and Professional Services
    • Construction & Real Estate
    • Consumer Products
    • Education
    • Electronics
    • Energy & Utilities
    • Food & Beverage
    • Information Technology
    • Insurance
    • Healthcare
    • Goverment
    • Telecommunications
    • Transportation
    • Travel
    • Manufacturing
    • Media & Entertainment
    • Mining & Natural Resources
    • Life Sciences
    • Retail
  • PRODUCTS
    • Brand Overview
      • Acunetix
      • E-Lock
      • Hex-Rays
      • Immunity
      • Progress | Ipswitch
      • Metageek
      • Qualys
      • Parasoft
      • Tenable
      • Titania
      • Veracode
    • Rest of Brands
      • Adobe
      • BeyondTrust
      • Core Security
      • DefenseCode
      • HCL
      • ImmuniWeb
      • LiveAction
      • McAfee
      • Micro Focus
      • Microsoft
        • Microsoft Surface
      • Netsparker
      • Nutanix
      • Paessler
      • PECB
      • Portswigger
      • Red Hat
      • Riverbed
      • RSA
      • Solarwinds
      • TamoSoft
      • Trend Micro
      • TSFactory
      • Trustwave
      • VMware
      • VanDyke
      • Visiwave
    • Services Overview
    • Line Card
  • e-STORE
    • e-STORE
    • eSTORE Guide
    • SUPPORT
  • CAREERS
    • Careers
    • Culture, Values and CSR
    • How We Hire
    • Job Openings
  • BLOG / NEWS
    • Blogs and News
    • Resources Library
    • Calendar of Events
  • CONTACT
  • Home
  • Global Themes and Feature Topics
  • 5 Common ML Challenges Data Scientists Face
5 Common ML Challenges Data Scientists Face
0
E-SPIN
Friday, 19 January 2018 / Published in Global Themes and Feature Topics

5 Common ML Challenges Data Scientists Face

1) Communication: Unclear questions and outcome metrics
A fundamental challenge facing data scientists has nothing to do with ensemble algorithms, optimization methods, or computing power. Communication – prior to any analysis or data engineering – is crucial to solving an ML problem quickly and painlessly.

There are many, many questions ML can solve: this is an incredibly powerful tool for making sense of the world around us. However, these questions have to be specific and formulaic in a way that the people responsible for identifying the problem, such as management or marketing, might be unfamiliar with.

Questions as posed in a ‘real-world’ environment, while substantively useful for framing and approaching a business problem, are often too vague to translate directly into ML modeling. Because of this, it is crucial to communicate effectively between different branches within the organization: the ‘small’ question being solved by ML modeling has to match the ‘big’ question that constitutes the business problem itself.

2) Feature engineering: getting more information out of a data set
Feature engineering and feature selection are important parts of any ML task. Even with highly sophisticated estimation algorithms and powerful, cheap computing capabilities, the data scientist plays an important role in creating a model that is both accurate and efficient.

Significant time and energy can (and should!) be spent on looking over the data itself to try and identify additional information that may be ‘hiding’ in the features already included. It may be, for example, that the difference between two values (for example, length of time since a customer’s most recent transaction) matters more to predictive accuracy than either of the values themselves.

This means that feature engineering is a combination of subject matter expertise and general intuition: skilled feature engineers can pull the maximum amount of useful information out of a given set of input data, giving an ML model the most informative data set possible to work with.

3) Logistics: budgeting computational resources
Few things are more frustrating than putting in hours, days, or weeks of work on cleaning and preparing a data set for analysis, only to hit an ‘out of memory’ error when trying to build the finalized model. Budgeting computational resources for ML estimation can be tricky: over-budgeting on powerful computing systems can waste significant money, but under-budgeting can produce severe bottlenecks in model construction and deployment.

However, cloud computing has taken dramatic steps towards making computational pipelines more expandable. Using a system like Amazon’s AWS allows for the deployment of larger virtual machines (or greater numbers of machines, if working in parallel) with relatively low cost and high speed. This type of elastic-computing framework makes it much, much easier to budget appropriately when setting up an ML system, especially when working with very large data sets.

4) Generalizability: Conflation of training and testing data sets
a. Particularly for those who are first getting into data science, this can be an easy step to miss, but it is incredibly important. ML models are built for estimation: their purpose is to intake new data and generate values that can be used to guide future decisions. Because of this, it is absolutely crucial to separate ‘training’ data that is used to fit an original ML model from ‘testing’ data that is used to assess the model’s accuracy.

Failure to do some type of out-of-sample testing can result in a model that looks fantastic in terms of accuracy and fit statistics… and then fails miserably when faced with new, unfamiliar data. Generalizability is key to creating usable long-term ML solutions, and as such, models need to be tested on independent, out-of-sample data before being put into regular use. A solid rule of thumb is to hold back 20-25% of the original data set: this is testing data, and should be kept entirely separate from the 75-80% of data used to build the ML model itself.

5) Focusing on the little things: algorithm choice
a. The range of algorithms available for ML problem solving is astounding. Random forests, support vector machines, neural networks, Bayesian estimation methods – the list goes on (and on, and on). The question of what algorithm is best for a given ML problem, however, is often less impactful than we might think.

It’s true that some approaches, on some questions, will work better than others. In some cases, this difference can even be quite distinct. However, in my experience it’s been quite rare that one modeling approach will strictly dominate all other options in answering a given ML question.

A useful middle ground in selecting an algorithm, in my opinion, is to build a ‘stable’ of robust modeling approaches that can be built quickly and easily for day-to-day use. Running a battery of models on a given data set allows the data scientist to pick whatever approach has the greatest marginal gain on that particular data set.

However, going far afield for exotic new algorithms, or adopting different programming languages, in my opinion, is rarely necessary or even worth the time.

Feel free to contact E-SPIN for machine learning infrastructure and application security, infrastructure availability and performance monitoring solution.

To know more about Machine Learning, please click on the link below:

  1. Machine learning use cases for security
  2. How business can be benefit with machine learning
  3. Typical explanation between AI and ML
  4. Machine learning what it is & why it matters
Tagged under: machine learning

What you can read next

The Phases of Cyber Kill Chain
Augmented Analytics Capabilities in Business Intelligence
Machine Learning Penetrate Construction Industry
Machine Learning Penetrate Construction Industry

1 Comment to “ 5 Common ML Challenges Data Scientists Face”

  1. Dorai M says :Reply
    May 3, 2020 at 4:41 pm

    good into about 5 Common ML Challenges

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • E-SPIN PECB ISO 37001:2016 – Anti-Bribery Management Systems (ABMS) Understanding ABMS Requirements and Internal Auditing

    E-SPIN Group is pleased to offer E-SPIN PECB IS...
  • What makes Composite AI an ideal data management approach?

    Composite AI refers to an approach that utilize...
  • IoC related to threat and vulnerability management

    More and more enterprise customers are now payi...
  • Operations in the next normal

    Operations in the next normal, it is time to re...
  • Top 4 Digital Technologies trends in Retail Industry

    Advantages of Composite AI Approach in Industries

    What are the advantages of Composite AI approac...

Recent Comments

  • Dorai M on 5 Common ML Challenges Data Scientists Face

Archives

  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • March 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • January 2015
  • December 2014
  • October 2014
  • September 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • February 2012
  • July 2011
  • June 2011
  • February 2009
  • July 2008

Categories

  • Acunetix
  • Adobe
  • Aerospace and Defence
  • AppSec Labs
  • Automotive
  • Banking and Financial Markets
  • BeyondTrust
  • Brand
  • Chemical and petroleum
  • Codified Security
  • Commercial and Professional Services
  • Construction and Real Estate
  • Consumer products
  • Contact Us
  • Core Impact
  • Core Security
  • DefenseCode
  • E-Lock
  • Education
  • Electronics
  • Energy and utilities
  • FAQ
  • Food and Beverage (F&B)
  • GFI
  • Global Themes and Feature Topics
  • Government
  • HCL
  • Healthcare
  • Hex-Rays
  • IBM
  • Immunity
  • ImmuniWeb
  • Industries
  • Information Technology
  • Insurance
  • Ipswitch
  • Job
  • Life Science
  • LiveAction
  • Logpoint
  • Manufacturing
  • McAfee
  • Media and Entertainment
  • Metageek
  • Micro Focus
  • Microsoft
  • Mining and Natural Resources
  • Nessus
  • Netsparker
  • News
  • Nutanix
  • Paessler
  • Parasoft
  • PECB
  • PortSwigger
  • Pradeo
  • Product
  • Qualys
  • Rapid7
  • RedHat
  • Retail
  • Retina
  • Riverbed
  • RSA
  • Security Innovation
  • Security Roots
  • Services
  • SILICA
  • Smart City
  • Soft Activity
  • SolarWinds
  • Solution
  • Symantec
  • TamoSoft
  • Telecommunications
  • Tenable
  • Titania
  • Transportation
  • Travel
  • Trend Micro
  • Trustwave
  • TSFactory
  • Uncategorized
  • Vandyke
  • Veracode
  • Videos
  • VisiWave
  • VMware
  • Webinar Archive

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

CORPORATE

  • Profile
  • About us
  • Careers
  • Investor Relations
  • Procurement

SOLUTIONS & PRODUCTS

  • Industries
  • Solutions
  • Products
  • Brand Overview
  • Services

STORE & SUPPORT

  • Shop
  • Cart
  • Checkout
  • My Account
  • Support

PRODUCTS & SERVICES

  • Industries
  • Solutions
  • Products
  • Brand Overview
  • Services

FOLLOW US

  • Facebook
  • Twitter
  • Pinterest
  • LinkedIn
  • YouTube
  • WordPress Blog
© 2005 - 2021 E-SPIN Group of Companies | All rights reserved.
  • Contact
  • Privacy
  • Terms of use
TOP