SIGN IN YOUR ACCOUNT TO HAVE ACCESS TO DIFFERENT FEATURES

FORGOT YOUR PASSWORD?

FORGOT YOUR DETAILS?

AAH, WAIT, I REMEMBER NOW!
Need Help? Email [email protected]
  • LOGIN

E-SPIN Group

CONTACT US / GET A QUOTE
  • No products in cart.
  • HOME
  • PROFILE
    • Corporate Profile
    • About us
    • Customer Overview
    • Case Studies
    • Investor Relations
    • Procurement
  • GLOBAL THEMES
    • Artificial Intelligence (AI)
    • Big Data
    • Blockchain
    • Cloud Computing
    • Cognitive Computing
    • Cyber Security
    • DevSecOps
    • Digital Transformation (DT)
    • Modern Workplace
    • Internet of Things (IoT)
    • Quantum Computing
    • More theme and feature topics
  • SOLUTIONS
    • Application Lifecycle Management (ALM), DevSecOps/VSM, Application Security
      • Application Security
      • DevSecOps
      • Digital Forensics
      • Secure Development
    • Cybersecurity, Governance Risk Compliance (GRC) and Resiliency
      • Governance, Risk Management and Compliance (GRC)
      • Malware Analysis and Reverse Engineering
      • Security Information & Event Management (SIEM)
      • Security Configuration Management (SCM)
      • Threat and Vulnerability Management
      • Penetration Testing and Ethical Hacking
    • Modern Infrastructure, NetOps
      • Network Performance Monitoring and Diagnostics (NPMD)
      • IT Operations Management (ITOM)
      • Network Operation (NetOps)
      • Network Management System (NMS)
    • Modern Workspace & Future of Work
      • Digital Workspace
      • End User Computing (EUC)
      • User Activity Monitoring (UAM)
  • INDUSTRIES
    • Aerospace & Defense
    • Automotive
    • Banking & Financial Markets
    • Chemical & Petroleum
    • Commercial and Professional Services
    • Construction & Real Estate
    • Consumer Products
    • Education
    • Electronics
    • Energy & Utilities
    • Food & Beverage
    • Information Technology
    • Insurance
    • Healthcare
    • Goverment
    • Telecommunications
    • Transportation
    • Travel
    • Manufacturing
    • Media & Entertainment
    • Mining & Natural Resources
    • Life Sciences
    • Retail
  • PRODUCTS
    • Hidden Menu
      • Brand Overview
      • Services Overview
      • E-SPIN Product Line Card
      • E-SPIN Ecosystem World Solution Portfolio Overview
      • GitLab (DevOps, DevSecOps, VSM)
      • Hex-Rays (IDA Pro, Hex-Rays Decompiler)
      • Immunity (Canvas, Silica, Innuendo)
      • Parasoft (automated software testing, AppSec)
      • Tenable (Enterprise Vulnerability Management)
      • Veracode (Application Security Testing)
    • Cybersecurity, App Lifecycle, AppSec Management
      • Cerbero Labs (Cerbero Suite)
      • Core Security (Core Impact, Cobalt Strike)
      • HCL (AppScan, BigFix)
      • Invicti (Acunetix, Netsparker)
      • ImmuniWeb
      • Portswigger (Burp Suite Pro, Burp Suite Enterprise)
      • Titania (Nipper Studio)
      • TSFactory (User Activity Monitoring)
    • Infrastructure, Network, Wireless, Cloud Management
      • Metageek (Wi-Spy, Chanalyzer, Eye P.A.)
      • Progress (WhatsUp Gold, WS_FTP, MOVEit MFT)
      • Paessler
      • Solarwinds (IT Management)
      • TamoSoft (wireless site survey)
      • Visiwave (wireless site survey, traffic analysis)
      • VMware (Virtualization, cloud mgt, Digital Workspace)
    • Platform products
      • Adobe (Digital Media Creation)
      • McAfee
      • Micro Focus
      • Microsoft
      • Red Hat (Enterprise Linux, OpenStack, OpenShift, Ansible,JBoss)
      • SecHard
      • SUSE (Enterprise Linux, Rancher)
      • Show All The Brands and Products (Full)
  • e-STORE
    • e-STORE
    • eSTORE Guide
    • SUPPORT
  • CAREERS
    • Culture, Values and CSR
    • How We Hire
    • Job Openings
  • BLOG / NEWS
    • Blogs and News
    • Resources Library
    • Calendar of Events
  • CONTACT
  • Home
  • Global Themes and Feature Topics
  • 5 Common ML Challenges Data Scientists Face
5 Common ML Challenges Data Scientists Face
0
E-SPIN
Friday, 19 January 2018 / Published in Global Themes and Feature Topics

5 Common ML Challenges Data Scientists Face

1) Communication: Unclear questions and outcome metrics
A fundamental challenge facing data scientists has nothing to do with ensemble algorithms, optimization methods, or computing power. Communication – prior to any analysis or data engineering – is crucial to solving an ML problem quickly and painlessly.

There are many, many questions ML can solve: this is an incredibly powerful tool for making sense of the world around us. However, these questions have to be specific and formulaic in a way that the people responsible for identifying the problem, such as management or marketing, might be unfamiliar with.

Questions as posed in a ‘real-world’ environment, while substantively useful for framing and approaching a business problem, are often too vague to translate directly into ML modeling. Because of this, it is crucial to communicate effectively between different branches within the organization: the ‘small’ question being solved by ML modeling has to match the ‘big’ question that constitutes the business problem itself.

2) Feature engineering: getting more information out of a data set
Feature engineering and feature selection are important parts of any ML task. Even with highly sophisticated estimation algorithms and powerful, cheap computing capabilities, the data scientist plays an important role in creating a model that is both accurate and efficient.

Significant time and energy can (and should!) be spent on looking over the data itself to try and identify additional information that may be ‘hiding’ in the features already included. It may be, for example, that the difference between two values (for example, length of time since a customer’s most recent transaction) matters more to predictive accuracy than either of the values themselves.

This means that feature engineering is a combination of subject matter expertise and general intuition: skilled feature engineers can pull the maximum amount of useful information out of a given set of input data, giving an ML model the most informative data set possible to work with.

3) Logistics: budgeting computational resources
Few things are more frustrating than putting in hours, days, or weeks of work on cleaning and preparing a data set for analysis, only to hit an ‘out of memory’ error when trying to build the finalized model. Budgeting computational resources for ML estimation can be tricky: over-budgeting on powerful computing systems can waste significant money, but under-budgeting can produce severe bottlenecks in model construction and deployment.

However, cloud computing has taken dramatic steps towards making computational pipelines more expandable. Using a system like Amazon’s AWS allows for the deployment of larger virtual machines (or greater numbers of machines, if working in parallel) with relatively low cost and high speed. This type of elastic-computing framework makes it much, much easier to budget appropriately when setting up an ML system, especially when working with very large data sets.

4) Generalizability: Conflation of training and testing data sets
a. Particularly for those who are first getting into data science, this can be an easy step to miss, but it is incredibly important. ML models are built for estimation: their purpose is to intake new data and generate values that can be used to guide future decisions. Because of this, it is absolutely crucial to separate ‘training’ data that is used to fit an original ML model from ‘testing’ data that is used to assess the model’s accuracy.

Failure to do some type of out-of-sample testing can result in a model that looks fantastic in terms of accuracy and fit statistics… and then fails miserably when faced with new, unfamiliar data. Generalizability is key to creating usable long-term ML solutions, and as such, models need to be tested on independent, out-of-sample data before being put into regular use. A solid rule of thumb is to hold back 20-25% of the original data set: this is testing data, and should be kept entirely separate from the 75-80% of data used to build the ML model itself.

5) Focusing on the little things: algorithm choice
a. The range of algorithms available for ML problem solving is astounding. Random forests, support vector machines, neural networks, Bayesian estimation methods – the list goes on (and on, and on). The question of what algorithm is best for a given ML problem, however, is often less impactful than we might think.

It’s true that some approaches, on some questions, will work better than others. In some cases, this difference can even be quite distinct. However, in my experience it’s been quite rare that one modeling approach will strictly dominate all other options in answering a given ML question.

A useful middle ground in selecting an algorithm, in my opinion, is to build a ‘stable’ of robust modeling approaches that can be built quickly and easily for day-to-day use. Running a battery of models on a given data set allows the data scientist to pick whatever approach has the greatest marginal gain on that particular data set.

However, going far afield for exotic new algorithms, or adopting different programming languages, in my opinion, is rarely necessary or even worth the time.

Feel free to contact E-SPIN for machine learning infrastructure and application security, infrastructure availability and performance monitoring solution.

To know more about Machine Learning, please click on the link below:

  1. Machine learning use cases for security
  2. How business can be benefit with machine learning
  3. Typical explanation between AI and ML
  4. Machine learning what it is & why it matters
Tagged under: Machine Learning (ML)

What you can read next

Future of Work from next normal onward
What is low-code development platform (LCDP)
If can not dominate cryptocurrencies ban it

1 Comment to “ 5 Common ML Challenges Data Scientists Face”

  1. Dorai M says :Reply
    May 3, 2020 at 4:41 pm

    good into about 5 Common ML Challenges

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Next normal outlook from global competitiveness report

    The World Economic Forum publishes a yearly Glo...
  • Rise of home robots for household needs

    Household needs is one area where it consumes a...
  • VMware SASE Product Overview Webinar

    VMware SASE Product Overview Webinar is a routi...
  • From food crisis to export ban in next normal

    From food crisis to export ban in next normal u...
  • Innovations Shaping the Future of Transportation Industry

    Post-pandemic restructuring to survive in Next normal

    If you observe the global news, beside the COVI...

Recent Comments

  • JEAN ARIANE H. EVANGELISTA on E-SPIN Wishes all Filipino Araw ng Kagitingan 2022
  • Ira Camille Arellano on E-SPIN Wishes all Filipino Araw ng Kagitingan 2022
  • NKIRU OKEKE on Top 5 Challenges in the Consumer Products Industry
  • Md Abul Quashem on Types of Online Banking or E-Banking
  • Atalay marie on What is Cybersecurity Mesh ?

Archives

  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • March 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • January 2015
  • December 2014
  • October 2014
  • September 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • July 2012
  • June 2012
  • May 2012
  • February 2012
  • July 2011
  • June 2011
  • February 2009
  • July 2008

Categories

  • Acunetix
  • Adobe
  • Aerospace and Defence
  • AppSec Labs
  • Automotive
  • Banking and Financial Markets
  • BeyondTrust
  • Brand
  • Case Studies
  • Cerbero Labs
  • Chemical and petroleum
  • Codified Security
  • Commercial and Professional Services
  • Construction and Real Estate
  • Consumer products
  • Contact Us
  • Core Impact
  • Core Security
  • DBeaver
  • DefenseCode
  • DSquare Security
  • DSquare Security
  • E-Lock
  • Education
  • Electronics
  • Energy and utilities
  • FAQ
  • Food and Beverage (F&B)
  • GFI
  • GitLab
  • Global Themes and Feature Topics
  • Government
  • HCL
  • Healthcare
  • Hex-Rays
  • IBM
  • Immunity
  • ImmuniWeb
  • Industries
  • Information Technology
  • Insurance
  • Invicti
  • Ipswitch
  • Job
  • Life Science
  • LiveAction
  • Logpoint
  • Magnet forensics
  • Manufacturing
  • McAfee
  • Media and Entertainment
  • Metageek
  • Micro Focus
  • Microsoft
  • Mining and Natural Resources
  • Nessus
  • Netsparker
  • News
  • Nutanix
  • Paessler
  • Parasoft
  • PECB
  • PortSwigger
  • Pradeo
  • Product
  • Progress
  • Qualys
  • Rapid7
  • RedHat
  • Retail
  • Retina
  • Riverbed
  • RSA
  • SecHard
  • Security Innovation
  • Security Roots
  • Services
  • SILICA
  • Soft Activity
  • SolarWinds
  • Solution
  • SUSE
  • Symantec
  • TamoSoft
  • Telecommunications
  • Tenable
  • Titania
  • Transportation
  • Travel
  • Trend Micro
  • Trustwave
  • TSFactory
  • Uncategorized
  • Vandyke
  • Veracode
  • Videos
  • VisiWave
  • VMware
  • Webinar Archive

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

CORPORATE

  • Profile
  • About us
  • Investor Relations
  • Procurement

SOLUTIONS & PRODUCTS

  • Industries
  • Solutions
  • Products
  • Brand Overview
  • Services
  • Case Studies

STORE & SUPPORT

  • Shop
  • Cart
  • Checkout
  • My Account
  • Support

PRODUCTS & SERVICES

  • Industries
  • Solutions
  • Products
  • Brand Overview
  • Services
  • Case Studies

FOLLOW US

  • Facebook
  • Twitter
  • Pinterest
  • LinkedIn
  • YouTube
  • WordPress Blog
© 2005 - 2022 E-SPIN Group of Companies | All rights reserved.
E-SPIN refers to the global organisation, and may refer to one or more of the member firms of E-SPIN Group of Companies, each of which is a separate legal entity.
  • Contact
  • Privacy
  • Terms of use
TOP