Have a personal or library account? Click to login
Number of Instances for Reliable Feature Ranking in a Given Problem Cover

Number of Instances for Reliable Feature Ranking in a Given Problem

Open Access
|Jul 2018

Abstract

Background: In practical use of machine learning models, users may add new features to an existing classification model, reflecting their (changed) empirical understanding of a field. New features potentially increase classification accuracy of the model or improve its interpretability. Objectives: We have introduced a guideline for determination of the sample size needed to reliably estimate the impact of a new feature. Methods/Approach: Our approach is based on the feature evaluation measure ReliefF and the bootstrap-based estimation of confidence intervals for feature ranks. Results: We test our approach using real world qualitative business-tobusiness sales forecasting data and two UCI data sets, one with missing values. The results show that new features with a high or a low rank can be detected using a relatively small number of instances, but features ranked near the border of useful features need larger samples to determine their impact. Conclusions: A combination of the feature evaluation measure ReliefF and the bootstrap-based estimation of confidence intervals can be used to reliably estimate the impact of a new feature in a given problem

DOI: https://doi.org/10.2478/bsrj-2018-0017 | Journal eISSN: 1847-9375 | Journal ISSN: 1847-8344
Language: English
Page range: 35 - 44
Submitted on: Jan 31, 2018
Accepted on: Apr 21, 2018
Published on: Jul 28, 2018
Published by: IRENET - Society for Advancing Innovation and Research in Economy
In partnership with: Paradigm Publishing Services
Publication frequency: 2 issues per year

© 2018 Marko Bohanec, Mirjana Kljajić Borštnar, Marko Robnik-Šikonja, published by IRENET - Society for Advancing Innovation and Research in Economy
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.