Have a personal or library account? Click to login
Return on Investment in Machine Learning: Crossing the Chasm between Academia and Business Cover

Return on Investment in Machine Learning: Crossing the Chasm between Academia and Business

Open Access
|Dec 2020

References

  1. [1] Algorithmia. algorithmia.com/product. Accessed: 2020-11-30.
  2. [2] Amazon ground truth. https://amzn.to/3g0AGqf. Accessed: 2020-11-30.
  3. [3] Amazon mechanical turk. www.mturk.com. Accessed: 2020-11-30.
  4. [4] Amazon open data. registry.opendata.aws. Accessed: 2020-11-30.
  5. [5] Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity. ibm.co/2QhuGxK. Accessed: 2020-11-30.
  6. [6] Cleaning big data: Most time-consuming, least enjoyable data science task, survey says. bit.ly/3d3M0zQ. Accessed: 2020-11-30.
  7. [7] Cookiecutter data science. https://bit.ly/3msmp8j. Accessed: 2020-11-30.
  8. [8] Data gov. www.data.gov. Accessed: 2020-11-30.
  9. [9] Data usa. datausa.io. Accessed: 2020-11-30.
  10. [10] Data version control. dvc.org. Accessed: 2020-11-30.
  11. [11] Figure eight. www.figure-eight.com. Accessed: 2020-11-30.
  12. [12] Flair. github.com/flairNLP/flair. Accessed: 2020-11-30.
  13. [13] Ganlab. poloclub.github.io/ganlab. Accessed: 2020-11-30.
  14. [14] Google dataset search. https://bit.ly/2JucsbU. Accessed: 2020-11-30.
  15. [15] Huggingface. https://bit.ly/39vUDE4. Accessed: 2020-11-30.
  16. [16] Jupyanno. github.com/chestrays/jupyanno. Accessed: 2020-11-30.
  17. [17] Kaggle datasets. www.kaggle.com/datasets. Accessed: 2020-11-30.
  18. [18] Kedro. github.com/quantumblacklabs/kedro. Accessed: 2020-11-30.
  19. [19] Metaflow. metaflow.org. Accessed: 2020-11-30.
  20. [20] Mlflow. mlflow.org. Accessed: 2020-11-30.
  21. [21] Neptune. neptune.ml. Accessed: 2020-11-30.
  22. [22] Network repository. networkrepository.com. Accessed: 2020-11-30.
  23. [23] Notion: All-in-one workplace. ww.notion.so. Accessed: 2020-11-30.
  24. [24] Pigeon. github.com/agermanidis/pigeon. Accessed: 2020-11-30.
  25. [25] Sagemaker. aws.amazon.com/sagemaker. Accessed: 2020-11-30.
  26. [26] The staggering cost of training sota ai models. bit.ly/39O20nL. Accessed: 2020-11-30.
  27. [27] Tensorboard. www.tensorflow.org/tensorboard. Accessed: 2020-11-30.
  28. [28] Tf serving. github.com/tensorflow/serving. Accessed: 2020-11-30.10.1002/ntlf.30266
  29. [29] Tldrlegal - software licenses explained in plain english. tldrlegal.com/. Accessed: 2020-11-30.
  30. [30] Uci ml repository. archive.ics.uci.edu/ml. Accessed: 2020-11-30.
  31. [31] Visdom. github.com/facebookresearch/visdom. Accessed: 2020-11-30.
  32. [32] Weights & biases. www.wandb.com. Accessed: 2020-11-30.
  33. [33] Yellowbrick. www.scikit-yb.org. Accessed: 2020-11-30.
  34. [34] Fair crowd work: Shedding light on the real work of crowd-, platform-, and app-based work. http://faircrowd.work/platform-reviews, 2018. Accessed: 2020-11-30.
  35. [35] Amershi S., Begel A., Bird C., DeLine R., Gall H., Kamar E., Nagappan N., Nushi B., and Zimmermann T. Software engineering for machine learning: A case study. In 41st IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, pages 291–300. IEEE, 2019.10.1109/ICSE-SEIP.2019.00042
  36. [36] Badene S., Thompson K., Lorré J.-P., and Asher N. Weak supervision for learning discourse structure. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2296–2305, 2019.10.18653/v1/D19-1234
  37. [37] Bernardi L., Mavridis T., and Estevez P. 150 successful machine learning models: 6 lessons learned at booking.com. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1743–1751, 2019.10.1145/3292500.3330744
  38. [38] Bloice M. D., Roth P. M., and Holzinger A. Biomedical image augmentation using augmentor. Bioinformatics, 35(21):4522–4524, 2019.10.1093/bioinformatics/btz25930989173
  39. [39] Bolukbasi T., Chang K.-W., Zou J., Saligrama V., and Kalai A. Man is to computer programmer as woman is to homemaker? debiasing word embeddings, 2016.
  40. [40] Bosch J., Crnkovic I., and Olsson H. H. Engineering ai systems: A research agenda, 2020.10.4018/978-1-7998-5101-1.ch001
  41. [41] Breck E., Cai S., Nielsen E., Salib M., and Sculley D. What's your ml test score? a rubric for ml production systems. In Reliable Machine Learning in the Wild - NIPS 2016 Workshop, 2016.
  42. [42] Breiman L. Random forests. Machine Learning, 45(1):5–32, 2001.10.1023/A:1010933404324
  43. [43] Brown T. B., Mann B., Ryder N., Subbiah M., Kaplan J., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., Agarwal S., Herbert-Voss A., Krueger G., Henighan T., Child R., Ramesh A., Ziegler D. M., Wu J., Winter C., Hesse C., Chen M., Sigler E., Litwin M., Gray S., Chess B., Clark J., Berner C., McCandlish S., Radford A., Sutskever I., and Amodei D. Language models are few-shot learners, 2020.
  44. [44] Buslaev A., Iglovikov V. I., Khvedchenya E., Parinov A., Druzhinin M., and Kalinin A. A. Albumentations: fast and flexible image augmentations. Information, 11(2):125, 2020.10.3390/info11020125
  45. [45] Byun T. and Rayadurgam S. Manifold for machine learning assurance. Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 06 2020.10.1145/3377816.3381734
  46. [46] Dehghani M., Severyn A., Rothe S., and Kamps J. Learning to learn from weak supervision by full supervision, 2017.
  47. [47] Demir S., Eniser H. F., and Sen A. Deepsmartfuzzer: Reward guided test generation for deep learning, 2019.
  48. [48] Devlin J., Chang M.-W., Lee K., and Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  49. [49] Dignum V. Responsible artificial intelligence: designing ai for human values. The ITU Journal on Future and Evolving Technologies, 2017.
  50. [50] Dingwall N. and Potts C. Mittens: An extension of glove for learning domain-specialized representations, 2018.10.18653/v1/N18-2034
  51. [51] Doersch C. and Zisserman A. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2051–2060, 2017.10.1109/ICCV.2017.226
  52. [52] Drozdowski M., Kowalski D., Mizgajski J., Mokwa D., and Pawlak G. Mind the gap: a heuristic study of subway tours. Journal of Heuristics, 20(5):561–587, 2014.10.1007/s10732-014-9252-3
  53. [53] Fayyad U., Piatetsky-Shapiro G., and Smyth P. From data mining to knowledge discovery in databases. AI magazine, 17(3):37–37, 1996.
  54. [54] Gagolewski M. et al. Benchmark suite for clustering algorithms – version 1, 2020.
  55. [55] Garcia M. Racist in the machine: The disturbing implications of algorithmic bias. World Policy Journal, 33(4):111–117, 2016.10.1215/07402775-3813015
  56. [56] Gerasimou S., Eniser H. F., Sen A., and Cakan A. Importance-driven deep learning system testing, 2020.10.1145/3377811.3380391
  57. [57] Gilyazev R. and Turdakov D. Y. Active learning and crowdsourcing: A survey of optimization methods for data labeling. Programming and Computer Software, 44(6):476–491, 2018.10.1134/S0361768818060142
  58. [58] Gofman M. and Jin Z. Artificial intelligence, human capital, and innovation. Human Capital, and Innovation (August 20, 2019), 2019.10.2139/ssrn.3448116
  59. [59] Gong C., Zhang H., Yang J., and Tao D. Learning with inadequate and incorrect supervision. In 2017 IEEE International Conference on Data Mining (ICDM), pages 889–894. IEEE, 2017.10.1109/ICDM.2017.110
  60. [60] Gottgtroy P. Ontology driven knowledge discovery process: a proposal to integrate ontology engineering and kdd. PACIS 2007 Proceedings, page 88, 2007.
  61. [61] Hadiwinoto C. and Ng H. T. Upping the ante: Towards a better benchmark for chinese-to-english machine translation, 2018.
  62. [62] Hager G. D., Drobnis A., Fang F., Ghani R., Greenwald A., Lyons T., Parkes D. C., Schultz J., Saria S., Smith S. F., and Tambe M. Artificial intelligence for social good, 2019.
  63. [63] Hand D. J. and Khan S. Validating and verifying ai systems. Patterns, 1(3):100037, 2020.10.1016/j.patter.2020.100037766044933205105
  64. [64] Hao D., Zhang L., Sumkin J., Mohamed A., and Wu S. Inaccurate labels in weakly supervised deep learning: Automatic identification and correction and their impact on classification performance. IEEE Journal of Biomedical and Health Informatics, 2020.10.1109/JBHI.2020.2974425742934532078570
  65. [65] Hendrycks D., Mazeika M., Kadavath S., and Song D. Using self-supervised learning can improve model robustness and uncertainty. In Advances in Neural Information Processing Systems, pages 15663–15674, 2019.
  66. [66] Ho D., Liang E., Chen X., Stoica I., and Abbeel P. Population based augmentation: Efficient learning of augmentation policy schedules. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2731–2741, Long Beach, California, USA, 06 2019. PMLR.
  67. [67] Honnibal M. and Montani I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. 2017.
  68. [68] Howard A. and Borenstein J. The ugly truth about ourselves and our robot creations: The problem of bias and social inequity. Science and engineering ethics, 24(5):1521–1536, 2018.10.1007/s11948-017-9975-2
  69. [69] Howard A. G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T., Andreetto M., and Adam H. Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.
  70. [70] Howard J. and Ruder S. Universal language model fine-tuning for text classification, 2018.10.18653/v1/P18-1031
  71. [71] Inmon B. Data Lake Architecture: Designing the Data Lake and avoiding the garbage dump. Technics Publications, 2016.
  72. [72] Inmon W. H. Building the data warehouse. John Wiley & Sons, 2005.
  73. [73] Kessler J. S. Scattertext: a browser-based tool for visualizing how corpora differ, 2017.10.18653/v1/P17-4015
  74. [74] Kim M., Zimmermann T., DeLine R., and Begel A. Data scientists in software teams: State of the art and challenges. IEEE Transactions on Software Engineering, 44(11):1024–1038, 2017.10.1109/TSE.2017.2754374
  75. [75] Krizhevsky A., Sutskever I., and Hinton G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.10.1145/3065386
  76. [76] Lacoste A., Luccioni A., Schmidt V., and Dandres T. Quantifying the carbon emissions of machine learning, 2019.
  77. [77] Lazer D., Kennedy R., King G., and Vespignani A. The parable of google flu: traps in big data analysis. Science, 343(6176):1203–1205, 2014.10.1126/science.1248506
  78. [78] Ledwich M. and Zaitsev A. Algorithmic extremism: Examining youtube's rabbit hole of radicalization, 2019.10.5210/fm.v25i3.10419
  79. [79] Lee S. W. and Kerschberg L. A methodology and life cycle model for data mining and knowledge discovery in precision agriculture. In SMC’98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics, volume 3, pages 2882–2887, 1998.10.1109/ICSMC.1998.725100
  80. [80] Ma E. NLP augmentation. github.com/makcedward/nlpaug, 2019.
  81. [81] Matignon R. Data mining using SAS enterprise miner, volume 638. John Wiley & Sons, 2007.10.1002/9780470171431
  82. [82] Mikolov T., Chen K., Corrado G., and Dean J. Efficient estimation of word representations in vector space, 2013.
  83. [83] Mizgajski J., Szymczak A., Żelasko P., Morzy M., Augustyniak Ł., and Szymański P. Return of investment in machine learning: Crossing the chasm between academia and business. In Proceedings of the PP-RAI’2019 Conference: Polskie Porozumienie na Rzecz Rozwoju Sztucznej Inteligencji, pages 285–291. Wroclaw University of Science and Technology, 2019.
  84. [84] Montani I. and Honnibal M. Prodigy: A new annotation tool for radically efficient machine teaching. prodi.gy/, 2018. Accessed: 2020-11-30.
  85. [85] Nakayama H., Kubo T., Kamura J., Taniguchi Y., and Liang X. Doccano: Text annotation tool for human. github.com/doccano/doccano, 2018. Accessed: 2020-11-30.
  86. [86] Obermeyer Z. and Mullainathan S. Dissecting racial bias in an algorithm that guides health decisions for 70 million people. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 89–89, 2019.10.1145/3287560.3287593
  87. [87] Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32:8026–8037, 2019.
  88. [88] Pennington J., Socher R., and Manning C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.10.3115/v1/D14-1162
  89. [89] Purdy M. and Daugherty P. Why artificial intelligence is the future of growth, 2016.
  90. [90] Pustejovsky J. and Stubbs A. Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications. O’Reilly Media, 2012.
  91. [91] Ratner A., Bach S. H., Ehrenberg H., Fries J., Wu S., and Ré C. Snorkel: rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11(3):269–282, 11 2017.10.14778/3157794.3157797595119129770249
  92. [92] Ribeiro M. T., Wu T., Guestrin C., and Singh S. Beyond accuracy: Behavioral testing of nlp models with checklist, 2020.10.24963/ijcai.2021/659
  93. [93] Roh Y., Heo G., and Whang S. E. A survey on data collection for machine learning: a big data – ai integration perspective, 2019.
  94. [94] Ruder S., Peters M. E., Swayamdipta S., and Wolf T. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18, 2019.10.18653/v1/N19-5004
  95. [95] Schwartz R., Dodge J., Smith N. A., and Etzioni O. Green ai, 2019.
  96. [96] Sculley D., Holt G., Golovin D., Davydov E., Phillips T., Ebner D., Chaudhary V., Young M., Crespo J.-F., and Dennison D. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems, pages 2503–2511, 2015.
  97. [97] Settles B. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.
  98. [98] Shane J. You Look Like a Thing and I Love You: How Artificial Intelligence Works and why It's Making the World a Weirder Place. Voracious, 2019.
  99. [99] Shi Z. R., Wang C., and Fang F. Artificial intelligence for social good: A survey, 2020.
  100. [100] Simonyan K. and Zisserman A. Very deep convolutional networks for large-scale image recognition, 2015.
  101. [101] Strobelt H., Gehrmann S., Behrisch M., Perer A., Pfister H., and Rush A. M. Seq2seq-vis: A visual debugging tool for sequence-to-sequence models. IEEE Transactions on Visualization and Computer Graphics, 25(1):353–363, 2018.10.1109/TVCG.2018.2865044
  102. [102] Sun Z., Zhang J. M., Harman M., Papadakis M., and Zhang L. Automatic testing and improvement of machine translation, 2019.10.1145/3377811.3380420
  103. [103] Tatman R. Gender and dialect bias in youtube's automatic captions. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pages 53–59, 2017.10.18653/v1/W17-1606
  104. [104] Tufekci Z. Youtube's recommendation algorithm has a dark side. bit.ly/2m09tvZ. Accessed: 2020-11-30.
  105. [105] Van Engelen J. E. and Hoos H. H. A survey on semi-supervised learning. Machine Learning, 109(2):373–440, 2020.10.1007/s10994-019-05855-6
  106. [106] VanderPlas J. The big data brain drain: Why science is in trouble. Accessed: 2020-11-30.
  107. [107] VanderPlas J. Hacking academia: Data science and the university. Accessed: 2020-11-30.
  108. [108] Vassiliadis P. A survey of extract–transform–load technology. International Journal of Data Warehousing and Mining, 5(3):1–27, 2009.10.4018/jdwm.2009070101
  109. [109] Wagstaff K. Machine learning that matters, 2012.
  110. [110] Wan Z., Xia X., Lo D., and Murphy G. C. How does machine learning change software development practices? IEEE Transactions on Software Engineering, pages 1–14, 2019.10.1109/TSE.2019.2937083
  111. [111] Wang J., Gou L., Shen H.-W., and Yang H. Dqnviz: A visual analytics approach to understand deep q-networks. IEEE Transactions on Visualization and Computer Graphics, 25(1):288–298, 2018.10.1109/TVCG.2018.2864504
  112. [112] Weiss K., Khoshgoftaar T. M., and Wang D. A survey of transfer learning. Journal of Big data, 3(1):9, 2016.10.1186/s40537-016-0043-6
  113. [113] Wirth R. and Hipp J. Crisp-dm: Towards a standard process model for data mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, pages 29–39. Springer-Verlag, 2000.
  114. [114] Yan M., Wang L., and Fei A. Artdl: Adaptive random testing for deep learning systems. IEEE Access, 2019.10.1109/ACCESS.2019.2962695
  115. [115] Yosinski J., Clune J., Nguyen A., Fuchs T., and Lipson H. Understanding neural networks through deep visualization, 2015.
  116. [116] Zafrir O., Boudoukh G., Izsak P., and Wasserblat M. Q8bert: Quantized 8bit bert, 2019.10.1109/EMC2-NIPS53020.2019.00016
  117. [117] Zeiler M. D. and Fergus R. Visualizing and understanding convolutional networks. In 13th European Conference on Computer Vision, ECCV 2014, pages 818–833. Springer Verlag, 2014.10.1007/978-3-319-10590-1_53
  118. [118] Zhang J. M., Harman M., Ma L., and Liu Y. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering, 2020.
  119. [119] Zinkevich M. Rules of machine learning: Best practices for ml engineering. developers.google.com/machine-learning/guides/rules-of-ml. Accessed: 2020-11-30.
DOI: https://doi.org/10.2478/fcds-2020-0015 | Journal eISSN: 2300-3405 | Journal ISSN: 0867-6356
Language: English
Page range: 281 - 304
Submitted on: Mar 16, 2020
|
Accepted on: Nov 30, 2020
|
Published on: Dec 16, 2020
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2020 Jan Mizgajski, Adrian Szymczak, Mikołaj Morzy, Łukasz Augustyniak, Piotr Szymański, Piotr Żelasko, published by Poznan University of Technology
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.