Authors: ** 저자 정보가 제공되지 않았습니다. (가능한 경우, 원고에 명시된 저자명을 기입하십시오.) **
📝 Abstract
In the age of digital epidemiology, epidemiologists are faced by an increasing amount of data of growing complexity and dimensionality. Machine learning is a set of powerful tools that can help to analyze such enormous amounts of data. This chapter lays the methodological foundations for successfully applying machine learning in epidemiology. It covers the principles of supervised and unsupervised learning and discusses the most important machine learning methods. Strategies for model evaluation and hyperparameter optimization are developed and interpretable machine learning is introduced. All these theoretical parts are accompanied by code examples in R, where an example dataset on heart disease is used throughout the chapter.
💡 Deep Analysis
📄 Full Content
Machine learning has become an integral part of almost all businesses and scientific fields alike, including epidemiology. With the rise of deep learning, machine learning revolutionized various applications, from image and speech recognition to natural language processing. Alongside this hype, epidemiologists are faced with an ever-increasing amount of data of growing complexity and dimensionality, including data from electronic health records, wearable devices, social media and genetics. Machine learning methods are able to efficiently analyze such enormous amounts of data. Thus, in the age of digital epidemiology, machine learning is an essential tool that every modern epidemiologist should know about.
One of the major advantages of machine learning is that it does not require exact model specifications. Instead, one simply indicates which variables or features to include, and relies on the machine learning method to find all the interactions and other important factors. However, this increased model flexibility comes at a high computational cost, a loss of interpretability and the risk of overfitting, i.e., fitting training data too closely, leading to poor generalization performance and the need for proper model evaluation. Further, most machine learning methods have to be configured by setting so-called hyperparameters, which heavily influence performance and thereby have to be chosen carefully or tested systematically.
With this book chapter, we aim to give epidemiologists the foundation for successfully applying machine learning to their research. Notably, this does not require knowing all the details of all the different machine learning methods. Instead, we focus on general principles such as supervised learning (Sec. 2), model evaluation (Sec. 3) and hyperparameter optimization (Sec. 4). Nevertheless, we cover two of the most important machine learning methods in Sec. 2. These methods are focused on making predictions, which is useful in many epidemiological tasks, however, does not help in understanding diseases, identifying risk factors or generating synthetic data. In this regard, Sec. 5 introduces the basics of interpretable machine learning and Sec. 6 covers unsupervised learning and generative modeling. Throughout the chapter, we use an example dataset on heart disease and show how to apply the covered methods in R using the mlr3 framework (Lang et al, 2019).
The heart disease data (Janosi et al, 1988) are available from OpenML (Vanschoren et al, 2013). The labeled dataset contains 𝑛 = 270 instances of patients with 𝑝 = 13 features. The features include a patient’s age (age), results of a thallium stress test (thal) and four types of chest pain (chest pain), among others. We aim to predict whether the target heart disease 𝑦 ∈ {1, 2} is absent (1) or present (2). Since 𝑦 is categorical with two classes, it is a binary classification task. For details on the dataset, preprocessing, software and code examples, we refer to the appendix, our GitHub page https://github.com/bips-hb/epi-handbook-ml
, and the mlr3 book (Bischl et al, 2024).
Supervised learning refers to learning a functional relationship, or model, f : 𝑋 → 𝑌 between a set of 𝑝 features 𝒙 ∈ 𝑋 ⊆ R 𝑝 and a target 𝑦 ∈ 𝑌 ⊆ R from data D = {(𝒙 𝑖 , 𝑦 𝑖 )} 𝑛 𝑖=1 with 𝑛 ∈ N instances. It is uniquely characterized by the use of labeled training data for learning the underlying relationship 𝑓 : 𝑋 → 𝑌 . We refer to labeled data if both features and targets are observed. The model f is then used to make predictions for the target of new data ŷ = f (𝒙), where the features but not the targets are available. An example prediction task for the field of epidemiology may be predicting the risk of a specific disease, based on genetic and lifestyle features.
It is important to note that the term model is used for many different concepts in science, which may lead to confusion. In this chapter, the term model is used to refer only to the functional relationship f between 𝒙 and 𝑦. The algorithm that is used to find the model is termed inducer or learner.
Generally, supervised learning with a continuous target 𝑦 ∈ R is referred to as a regression task. A classification task is presented, when 𝑦 is categorical, i.e., 𝑦 ∈ {1, . . . , 𝐶}, where 𝐶 ∈ N is the number of classes. In this case, the prediction can either be categorical, i.e., ŷ ∈ {1, . . . , 𝐶}, or probabilistic in nature with π = 𝑃(𝑦 = 𝑐|𝒙) for each class 𝑐 ∈ {1, . . . , 𝐶}. For only two classes (𝐶 = 2), typically a {0, 1} target is used, hence it is referred to as binary classification, while 𝐶 > 2 is called multiclass classification.
For evaluation purposes, the difference between the predicted values ŷ and the actual values 𝑦 is usually quantified in the form of a loss function 𝐿( ŷ, 𝑦). It measures the performance of a model and the goal is often to minimize this function during training to improve the model’s accuracy, for further details see Sec. 3.