### Introduction

*deep learning*. To a newcomer in this field, it might seem that AI and deep learning are equivalent terms. However, AI, as an academic field, has existed for a long time and deep learning (a.k.a. multi-layer perceptron, the old-fashioned term) is one of its many research topics. Most of its important theoretical development already took place in the mid-twentieth century but its enormous potential was unrecognized for a long time.

*supervised learning, unsupervised learning*, and

*reinforcement learning*(Fig. 1). Algorithms that constitute each of these categories are diverse, and deep learning is merely one of them. Deep learning is special in certain aspects and supersedes all other algorithms when dealing with tasks related to images, videos, sounds, and machine translation. However, deep learning is occasionally overstated in its capability and misconceived as a magic wand. It is necessary to gain a clear overview of the entire field of ML before digging deeper into any specific algorithm.

### Basic concepts of ML

*training data*that consist of

*features*. A good understanding of these terms is crucial in acquiring a firm grasp of ML. Hereafter, important terms will be highlighted in

*italics*.

### Data

*i*row of X is called an ith

^{th}*instance*of D. The

*j*column of X, on the other hand, is called a

^{th}*j*

^{th}*feature*of D. Throughout this review, we will use a well-known dataset in the ML community (the Pima Indian diabetes dataset), originally from the National Institute of Diabetes and Digestive and Kidney Diseases (https://www.niddk.nih.gov/), which consists of 768 subjects and 8 features. These features are as follows: number of prior pregnancies, plasma glucose concentration (glucose tolerance test), diastolic blood pressure (mmHg), triceps skin fold thickness (mm), 2-h serum insulin (mu U/ml), body mass index (BMI), diabetes pedigree function, and age (Table 1). The objective is to predict whether the patient is diabetic, based on the values of these features.

### Task

*classification*and

*regression*.

*target*variable (or

*label*) is an element of a countable set S = {

*C*

_{1}, …,

*C*}. In our running example, S has two elements {diabetes mellitus (DM) absent, DM present}. For notational convenience, we assign 0 and 1 to each element, respectively. The role of the ML algorithms is to discover an appropriate mapping from the features to {0, 1}. Such a mapping is called a

_{k}*model*.

### Model and parameters

*parameters*of the model. Given the model, optimal parameter values must be found such that the outputs are as close to the actual target values. For various reasons, ML algorithms instead try to find parameters that yield the least dissimilar outputs to the target variables. The metric of dissimilarity is called a

*loss function*. Depending on the specific task, different loss functions are chosen. For our classification task, we could imagine assigning a value of +1 if there is a mismatch between the output and the target variable, and 0 otherwise. The sum of the assigned values of all instances divided by the number of instances is then defined as the

*expected loss*. The parameters that yield the smallest magnitude of this value become the solution of this task.

*decision*tree algorithm is a popular ML sequence that provides an intuitive framework to make classifications and decisions. Often, multiple trees are generated on random subsets of the original data. The decisions of the individual trees are then combined to generate the final prediction. This algorithm is known as the

*random forest*algorithm [5].

*no free lunch theorem*of ML [6]. It is usually required to try multiple models and find one that works best for a task. This trial and error process lies at the heart of ML artistry. The list of commonly used ML algorithms is as follows:

- Regularized linear regression

- Logistic regression

- Discriminant analysis

- Support vector machine

- Naïve Bayes

- K-Nearest Neighbor

- Decision tree

- Ensemble method

- Neural network

*ensemble method*is noteworthy. As the term implies, this method combines several ML algorithms into a single model, and can be approximately divided into two types:

*sequential*and

*parallel*methods. In the former, the base learner is trained sequentially, each time assigning more weights to previous errors. Prototypical algorithms are

*AdaBoost*and

*Gradient*Boost [7,8]. Parallel methods, on the other hand, combine learners that have been trained independently. The above-mentioned random forest algorithm is the prototype.

### Loss functions

*expected error*. The most commonly used loss function is the

*mean squared error*(MSE), defined as

*y*and

_{i}*f*(

*X*) are the target variable and the model output of the

_{i}*i*instance, respectively.

^{th}*Root-mean-square error*(RMSE), which is defined simply as

*ε*(%), where,

*ε*= 100(%)×

*mean prediction*).

*y*being positive.

_{i}*Binary cross entropy*, defined as –[

*y*

_{i}*logp*+ (1-

_{i}*y*)log(1-

_{i}*p*)], where,

_{i}*p*is the model output for the i

_{i}^{th}instance, is widely used as the loss function.

*Optimization algorithms*are used to search the parameter space of a given model so that the loss function is minimized. There are various algorithms, the most common one being based on calculating the

*gradient*of the loss function. It is an iterative search algorithm where the parameter values are updated every time based on the calculated gradient of the current parameter estimates. A gradient is a multivariate extension of a

*derivative*. Since the loss function, L, is a function of its parameters p, changing the values of p results in changes in L. Intuitively, changing p in the direction that most sensitively affects L is likely to be an effective search strategy. To illustrate this idea more vividly, imagine a landscape where the value of L is plotted along the vertical axis and the parameters (restricted to two-dimensional for demonstrative purposes) on the horizontal plane (Fig. 3). Supposing that a certain agent is riding along the ridges of the loss function surface, searching for its deepest valley, starting from any arbitrary point on the surface, a plausible search strategy would be to look around and descend down the path with the steepest slope. A tangent with the steepest slope, the derivative in a univariate problem, is the gradient. Repeating such a search often results in the agent ending up in one of the valleys, which is a potential solution to the search task. Unfortunately, the gradient-based search strategy does not guarantee a globally optimal solution when applied to a loss function surface with multiple peaks and valleys, as shown in Fig. 3. Methods to circumvent this limitation do exist (such as introducing a momentum, starting the search from multiple initial parameter values, adopting a stochastic component, and so on) but none of them definitely solve the problem (other than the brute-force strategy of searching all possible parameter combinations). For more information, refer to [9].

### Overfitting and underfitting

*overfitting*in ML literature.

### Training, validation, and test datasets

*training dataset*, and a separate dataset used for detecting overfitting is called the

*test dataset*. Hereafter, training and test datasets will be denoted as Tr and Te, respectively. It is customary in ML practice to randomly split D into Tr and Te. The split ratio is often chosen such that the size of Tr is greater than Te.

*validation*(V)

*datasets*(Fig. 4).

*k-fold cross-validation*, is also widely used. First, Tr is partitioned into k chunks (often of equal or similar sizes). One of the k chunks is defined as V and the rest as Tr. Then, the predictive performance on V is assessed, and this procedure is repeated on all possible allocations of V and Tr. Finally, the k assessment scores from the k validation runs are averaged to yield the

*mean performance index*. Models that improve the mean performance index are chosen.

*data generating mechanism*if possible. For example, to predict the concentration of an anesthetic drug that is known to be eliminated from the kidney by first order kinetics, it would be a better choice to use a pharmacokinetic model C(t) = C(0)e

^{(-kt)}(k: elimination rate constant) than to use a high degree polynomial C(t) = C(0) + β

_{1}t + + β

_{2}t

^{2}… + β

_{p}t

^{p}. The latter function can undoubtedly generate outputs that match the observed concentrations given a sufficiently high degree p. However, this model would be prone to overfitting the data.

### Feature selection

*feature selection algorithms*can be applied. There are basically three different classes of such algorithms: (1)

*stepwise selection*methods that iteratively incorporate the “best” feature at each step, (2)

*dimensionality reduction*techniques that extract the most important components to be used as new features, and (3)

*regularization*or

*shrinkage*methods that penalize large parameter values during the optimization process.

*principal components analysis*. Given multiple features, the method identifies the principal components that retain as much of the original variance as possible. For more information on this particular method, refer to [10].

*LASSO*,

*Ridge*, and

*ElasticNet*[11]. A more interested reader is referred to a review article that specifically deals with various feature selection methods [12].

*deep learning*, can automatically extract useful features from the raw input. This is one of the great advantages of this algorithm and has contributed to its high popularity.

### Assessment of predictive performance

*confusion matrix*is a table showing the frequencies of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) (see Table 2).

*Sensitivity*(a.k.a. recall, true positive rate) =

*Specificity*(a.k.a. selectivity, true negative rate) =

*Precision*(a.k.a. positive predictive value) =

*Negative predictive value*=

*receiver operating characteristic*(ROC) curve, which is created by plotting sensitivity against 1 – specificity, and then calculate the

*area under the curve*(AUC). An ideal algorithm that achieves 100% sensitivity and 100% specificity would be associated with an AUC of 100%, which is the maximum score achievable. A random guess, due to the tradeoff between sensitivity and specificity, would satisfy the following equality:

### Summary

*supervisor*.

### Tools and software

### R

### Python

*TensorFlow*, on which it runs.

*libraries*), such as

*Scikit-learn*,

*NumPy*,

*Pandas*, and

*SciPy*, come pre-installed. The need for manually installing the required libraries is thus minimized. Anaconda also includes one of the most revolutionary projects that changed the practice of ML research, i.e.,

*Jupyter*notebook (https://jupyter.org). Project Jupyter is a non-profit, open-source project that started in 2014. It evolved to support interactive data science and scientific computing, and supports Python, R, and other programming languages. It operates on a web-browser, where codes can be embedded as cells, greatly facilitating the presentation, sharing, and collaboration of ML research. It is currently the primary presentation format used by Kaggle (https://kaggle.com), a popular platform for launching public ML competitions and sharing public data.

### Others

*de facto*standard programming language for engineers. It is a commercial software developed by MathWorks (https://mathworks.com), and offers fast and reliable tools for all areas of scientific computing. Third-party tools can also be shared through MATLAB central (https://mathworks.com/matlabcentral), an open exchange for the MATLAB community.