Python machine learning cookbook pdf download oreilly
Detecting Edges 8. Detecting Corners 8. Creating Features for Machine Learning 8. Encoding Mean Color as a Feature 8. Encoding Color Histograms as Features 9.
Dimensionality Reduction Using Feature Extraction 9. Introduction 9. Reducing Features Using Principal Components 9. Reducing Features by Maximizing Class Separability 9. Reducing Features Using Matrix Factorization 9. Reducing Features on Sparse Data Dimensionality Reduction Using Feature Selection Introduction Thresholding Numerical Feature Variance Thresholding Binary Feature Variance Handling Highly Correlated Features Removing Irrelevant Features for Classification Recursively Eliminating Features Model Evaluation Cross-Validating Models Creating a Baseline Regression Model Creating a Baseline Classification Model Evaluating Binary Classifier Predictions Evaluating Binary Classifier Thresholds Evaluating Multiclass Classifier Predictions Evaluating Regression Models Evaluating Clustering Models Creating a Custom Evaluation Metric Visualizing the Effect of Training Set Size Creating a Text Report of Evaluation Metrics Visualizing the Effect of Hyperparameter Values Model Selection Selecting Best Models When Preprocessing Speeding Up Model Selection with Parallelization Evaluating Performance After Model Selection Linear Regression Fitting a Line Handling Interactive Effects Fitting a Nonlinear Relationship Reducing Variance with Regularization Reducing Features with Lasso Regression Trees and Forests Training a Decision Tree Classifier Training a Decision Tree Regressor Visualizing a Decision Tree Model Training a Random Forest Classifier Training a Random Forest Regressor Identifying Important Features in Random Forests Selecting Important Features in Random Forests Handling Imbalanced Classes Controlling Tree Size Improving Performance Through Boosting K-Nearest Neighbors Creating a K-Nearest Neighbor Classifier Identifying the Best Neighborhood Size Logistic Regression Training a Binary Classifier Training a Multiclass Classifier Reducing Variance Through Regularization Training a Classifier on Very Large Data Support Vector Machines Training a Linear Classifier Creating Predicted Probabilities Identifying Support Vectors Naive Bayes Training a Classifier for Continuous Features Training a Classifier for Discrete and Count Features Calibrating Predicted Probabilities Clustering Clustering Using K-Means The corpus we are using is the Brown Corpus, one of the most popular sources of tagged text.
To examine the accuracy of our tagger, we split our text data into two parts, train our tagger on one part, and test how well it predicts the tags of the second part: Load library from nltk. Bag-of-words models output a feature for every unique word in text data, with each feature containing a count of occurrences in observations. For example, in our solution the sentence I love Brazil. The text data in our solution was purposely small. Since our bag-of-words model creates a feature for every unique word in the data, the resulting matrix can contain thousands of features.
This means that the size of the matrix can sometimes become very large in memory. This will save us memory when we have large feature matrices. One of the nice features of CountVectorizer is that the output is a sparse matrix by default.
CountVectorizer comes with a number of useful parameters to make creating bag- of-words feature matrices easy. First, while by default every feature is a word, that does not have to be the case. Instead we can set every feature to be the combination of two words called a 2-gram or even three words 3-gram.
For example, 2,3 will return all 2- grams and 3-grams. Finally, we can restrict the words or phrases we want to consider to a certain list of words using vocabulary. Solution Compare the frequency of the word in a document a tweet, movie review, speech transcript, etc. However, if we want to view the output as a dense matrix, we can use. For example, if the word economy appears frequently, it is evidence that the document might be about economics. We call this term frequency tf.
In contrast, if a word appears in many documents, it is likely less important to any individual document. For example, if every document in some text data contains the word after then it is probably an unimportant word. We call this document frequency df. By combining these two statistics, we can assign a score to every word representing how important that word is in a document.
There are a number of variations in how tf and idf are calculated. In scikit-learn, tf is simply the number of times a word appears in the document and idf is calculated as: 6. By default, scikit-learn then normalizes the tf-idf vectors using the Euclidean norm L2 norm. In this chapter, we will build a toolbox of strategies for handling time series data including tackling time zones and creating lagged time features.
Discussion When dates and times come as strings, we need to convert them into a data type Python can understand. One obstacle to strings representing dates and times is that the format of the strings can vary significantly between data sources. We can use the format parameter to specify the exact format of the string.
Solution If not specified, pandas objects have no time zone. However, we can add a time zone using tz during creation: Load library import pandas as pd Create datetime pd. Series pd. If we wanted to do some complex time series manipulation, it might be worth the overhead of setting the date column as the index of the DataFrame, but if we wanted to do some simple data wrangling, the boolean conditions might be easier.
Solution Use pandas Series. For example, we might want a feature that just includes the year of the observation or we might want only to consider the month of some observation so we can compare them regardless of year.
Timestamp '' , pd. Timestamp '' ] Calculate duration between features dataframe['Left'] - dataframe['Arrived'] 0 0 days 1 2 days dtype: timedelta64[ns] Often we will want to remove the days output and keep only the numerical value: Calculate duration between features pd. Series delta.
For example, we might have the dates a customer checks in and checks out of a hotel, but the feature we want is the duration of his stay. With pandas we can use shift to lag values by one row, creating a new feature containing past values. It is often useful to have a time window of a certain number of months and then 7. For example, if we have a time window of three months and we want a rolling mean, we would calculate: 1.
Rolling means are often used to smooth out time series data because using the mean of the entire time window dampens the effect of short-term fluctuations. Interpolation can be particularly useful 7.
For example, in our solution a gap of two missing values was bordered by 2. By fitting a line starting at 2. One minor advantage back- and forward-filling have over interpolation is the lack of the need for known values on both sides of missing value s.
The ability of computers to recognize patterns and objects from images is an incredibly powerful tool in our toolkit. While there are a number of good libraries out there, OpenCV is the most popular and documented library for handling images.
However, fortunately if we are using Python 3 at the time of publication OpenCV does not work with Python 3. Finally, throughout this chapter we will use a set of images as examples, which are available to download on GitHub. We can even take a look at the actual values of the matrix: Show image data image array [[, , , In grayscale images, the value of an individual element is the pixel intensity.
Intensity values range from black 0 to white First, images come in all shapes and sizes, and to be usable as features, images must have the same dimensions. Second, machine learning can require thousands or hundreds of thousands of images. When those images are very large they can take up a lot of memory, and by resizing them we can dramatically reduce memory usage. Solution To blur an image, each pixel is transformed to be the average value of its neighbors.
Since all elements have the same value normalized to add up to 1 , each has an equal say in the resulting value of the pixel of interest. Solution Create a kernel that highlights the target pixel. Then apply it to the image using fil ter2D: Load libraries import cv2 import numpy as np from matplotlib import pyplot as plt 8.
The resulting effect makes contrasts in edges stand out more in the image. Solution Histogram equalization is a tool for image processing that can make objects and shapes stand out.
The Y is the luma, or brightness, and U and V denote the color. Solution Define a range of colors and then apply a mask to the image: Load libraries import cv2 import numpy as np 8. First we convert an image into HSV hue, saturation, and value. Second, we define a range of values we want to isolate, which is probably the most difficult and time-consuming part. Third, we create a mask for the image we will only keep the white areas : Show image plt.
The weights are determined by a Gaussian window. Alternatively we could set the threshold to simply the mean of the neighboring pixels with cv2.
For example, thresholding is often applied to photos of printed text to isolate the letters from the page. We could go back and manually mark those areas as background, but in the real world we have thousands of images and manually fixing them individually is not feasible. Therefore, we would do well by simply accepting that the image data will still contain some background noise. In our solution, we start out by marking a rectangle around the area that contains the foreground.
GrabCut assumes everything outside this rectangle to be background and uses that information to figure out what is likely background inside the square to learn how the algorithm does this, check out the external resources at the end of this solution.
The gray area is what GrabCut considered likely background, while the white area is likely foreground. This mask is then used to create a second mask that merges the black and gray regions: Show mask plt.
Edges are important because they are areas of high information. For example, in our image one patch of sky looks very much like another and is unlikely to contain unique or interesting information.
However, patches where the background sky meets the airplane contain a lot of information e. Edge detection allows us to remove low- information areas and isolate the areas of images containing the most information.
There are many edge detection techniques Sobel filters, Laplacian edge detector, etc. However, our solution uses the commonly used Canny edge detector. How the Canny detector works is too detailed for this book, but there is one point that we need to address. Potential edge pixels between the low and high thresholds are 8.
However, there are often cases when we might get better results if we used a good pair of low and high threshold values through manual trial and error using a few images before running Canny on our entire collection of images.
Our interest in detecting corners is motivated by the same reason as for deleting edges: corners are points of high information. The output is a grayscale image depicting potential corners: Show potential corners plt. Alternatively, we can use a similar detector, the Shi-Tomasi corner detector, which works in a similar way to the Harris detector goodFeaturesToTrack to identify a fixed number of strong corners.
This problem will motivate dimensionality strategies discussed in a later chapter, which attempt to reduce the number of features while not losing an excessive amount of information contained in the data. Solution Each pixel in an image is represented by the combination of multiple color channels often three: red, green, and blue. These features can be used like any other features in learning algorithms to classify images according to their colors.
In turn, each channel can take on one of values represented by an integer between 0 and Series [1, 1, 2, 2, 3, 3, 3, 4, 5] Show the histogram data. In the histogram, each bar represents the number of times each value 1, 2, etc. We can apply this same technique to each of the color channels, but instead of five possible values, we have the range of possible values for a channel value. This distribution of channel values is shown for all three channels.
This is problematic because we will practically never be able to collect enough observations to cover even a small fraction of those configurations and our learning algorithms do not have enough data to operate correctly. In this chapter, we will cover a number of feature extraction techniques to do just this.
One downside of the feature extraction techniques we discuss is that the new features we generate will not be interpretable by humans. They will contain as much or nearly as much ability to train our models, but will appear to the human eye as a collection of random numbers. If we wanted to maintain our ability to interpret our models, dimensionality reduction through feature selection is a better option. PCA is an unsupervised technique, meaning that it does not use the information from the target vector and instead only considers the feature matrix.
For a mathematical description of how PCA works, see the external resources listed at the end of this recipe. However, we can understand the intuition behind PCA using a simple example. In the following figure, our data contains two features, x1 and x2. Looking at the visualization, it should be clear that observations are spread out like a cigar, with a lot of length and very little height. If we wanted to reduce our features, one strategy would be to project all observations in our 2D space onto the 1D principal component.
We would lose the information captured in the second principal component, but in some situations that would be an acceptable trade-off. This is PCA. PCA is implemented in scikit-learn using the pca method. This leads to the question of how to select the number of features that is optimal. It is common to use values of 0. Solution Use an extension of principal component analysis that uses kernels to allow for non- linear dimensionality reduction: Load libraries from sklearn.
Standard PCA uses linear projection to reduce the features. If the data is linearly separable i. However, if your data is not linearly separable e. Ideally, we would want a transformation that would both reduce the dimensions and also make the data linearly separable.
Kernel PCA can do both. Kernels allow us to project the linearly inseparable data into a higher dimension where it is linearly separable; this is called the kernel trick. We can even specify a linear projection lin ear , which will produce the same results as standard PCA.
For example, in Recipe 9. Instead we have to define the number of parameters e. Furthermore, kernels come with their own hyperparameters that we will have to set; for example, the radial basis function requires a gamma value. So how do we know which values to use? Through trial and error. Specifically we can train our machine learning model multiple times, each time with a different kernel or different value of the parameter.
Solution Try linear discriminant analysis LDA to project the features onto component axes that maximize the separation of classes: Load libraries from sklearn import datasets from sklearn. LDA works similarly to principal component analysis PCA in that it projects our feature space onto a lower-dimensional space.
If we project the data onto the y-axis, the two classes are not easily separable i. In the real world, of course, the relationship between the classes will be more complex and the dimensionality will be higher, but the concept remains the same. For example: lda. Formally, given a desired number of returned features, r, NMF factorizes our feature matrix such that: 9.
Additionally, unlike PCA and other techniques we have examined, NMA does not provide us with the explained variance of the outputted features. One issue with TSVD is that because of how it uses a random number generator, the signs of the output can flip between fittings. An easy workaround is to use fit only once per preprocessing pipeline, then use transform multiple times.
This is called feature extraction. In this chapter we will cover an alternative approach: selecting high-quality, informative features and dropping less useful features. This is called feature selection. There are three types of feature selection methods: filter, wrapper, and embedded.
Wrapper methods use trial and error to find the subset of features that produce models with the highest quality predictions.
However, since embedded methods are closely intertwined with specific learning algorithms, they are difficult to explain prior to a deeper dive into the algorithms themselves. Solution Select a subset of features with variances above a given threshold: Load libraries from sklearn import datasets from sklearn. It is motivated by the idea that features with low variance are likely less interesting and useful than features with high variance.
Next, it drops all features whose variance does not meet that threshold. First, the variance is not centered; that is, it is in the squared unit of the feature itself.
Therefore, the VT will not work when feature sets contain different units e. Second, the variance threshold is selected manually, so we have to use our own judgment for a good value to select or use a model selection technique described in Chapter Solution Select a subset of features with a Bernoulli random variable variance above a given threshold: Load library from sklearn.
In binary features i. Therefore, by setting p, we can remove features where the vast majority of observations are one class. Solution Use a correlation matrix to check for highly correlated features. If two features are highly correlated, then the information they contain is very similar, and it is likely redundant to include both features. The solution to highly correlated features is simple: remove one of them from the feature set. In our solution, first we create a correlation matrix of all features: Correlation matrix dataframe.
By calculating the chi-squared statistic between a feature and the target vector, we obtain a measurement of the independence between the two. If the target is independent of the feature variable, then it is irrelevant for our purposes because it contains no information we can use for classification.
On the other hand, if the two features are highly dependent, they likely are very informative for training our model. In scikit-learn, we can use SelectKBest to select the features with the best statistics. The parameter k determines the number of features we want to keep. It is important to note that chi-square statistics can only be calculated between two categorical vectors. For this reason, chi-squared for feature selection requires that both the target vector and the features are categorical.
For example, if we had a binary target vector, gender, and a quantitative feature, test scores, the F-value score would tell us if the mean test score for men is different than the mean test score for women.
That is, repeatedly train a model, each time removing a feature until model performance e. The remaining features are the best: Load libraries import warnings from sklearn. The first time we train the model, we include all the features. Then, we find the feature with the smallest parameter notice that this assumes the features are either rescaled or standardized , meaning it is less important, and remove the feature from the feature set.
The obvious question then is: how many features should we keep? A better approach requires that we include a new concept called cross-validation CV. We will discuss cross- validation in detail in the next chapter, but here is the general idea. Given data containing 1 a target we want to predict and 2 a feature matrix, first we split the data into two groups: a training set and a test set.
Second, we train our model using the training set. Finally, we compare our predicted target values with the true target values to evaluate our model. If CV shows that our model improved after we eliminated a feature, then we continue on to the next loop. However, if CV shows that our model got worse after we eliminated a feature, we put that feature back into the feature set and select those features as the best. The estimator parameter determines the type of model we want to train e.
The scoring parameter sets the metric of quality we use to evaluate our model during cross-validation. It might appear strange to discuss model evaluation before discussing how to create them, but there is a method to our madness. Models are only as useful as the quality of their predictions, and thus fundamentally our goal is not to create models which is easy but to create high-quality models which is hard. Therefore, before we explore the myriad learning algorithms, we first set up how we can evaluate the models they produce.
Solution Create a pipeline that preprocesses the data, trains the model, and then evaluates it using cross-validation: Load libraries from sklearn import datasets from sklearn import metrics from sklearn.
However, this approach is fundamentally flawed. If we train a model using our data, and then evaluate how well it did on that data, we are not achieving our desired goal. Our goal is not to evaluate how well the model does on our training data, but how well it does on data it has never seen before e. For this reason, our method of evaluation should help us understand how well models are able to make predictions from data they have never seen before.
One strategy might be to hold off a slice of data for testing. This is called validation or hold-out. In validation our observations features and targets are split into two sets, traditionally called the training set and the test set. We take the test set and put it off to the side, pretending that we have never seen it before. Finally, we simulate having never before seen external data by evaluating how our model trained on our training set performs on our test set.
However, the validation approach has two major weaknesses. First, the performance of the model can be highly dependent on which few observations were selected for the test set. Second, the model is not being trained using all the available data, and not being evaluated on all the available data.
A better strategy, which overcomes these weaknesses, is called k-fold cross-validation KFCV. We repeat this k times, each time using a different fold as the test set. The performance on the model for each of the k iterations is then averaged to produce an overall measurement.
First, KFCV assumes that each observation was created independent from the other i. If the data is IID, it is a good idea to shuffle observations when assigning to folds. In scikit-learn, we can conduct stratified k-fold cross-validation by replacing the KFold class with StratifiedKFold. For example, when we fit our standardization object, stand ardizer, we calculate the mean and variance of only the training set.
Then we apply that transformation using transform to both the training and test sets: Import library from sklearn. If we fit both our preprocessors using observations from both training and test sets, some of the information from the test set leaks into our training set.
This rule applies for any preprocessing step such as feature selection. We first create a pipeline that preprocesses the data e. The scoring parameter defines our metric for success, a number of which are discussed in other recipes in this chapter. For example, if your computer has four cores a common number for laptops , then scikit-learn will use all four cores at once to speed up the operation.
If we encode that assumption into a baseline model, we are able to concretely state the benefits of using a machine learning approach. The closer R2 is to 1, the more of the variance in the target vector that is explained by the features. The strat egy parameter gives us a number of options for generating values. Second, uniform will generate predictions uniformly at random between the different classes. Accuracy is a common performance metric.
Observations that are part of the positive class has the disease, purchased the product, etc. Observations that are part of the negative class does not have the disease, did not purchase the product, etc. Also called a Type I error. Also called a Type II error. However, in the real world, often our data has imbalanced classes e. However, For this reason, we are often motivated to use other metrics like precision, recall, and the F1 score.
Precision is the proportion of every observation predicted to be positive that is actually positive. We can think about it as a measurement noise in our predictions— that is, when we predict something is positive, how likely we are to be right.
Models with high precision are pessimistic in that they only predict an observation is of the positive class when they are very certain about it. Models This is one of the downsides to accuracy; precision and recall are less intuitive.
Almost always we want some kind of balance between precision and recall, and this role is filled by the F1 score. However, better metrics often involve using some balance of precision and recall—that is, a trade-off between the optimism and pessimism of our model. ROC compares the presence of true positives and false positives at every probability threshold i. By plotting the ROC curve, we can see how the model performs.
A classifier that predicts at random will appear as the diagonal line. The better the model, the closer it is to the solid line. That is, each observation is given an explicit probability of belonging in each class. By default, scikit-learn predicts an observation is part of the positive class if the probability is greater than 0.
However, instead of a middle ground, we will often want to explicitly bias our model to use a different threshold for substantive reasons. For example, if a false positive is very costly to our company, we might prefer a model that has a high probability threshold. We fail to predict some positives, but when an observation is predicted to be positive, we can be very confident that the prediction is correct.
For example, in our solution a threshold of roughly 0. The better a model is, the higher the curve and thus the greater the area under the curve. However, many of these metrics can be extended for use when we have more than two classes. Precision, recall, and F1 scores are useful metrics that we have already covered in detail in previous recipes. Solution Use a confusion matrix, which compares predicted classes and true classes: Load libraries import matplotlib.
One of the major benefits of confusion matrices is their interpretability. Each column of the matrix often visualized as a heatmap represents predicted classes, while every row shows true classes.
The end result is that every cell is one possible combination of predict and true classes. This is probably best explained using an example.
In the This means the models accurately predicted all Iris setosa flowers. However, the model does not do as well at predicting Iris virginica. The bottom-right cell indicates that the model successfully predicted nine observations were Iris virginica, but looking one cell up predicted six flowers to be viriginica that were actually Iris versicolor. There are three things worth noting about confusion matrices. First, a perfect model will have values along the diagonal and zeros everywhere else.
A bad model will look like the observation counts will be spread evenly around cells. Second, a confusion matrix lets us see not only where the model was wrong, but also how it was wrong. That is, we can look at patterns of misclassification. For example, our model had an easy time differentiating Iris virginica and Iris setosa, but a much more difficult time classifying Iris virginica and Iris versicolor.
Finally, confusion matrices work with any number of classes although if we had one million classes in our target vector, the confusion matrix visualization might be difficult to read. MSE is a measurement of the squared sum of all distances between predicted and true values.
The higher the value of MSE, the greater the total squared error and thus the worse the model. In practice this implication is rarely an issue and indeed can be theoretically beneficial and MSE works perfectly fine as an evaluation metric. One important note: by default in scikit-learn arguments of the scoring parameter assume that higher values are better than lower values.
However, this is not the case for MSE, where higher values mean a worse model. The closer to 1. Now you want to know how well it did. That said, one option is to evaluate clustering using silhouette coefficients, which measure the quality of the clusters: import numpy as np from sklearn. Silhouette coefficients provide a single value measuring both traits. First, we define a function that takes in two arguments—the ground truth target vector and our predicted values—and outputs some score.
Solution Plot the learning curve: Load libraries import numpy as np import matplotlib. They are commonly used to determine if our learning algorithms would benefit from gathering additional training data. Support refers to the number of observations in each class. Solution Plot the validation curve: Load libraries import matplotlib.
One hyperparameter in random forest classifiers is the number of trees in the forest. However, it is occasionally useful to visualize how model performance changes as the hyperparameter value changes.
When we have a small number of trees, both the training and cross-validation score are low, suggesting the model is underfitted. As the number of trees increases to , the accuracy of both levels off, suggesting there is probably not much value in the computational cost of training a massive forest. However, in addition many learning algorithms e. This is often referred to as hyperparameter tuning, hyperparameter optimization, or model selection.
The reason is straightforward: imagine we have data and want to train a support vector classifier with 10 candidate hyperparameter values and a random forest classifier with 10 candidate hyperparameter values.
The result is that we are trying to select the best model from a set of 20 candidate models. Throughout this chapter we will refer to specific hyperparameters, such as C the inverse of regularization strength.
We will cover them in later chapters. The model with the best performance score is selected as the best model. Just realize that C and the regularization penalty can take a range of values, which have to be specified prior to training.
For each combination of C and regularization penalty values, we train the model and evaluate it using k-fold cross-validation. While mostly unnecessary, it can be reassuring during long searching processes to receive an indication that the search is progressing. Solution Create a dictionary of candidate learning algorithms and their hyperparameters: Load libraries import numpy as np from sklearn import datasets from sklearn.
Recent versions of scikit-learn allow us to include learning algorithms as part of the search space. In our solution we define a search space that includes two learning algorithms: logistic regression and random forest classifier.
First, GridSearchCV uses cross-validation to determine which model has the highest performance. For this reason, we cannot preprocess the data and then run GridSearchCV. Rather, the preprocessing steps must be a part of the set of actions taken by GridSearchCV.
While this might appear complex, the reality is that scikit-learn makes it simple. This object is called preprocess and contains both of our preprocessing steps. Second, some preprocessing methods have their own parameters, which often have to be supplied by the user.
Luckily, scikit-learn makes this easy. When we include candidate component values in the search space, they are treated like any other hyperparameter to be searched over.
However, in the real world we will often have many thousands or tens of thousands of models to train. The end result is that it can take many hours to find the best model. To speed up the process, scikit-learn lets us train multiple models simultaneously. Without going into too much technical detail, scikit- learn can simultaneously train models up to the number of cores on the machine. Most modern laptops have four cores, so assuming you are currently on a laptop we can potentially train four models at the same time.
This will dramatically increase the speed of our model selection process. In scikit-learn, many learning algorithms e. For example, LogisticRegression is used to conduct a standard logistic regression classifier, while LogisticRegressionCV implements an efficient cross- validated logistic regression classifier that has the ability to identify the optimum value of the hyperparameter C. If supplied a list, Cs is the candidate hyperparameter values to select from.
If supplied an integer, the parameter Cs generates a list of that many candidate values. The candidate values are drawn logarithmically from a range between 0. However, a major downside to LogisticRegressionCV is that it can only search a range of values for C. In Recipe We then repeat this process k times. In the model selection searches described in this chapter i.
The solution? Wrap the cross-validation used for model search in another cross-validation! If you are confused, try a simple experiment. If you have ever taken an introductory statistics course in college, likely the final topic you covered was linear regression.
Solution Use a linear regression in scikit-learn, LinearRegression : Load libraries from sklearn. That is, the effect also called coefficient, weight, or parameter of the features on the target vector is constant. In our solution, for the sake of explanation we have trained our model using only two features. After we have fit our model, we can view the value of each parameter.
For example, the first feature in our solution is the number of crimes per resident. The effects of sugar and stir on sweetness are dependent on each other. In this case we say there is an interaction effect between the features sugar and stirred. In our solution, we used a dataset containing only two features. To create interaction terms using PolynomialFeatures, there are three important parameters we must set.
Finally, the degree parameter determines the maximum number of features to create interaction terms from in case we wanted to create an interaction term that is the combination of three features. In linear regression, we assume the effect of number of stories and building height is approximately constant, meaning a story building will be roughly twice as high as a story building, which will be roughly twice as high as a 5-story building.
Many relationships of interest, however, are not strictly linear. Often we want to model a non-linear relationship—for example, the relationship between the number of hours a student studies and the score she gets on the test. Intuitively, we can imagine there is a big difference in test scores between students who study for one hour compared to students who did not study at all. However, there is a much smaller difference in test scores between a student who studied for 99 hours and a student who studied for hours.
To create a polynomial regression, convert the linear function we used in Recipe How are we able to use a linear regression for a nonlinear function? It just considers it one more variable. A more practical description might be in order. To model nonlinear relationships, we can create new features that raise an existing feature, x, up to some power: x2, x3, and so on.
From there, you can insert, combine, or adapt the code to help construct your application. Recipes also include a discussion that explains the solution and provides meaningful context. This cookbook takes you beyond theory and concepts by providing the nuts and bolts you need to construct working machine learning applications.