In this post we'll see how to use decision tree regression which uses decision trees for regression tasks to predict continuous values. Decision trees are also used for creating classification models to predict a category or class label.
How does decision tree regression work
In decision tree regressor, decision tree splits the data using features and threshold values that enables them to capture complex, non-linear relationships.
The decision tree regression model has a binary tree like structure where you have a-
- Root node- Starting point which represents the whole dataset.
- Decision nodes- A decision point where the algorithm chooses a feature and a threshold to split the data into subsets.
- Branches- From each decision node there are branches to child nodes, representing the outcome of the rule
(tested decision). For example, if you have housing dataset and one of the features is square footage and the threshold value
is 1500 then the algorithm at a decision node decides- Is square footage ≤ 1500
- If yes go towards left (houses smaller than 1500 sft)
- If no go towards right (houses larger than or equal to 1500 sft).
- Leaf node- Contains the final predicted value. Also known as the terminal node.
How is feature selected
If there are multiple features, at each node only one of the features is selected for making the decision rule but that feature is not just picked arbitrarily by the algorithm. All of the features are evaluated, where the steps are as given below-
- For each feature, the algorithm considers possible split points (threshold values).
- For each candidate split, the algorithm computes the decrease in impurity after splitting. For decision tree regressor,
impurity is measured using one of the following metrics-
- Mean Squared Error (MSE) which is the default
- Friedman MSE
- Mean Absolute Error (MAE)
- Poisson deviance
When splitting a parent node into left (L) and right (R) child nodes:
$$Cost of (\mathrm{split})=C(L)+C(R)$$
The algorithm evaluates all possible features and thresholds, and chooses the split that minimizes the total squared error across child nodes
For a parent node with N samples split into-
- Left child with N_L samples
- Right child with N_R samples
The cost of the split is-
$$C(\mathrm{split})=\frac{N_L}{N}\cdot MSE(L)\; +\; \frac{N_R}{N}\cdot MSE(R)$$where-
$$MSE(L)=\frac{1}{N_L}\sum _{i\in L}(y_i-\bar {y}_L)^2$$ $$MSE(R)=\frac{1}{N_R}\sum _{i\in R}(y_i-\bar {y}_R)^2$$- yi = target value of sample i
- \(\bar {y}_L, \bar {y}_R\) = mean target values in left and right child nodes
- \(N=N_L+N_R\)
- The same procedure is repeated recursively for each child node until stopping criteria are met (e.g., max depth, min samples
per leaf, max_leaf_node or no further improvement). Which means at each node:
- Compute cost of split (C-split) for all candidate features and thresholds.
- Choose the split with the minimum cost of split
Scikit-Learn uses the Classification and Regression Tree (CART) algorithm to train decision trees
Here is the decision tree structure (with max depth as 3) for the laptop data used in the example in this post.
How is the value predicted
Before going to how prediction is done in decision tree regressor, note that by following the decision rules at each node, samples will ultimately fall into one of the leaf nodes. Value you see in each of the leaf node in the above image is the average of the target values of all the training samples that ended up in that specific leaf node.
To make a prediction for a new data point, you traverse the tree from the root to a leaf node by following the decision rules. If you follow the above image at the root node it has chosen the "TypeName" feature (actually that TypeName feature is encoded that's why you see the feature name as "encoder_type_notebook"), threshold value is 0.5. Same way for all the other decision nodes there are some rules which will be evaluated for the new data point and ultimately the new data point also falls into one of the leaf nodes. So, the predicted value for the new data point will be the value in that leaf node (the average of the target values of all the training samples that ended up in that specific leaf node).
Decision tree regression using scikit-learn Python library
Dataset used here can be downloaded from- https://www.kaggle.com/datasets/illiyask/laptop-dataset
Goal is to predict the price of the laptop based on the given features.
In the implementation code is broken into several smaller units with some explanation in between for the steps.
1. Importing libraries and reading CSV file
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
df = pd.read_csv('./laptop_eda.csv')
laptop_eda.csv file is in the current directory.
2. Getting info about the data.
df.describe(include='all')
Analyze the data, there are 1300 rows, price has lots of variance, minimum value is 9270.72 whereas maximum value is 324954.72.
3. Check for duplicates and missing values
#for duplicates df.duplicated().value_counts() #for missing values df.isnull().sum()
Output (for duplicates)
False 1270 True 30 Name: count, dtype: int64
There are duplicates which can be removed.
df.drop_duplicates(inplace=True)
4. Plotting pairwise relationship in the dataset
sns.pairplot(df[["Company", "Ram", "Weight", "SSD", "Price"]], kind="reg") plt.show()
This helps in understanding the relationship between features as well as with dependent variables.
If you analyse the plots, relationship between Price and RAM looks kind of linear otherwise it is non-linear relationship among the pairs.
5. Checking for outliers
To check for extreme values IQR method is used. The IQR (Interquartile Range) method detects outliers by finding data points falling below Q1 - 1.5 X IQR or above Q3 + 1.5 X IQR
Here IQR = Q3 - Q1 (middle 50% of data)
- Q1 is the 25th percentile
- Q3 is the 75th percentile
for label, content in df.select_dtypes(include='number').items():
q1 = content.quantile(0.25)
q3 = content.quantile(0.75)
iqr = q3 - q1
outl = content[(content <= q1 - 1.5 * iqr) | (content >= q3 + 1.5 * iqr)]
perc = len(outl) * 100.0 / df.shape[0]
print("Column %s outliers = %.2f%%" % (label, perc))
Output
Column Ram outliers = 17.24% Column Weight outliers = 3.54% Column Touchscreen outliers = 100.00% Column ClockSpeed outliers = 0.16% Column HDD outliers = 0.00% Column SSD outliers = 1.42% Column PPI outliers = 9.37% Column Price outliers = 2.20%
Going back to where data info was displayed and analysing it shows RAM values are from 2 GB to 64 GB, which to me looks ok in the context of laptop dataset and doesn't require dropping of any rows.
Touchscreen has only 2 values 0 and 1 (binary column), so that also doesn't require deletion of any rows as outliers.
Price also has lots of variance. You can also plot a distplot to look for distribution.
dp = sns.displot(df['Price'], kde=True, bins=30) dp.set(xlim=(0, None))
Plot shows positive skewness but in this example, not deleting any outliers for Price. You can test the final result by keeping all the rows or after deleting outliers.
6. Feature and label selection
X = df.iloc[:, :-1] y = df.iloc[:, -1]
7. Checking for multicollinearity
Multicollinearity check is generally not required for decision tree regression. Decision trees split the data based on thresholds of individual features. They don't estimate coefficients like linear regression does, so correlated predictors don't distort parameter estimates.
8. Splitting and encoding data
Splitting is done using train_test_split where test_size is passed as 0.2, meaning 20% of the data is used as test data whereas 80% of the data is used to train the model.
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
ct = ColumnTransformer([
('encoder', OneHotEncoder(sparse_output = False, drop = 'first', handle_unknown = 'ignore'), X.select_dtypes(exclude='number').columns)
],remainder = 'passthrough')
X_train_enc = ct.fit_transform(X_train)
X_test_enc = ct.transform(X_test)
9. Training the model and predicting values
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(max_depth=5, random_state=42)
regressor.fit(X_train_enc, y_train)
y_pred = regressor.predict(X_test_enc)
df_result = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df_result.head(10)
Output (side-by-side comparison of actual values and predicted values)
Actual Predicted 1176 114731.5536 90147.685886 1117 69929.4672 60902.414400 427 106506.7200 60902.414400 351 75071.5200 60902.414400 364 20725.9200 27029.100706 853 41931.3600 55405.469087 1018 118761.1200 55405.469087 762 60153.1200 83108.137838 461 39906.7200 55405.469087 883 19660.3200 20666.681974
10. Seeing the model metrics such as R squared to check whether the model is overfitting or not.
from sklearn.metrics import r2_score, mean_squared_error # for training data print(regressor.score(X_train_enc, y_train)) #for predicted values print(r2_score(y_test, y_pred))
Output
0.8003266191317047 0.7163190370541831
R2 score for training is 0.8 where as for test data it is 0.71
If the training score were very high (close to 1.0) and the test score much lower (like 0.3–0.4), that would indicate overfitting.
If both scores were low (say <0.5), that would indicate the model is too simple and not capturing the patterns.
The gap between 0.80 and 0.71 is modest. This means a slight overfitting, but not in extreme. The model generalizes reasonably well.
11. Plotting the tree
If you want to check how decision nodes were created by the algorithm you can plot the decision tree.
from sklearn.tree import plot_tree plt.figure(figsize=(20,15)) plot_tree(regressor, filled=True, fontsize=10) plt.show()
Another way to do it is by using the graphviz library. But that would mean downloading graphviz from this location- https://graphviz.org/download/
Also requires setting the path to the bin directory which can be done programmatically.
feature_names = ct.get_feature_names_out() # get the name of the features used (after encoding) X_train_final = pd.DataFrame(X_train_enc, columns=feature_names, index=X_train.index) from sklearn.tree import export_graphviz dot_data = export_graphviz( regressor, out_file=None, feature_names=X_train_final.columns, rounded=True,filled=True ) import os #setting path os.environ["PATH"] += os.pathsep + r"D:\Softwares\Graphviz-14.1.1-win64\bin" from graphviz import Source Source(dot_data)
That's all for this topic Decision Tree Regression With Example. If you have any doubt or any suggestions to make please drop a comment. Thanks!
>>>Return to Python Tutorial Page
Related Topics
You may also like-



No comments:
Post a Comment