Thursday, April 2, 2026

Decision Tree Regression With Example

In this post we'll see how to use decision tree regression which uses decision trees for regression tasks to predict continuous values. Decision trees are also used for creating classification models to predict a category or class label.

How does decision tree regression work

In decision tree regressor, decision tree splits the data using features and threshold values that enables them to capture complex, non-linear relationships.

The decision tree regression model has a binary tree like structure where you have a-

  1. Root node- Starting point which represents the whole dataset.
  2. Decision nodes- A decision point where the algorithm chooses a feature and a threshold to split the data into subsets.
  3. Branches- From each decision node there are branches to child nodes, representing the outcome of the rule (tested decision). For example, if you have housing dataset and one of the features is square footage and the threshold value is 1500 then the algorithm at a decision node decides- Is square footage ≤ 1500
    • If yes go towards left (houses smaller than 1500 sft)
    • If no go towards right (houses larger than or equal to 1500 sft).
  4. Leaf node- Contains the final predicted value. Also known as the terminal node.
Decision Tree Structure
Decision Tree Structure

How is feature selected

If there are multiple features, at each node only one of the features is selected for making the decision rule but that feature is not just picked arbitrarily by the algorithm. All of the features are evaluated, where the steps are as given below-

  1. For each feature, the algorithm considers possible split points (threshold values).
  2. For each candidate split, the algorithm computes the decrease in impurity after splitting. For decision tree regressor, impurity is measured using one of the following metrics-
    • Mean Squared Error (MSE) which is the default
    • Friedman MSE
    • Mean Absolute Error (MAE)
    • Poisson deviance

    When splitting a parent node into left (L) and right (R) child nodes:

    $$Cost of (\mathrm{split})=C(L)+C(R)$$

    The algorithm evaluates all possible features and thresholds, and chooses the split that minimizes the total squared error across child nodes

    For a parent node with N samples split into-

    • Left child with N_L samples
    • Right child with N_R samples

    The cost of the split is-

    $$C(\mathrm{split})=\frac{N_L}{N}\cdot MSE(L)\; +\; \frac{N_R}{N}\cdot MSE(R)$$

    where-

    $$MSE(L)=\frac{1}{N_L}\sum _{i\in L}(y_i-\bar {y}_L)^2$$ $$MSE(R)=\frac{1}{N_R}\sum _{i\in R}(y_i-\bar {y}_R)^2$$
    • yi = target value of sample i
    • \(\bar {y}_L, \bar {y}_R\) = mean target values in left and right child nodes
    • \(N=N_L+N_R\)
  3. The same procedure is repeated recursively for each child node until stopping criteria are met (e.g., max depth, min samples per leaf, max_leaf_node or no further improvement). Which means at each node:
    • Compute cost of split (C-split) for all candidate features and thresholds.
    • Choose the split with the minimum cost of split

    Scikit-Learn uses the Classification and Regression Tree (CART) algorithm to train decision trees

Here is the decision tree structure (with max depth as 3) for the laptop data used in the example in this post.

Decision Tree

How is the value predicted

Before going to how prediction is done in decision tree regressor, note that by following the decision rules at each node, samples will ultimately fall into one of the leaf nodes. Value you see in each of the leaf node in the above image is the average of the target values of all the training samples that ended up in that specific leaf node.

To make a prediction for a new data point, you traverse the tree from the root to a leaf node by following the decision rules. If you follow the above image at the root node it has chosen the "TypeName" feature (actually that TypeName feature is encoded that's why you see the feature name as "encoder_type_notebook"), threshold value is 0.5. Same way for all the other decision nodes there are some rules which will be evaluated for the new data point and ultimately the new data point also falls into one of the leaf nodes. So, the predicted value for the new data point will be the value in that leaf node (the average of the target values of all the training samples that ended up in that specific leaf node).

Decision tree regression using scikit-learn Python library

Dataset used here can be downloaded from- https://www.kaggle.com/datasets/illiyask/laptop-dataset

Goal is to predict the price of the laptop based on the given features.

In the implementation code is broken into several smaller units with some explanation in between for the steps.

1. Importing libraries and reading CSV file

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
df = pd.read_csv('./laptop_eda.csv')

laptop_eda.csv file is in the current directory.

2. Getting info about the data.

df.describe(include='all')

Analyze the data, there are 1300 rows, price has lots of variance, minimum value is 9270.72 whereas maximum value is 324954.72.

3. Check for duplicates and missing values

#for duplicates
df.duplicated().value_counts()
#for missing values
df.isnull().sum()

Output (for duplicates)

False    1270
True       30
Name: count, dtype: int64

There are duplicates which can be removed.

df.drop_duplicates(inplace=True)

4. Plotting pairwise relationship in the dataset

sns.pairplot(df[["Company", "Ram", "Weight", "SSD", "Price"]], kind="reg")
plt.show()

This helps in understanding the relationship between features as well as with dependent variables.

Decision Tree Regression

If you analyse the plots, relationship between Price and RAM looks kind of linear otherwise it is non-linear relationship among the pairs.

5. Checking for outliers

To check for extreme values IQR method is used. The IQR (Interquartile Range) method detects outliers by finding data points falling below Q1 - 1.5 X IQR or above Q3 + 1.5 X IQR

Here IQR = Q3 - Q1 (middle 50% of data)

  • Q1 is the 25th percentile
  • Q3 is the 75th percentile
for label, content in df.select_dtypes(include='number').items():
    q1 = content.quantile(0.25)
    q3 = content.quantile(0.75)
    iqr = q3 - q1
    outl = content[(content <= q1 - 1.5 * iqr) | (content >= q3 + 1.5 * iqr)]
    perc = len(outl) * 100.0 / df.shape[0]
    print("Column %s outliers = %.2f%%" % (label, perc))

Output

Column Ram outliers = 17.24%
Column Weight outliers = 3.54%
Column Touchscreen outliers = 100.00%
Column ClockSpeed outliers = 0.16%
Column HDD outliers = 0.00%
Column SSD outliers = 1.42%
Column PPI outliers = 9.37%
Column Price outliers = 2.20%

Going back to where data info was displayed and analysing it shows RAM values are from 2 GB to 64 GB, which to me looks ok in the context of laptop dataset and doesn't require dropping of any rows.

Touchscreen has only 2 values 0 and 1 (binary column), so that also doesn't require deletion of any rows as outliers.

Price also has lots of variance. You can also plot a distplot to look for distribution.

dp = sns.displot(df['Price'], kde=True, bins=30)
dp.set(xlim=(0, None))

Plot shows positive skewness but in this example, not deleting any outliers for Price. You can test the final result by keeping all the rows or after deleting outliers.

6. Feature and label selection

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

7. Checking for multicollinearity

Multicollinearity check is generally not required for decision tree regression. Decision trees split the data based on thresholds of individual features. They don't estimate coefficients like linear regression does, so correlated predictors don't distort parameter estimates.

8. Splitting and encoding data

Splitting is done using train_test_split where test_size is passed as 0.2, meaning 20% of the data is used as test data whereas 80% of the data is used to train the model.

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ct = ColumnTransformer([
    ('encoder', OneHotEncoder(sparse_output = False, drop = 'first', handle_unknown = 'ignore'), X.select_dtypes(exclude='number').columns)
],remainder = 'passthrough')

X_train_enc = ct.fit_transform(X_train)
X_test_enc = ct.transform(X_test)

9. Training the model and predicting values

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(max_depth=5, random_state=42)
regressor.fit(X_train_enc, y_train)

y_pred = regressor.predict(X_test_enc)
df_result = pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})
df_result.head(10)

Output (side-by-side comparison of actual values and predicted values)

	 	Actual		Predicted
1176	114731.5536	90147.685886
1117	69929.4672	60902.414400
427		106506.7200	60902.414400
351		75071.5200	60902.414400
364		20725.9200	27029.100706
853		41931.3600	55405.469087
1018	118761.1200	55405.469087
762		60153.1200	83108.137838
461		39906.7200	55405.469087
883		19660.3200	20666.681974

10. Seeing the model metrics such as R squared to check whether the model is overfitting or not.

from sklearn.metrics import r2_score, mean_squared_error
# for training data
print(regressor.score(X_train_enc, y_train))
#for predicted values
print(r2_score(y_test, y_pred))

Output

0.8003266191317047
0.7163190370541831

R2 score for training is 0.8 where as for test data it is 0.71

If the training score were very high (close to 1.0) and the test score much lower (like 0.3–0.4), that would indicate overfitting.

If both scores were low (say <0.5), that would indicate the model is too simple and not capturing the patterns.

The gap between 0.80 and 0.71 is modest. This means a slight overfitting, but not in extreme. The model generalizes reasonably well.

11. Plotting the tree

If you want to check how decision nodes were created by the algorithm you can plot the decision tree.

from sklearn.tree import plot_tree
plt.figure(figsize=(20,15)) 
plot_tree(regressor, filled=True, fontsize=10)
plt.show()

Another way to do it is by using the graphviz library. But that would mean downloading graphviz from this location- https://graphviz.org/download/

Also requires setting the path to the bin directory which can be done programmatically.

feature_names = ct.get_feature_names_out()
# get the name of the features used (after encoding)
X_train_final = pd.DataFrame(X_train_enc, columns=feature_names, index=X_train.index)

from sklearn.tree import export_graphviz
dot_data = export_graphviz(
regressor,
out_file=None,
feature_names=X_train_final.columns,
rounded=True,filled=True
)
import os
#setting path
os.environ["PATH"] += os.pathsep + r"D:\Softwares\Graphviz-14.1.1-win64\bin"
from graphviz import Source
Source(dot_data)

That's all for this topic Decision Tree Regression With Example. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Python Tutorial Page


Related Topics

  1. Simple Linear Regression With Example
  2. Multiple Linear Regression With Example
  3. Polynomial Regression With Example
  4. Support Vector Regression With Example
  5. Mean, Median and Mode With Python Examples

You may also like-

  1. Passing Object of The Class as Parameter in Python
  2. Local, Nonlocal And Global Variables in Python
  3. Python count() method - Counting Substrings
  4. Python Functions : Returning Multiple Values
  5. Marker Interface in Java
  6. Functional Interfaces in Java
  7. Difference Between Checked And Unchecked Exceptions in Java
  8. Race Condition in Java Multi-Threading

No comments:

Post a Comment