Fully Understand Q-Q Plot for Probability Distribution in Machine Learning

Better understanding of skewed data in statistics and data science

Amit Chauhan

--

An image by the Author

In this article, we will study the interpretation of the Q-Q plots with different data distribution shapes i.e. in the case of normal distribution and skew distribution cases.

In simple terms, the Q-Q plot is useful to interpret if the feature (column) follows a normal distribution or not.

The full form is a Quantile-Quantile plot.

Let’s take an image of the Q-Q plot for interpretation.

In the above image, this plot is a Q-Q plot, in the case of a univariate feature, they try to make the quantiles of the data and make the quantiles of theoretical data in each number as normal standard distribution.

We try to plot these quantile values on the plot, if the point lies on the line then it means the data distribution follows the normal distribution otherwise the data is skewed.

We can make a Q-Q plot using different libraries.

  1. scipy.stats

Here, we use the stats class from the scipy library.

# Example
stats.probplot(df[column_name], dist="norm", plot=plt)

2. statsmodels.api

Here, we use the stats model library

# Example
sm.qqplot(df[column_name], line='45', fit=True)

The below Q-Q plot examples are the comparisons between normal distributed data and skewed distributions.

Case 1: Normal distribution data

This above Q-Q plot belongs to the almost normal distribution data. We can observe how the data points almost lie on the line.

Case 2: Right skewed data

--

--