Normality Test in Python: Exploring Data Distribution

Read Time:5 Minute, 58 Second

In this comprehensive guide, we dive into the world of normality tests in Python, exploring various statistical methods to assess whether a dataset follows a normal distribution. We’ll use a real-world example, analyzing the weekly historical returns of Microsoft stock. By the end of this tutorial, you’ll gain insights into the Jarque-Bera test, Kolmogorov-Smirnov test, Anderson-Darling test, and Shapiro-Wilk test, along with practical Python implementations.

Sample Data for Normality Testing

Let’s begin by setting up our sample data. We’ll be working with weekly historical returns for Microsoft stock from January 1, 2018, to December 31, 2018. This dataset can be easily obtained from Yahoo! Finance.

import pandas as pd
# Load data from a CSV filedf = pd.read_csv(‘MSFT.csv’)
# Select relevant columnsdf = df[[‘Date’, ‘Close’]]

We’ll convert stock prices into returns, a common practice in financial analysis. Then, we’ll visualize the data with a histogram.

import numpy as npimport matplotlib.pyplot as plt
# Calculate returnsdf[‘diff’] = pd.Series(np.diff(df[‘Close’]))df[‘return’] = df[‘diff’] / df[‘Close’]
# Drop missing valuesdf = df[[‘Date’, ‘return’]].dropna()
# Visualize data with a histogramplt.hist(df[‘return’])plt.show()

Quantile-Quantile (Q-Q) Plot

We start our exploration with a Q-Q plot, a visual method to assess the normality of data. This plot compares observed quantiles against theoretical quantiles of a normal distribution. Deviations from a straight line indicate non-normality.

import pylabimport scipy.stats as stats
# Create a Q-Q plotstats.probplot(df[‘return’], dist=”norm”, plot=pylab)pylab.show()

The Q-Q plot shows a reasonably linear relationship, suggesting that the data approximates normality but is not perfect.

Jarque-Bera Test

The Jarque-Bera test assesses whether a dataset’s skewness and kurtosis match those of a normal distribution. A high test statistic indicates significant deviation from normality.

from scipy.stats import jarque_bera
result = jarque_bera(df[‘return’])

In our case, the test statistic is 1.94, and the p-value is approximately 0.38. Since the p-value is greater than 0.05, we fail to reject the null hypothesis, suggesting that the data is normally distributed.

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov test is a non-parametric test that compares the empirical distribution function (ECDF) of the data to that of a theoretical distribution, such as the normal distribution.

from scipy.stats import kstest
result = kstest(df[‘return’], cdf=’norm’)

Here, the K-S statistic is 0.47, and the p-value is close to zero. With a small p-value, we reject the null hypothesis, indicating that the data does not follow a normal distribution.

Anderson-Darling Test

The Anderson-Darling test is a modification of the K-S test, providing more weight to the tails of the distribution. It is sensitive to deviations in both the tails and the center of the distribution.

from scipy.stats import anderson
result = anderson(df[‘return’], dist=’norm’)

The A-D statistic is 0.37, and critical values are provided for significance levels ranging from 15% to 1%. We find that our data fails to reject the null hypothesis, suggesting normality.

Shapiro-Wilk Test

The Shapiro-Wilk test assesses whether a dataset is significantly different from a normal distribution. It is particularly suitable for small sample sizes.

from scipy.stats import shapiro
result = shapiro(df[‘return’])

With a Shapiro-Wilk statistic of 0.98 and a p-value of approximately 0.42, we fail to reject the null hypothesis, indicating that the data is not significantly different from a normal distribution.

Comparing Normality Test Results

Here’s a summary of the normality test results for our Microsoft stock returns data:

Test	H0: Normality
Jarque-Bera	Fail to reject
Kolmogorov-Smirnov	Reject
Anderson-Darling	Fail to reject
Shapiro-Wilk	Fail to reject

Video Explanation

In order to explain this topic in more detail, we have prepared a special video for you. Enjoy watching it!

Advantages and Disadvantages of Different Normality Tests

In the realm of statistics and data analysis, testing for normality is a crucial step before applying various statistical methods. In this article, we have explored four common normality tests in Python: the Jarque-Bera test, Kolmogorov-Smirnov test, Anderson-Darling test, and Shapiro-Wilk test. Each of these tests serves a unique purpose and has its own set of advantages and disadvantages.

Jarque-Bera Test

Advantages:

Relatively easy to implement;
Suitable for large datasets;
Accounts for skewness and kurtosis.

Disadvantages:

Less powerful for small sample sizes;
Assumes independence of observations;
Kolmogorov-Smirnov Test

Advantages:

Non-parametric, making it distribution-free;
Suitable for comparing a sample distribution to a known distribution.

Disadvantages:

Less powerful for small sample sizes;
Focuses on the maximum difference between ECDFs, which may not capture subtle departures from normality.

Anderson-Darling Test

Advantages:

Powerful test, especially for small sample sizes;
Provides critical values for different significance levels;
Takes into account observations across the entire dataset.

Disadvantages:

Requires pre-defined significance levels for interpretation;
Sensitive to outliers;
Shapiro-Wilk Test

Advantages:

Works well for small to moderately-sized datasets;
Offers a p-value for hypothesis testing.

Disadvantages:

May lead to Type II errors with large sample sizes;
Sensitive to deviations in the tails of the distribution.

When choosing a normality test, it’s essential to consider the characteristics of your dataset, such as sample size and potential outliers. Each of these tests can be a valuable tool in assessing whether your data follows a normal distribution. However, no single test is universally superior, and the choice often depends on the specific context of your analysis.

Conclusion

In this comprehensive guide, we explored various normality tests in Python and applied them to real-world data. While different tests yielded varying results, it’s essential to consider the nature of your data and the specific requirements of your analysis when selecting a normality test. Understanding data distribution is a crucial step in many statistical analyses, as it can impact the validity of statistical tests and the choice of modeling techniques.

FAQ

1. What is normality testing?

Normality testing is a statistical process used to determine whether a dataset follows a normal or Gaussian distribution. It helps assess whether a dataset’s values are symmetrically distributed around the mean and if they exhibit the characteristic bell-shaped curve of a normal distribution.

2. Why is normality testing important?

Normality testing is essential in statistics because many statistical methods assume that the data being analyzed follows a normal distribution. Validating the normality assumption is crucial to ensure the accuracy of these methods. If the data is not normal, it may be necessary to use alternative statistical techniques.

3. What does it mean when a normality test fails?

If a normality test indicates that your dataset does not follow a normal distribution, it suggests that the assumptions of many parametric statistical tests may not be met. In such cases, consider using non-parametric tests or transformations to analyze your data.

4. Can I rely solely on normality tests to assess data distribution?

Normality tests are valuable tools, but they should be used in conjunction with visualizations (e.g., histograms and Q-Q plots) and domain knowledge. No single test can provide a definitive answer, and it’s essential to interpret the results in context.