Normality Tests in Python: Assessing Data Distribution
In this comprehensive guide, we dive into the world of normality tests in Python, exploring various statistical methods to assess whether a dataset follows a normal distribution. We’ll use a real-world example, analyzing the weekly historical returns of Microsoft stock. By the end of this tutorial, you’ll gain insights into the Jarque-Bera test, Kolmogorov-Smirnov test, Anderson-Darling test, and Shapiro-Wilk test, along with practical Python implementations.
Sample Data for Normality Testing
Let’s begin by setting up our sample data. We’ll be working with weekly historical returns for Microsoft stock from January 1, 2018, to December 31, 2018. This dataset can be easily obtained from Yahoo! Finance.
import pandas as pd # Load data from a CSV filedf = pd.read_csv(‘MSFT.csv’) # Select relevant columnsdf = df[[‘Date’, ‘Close’]] |
We’ll convert stock prices into returns, a common practice in financial analysis. Then, we’ll visualize the data with a histogram.
import numpy as npimport matplotlib.pyplot as plt # Calculate returnsdf[‘diff’] = pd.Series(np.diff(df[‘Close’]))df[‘return’] = df[‘diff’] / df[‘Close’] # Drop missing valuesdf = df[[‘Date’, ‘return’]].dropna() # Visualize data with a histogramplt.hist(df[‘return’])plt.show() |
Quantile-Quantile (Q-Q) Plot
We start our exploration with a Q-Q plot, a visual method to assess the normality of data. This plot compares observed quantiles against theoretical quantiles of a normal distribution. Deviations from a straight line indicate non-normality.
import pylabimport scipy.stats as stats # Create a Q-Q plotstats.probplot(df[‘return’], dist=”norm”, plot=pylab)pylab.show() |
The Q-Q plot shows a reasonably linear relationship, suggesting that the data approximates normality but is not perfect.
Jarque-Bera Test
The Jarque-Bera test assesses whether a dataset’s skewness and kurtosis match those of a normal distribution. A high test statistic indicates significant deviation from normality.
from scipy.stats import jarque_bera result = jarque_bera(df[‘return’]) |
In our case, the test statistic is 1.94, and the p-value is approximately 0.38. Since the p-value is greater than 0.05, we fail to reject the null hypothesis, suggesting that the data is normally distributed.
Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov test is a non-parametric test that compares the empirical distribution function (ECDF) of the data to that of a theoretical distribution, such as the normal distribution.
from scipy.stats import kstest result = kstest(df[‘return’], cdf=’norm’) |
Here, the K-S statistic is 0.47, and the p-value is close to zero. With a small p-value, we reject the null hypothesis, indicating that the data does not follow a normal distribution.
Anderson-Darling Test
The Anderson-Darling test is a modification of the K-S test, providing more weight to the tails of the distribution. It is sensitive to deviations in both the tails and the center of the distribution.
from scipy.stats import anderson result = anderson(df[‘return’], dist=’norm’) |
The A-D statistic is 0.37, and critical values are provided for significance levels ranging from 15% to 1%. We find that our data fails to reject the null hypothesis, suggesting normality.
Shapiro-Wilk Test
The Shapiro-Wilk test assesses whether a dataset is significantly different from a normal distribution. It is particularly suitable for small sample sizes.
from scipy.stats import shapiro result = shapiro(df[‘return’]) |
With a Shapiro-Wilk statistic of 0.98 and a p-value of approximately 0.42, we fail to reject the null hypothesis, indicating that the data is not significantly different from a normal distribution.
Comparing Normality Test Results
Here’s a summary of the normality test results for our Microsoft stock returns data:
Test | H0: Normality |
---|---|
Jarque-Bera | Fail to reject |
Kolmogorov-Smirnov | Reject |
Anderson-Darling | Fail to reject |
Shapiro-Wilk | Fail to reject |
Video Explanation
In order to explain this topic in more detail, we have prepared a special video for you. Enjoy watching it!
Advantages and Disadvantages of Different Normality Tests
In the realm of statistics and data analysis, testing for normality is a crucial step before applying various statistical methods. In this article, we have explored four common normality tests in Python: the Jarque-Bera test, Kolmogorov-Smirnov test, Anderson-Darling test, and Shapiro-Wilk test. Each of these tests serves a unique purpose and has its own set of advantages and disadvantages.
Jarque-Bera Test
Advantages:
- Relatively easy to implement;
- Suitable for large datasets;
- Accounts for skewness and kurtosis.
Disadvantages:
- Less powerful for small sample sizes;
- Assumes independence of observations;
- Kolmogorov-Smirnov Test
Advantages:
- Non-parametric, making it distribution-free;
- Suitable for comparing a sample distribution to a known distribution.
Disadvantages:
- Less powerful for small sample sizes;
- Focuses on the maximum difference between ECDFs, which may not capture subtle departures from normality.
Anderson-Darling Test
Advantages:
- Powerful test, especially for small sample sizes;
- Provides critical values for different significance levels;
- Takes into account observations across the entire dataset.
Disadvantages:
- Requires pre-defined significance levels for interpretation;
- Sensitive to outliers;
- Shapiro-Wilk Test
Advantages:
- Works well for small to moderately-sized datasets;
- Offers a p-value for hypothesis testing.
Disadvantages:
- May lead to Type II errors with large sample sizes;
- Sensitive to deviations in the tails of the distribution.
When choosing a normality test, it’s essential to consider the characteristics of your dataset, such as sample size and potential outliers. Each of these tests can be a valuable tool in assessing whether your data follows a normal distribution. However, no single test is universally superior, and the choice often depends on the specific context of your analysis.
Conclusion
In this comprehensive guide, we explored various normality tests in Python and applied them to real-world data. While different tests yielded varying results, it’s essential to consider the nature of your data and the specific requirements of your analysis when selecting a normality test. Understanding data distribution is a crucial step in many statistical analyses, as it can impact the validity of statistical tests and the choice of modeling techniques.
FAQ
Normality testing is a statistical process used to determine whether a dataset follows a normal or Gaussian distribution. It helps assess whether a dataset’s values are symmetrically distributed around the mean and if they exhibit the characteristic bell-shaped curve of a normal distribution.
Normality testing is essential in statistics because many statistical methods assume that the data being analyzed follows a normal distribution. Validating the normality assumption is crucial to ensure the accuracy of these methods. If the data is not normal, it may be necessary to use alternative statistical techniques.
If a normality test indicates that your dataset does not follow a normal distribution, it suggests that the assumptions of many parametric statistical tests may not be met. In such cases, consider using non-parametric tests or transformations to analyze your data.
Normality tests are valuable tools, but they should be used in conjunction with visualizations (e.g., histograms and Q-Q plots) and domain knowledge. No single test can provide a definitive answer, and it’s essential to interpret the results in context.
Average Rating