Image by freepik
Statistical functions are the cornerstone for extracting meaningful insights from raw data. Python provides a powerful toolkit for statisticians and data scientists to understand and analyze datasets. Libraries like NumPy, Pandas, and SciPy offer a comprehensive suite of functions. This guide will go over 10 essential statistical functions in Python within these libraries.
Libraries for Statistical Analysis
Python offers many libraries specifically designed for statistical analysis. Three of the most widely used are NumPy, Pandas, and SciPy stats.
- NumPy: Short for Numerical Python, this library provides support for arrays, matrices, and a range of mathematical functions.
- Pandas: Pandas is a data manipulation and analysis library helpful for working with tables and time series data. It is built on top of NumPy and adds in additional features for data manipulation.
- SciPy stats: Short for Scientific Python, this library is used for scientific and technical computing. It provides a large number of probability distributions, statistical functions, and hypothesis tests.
Python libraries must be downloaded and imported into the working environment before they can be used. To install a library, use the terminal and the pip install command. Once it has been installed, it can be loaded into your Python script or Jupyter notebook using the import statement. NumPy is normally imported as np
, Pandas as pd
, and typically only the stats module is imported from SciPy.
pip install numpy
pip install pandas
pip install scipy
import numpy as np
import pandas as pd
from scipy import stats
Where different functions can be calculated using more than one library, example code using each will be shown.
1. Mean (Average)
The mean, also known as the average, is the most fundamental statistical measure. It provides a central value for a set of numbers. Mathematically, it is the sum of all the values divided by the number of values present.
mean_numpy = np.mean(data)
mean_pandas = pd.Series(data).mean()
2. Median
The median is another measure of central tendency. It is calculated by reporting the middle value of the dataset when all the values are sorted in order. Unlike the mean, it is not impacted by outliers. This makes it a more robust measure for skewed distributions.
median_numpy = np.median(data)
median_pandas = pd.Series(data).median()
3. Standard Deviation
The standard deviation is a measure of the amount of variation or dispersion in a set of values. It is calculated using the differences between each data point and the mean. A low standard deviation indicates that the values in the dataset tend to be close to the mean while a larger standard deviation indicates that the values are more spread out.
std_numpy = np.std(data)
std_pandas = pd.Series(data).std()
4. Percentiles
Percentiles indicate the relative standing of a value within a dataset when all of the data is sorted in order. For example, the 25th percentile is the value below which 25% of the data lies. The median is technically defined as the 50th percentile.
Percentiles are calculated using the NumPy library and the specific percentiles of interest must be included in the function. In the example, the 25th, 50th, and 75th percentiles are calculated, but any percentile value from 0 to 100 is valid.
percentiles = np.percentile(data, [25, 50, 75])
5. Correlation
The correlation between two variables describes the strength and direction of their relationship. It is the extent to which one variable is changed when the other one changes. The correlation coefficient ranges from -1 to 1 where -1 indicates a perfect negative correlation, 1 indicates a perfect positive correlation, and 0 indicates no linear relationship between the variables.
corr_numpy = np.corrcoef(x, y)
corr_pandas = pd.Series(x).corr(pd.Series(y))
6. Covariance
Covariance is a statistical measure that represents the extent to which two variables change together. It does not provide the strength of the relationship in the same way a correlation does, but does give the direction of the relationship between the variables. It is also key to many statistical methods that look at the relationships between variables, such as principal component analysis.
cov_numpy = np.cov(x, y)
cov_pandas = pd.Series(x).cov(pd.Series(y))
7. Skewness
Skewness measures the asymmetry of the distribution of a continuous variable. Zero skewness indicates that the data is symmetrically distributed, such as the normal distribution. Skewness helps in identifying potential outliers in the dataset and establishing symmetry is a requirement for some statistical methods and transformations.
skew_scipy = stats.skew(data)
skew_pandas = pd.Series(data).skew()
8. Kurtosis
Often used in tandem with skewness, kurtosis describes how much area is in a distribution’s tails relative to the normal distribution. It is used to indicate the presence of outliers and describe the overall shape of the distribution, such as being highly peaked (called leptokurtic) or more flat (called platykurtic).
kurt_scipy = stats.kurtosis(data)
kurt_pandas = pd.Series(data).kurt()
9. T-Test
A t-test is a statistical test used to determine whether there is a significant difference between the means of two groups. Or, in the case of a one-sample t-test, it can be used to determine if the mean of a sample is significantly different from a predetermined population mean.
This test is run using the stats module within the SciPy library. The test provides two pieces of output, the t-statistic and the p-value. Generally, if the p-value is less than 0.05, the result is considered statistically significant where the two means are different from each other.
t_test, p_value = stats.ttest_ind(data1, data2)
onesamp_t_test, p_value = stats.ttest_1samp(data, popmean = 0)
10. Chi-Square
The Chi-Square test is used to determine whether there is a significant association between two categorical variables, such as job title and gender. The test also uses the stats module within the SciPy library and requires the input of both the observed data and the expected data. Similarly to the t-test, the output gives both a Chi-Squared test statistic and a p-value that can be compared to 0.05.
chi_square_test, p_value = stats.chisquare(f_obs=observed, f_exp=expected)
Summary
This article highlighted 10 key statistical functions within Python, but there are many more contained within various packages that can be used for more specific applications. Leveraging these tools for statistics and data analysis allow you to gain powerful insights from your data.
Mehrnaz Siavoshi holds a Masters in Data Analytics and is a full time biostatistician working on complex machine learning development and statistical analysis in healthcare. She has experience with AI and has taught university courses in biostatistics and machine learning at University of the People.