Statistics for Machine Learning: A Complete Guide

AI Basics


Statistics is a core component of data analysis and machine learning. Analyze and visualize data to help find invisible patterns. If you are interested in machine learning and want to advance your career in machine learning, learning statistics along with programming should be your first step. In this article, you will learn all the concepts of statistics for machine learning.

What are statistics?

Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, and visualization of empirical data. Descriptive statistics and inferential statistics are his two main areas of statistics. Descriptive statistics are meant to describe characteristics of data (what happened) in samples and populations. Inferential statistics uses these properties to test hypotheses, reach conclusions, and make predictions (what to expect).

Are you looking forward to becoming a machine learning engineer? Check out Simplilearn's AIML courses, machine learning courses and get certified today.

Using statistics in machine learning

Statistical use

  • Data questions
  • Data cleaning and preprocessing
  • Choosing the right features
  • Model evaluation
  • model prediction

Once you understand this basic, dive into all the important concepts related to machine learning statistics.

Population and sample

population:

In statistics, a population consists of all the observations (data points) about the subject of study.

An example of a population is surveying voters in an election. In the 2019 Lok Sabha elections, nearly 900 million voters were eligible to vote in 543 constituencies.

sample:

In statistics, a sample is a subset of a population. It is a small fraction of the total observed population.

An example of a sample is analyzing first-time voters in a public opinion poll.

Measuring central tendency

A measure of central tendency is a measure used to describe the distribution of data using a single value. The mean, median, and mode are three measures of central tendency.

average:

Arithmetic mean is the average of all data points.

If there are n observations and xi is the ith observation, then the mean is:

average

Consider the following data frame that contains the names of seven employees and their salaries.

Employee dataset

To find out the average salary of your employees, you can use Python's means() function.

Average salary.

Median:

The median is the middle value that divides the data into two equal parts after sorting the data in ascending order.

If the total number of data points (n) is odd, the median is the value at position (n+1)/2.

If the total number of observations (n) is even, then the median is the average value of the observations at n/2 and (n+2)/2 positions.

Python's median() function helps you find the median value of a column. From the data frame above, we find that the median salary is:

median salary

mode:

A mode is the most frequently occurring observation (value) in a data set. A dataset may have multiple modes.

The heights (cm) of students in the class are:

155, 157, 160, 159, 162, 160, 161, 165, 160, 158

mode = 160 cm.

The mode salary from the data frame can be calculated as follows:

mode salary

variance and standard deviation

Variance is used to measure the dispersion of data from the mean.

Dispersion formula

Consider the following dataset.

employee data frame

To calculate the grade variance, use:

dispersion grade

Standard deviation in statistics is the square root of the variance. Variance and standard deviation represent goodness of fit, meaning how well the mean represents the data.

Standard deviation formula

The standard deviation can be found using Python's std() function.

standard grade

Range and interquartile range

range:

In statistics, a range is the difference between the maximum and minimum values ​​in a dataset.

range

Interquartile range (IQR):

IQR is a measure of the distance between the first quartile (Q1) and the third quartile (Q3).

IQR

skewness and kurtosis

skewness:

Skewness measures the shape of a distribution. A distribution is symmetric if the proportion of data that is equidistant from the mean (or median) is equal. If the values ​​extend to the right, you are leaning to the right; if the values ​​extend to the left, you are leaning to the left.

skewness

kurtosis:

Kurtosis in statistics is used to see if there are extreme values ​​in the tails of a particular distribution. It also represents the shape of the probability distribution.

Skewness-Kurtosis

salary distortion

time skewness

grade skewness

We will now discuss a very common distribution in machine learning statistics: the Gaussian distribution.

Start your UI/UX career here!

Gaussian distribution

In statistics and probability, the Gaussian (normal) distribution is a commonly used continuous probability distribution for any random variable. It is characterized by two parameters: the mean μ and the standard deviation σ. Many natural phenomena, such as a person's height or IQ score, follow a normal distribution.

Gaussian distribution

Properties of Gaussian distribution:

  • The mean, median, and mode are the same
  • It is a symmetrical bell-shaped
  • 68% of the data is within 1 standard deviation of the mean
  • 95% of the data is within 2 standard deviations of the mean
  • 99.7% of the data is within 3 standard deviations of the mean

Gaussian code.

Gaussian plot

central limit theorem

The central limit theorem states that given a population with mean μ and standard deviation σ, if you draw a large random sample from the population, the sample mean will be The distribution will be approximately normal.

Rule of thumb: For the central limit theorem to hold, the sample size must be at least 30.

CLT

Here you will learn a very important concept in machine learning statistics: hypothesis testing.

Hypothesis verification

Hypothesis testing is a statistical analysis that uses experimental data to make decisions. This allows us to statistically back up some of the findings we have made by examining the data. Hypothesis testing makes a claim, and that claim is usually about a population parameter, such as the mean, median, or standard deviation.

  • The assumption made for a statistical test is called the null hypothesis (H0).
  • The alternative hypothesis (H1) contradicts the null hypothesis and indicates that the assumption does not hold at some level of significance.

Hypothesis testing allows you to decide whether to reject or maintain the null hypothesis.

Example: H0: Boys and girls in the class have the same average BMI

H1: The average BMI of boys and girls in a class is not the same.

To determine whether a finding is statistically significant, you must interpret the p-value. It is common to compare the p-value to a threshold called the significance level.

Significance levels are often set at 5% or 0.05.

If p-value > 0.05 – accept the null hypothesis.

If p-value < 0.05 - reject the null hypothesis.

Commonly used hypothesis tests include:

  • chi-square test
  • T-test
  • Z test
  • Analysis of variance (ANOVA)

conclusion

Statistics is a core component of machine learning. Analyzing raw data helps you draw meaningful conclusions. In this article about statistics for machine learning, we explained all the important concepts that are widely used to understand data.

If you want to learn more about machine learning with the aim of becoming a machine learning expert, Simplilearn's machine learning program in partnership with IIT Kanpur University is the ideal way to go. Ranked #1 for AI and Machine Learning courses by TechGig, this unique AI and Machine Learning program is extremely comprehensive, covering the most in-demand tools, skills, and techniques used in Machine Learning today. provides an applied learning curriculum. Sharpen your skills with three domain capstone projects and over 25 projects that use real industry data sets from companies like Twitter, Amazon, and Mercedes.

Do you have any questions about this article about statistics for machine learning? If so, please write them in the comments section. We will help you resolve your questions. Click the following link to learn more about important statistical methods: Mathematics for Machine Learning.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *