Statistics for Data Science

DS - VRP
13 min readMar 1, 2022

Statistics plays important role in data science. we can build models with help of programming language libraries, but we need statistics for the interpretation and analysis of the models. This is the theory for the statistics, in future will come with further blogs for the Statistical tests in detail with example.

Statistics is the process of collection, arrangement, analysis, interpretation and presentation of data. This means that study must be capable of quantitative measurement.

Types Of Statistics:

  1. Descriptive : This studies the problem and describe the cause, it understand the data to get the information. Ex: where and when company is losing sales.
  2. Predictive: This help us to look into problem and forecast the the future. Ex: what need to change to improve the sales.
  3. Prescriptive: It helps in discission making. Ex: It suggests the way to improve the sales.

Statistics provide us various tools and techniques to understand, analyze and interpret the data. Statistics problems start with data, it helps us to understand the patterns within the data. Let's understand the various concepts of Statistics.

  1. Data: The first term is Data. Statistics is the process of collection, arrangement, analysis, interpretation and presentation of data.

For Machine learning process data is a solution but for Statistics Data is a problem. Thus, statistical analysis is mandatory for building Machine learning models.

Data is originated from different sources and there are different methods to collect data.

There are mainly two data sources

  1. Primary — In primary type of the data sources, the investigator or the organization agency conducts the enquiry originally in the aspired filed and subject.
  2. Secondary — The investigator obtain data from some other’s enquiry. While relying on secondary sources of data, we need to ensure data accuracy, suitability, and adequacy.

After the collecting the data we should classy the data on the following basis.

  1. Geographical — Area or Regional wise
  2. Chronological — occurrence of time
  3. Qualitative — some characteristics and attribute and
  4. Quantitative — numerical values or magnitude, also known as classification by variable.

Variable: it is a quantitative phenomenon under study. there are two types of variable.

  1. Continuous: Those variable which can take all the possible values (integral as well as fractional) in a given specified range are termed as continuous variables. For example: the age of the students in the school.
  2. Discrete: variables which cannot take take all possible values within a given specified range are termed as discrete, also know discontinuous variable. For example, marks in the exams and population in the city.

Frequency Distribution: The organization of the data pertaining toe a quantitively phenomenon involve in the four stages:

  1. Unorganized — As the data are unwieldy and scattered, they will not provide any information. To understand the data, it is necessary to arrange them in ascending order, this process is called array.
  2. Discrete or Ungroup Frequency Distribution: In this type of distribution, we count the number of times each value of the variable appears in the above data. This is facilitated by using tally marks.

3. Grouped Frequency Distribution: In this type of distribution, the values of the variable into a suitable number of groups called classes and recording the recording the number of observations in each group. The various grouped into which the values of the variable are classified are known as classes or class intervals; the length of the class interval is all the width or magnitude of the classes. The two values specifying the class are called the class limit; the larger values is called upper class limit and the smaller value is called the lower class limit.

4. Continuous Frequency Distribution: The presentation of the data into continuous classes of the above type along with the corresponding frequencies is known as continuous frequency distribution.

To calculate the proper number of classes, we use the Sturges’ rule

Range help to get the size of the class intervals, it can be obtain by the max-min.

Class intervals can be classified into two: Inclusive Type Classes: The classes of the type in which both the upper and lower limits are included in the class are inclusive classes while such classes in which upper limits are excluded from the respective classes and are included in the immediate next class are termed as exclusive classes. For the discrete variable the inclusive classes may be used while for continuous variable exclusive classes are to be used.

Presentation of Data: The data can be presented in the following ways:

  1. Table: the systematic presentation of the information contained int he data, in row and columns.
  2. Line Diagram: This is the simplest of all the diagrams. It consists in drawing vertical lines, each vertical line being equal to the frequency. It facilitated comparison.
  3. Bar Diagram: It is the most easiest and the most commonly used diagram for the presentation of the data. There are different types of Bar Diagrams.

i. Simple bar diagram: it is used frequently for the comparative study of two tor more items or values of a single variable or a ingle classification or category of data.

ii. Subdivided bar diagram: it is used if the total magnitude of the given variable is to be divided into various parts or sub-classes.

iii. Percentage bar diagram: Sub- divided bar diagrams presented graphically on percentage basis give percentage bar diagrams.

iv. Multiple bar diagram: It is used for the comparison of two variable in same time period.

4. Pie chart: it is also used for the comparison of different categories. It get constructed on the basis of the area of circle.

5. Line Graph: It use for the time series data. It describe the frequencies of the different time period.

6. Box or boxen plot- It represent the quantiles, mean, median and Inter quantile range of the dataset. It is use for understand the data distribution and to check the outliers.

Averages: Averages are statistical constants which enable us to comprehend in a single effort the significance of the whole. It is also know as central tendency. There are various measures of Averages.

  1. Mean: Mean given set of observations is their sum divided by the number of observations. It will be given by the Following formula.

2. Median: The median is the values of the variable which divides the group in two equal parts, one part comprising all the values greater and the other, all the values less than median. It is the middle value of an array.

If the number of observation is odd, then then median is the middle value. In case of even, the median is the mean of two middle terms. Ex: In the array of 1,2,3,4,5, the median is 3. If we add another number 6, the array 1,2,3,4,5,6, then the median will be 3+4/2 = 3.5.

There is difference between mean and median.

Median: Age of an average Person

Mean: Average of a person, mean is sensitive to an outliers.

3. Mode: Mode is the values which occurs most frequently in a set of observations. for example, in the array of 1,2,2,3,4,5. 2 is mode, because it the most frequently occurred value.

Relationship between mean, median and mode: In case symmetrical distribution, Mean=Median =Mode. In case of no symmetric mean and mode lie on the two ends and median lies in between them. In case of bimodal or multimodal, mode = 3Median -2Mean.

Dispersion: Dispersion is a measure of the extent to which the individual observations vary. Literal meaning of dispersion is Scatteredness.

Measures of Dispersion:

  1. Range: The range is the simplest of all the measures of dispersion. It defined as the difference between the two observations of the distribution.

Range = Max — Min

2. Quartile Deviation: It is a measure of dispersion based on the upper quartile (Q3) and lower Quartile(Q1). QD obtained from inter-quartile range on dividing by 2. it is also known as semi interquartile range.

3. Standard Deviation: It is defined as the positive square root of the mean of the squares of the deviations of the given observations from their mean. It is denoted by letter small sigma.

4. Variance: Variance is the square of the standard deviation.

Skewness: Skewness helps us to study the shape of data. It means lake of symmetry. These skewness created by extreme values popularly known as the outliers, mean is sensitive to outlier.

Types of Skewness:

  1. Symmetrical Distribution: A distribution is called symmetric when values of mean, median and mode are the same. It will be perfectly bell curve.

2. Positively Skewed Distribution: A frequency distribution for which the curve has a longer tail towards right is said to be positive skewed Distribution. for the positively skewed distribution, the value of mean is the greatest of three measures and value of mode is the least of the three.

3. Negatively Skewed Distribution: A frequency distribution for which the curve has a longer tail towards left is said to be negatively skewed Distribution. for a negatively skewed distribution, of three measures of average, the mode has the maximum value and mean has the least.

Kurtosis: it is concerned with the flatness or peakedness of the frequency curve. It is use to measure volatility.

A mesokurtic distribution has a similar extreme value character as a normal distribution. A platykurtic distribution will have thinner tails than a normal distribution an Leptokurtic distributions are statistical distributions with kurtosis greater than three.

Correlation: The correlation is a statistical tool which studies the relationship between two variables. Two variables are said to be correlated if the change in one variable results in a corresponding change in the other variable.

Types of correlation:

  1. Positive Correlation: if the increase in the values of one variable results in increase of other variable or vice versa.

2. Negative Correlation: if the increase in the values of one variable results in decrease of other variable or vice versa.

These above methods are used to under the data, now we learn about the statistical methods to make inferences from the data.

Probability: A numerical measure of uncertainty is provided by a very important branch of Statistics called the theory of probability. Statistics is the science of decision making with calculated risks in the face of uncertainty.

Followings are the terminologies of the probability.

  1. Favorable Cases: favorable cases are the outcomes of a random experiment which entail the happening of an event.
  2. Mutually Exclusive Event: Events are mutually exclusive if the happening of any one of them excluded the happening of all others in the same experiment.
  3. Equally Likely Event: Equally likely implies that the all of the outcomes have the same probability of occurring.
  4. Independent Event: Events are independent of each other if happening of any one of them is not affected by an does not affect the happening of any one of others.
  5. Sample Space: The set of all possible outcomes of a random experiment is known as the sample space and is denoted by S.

Conditional Probability: the probability of the first event times the conditional probability of the second event, given that the first event has already occurred. Conditional probability of happening of B under that A has happened. P(B\A).

Expected Value: The expected value (EV) is an anticipated value for an investment at some point in the future. The expected value is calculated by multiplying each of the possible outcomes by the likelihood each outcome will occur and then summing all of those values.

Bayes' Theorem: Conditional probability is the likelihood of an outcome occurring, based on a previous outcome occurring in similar circumstances. If an even can only occur in conjunction with one of the mutually exclusive and exhaustive events E1, E2, E3….En and if actually happens, then the probability that it was preceded by the particular event Ei is given by

Random Variable: A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment’s outcomes.

Theoretical Distribution: The values of the variable may be distributed according to some definite probability law which can be expressed mathematically and the corresponding probability distribution is known as theoretical distribution. There are three types of probability distribution.

Binomial Distribution: The binomial distribution is a probability distribution that summarizes the likelihood that a value will take one of two independent values under a given set of parameters or assumptions. The underlying assumptions of the binomial distribution are that there is only one outcome for each trial, that each trial has the same probability of success, and that each trial is mutually exclusive or independent of one another.

After all the above steps we will understand the data. and now we will learn how we can make decision from the data.

Hypothesis Testing

Hypothesis is a claim, we use statistical test to prove the claim is hypothesis testing.

Hypothesis testing start with the formation of these two hypotheses.

  • Null Hypothesis: The Status Quo
  • Alternate hypothesis : The challenge to statis quo.

We always want to prove the alternate hypothesis AND we never accept the null hypothesis.

You can use the following rule to formulate the null and alternate hypotheses: • The null hypothesis always has the following signs: = OR ≤ OR ≥

• The alternate hypothesis always has the following signs: ≠ OR > OR

Type of Tests:

  1. Two- tailed Test: It s a method in which the critical area of a distribution is two-sided and tests whether a sample is greater than or less than a certain range of values. We can use two-tailed test whenever we use ‘=’ in the null hypothesis.

2. One- tailed test: A one-tailed test is a statistical test in which the critical area of a distribution is one-sided so that it is either greater than or less than a certain value, but not both. We can use one -tailed test whenever we use ‘>’ or ‘<’ in the null hypothesis.

Statistical Test: Statistical Test help us to

Types of Statistical Test

  1. Z- Test — A z-test is a statistical test to determine whether two population means are different when the variances are known and the sample size is large. A z-test is a hypothesis test in which the z-statistic follows a normal distribution
  2. T-Test- we use t- test is a statistical test when we don’t know the population variance and sample size is less than

3. F- Test — F-statistic is used when the data is positively skewed and follows an F distribution. F distributions are always positive and skewed right.

4. Chi-Square test- Chi-squared test of independence — We use the Chi-Square test to determine whether or not there is a significant relationship between two categorical variables.

5. Anova- Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed aggregate variability found inside a data set into two parts: systematic factors and random factors.

6. AB Testing: It is also known as split testing, where we compare the two versions of same variable. It is used for the comparative study.

Critical Region: A critical region, also known as the rejection region, is a set of values for the test statistic for which the null hypothesis is rejected.

Confidence Interval: A confidence interval, also known as the acceptance region, is a set of values for the test statistic for which the null hypothesis is accepted.

Significance Level: Confidence intervals can be calculated at different significance levels. We use α (alpha) to denote the level of significance.

Critical Region: The critical value at a certain significance level can be thought of as a cut-off point. If a test statistic on one side of the critical value results in accepting the null hypothesis, a test statistic on the other side will result in rejecting the null hypothesis.

p-value: p-value tells us about the total probability of getting any observed value which we assumed as true. It explain the likelihood of data that occurred randomly by chance. If p-value is less than 0.05, the null hypothesis is rejected, we p-value is more than 0.05, we fail to reject the null hypothesis.

Error: There is a change of occurrence of error in the hypothesis test. The below error can occur during the any statistical test.

Statistics plays very vital role in the decision making during any data science and machine learning process.

Please suggest if any changes required. Stay tuned for the all test in details with example.

Source: Fundamental of Statistics — S.C. Gupta

Statistics for Research

Please connect over linkedIN

--

--

DS - VRP

A budding data scientist who is learning and evolving everyday