Stats for Data Science

DonDuminda
4 min readJan 15, 2020

Some of you, like myself has done stats to a certain extent 10 years back. Its always nice to refresh a bit or learn new things while you are working on your dream of becoming a Data Scientist.

At first we need to understand the very basic term in statistics, Population vs Sample. Population is by definition represents all items of interest to our study. Its usually represented by a upper case N. The findings or the numbers we obtain out of the Population (N) is called ‘Parameters’. While in a Sample is a subset of the population and the sample is represented by a simple case n. The findings or the numbers we obtain out of the sample (n) is called ‘Statistics’.

Next up its important to understand the types of variables we encounter when doing statistical testing. Different types of variables have,

1. data types and 2. measurement levels.

Types of Data : Categorical, Numerical

Categorical type represents the groupings of data like car types, yes no questions etc.. Numerical type represents numbers and its future divided into two sections, discrete and continuance.

Discrete means that the data is finite. Eg the number of children you will have. At-least you know the answer is an integer.

Continuous data is variables that keeps on changing. Eg your weight, speed of a a car. Most of the time the X axis is the time.

pic 1 : Types of Data

Next up lets discuss the measurement types.

Types of Measurements : Qualitative and Quantitative data.

There exists a fundamental distinction between two types of measurement data: Quantitative data is information about quantities, and therefore numbers, and qualitative data is descriptive, and regards phenomenon which can be observed but not measured, such as language.

Then again the qualitative data can be divided in to nominal and ordinal. If you take a car categorization, BMW,AUDI,HONDA are types and each has it own quality. So still the three categories does not have an order as such, which means its nominal. But lets say we categorize the cars according to price. Then it has an order, HONDA,AUDI,BMW if we name according to the ascending order. In this case the data is ordinal.

Measurement type Quantitative has two types : ratio and intervals. Both are represented by numbers but has one main difference, which is the ratios has a real 0 value and the intervals dont. Please read this article to understand the concept in detail. link 5 mins read

After 5 mins….

Great you are back :)

Next up is the visualization techniques. Visualizing data is the most effective method to analyse data. But we need to have the idea as to which method to use by looking at the data type.

Lets focus on Categorical Data Types first. To represent categorical data we can use frequency tables, bar charts, PIE chats. (see pic below)

pic 2 categorical data representation

next up lets identify the methods of visual representations in numerical data. Basically histograms and scatter-pots are used. I have given few examples below.. (Do you know the difference between histogram and a bar chart? in a histogram the bars are connected since the data is linked unlike in a categorized scenario)

pic 3 histogram
pic 4 scatter pot

Below I have prepared a summary of the diagrams we can use depending on the data types.

pic 5 summary

Next up we will talk about mean,median and mode which is also called measures of central tendency.

Mean is represented by the Greek letter mu( μ). We can find the mean by adding up all the components and then divided it by the number of component. Very simple as it sounds but we need to understand the shortcomings of mean. Its easily effected by outliers. Therefore using mean only is not sufficient.

Median is the middle number, where you get the position of median by getting the components and then dividing it by 2.

Next up, the Mode is the component with most occurrences.

Using only one method to understand a data set is not sufficient. So basically we need to understand the concept and use it accordingly.

Then comes skewness, which is not a popular term but lets understand the concept in brief. Skewness indicates the observations are concentrated on one side or not. Lets take a mean of 2.0 and a median of 2.5. Mean < median, which indicates a positive skew. If we draw an histogram the occurrences are more left sided for the above scenario. Skewness indicates where the data is located. Take a look at the below diagram, It has a a graphical explanation on positive, negative and zero skew.

pic 6 skewness

… to be continued..

--

--

DonDuminda

Fintech Enthusiastic. Hard work is the only solution to life ..!