Mean, Median and Mode in Data Science

We have discussed Mean, Median and Mode before. Although not in a great detail.

However, the three types of averages in data science play key role. 

Firstly, they are mostly statistical data. But they relate to Discrete Mathematics also. Secondly, we can understand them from any data set.

Why? We’ll see in a minute.

Finding averages from a range of data differ from one type to the other. 

We have seen before that in discrete mathematics, two conceptions always go together. 

One is ‘order of operation’, and the other is ‘distributive property’. 

Consider some factors like this:

x = 3 * (4 + 6)
    x = 3 * 10
    x = 30

Now, we can rearrange the same factors as follows.

x = 3 * (4 + 6)
    x = 3*4 + 3*6
    x = 12 + 18
    x = 30

It proves that the order of operations and distributive property works pretty well with addition and subtraction.

However, this will not work with multiplication and division.

Just try it.

Any algebraic operation is nothing but a kind of algorithm.

In the above case, our algorithm fails for two separate cases. Our algorithm works very well with addition and subtraction.

But, it does not work with multiplication and division.

The same thing might happen for a data set.

We can add or subtract two vectors just like any scalar value. To prove this, we can plot them and find that they’re adding like scalar values.

Adding two vectors and plot the graph
Adding two vectors and plot the graph

But the product of vectors works in a different principle.

Let’s see how the products of two arrays work. 

First the code.

import numpy as np 
import matplotlib.pyplot as plt

a = np.array([
    [1, 2], 
    [0, 1]
])

b = np.array([
    [2, 0], 
    [3, 4]
])

prod = a @ b

print(prod)

# PLotting product of two arrays

plt.plot(a) 
plt.plot(b) 
plt.plot(prod) 


# Add Title

plt.title("Matplotlib PLotting Product of NumPy Arrays") 

plt.show()

If you want to see the output in GitHub with the image, please visit this repository.

The product of two arrays is different than the sum
The product of two arrays is different than the sum

The above image shows the output and the plots. 

Mean, Median and Mode

In a collection of numbers, we might try to apply the same algorithm. Consider a data set like as follows.

{14, 5, {78, 1, 56}, 8, 96, 0, 6}

We find a data set inside a data set. 

But, we cannot apply our algorithm of addition or subtraction here as well.

Therefore, we need to use the ‘order of operation’ rule to get the addition done inside the sub-data set first.

Now, we understand one thing.

What we can do with discrete or scalar integers, we cannot always do with a collection of integers. We need to invent some different algorithms.

In any algebraic data set, there is always an average of the collection. Consider this data set:

{6, 2, 3, 8, 1}

The average of this data set is : (6 + 2 + 3 + 8 + 1) / 5, or 20.

Mean

In algebra this is the Mean of a data set. In mathematics there is another ‘mean’ as well. The geometric mean, you may have heard of that.

On the other hand, in any algebraic data set, the Median represents the middle value of the data set.

In any collection of integers, which is odd in numbers, finding the Median is quite easy.

Median

In the above data set, 3 is the Median or middle value. But it is not true when the number of integers in a collection is even.

Consider the following data set.

{1, 2, 3, 4}

Finding the Mean is quite easy. To find the Median we need to find the middle value of 2 and 3, because they are in the middle. 

The middle value of 2 and 3 is 2.5. Therefore, that is the Median of the above data set.

Mode

Another important thing of a data set is the Mode. In some data sets, one value or more than one value most often appears.

Consider this data set as follows.

{1, 56, 7, 89, 7, 3, 56, 2, 1, 7, 8, 9, 3, 45, 3, 96, 3, 78, 5, 3}

In the above data set, there are more than one values that repeatedly appear. Right? We can see that 1 and 56 repeatedly appear two times. 7 appears three times.

Moreover, the integer 3 repeats five times. Therefore, the most often appeared integer is 3. It is the Mode of the data set.

Why we need them in data science

Reading so far, we may ask ourselves, why do we need to know this before studying data science?

Well, there is an answer. 

The sum of a collection of numbers divided by the count of numbers in the collection means the arithmetic mean.

It is true for mathematics and statistics. 

Therefore, we reference the ‘Mean’ of a data set as ‘arithmetic mean’ so that there is no confusion. 

However, we have learned before that there are ‘geometric mean’ or ‘harmonic mean’. 

In data science, sometimes we need to find out the arithmetic average income of a nation’s population, which is known as per capita income.

Now, we can logically guess that by using ‘Mean’ of a data set, we cannot represent the central tendencies of a data set. 

Consider a data set, where a small section of people’s income is much, much greater than most people’s income. Theoretically, the nation’s per capita income shows a very good central tendency, which is a false statement.

In reality, in that country, 70 percent of people might live under the poverty line.

In such a situation, the ‘Median’ may be a better description of central tendencies.

Why? Because, the ‘Median’ separates the higher half from the lower half of a data sample. For a data set, we have seen that it represents the middle value. 

Whereas anyone can skew the ‘Mean’ and twist to give a false representation of central tendencies.

Here is the basic advantage of the ‘Median’. 

It may give a better idea of typical central tendencies. 

If only more than half the data are represented by false, and extremely large values, the ‘Median’ will give an arbitrarily large or small result.

What Next?

Books at Leanpub

Books in Apress

My books at Amazon

Courses at Educative

GitHub repository

Flutter, Dart and Algorithm

Twitter

Comments

2 responses to “Mean, Median and Mode in Data Science”

  1. […] For example, a few rich people, having a lot of money will definitely make the Mean unrealistic. […]

  2. […] Because in statistics, or in data science we have learned what is Mean, Median and Mode. […]

Leave a Reply