Why is Median better than Mean in Data Science? This question haunts beginners in particular. We will discuss this topic in this section.
Although an experienced data scientist knows the answer. But, still we can question this conviction also.
Why?
We’ll check in the next section. Therefore, I will request you to check the next section after reading this article.
Firstly, we generalise the concept that Median is better than Mean in Data Science.
Why?
Because we know that, when we handle income-data of a population that belongs to a poor, or a developing country, the Mean will not work.
For example, a few rich people, having a lot of money will definitely make the Mean unrealistic.
In that case, Median will do justice. In addition, only the Median will reflect the real picture of the national income per capita.
We can prove it by two simple programs. In these two programs we will work on the same data set.
Finding Mean
Let’s find the Mean first.
import numpy as np
mean = 0.0
median = 0.0
def find_mean(the_array, num):
sum = 0.0
i = 0
while(i < num):
sum = sum + the_array[i]
i = i + 1
mean = sum / num
return mean
the_array = np.array([1, 2, 3, 4, 5, 3, 2, 568, 1200])
length_of_array = len(the_array)
print(f'Mean from original program: {find_mean(the_array, length_of_array)}')
numpy_mean = np.mean([1, 2, 3, 4, 5, 3, 2, 568, 1200])
print(f'Mean from NumPy Mean: {numpy_mean}')
'''
Mean from original program: 198.66666666666666
Mean from NumPy Mean: 198.66666666666666
'''
We have worked on a small sample of a data set. It shows the incomes of nine people. Of these nine people, we can see that seven are poor. Two are rich.
As a result, they greatly influence the average income.
These rich people who don’t belong to the majority of the population makes the average income look like a farce.
Isn’t it?
In a data set, or in Data Science, we call them “outliers”.
For what reason? Because they represent a tiny segment that doesn’t belong to the main section of people.
The above program has two parts. In one part we have used Python imperative programming style to write an algorithm to find the Mean.
In the second part we used the NumPy library and applied its mean() method. Quite naturally, it depicts declarative programming style.
However, the result is the same.
Median represents a better output than Mean
In the next program we will follow the same path to show why the Median represents a better output than Mean.
In the first part, we have used an algorithm using which we have got the Median of the same data set.
On the other hand, we have used the NumPy library to find the Median by applying its median() method.
No doubt using NumPy makes our life much easier. Because we don’t have to bother about what is going on inside the background.
As a result, we get the result in one line of code.
However, for data science and as well as machine learning, we need to know how to write algorithms in Python.
Anyway, let’s take a look at the code below.
import numpy as np
def get_median(the_array, num):
# we know that median varies due to the odd and even numbers
# let us sort the array first
the_array.sort()
# next check for the even case
if (num % 2 != 0):
even = num // 2
median = the_array[(even)]
return median
else:
# check for the odd case
median = (the_array[(num - 1) / 2] + the_array[num / 2]) / 2.0
return median
the_array = np.array([1, 2, 3, 4, 5, 3, 2, 568, 1200])
length = np.count_nonzero(the_array)
print(f'Median: {get_median(the_array, length)}')
print(f'Median from NumPy: {np.median(the_array)}')
'''
Median: 3
Median from NumPy: 3.0
'''
Certainly the Median represents the true average of this particular data set.
Why?
Because the sample of incomes shows that most of the people earn in and around 3. Not 198 which our Mean had produced earlier.
As we progress in Data Science and Machine Learning, we will find that Median is not the true representation of the average of any data set.
How and why, we’ll discuss in the next section.
If you want to read such code snippets that revolve around discrete mathematics, data structures and algorithms, please clone the respective GitHub Repository.
Leave a Reply