Why Median is better than Mean in Data Science

Why is Median better than Mean in Data Science? This question haunts beginners in particular. We will discuss this topic in this section.

Although an experienced data scientist knows the answer. But, still we can question this conviction also. 

Why?

We’ll check in the next section. Therefore, I will request you to check the next section after reading this article.

Firstly, we generalise the concept that Median is better than Mean in Data Science. 

Why?

Because we know that, when we handle income-data of a population that belongs to a poor, or a developing country, the Mean will not work.

For example, a few rich people, having a lot of money will definitely make the Mean unrealistic.

In that case, Median will do justice. In addition, only the Median will reflect the real picture of the national income per capita.

We can prove it by two simple programs. In these two programs we will work on the same data set. 

Finding Mean

Let’s find the Mean first.

import numpy as np

mean = 0.0
median = 0.0

def find_mean(the_array, num):
  sum = 0.0
  i = 0
  while(i < num):
    sum = sum + the_array[i]
    i = i + 1
  mean = sum / num
  return mean
the_array = np.array([1, 2, 3, 4, 5, 3, 2, 568, 1200])
length_of_array = len(the_array)
print(f'Mean from original program: {find_mean(the_array, length_of_array)}')
numpy_mean = np.mean([1, 2, 3, 4, 5, 3, 2, 568, 1200])
print(f'Mean from NumPy Mean: {numpy_mean}')

'''
Mean from original program: 198.66666666666666
Mean from NumPy Mean: 198.66666666666666
'''

We have worked on a small sample of a data set. It shows the incomes of nine people. Of these nine people, we can see that seven are poor. Two are rich.

As a result, they greatly influence the average income.

These rich people who don’t belong to the majority of the population makes the average income look like a farce.

Isn’t it?

In a data set, or in Data Science, we call them “outliers”. 

For what reason? Because they represent a tiny segment that doesn’t belong to the main section of people.

The above program has two parts. In one part we have used Python imperative programming style to write an algorithm to find the Mean. 

In the second part we used the NumPy library and applied its mean() method. Quite naturally, it depicts declarative programming style.

However, the result is the same.

Median represents a better output than Mean

In the next program we will follow the same path to show why the Median represents a better output than Mean.

In the first part, we have used an algorithm using which we have got the Median of the same data set. 

On the other hand, we have used the NumPy library to find the Median by applying its median() method.

No doubt using NumPy makes our life much easier. Because we don’t have to bother about what is going on inside the background. 

As a result, we get the result in one line of code.

However, for data science and as well as machine learning, we need to know how to write algorithms in Python.

Anyway, let’s take a look at the code below.

import numpy as np


def get_median(the_array, num):
  # we know that median varies due to the odd and even numbers
  # let us sort the array first
  the_array.sort()
  # next check for the even case
  if (num % 2 != 0):
    even = num // 2
    median = the_array[(even)]
    return median
  else:
    # check for the odd case
    median = (the_array[(num - 1) / 2] + the_array[num / 2]) / 2.0
    return median

the_array = np.array([1, 2, 3, 4, 5, 3, 2, 568, 1200])

length = np.count_nonzero(the_array)


print(f'Median: {get_median(the_array, length)}')
print(f'Median from NumPy: {np.median(the_array)}')
  
  
'''
Median: 3
Median from NumPy: 3.0
'''

Certainly the Median represents the true average of this particular data set.

Why?

Because the sample of incomes shows that most of the people earn in and around 3. Not 198 which our Mean had produced earlier.

As we progress in Data Science and Machine Learning, we will find that Median is not the true representation of the average of any data set.

How and why, we’ll discuss in the next section.

If you want to read such code snippets that revolve around discrete mathematics, data structures and algorithms, please clone the respective GitHub Repository.

What Next?

Books at Leanpub

Books in Apress

My books at Amazon

GitHub repository

Flutter, Dart and Algorithm

Twitter

Comments

4 responses to “Why Median is better than Mean in Data Science”

  1. […] In our previous section we have seen that we can trust Median than Mean. But in reality we can skew the Median. […]

  2. […] Why Median is better than Mean in Data Science […]

  3. […] গিয়েছে, এক পক্ষ মেডিয়ান লাইন অতিক্রমের চেষ্টা করলে অন্য পক্ষ […]

  4. […] গিয়েছে, এক পক্ষ মেডিয়ান লাইন অতিক্রমের চেষ্টা করলে অন্য পক্ষ […]

Leave a Reply