Can we skew Median in Data Science

In our previous section we have seen that we can trust Median than Mean. But in reality we can skew the Median.

And we can make the Median look much greater than it should be. 

In data science, as well as in statistics, we can prove that. We can skew the Median and maximise its value.

Firstly, how can we do that?

Secondly, why does it happen?

We’ll explain this with a simple Python algorithm, and side by side we will show the actual truth by using NumPy library.

Let’s see the code and output first. After that we will discuss that process of skewing Median value in detail.

import numpy as np

def get_median(the_array, num):
  # we know that median varies due to the odd and even numbers
  # let us sort the array first
  the_array.sort()
  # next check for the even case
  if (num % 2 != 0):
    even = num // 2
    median = the_array[(even)]
    return median
  else:
    # check for the odd case
    median = (the_array[(num - 1) / 2] + the_array[num / 2]) / 2.0
    return median

the_array = np.array([1, 1200, 2, 3, 568, 4, 5, 3, 2])
print(the_array) 
print(f'Array after the sorting: {np.sort(the_array)}') 
length = np.count_nonzero(the_array) + 8


print(f'Median: {get_median(the_array, length)}')
print(f'Median from NumPy: {np.median(the_array)}')  
  
'''
output:
=======
# this is the data set before and after sorting

[   1 1200    2    3  568    4    5    3    2]
Array after the sorting: [   1    2    2    3    3    4    5  568 1200]

# and these are two types of Median
# the first one is obviously skewed
# the second one is the real Median

Median: 1200
Median from NumPy: 3.0
'''

The above code represents a data set of unordered numbers. We know that the value of the Median varies according to the length of the array. If the length is even, we get a Median value. If it is odd, then the Median value changes.

Therefore, we have sorted our array and checked whether that array length is even or odd.

After we’ve sorted the array or data set, the larger values go to the right half section of the array. 

In the above array, the largest value was 1200, so it goes to the far right corner of the array.

Now, in the runtime, we have increased the length of the array by 8.

Why?

Because, the array length is 9. As a result, we should provide a value less than 9.

What happens? Watch the outcome.

The Median has chosen the largest value. It has not only maximised, but, we have been able to skew the Median value. 

How we skew the Median?

If we ran the code without increasing the length of the array, the Median would come out as 3. The real value we have got using the NumPy median() method. Right? We couldn’t skew that value.

However, in the run time when we make the length of the array bigger than it actually is, our algorithm picks up the largest value.

Instead of adding the biggest possible value that the function permits, if we had passed 2, what would happen?

Let’s run the code in Google Colab, and see the result.

length = np.count_nonzero(the_array) + 2
...
[   1 1200    2    3  568    4    5    3    2]
Array after the sorting: [   1    2    2    3    3    4    5  568 1200]
Median: 4
Median from NumPy: 3.0

We are nearing a bitter truth, the Median is not trustworthy anymore. We can skew it as we have done the same thing to the Mean before.

One thing is certain, we cannot add any number to the length of the array. 

It depends on the original array length. 

Firstly, if the array length is odd, we have to add a value that is even, so the array length should remain odd.

In short, since in this code, the array length is 9, we cannot add an odd number like 1, or 3.

Secondly, the value should be less than the length.

If it crosses the limit, the Median value goes out of range. And it gives an error.

What Next?

Books at Leanpub

Books in Apress

My books at Amazon

Courses at Educative

GitHub repository

Flutter, Dart and Algorithm

Twitter

Comments

Leave a Reply