NumPy and Pandas for Machine Learning

Both the NumPy and Pandas are essential for scientific computation. It includes machine learning and data science.

By now we have come to know that both are libraries and almost in every step we need them.

In this section we will take a close look at what are the key differences between these two libraries.

However, both the libraries are famous for their intuitive syntax and high-performance matrix computation capabilities.

As a result, we use them for data science applications very often.

On top of that both NumPy and pandas are open-source libraries in Python and have a few things in common.

Despite that they have some key differences.

Let’s see the major differences.

We use NumPy for numerical operations on arrays and matrices of large data sets.

Why? Because it provides functions for performing mathematical operations on these data structures.

We can think of mathematical fumctions such as linear algebra and random number generation.

On the contrary, built on top of NumPy, Pandas library provides additional functionality for working with structured data.

Nevertheless, for both of them, it’s true that scope is vast.

With reference to Pandas, it has data structures for storing and manipulating tabular data.

We know them as data frames.

Besides, it has functions for reading and writing data to and from various file formats.

Like we can read CSV files and also tabular data files in other formats.

Since it is made for manipulating numerical data, NumPy is generally faster than pandas for numerical operations on large data sets.

It’s because NumPy uses a lower-level language C for implementing certain functions.

However, beginners find Pandas much easier to handle and no doubt, Pandas library is more user-friendly when we work with data.

In other words, both the libraries can work

with numerical data, and tabular data.

There is no hard and fast rule as we can often perform the same actions on a data set by both libraries issuing different commands.

Both are comprehensive tools for working with both numerical and categorical data in a variety of formats.

Let us see some code now. Although we start from NumPy, consequently we will display some Pandas code.

First we will import NumPy and do some operations.

import numpy as np

# Create a 1D NumPy array
x = np.array([1, 2, 3, 4, 5])

# Create a 2D NumPy array
y = np.array([[1, 2, 3], [4, 5, 6]])

print(x)

print(y)

# output
[1 2 3 4 5]
[[1 2 3]
 [4 5 6]]

-----------------
# Finding Mean
m = np.mean(x)
print(m)

#output
3.0

------------------
# Find the standard deviation of an array
s = np.std(x)
print(s)
#output
1.4142135623730951
-------------------
# We can reshape an array
x_reshaped = x.reshape((5, 1))
print(x_reshaped) 
#output
[[1]
 [2]
 [3]
 [4]
 [5]]
-----------------
# We can transpose an array
x_transposed = x.T
print(x_transposed)
#output
[1 2 3 4 5]

After watching some NumPy code, let’s code some Pandas.

By the way, it’s huge datasets giving us details of Internet Users in each country.

# Now we can see a few basic examples of Pandas library

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/sanjibsinha/Machine-Learning-Primer/main/world_internet_user.csv', encoding = 'unicode_escape', engine ='python')
df.head()
#output
    Country	        Region	Population	Internet Users	% of Population
0	_World	        NaN	    7920539977	5424080321	    68.48
1	Afganistan	    Asia	40403518	9237489	        22.86
2	Albania	        Europe	2872758	2191467	            76.28
3	Algeria	        Africa	45150879	37836425	    83.80
4	American Samoa	Oceania	54995	34800	            63.28

We can mix NumPy and Pandas together and do some operations.

# creating a dataframe

import pandas as pd
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df
#output
    a	b	c
0	1	2	3
1	4	5	6
--------------------
# To access a column of the DataFrame
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
# df
col_b = df['b']
print(col_b)
#output
0    2
1    5
Name: b, dtype: int64
------------------------
# To access a row of the DataFrame
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
row_0 = df.iloc[1]
print(row_0)
#output
a    4
b    5
c    6
Name: 1, dtype: int64

Want to add a new column to an existing dataframe?

It’s quite easy with NumPy and Pandas.

import pandas as pd
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df['d'] = [7, 8]
print(df)
# output
   a  b  c  d
0  1  2  3  7
1  4  5  6  8

For more such machine learning code, please visit the respective GitHub Repository.