Both the NumPy and Pandas are essential for scientific computation. It includes machine learning and data science.
By now we have come to know that both are libraries and almost in every step we need them.
In this section we will take a close look at what are the key differences between these two libraries.
However, both the libraries are famous for their intuitive syntax and high-performance matrix computation capabilities.
As a result, we use them for data science applications very often.
On top of that both NumPy and pandas are open-source libraries in Python and have a few things in common.
Despite that they have some key differences.
Let’s see the major differences.
We use NumPy for numerical operations on arrays and matrices of large data sets.
Why? Because it provides functions for performing mathematical operations on these data structures.
We can think of mathematical fumctions such as linear algebra and random number generation.
On the contrary, built on top of NumPy, Pandas library provides additional functionality for working with structured data.
Nevertheless, for both of them, it’s true that scope is vast.
With reference to Pandas, it has data structures for storing and manipulating tabular data.
We know them as data frames.
Besides, it has functions for reading and writing data to and from various file formats.
Like we can read CSV files and also tabular data files in other formats.
Since it is made for manipulating numerical data, NumPy is generally faster than pandas for numerical operations on large data sets.
It’s because NumPy uses a lower-level language C for implementing certain functions.
However, beginners find Pandas much easier to handle and no doubt, Pandas library is more user-friendly when we work with data.
In other words, both the libraries can work
with numerical data, and tabular data.
There is no hard and fast rule as we can often perform the same actions on a data set by both libraries issuing different commands.
Both are comprehensive tools for working with both numerical and categorical data in a variety of formats.
Let us see some code now. Although we start from NumPy, consequently we will display some Pandas code.
First we will import NumPy and do some operations.
import numpy as np
# Create a 1D NumPy array
x = np.array([1, 2, 3, 4, 5])
# Create a 2D NumPy array
y = np.array([[1, 2, 3], [4, 5, 6]])
print(x)
print(y)
# output
[1 2 3 4 5]
[[1 2 3]
[4 5 6]]
-----------------
# Finding Mean
m = np.mean(x)
print(m)
#output
3.0
------------------
# Find the standard deviation of an array
s = np.std(x)
print(s)
#output
1.4142135623730951
-------------------
# We can reshape an array
x_reshaped = x.reshape((5, 1))
print(x_reshaped)
#output
[[1]
[2]
[3]
[4]
[5]]
-----------------
# We can transpose an array
x_transposed = x.T
print(x_transposed)
#output
[1 2 3 4 5]
After watching some NumPy code, let’s code some Pandas.
By the way, it’s huge datasets giving us details of Internet Users in each country.
# Now we can see a few basic examples of Pandas library
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/sanjibsinha/Machine-Learning-Primer/main/world_internet_user.csv', encoding = 'unicode_escape', engine ='python')
df.head()
#output
Country Region Population Internet Users % of Population
0 _World NaN 7920539977 5424080321 68.48
1 Afganistan Asia 40403518 9237489 22.86
2 Albania Europe 2872758 2191467 76.28
3 Algeria Africa 45150879 37836425 83.80
4 American Samoa Oceania 54995 34800 63.28
We can mix NumPy and Pandas together and do some operations.
# creating a dataframe
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df
#output
a b c
0 1 2 3
1 4 5 6
--------------------
# To access a column of the DataFrame
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
# df
col_b = df['b']
print(col_b)
#output
0 2
1 5
Name: b, dtype: int64
------------------------
# To access a row of the DataFrame
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
row_0 = df.iloc[1]
print(row_0)
#output
a 4
b 5
c 6
Name: 1, dtype: int64
Want to add a new column to an existing dataframe?
It’s quite easy with NumPy and Pandas.
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df['d'] = [7, 8]
print(df)
# output
a b c d
0 1 2 3 7
1 4 5 6 8
For more such machine learning code, please visit the respective GitHub Repository.
What Next?
TensorFlow, Machine Learning, AI and Data Science
Leave a Reply