Pandas dataframe operations for Machine Learning

As we said before Pandas is one of the most useful machine learning libraries. However, we need to know Pandas dataframe operations.

Especially those operations that we need everyday to train and test data for machine learning.

In our previous sections we have seen how we can import different libraries like Pandasscikit-learnMatplotlib, and NumPy.

Each of them is a very handy machine learning library without which we can not work in this particular field.

As always we have to import the Pandas library and to show the operations we have used Google Colab.

We are reading datasets from the GitHub repository that we have created for this purpose.

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/sanjibsinha/Machine-Learning-Primer/main/world_internet_user.csv', encoding = 'unicode_escape', engine ='python')

df.head()

It will give us a nice tabular data where we can see the statistical data of internet users.

Pandas DataFrame output of first five rows
Pandas DataFrame output of first five rows

Now we can concentrate more on unique data with the help of the describe() method.


df.describe(include='object')

# output
       Country	Region
count	243	    242
unique	243	    6
top	   _World	Africa
freq	1	    58

What is the problem with the datasets? 

It includes the world population and treats the world as a country.

We can easily solve that problem, we can start these datasets without the world-column.

However, we can get many more insights.

Besides, we can operate directly on these datasets. 

Suppose we want to replace the space between the names with a hyphen.

df.columns = df.columns.str.replace(' ', '-')
df.columns

# output
Index(['Country', 'Region', 'Population', 'Internet-Users', '%-of-Population'], dtype='object')

When we sort the value of any column like ‘Population’, we can make the ascending parameter False.

df.Population.sort_values(ascending=False)

# output
0      7920539977
44     1448314408
99     1402228175
230     335226482
100     278268685
          ...    
160          1748
159          1644
218          1385
235           804
46            596
Name: Population, Length: 243, dtype: int64

Usually, we don’t have to explicitly mention it if we want to make the ascending parameter True. 

It’s by default True.

df.Population.sort_values(ascending=True)

# output
46            596
235           804
218          1385
159          1644
160          1748
          ...    
100     278268685
230     335226482
99     1402228175
44     1448314408
0      7920539977
Name: Population, Length: 243, dtype: int64

Pandas dataframe operations are quite handy to get us more insights for data analysis and later build our models for machine learning.

As we progress, we will learn more about such Pandas dataframe operations for machine learning.

So stay tuned.

And for more machine learning code, don’t forget to visit the respective GitHub Repository.

What Next?

Books at Leanpub

Books in Apress

My books at Amazon

GitHub repository

TensorFlow, Machine Learning, AI and Data Science

Flutter, Dart and Algorithm

C, C++, Java and Game Development

Twitter

Comments

Leave a Reply