As we said before Pandas is one of the most useful machine learning libraries. However, we need to know Pandas dataframe operations.
Especially those operations that we need everyday to train and test data for machine learning.
In our previous sections we have seen how we can import different libraries like Pandas, scikit-learn, Matplotlib, and NumPy.
Each of them is a very handy machine learning library without which we can not work in this particular field.
As always we have to import the Pandas library and to show the operations we have used Google Colab.
We are reading datasets from the GitHub repository that we have created for this purpose.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/sanjibsinha/Machine-Learning-Primer/main/world_internet_user.csv', encoding = 'unicode_escape', engine ='python')
df.head()
It will give us a nice tabular data where we can see the statistical data of internet users.
![Pandas DataFrame output of first five rows](https://i0.wp.com/sanjibsinha.com/wp-content/uploads/2022/12/Pandas-DataFrame-output-of-first-five-rows.webp?ssl=1)
Now we can concentrate more on unique data with the help of the describe() method.
df.describe(include='object')
# output
Country Region
count 243 242
unique 243 6
top _World Africa
freq 1 58
What is the problem with the datasets?
It includes the world population and treats the world as a country.
We can easily solve that problem, we can start these datasets without the world-column.
However, we can get many more insights.
Besides, we can operate directly on these datasets.
Suppose we want to replace the space between the names with a hyphen.
df.columns = df.columns.str.replace(' ', '-')
df.columns
# output
Index(['Country', 'Region', 'Population', 'Internet-Users', '%-of-Population'], dtype='object')
When we sort the value of any column like ‘Population’, we can make the ascending parameter False.
df.Population.sort_values(ascending=False)
# output
0 7920539977
44 1448314408
99 1402228175
230 335226482
100 278268685
...
160 1748
159 1644
218 1385
235 804
46 596
Name: Population, Length: 243, dtype: int64
Usually, we don’t have to explicitly mention it if we want to make the ascending parameter True.
It’s by default True.
df.Population.sort_values(ascending=True)
# output
46 596
235 804
218 1385
159 1644
160 1748
...
100 278268685
230 335226482
99 1402228175
44 1448314408
0 7920539977
Name: Population, Length: 243, dtype: int64
Pandas dataframe operations are quite handy to get us more insights for data analysis and later build our models for machine learning.
As we progress, we will learn more about such Pandas dataframe operations for machine learning.
So stay tuned.
And for more machine learning code, don’t forget to visit the respective GitHub Repository.
What Next?
TensorFlow, Machine Learning, AI and Data Science
Leave a Reply