Pandas and Machine Learning

Pandas and Machine Learning become synonymous. In fact, in machine learning we use the Pandas library, more than others. 

In our previous sections we have seen how we can import different libraries like Pandasscikit-learnMatplotlib, and NumPy.

However in this section we will start with the Pandas only. Next we will read and operate on data that we have downloaded from this link. 

Here you will get many CSV files to download. 

The data gives is an insight about how do people go to the workplace in New Zealand.

Let’s read the datasets first.

import pandas as pd
import numpy as np
from sklearn import datasets, linear_model, model_selection
import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/sanjibsinha/Machine-Learning-Primer/main/main-means-of-travel-to-work-2018-census-csv.csv')

df

It produces a beautiful data table. 

We can get a look at the table.


Reading CSV file with the help of Pandas library
Reading CSV file with the help of Pandas library

Okay, there are different types of transportation. 

And we get an idea about how people go to the workplace in New Zealand.

However, the Pandas library does not only deal with reading the datasets. 

On the contrary, it can do a hell lot of things that help data scientists and machine learning engineers.



If you are a complete beginner your journey to learn TensorFlow might start from here.

For the TensorFlow beginners we have a dedicated category – TensorFlow for Beginners.
But besides that, you may need to learn several other machine learning and data science libraries.

As a result, you may check these categories as well – NumPy, Pandas, Matplotlib.

However, without learning Python, you cannot learn the usages of these libraries. Why? Because they all use Python as the Programming language.

Therefore please learn Python at the very beginning and start learning TensorFlow.

And, finally please check our Mathematics, Discrete Mathematics and Data Structures categories specially. We have tried to discuss from basic to intermediate level so that you can pick up the core ideas of TensorFlow.


As a consequence, let’s see some more code to get the idea about the power of Pandas library.


df['Main_means_of_travel_to_work']

As an outcome, we can simply pass the column name to the data frame we have just created. 

It will give us the output of the single column.

0                                    Worked at home
1                 Drove a private car, truck or van
2                 Drove a company car, truck or van
3     Passenger in a car, truck, van or company bus
4                                        Public bus
5                                             Train
6                                           Bicycle
7                                  Walked or jogged
8                                             Ferry
9                                             Other
10                          Response unidentifiable
11                                       Not stated
12                                     Total stated
13                                            Total
Name: Main_means_of_travel_to_work, dtype: object

For example, if the data is numerical, we can get the maximum or minimum value quite easily.


df['Employed_census_usually_resident_population_count_aged_15_years_and_over'].max()
df['Employed_census_usually_resident_population_count_aged_15_years_and_over'].min()

By the way, filtering the datasets is quite simple.


df.filter(items=['Main_means_of_travel_to_work'])

The Pandas library gives us a nice output. Moreover, we can have an insight that we badly need always in machine learning.

	Main_means_of_travel_to_work
0	Worked at home
1	Drove a private car, truck or van
2	Drove a company car, truck or van
3	Passenger in a car, truck, van or company bus
4	Public bus
5	Train
6	Bicycle
7	Walked or jogged
8	Ferry
9	Other
10	Response unidentifiable
11	Not stated
12	Total stated
13	Total

As we can see in the above output, we don’t need the last two rows.

As a result, we can easily delete them from our datasets with a simple Python code.

new_df = df[0:12]
new_df

Now we have a new data frame where we don’t have the last two rows anymore.


Managing datasets with the help of Pandas library
Managing datasets with the help of Pandas library

Managing datasets becomes quite easy with Pandas, as we see. 

In addition, if we want, we can work on the population datasets individually.


new_df['Employed_census_usually_resident_population_count_aged_15_years_and_over']

# output
0      291135
1     1412994
2      274905
3       97584
4      103194
5       48777
6       47811
7      127350
8        6045
9       35343
10          0
11          0
Name: Employed_census_usually_resident_population_count_aged_15_years_and_over, dtype: int64

Finally we will see a great feature of Pandas library where we can create datasets on our own from a Python dictionary.

It’s nice and handy and we can practice with such little datasets.

students = {
    'name' : ['Mary', 'John', 'Json', 'Catty', 'Ana'],
    'age' : [11, 8, 13, 14, 10],
    'class' : [5, 3, 6, 8, 4]
}

data_frame = pd.DataFrame(students)
data_frame

# output
    name	age	class
0	Mary	11	5
1	John	8	3
2	Json	13	6
3	Catty	14	8
4	Ana	    10	4

For more such machine learning primer series code, please visit the respective GitHub repository.

What Next?

Books at Leanpub

Books in Apress

My books at Amazon

GitHub repository

TensorFlow, Machine Learning, AI and Data Science

Flutter, Dart and Algorithm

C, C++, Java and Game Development

Twitter

Comments

Leave a Reply