Pandas and Machine Learning become synonymous. In fact, in machine learning we use the Pandas library, more than others.
In our previous sections we have seen how we can import different libraries like Pandas, scikit-learn, Matplotlib, and NumPy.
However in this section we will start with the Pandas only. Next we will read and operate on data that we have downloaded from this link.
Here you will get many CSV files to download.
The data gives is an insight about how do people go to the workplace in New Zealand.
Let’s read the datasets first.
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model, model_selection
import matplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/sanjibsinha/Machine-Learning-Primer/main/main-means-of-travel-to-work-2018-census-csv.csv')
df
It produces a beautiful data table.
We can get a look at the table.
Okay, there are different types of transportation.
And we get an idea about how people go to the workplace in New Zealand.
However, the Pandas library does not only deal with reading the datasets.
On the contrary, it can do a hell lot of things that help data scientists and machine learning engineers.
If you are a complete beginner your journey to learn TensorFlow might start from here.
For the TensorFlow beginners we have a dedicated category – TensorFlow for Beginners.
But besides that, you may need to learn several other machine learning and data science libraries.
As a result, you may check these categories as well – NumPy, Pandas, Matplotlib.
However, without learning Python, you cannot learn the usages of these libraries. Why? Because they all use Python as the Programming language.
Therefore please learn Python at the very beginning and start learning TensorFlow.
And, finally please check our Mathematics, Discrete Mathematics and Data Structures categories specially. We have tried to discuss from basic to intermediate level so that you can pick up the core ideas of TensorFlow.
As a consequence, let’s see some more code to get the idea about the power of Pandas library.
df['Main_means_of_travel_to_work']
As an outcome, we can simply pass the column name to the data frame we have just created.
It will give us the output of the single column.
0 Worked at home
1 Drove a private car, truck or van
2 Drove a company car, truck or van
3 Passenger in a car, truck, van or company bus
4 Public bus
5 Train
6 Bicycle
7 Walked or jogged
8 Ferry
9 Other
10 Response unidentifiable
11 Not stated
12 Total stated
13 Total
Name: Main_means_of_travel_to_work, dtype: object
For example, if the data is numerical, we can get the maximum or minimum value quite easily.
df['Employed_census_usually_resident_population_count_aged_15_years_and_over'].max()
df['Employed_census_usually_resident_population_count_aged_15_years_and_over'].min()
By the way, filtering the datasets is quite simple.
df.filter(items=['Main_means_of_travel_to_work'])
The Pandas library gives us a nice output. Moreover, we can have an insight that we badly need always in machine learning.
Main_means_of_travel_to_work
0 Worked at home
1 Drove a private car, truck or van
2 Drove a company car, truck or van
3 Passenger in a car, truck, van or company bus
4 Public bus
5 Train
6 Bicycle
7 Walked or jogged
8 Ferry
9 Other
10 Response unidentifiable
11 Not stated
12 Total stated
13 Total
As we can see in the above output, we don’t need the last two rows.
As a result, we can easily delete them from our datasets with a simple Python code.
new_df = df[0:12]
new_df
Now we have a new data frame where we don’t have the last two rows anymore.
Managing datasets becomes quite easy with Pandas, as we see.
In addition, if we want, we can work on the population datasets individually.
new_df['Employed_census_usually_resident_population_count_aged_15_years_and_over']
# output
0 291135
1 1412994
2 274905
3 97584
4 103194
5 48777
6 47811
7 127350
8 6045
9 35343
10 0
11 0
Name: Employed_census_usually_resident_population_count_aged_15_years_and_over, dtype: int64
Finally we will see a great feature of Pandas library where we can create datasets on our own from a Python dictionary.
It’s nice and handy and we can practice with such little datasets.
students = {
'name' : ['Mary', 'John', 'Json', 'Catty', 'Ana'],
'age' : [11, 8, 13, 14, 10],
'class' : [5, 3, 6, 8, 4]
}
data_frame = pd.DataFrame(students)
data_frame
# output
name age class
0 Mary 11 5
1 John 8 3
2 Json 13 6
3 Catty 14 8
4 Ana 10 4
For more such machine learning primer series code, please visit the respective GitHub repository.
What Next?
TensorFlow, Machine Learning, AI and Data Science
Leave a Reply