We have already seen how versatile the Pandas package is. In this section we’ll find how we can filter Pandas DataFrame.
We have been working with the Pandas GitHub repository of data.
The DataFrame is simple. Let’s see the code.
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/tips.csv")
df.shape
# output
(244, 7)
There are altogether 244 rows and 7 columns.
By the way, we refer to rows as entries also. In short, there are 244 entries.
We can take a look at the columns.
df.columns
# output
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')
To understand the tabular data we can add an extra column and name it “index”.
As a result, we can say each row has an index number. Certainly that starts with 0.
df.reset_index(inplace=True)
df
# output
index total_bill tip sex smoker day time size
0 0 16.99 1.01 Female No Sun Dinner 2
1 1 10.34 1.66 Male No Sun Dinner 3
2 2 21.01 3.50 Male No Sun Dinner 3
3 3 23.68 3.31 Male No Sun Dinner 2
4 4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ... ...
239 239 29.03 5.92 Male No Sat Dinner 3
240 240 27.18 2.00 Female Yes Sat Dinner 2
241 241 22.67 2.00 Male Yes Sat Dinner 2
242 242 17.82 1.75 Male No Sat Dinner 2
243 243 18.78 3.00 Female No Thur Dinner 2
If we want to restrict the output to 5 entries or rows, we can use the head() method.
We have seen how to use the head() method before.
Now each “time” column has two entries. Either Dinner, or Lunch.
As a result, we get the following output.
df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/tips.csv", index_col='time')
pd.set_option('display.max_columns', 5)
pd.set_option('display.max_rows', 5)
df.head()
# output
total_bill tip ... day size
time
Dinner 16.99 1.01 ... Sun 2
Dinner 10.34 1.66 ... Sun 3
Dinner 21.01 3.50 ... Sun 3
Dinner 23.68 3.31 ... Sun 2
Dinner 24.59 3.61 ... Sun 4
5 rows × 6 columns
The “set_option()” method takes two arguments. We can pass the columns and as well as rows.
We can get more entries by increasing the number.
However, the output of the first five rows gives us a pattern which shows maximum people came for dinner.
This pattern finding helps us to analyse data for data science and machine learning.
We can filter more.
More filter options in Pandas DataFrame
Suppose we want to see whether the people who came for dinner were smokers or not.
Because we know there is one “smoker” column.
As a result, we can filter that data first.
After that we can see whether the value is true or false.
If the customer is a smoker, the output will be true. Otherwise it comes out as false.
Let’s see the code first.
After that we will discuss the relevant method.
filtering_smoker = df['smoker'].str.contains('No')
filtering_smoker
# output
time
Dinner True
Dinner True
...
Dinner True
Dinner True
Name: smoker, Length: 244, dtype: bool
filtering_smoker = df['smoker'].str.contains('Yes')
filtering_smoker
# output
time
Dinner False
Dinner False
...
Dinner False
Dinner False
Name: smoker, Length: 244, dtype: bool
Let’s try to make it simple. And break the code down.
Firstly, the “smoker” column has two inputs – Yes, or No.
Secondly, the method checks whether it is Yes, or No.
If Yes, it displays “True”. Else it displays “No”.
To filter data in Pandas DataFrame is as simple as this.
Finally, we get a pattern out of the output.
Most of the customers don’t smoke.
You can try it on Google Colab.
For more such code to practice, please visit the GitHub repository.
Leave a Reply