How to filter Pandas DataFrame

We have already seen how versatile the Pandas package is. In this section we’ll find how we can filter Pandas DataFrame. 

We have been working with the Pandas GitHub repository of data.

The DataFrame is simple. Let’s see the code.

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/tips.csv")

df.shape

# output
(244, 7)

There are altogether 244 rows and 7 columns. 

By the way, we refer to rows as entries also. In short, there are 244 entries.

We can take a look at the columns.


df.columns

# output
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

To understand the tabular data we can add an extra column and name it “index”. 

As a result, we can say each row has an index number. Certainly that starts with 0.

df.reset_index(inplace=True)
df

# output
  index	total_bill	tip	sex	smoker	day	time	size
0	0	16.99	1.01	Female	No	Sun	Dinner	2
1	1	10.34	1.66	Male	No	Sun	Dinner	3
2	2	21.01	3.50	Male	No	Sun	Dinner	3
3	3	23.68	3.31	Male	No	Sun	Dinner	2
4	4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...	...
239	239	29.03	5.92	Male	No	Sat	Dinner	3
240	240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	242	17.82	1.75	Male	No	Sat	Dinner	2
243	243	18.78	3.00	Female	No	Thur Dinner	2

If we want to restrict the output to 5 entries or rows, we can use the head() method.

We have seen how to use the head() method before.

But before that we can filter the DataFrame in such a way so that the column “time” will come in the place of “index”.

Now each “time” column has two entries. Either Dinner, or Lunch.

As a result, we get the following output.

df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/tips.csv", index_col='time')

pd.set_option('display.max_columns', 5)
pd.set_option('display.max_rows', 5)

df.head()

# output
      total_bill tip	...	day	size

time					
Dinner	16.99	1.01	...	Sun	  2
Dinner	10.34	1.66	...	Sun	  3
Dinner	21.01	3.50	...	Sun	  3
Dinner	23.68	3.31	...	Sun	  2
Dinner	24.59	3.61	...	Sun	  4
5 rows × 6 columns

The “set_option()” method takes two arguments. We can pass the columns and as well as rows. 

But there is another advantage. Because along with that, we can also pass the numbers of the columns and rows.

We can get more entries by increasing the number.

However, the output of the first five rows gives us a pattern which shows maximum people came for dinner.

This pattern finding helps us to analyse data for data science and machine learning. 

We can filter more. 

More filter options in Pandas DataFrame

Suppose we want to see whether the people who came for dinner were smokers or not.

Because we know there is one “smoker” column.

As a result, we can filter that data first.

After that we can see whether the value is true or false.

If the customer is a smoker, the output will be true. Otherwise it comes out as false.

Let’s see the code first.

After that we will discuss the relevant method.

filtering_smoker = df['smoker'].str.contains('No')

filtering_smoker
# output
time
Dinner    True
Dinner    True
          ... 
Dinner    True
Dinner    True
Name: smoker, Length: 244, dtype: bool

filtering_smoker = df['smoker'].str.contains('Yes')

filtering_smoker
# output
time
Dinner    False
Dinner    False
          ...  
Dinner    False
Dinner    False
Name: smoker, Length: 244, dtype: bool

Let’s try to make it simple. And break the code down.

Firstly, the “smoker” column has two inputs – Yes, or No.

Secondly, the method checks whether it is Yes, or No.

If Yes, it displays “True”. Else it displays “No”.

To filter data in Pandas DataFrame is as simple as this.

Finally, we get a pattern out of the output.

Most of the customers don’t smoke.

You can try it on Google Colab.

For more such code to practice, please visit the GitHub repository.

What Next?

Books at Leanpub

Books in Apress

My books at Amazon

GitHub repository

Flutter, Dart and Algorithm

Twitter

Comments

Leave a Reply