How do I read and write tabular data?

In our previous section we learned how to use the Pandas package in python. In this section we’ll learn to read and write tabular data. 

Tabular data is nothing but a two dimensional array. We place them as rows and columns. 

In any relational database, we get tabular data. Not only that, we can get it in any format. 

CSV, Excel, JSON, SQL, and many more.

The Pandas package provides the read_csv() function to read data stored as a CSV file.

Not only the CSV file, with the prefix read_*, we can read any format that supports tabular data.

In the Pandas package, the DataFrame represents as follows.

How Pandas represent data
How Pandas represent data

The column goes vertically, and the rows go horizontally.

As a result, in tabular data, usually, the number of columns is less than the number of rows.

To start with we import the Pandas package.

import pandas as pd

We’ll read the Titanic data stored in Google, to analyse for data science and machine learning purposes.

titanic_train = pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/train.csv")

print(titanic_train)

As an output, the DataFrame will display the first and last 5 ros.

     survived     sex   age  n_siblings_spouses  parch     fare   class  \
0           0    male  22.0                   1      0   7.2500   Third   
1           1  female  38.0                   1      0  71.2833   First   
2           1  female  26.0                   0      0   7.9250   Third   
3           1  female  35.0                   1      0  53.1000   First   
4           0    male  28.0                   0      0   8.4583   Third   
..        ...     ...   ...                 ...    ...      ...     ...   
622         0    male  28.0                   0      0  10.5000  Second   
623         0    male  25.0                   0      0   7.0500   Third   
624         1  female  19.0                   0      0  30.0000   First   
625         0  female  28.0                   1      2  23.4500   Third   
626         0    male  32.0                   0      0   7.7500   Third   

        deck  embark_town alone  
0    unknown  Southampton     n  
1          C    Cherbourg     n  
2    unknown  Southampton     y  
3          C  Southampton     n  
4    unknown   Queenstown     y  
..       ...          ...   ...  
622  unknown  Southampton     y  
623  unknown  Southampton     y  
624        B  Southampton     y  
625  unknown  Southampton     n  
626  unknown   Queenstown     y  

[627 rows x 10 columns]

We can see there are 627 rows and 10 columns.

The above data is for training purposes. However, for evaluating we have got another set of data.

Therefore we can read that as well. 

titanic_eval = pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/eval.csv")

print(titanic_eval)

Let’s see the output.

     survived     sex   age  n_siblings_spouses  parch     fare   class  \
0           0    male  35.0                   0      0   8.0500   Third   
1           0    male  54.0                   0      0  51.8625   First   
2           1  female  58.0                   0      0  26.5500   First   
3           1  female  55.0                   0      0  16.0000  Second   
4           1    male  34.0                   0      0  13.0000  Second   
..        ...     ...   ...                 ...    ...      ...     ...   
259         1  female  25.0                   0      1  26.0000  Second   
260         0    male  33.0                   0      0   7.8958   Third   
261         0  female  39.0                   0      5  29.1250   Third   
262         0    male  27.0                   0      0  13.0000  Second   
263         1    male  26.0                   0      0  30.0000   First   

        deck  embark_town alone  
0    unknown  Southampton     y  
1          E  Southampton     y  
2          C  Southampton     y  
3    unknown  Southampton     y  
4          D  Southampton     y  
..       ...          ...   ...  
259  unknown  Southampton     n  
260  unknown  Southampton     y  
261  unknown   Queenstown     n  
262  unknown  Southampton     y  
263        C    Cherbourg     y  

[264 rows x 10 columns]

Altogether, in the above output 264 rows and 10 columns are there.

first_five_rows = titanic_eval.head(5)

print(first_five_rows)

# output:
   survived     sex   age  n_siblings_spouses  parch     fare   class  \
0         0    male  35.0                   0      0   8.0500   Third   
1         0    male  54.0                   0      0  51.8625   First   
2         1  female  58.0                   0      0  26.5500   First   
3         1  female  55.0                   0      0  16.0000  Second   
4         1    male  34.0                   0      0  13.0000  Second   

      deck  embark_town alone  
0  unknown  Southampton     y  
1        E  Southampton     y  
2        C  Southampton     y  
3  unknown  Southampton     y  
4        D  Southampton     y 

We can read the first five rows and columns. To do that, we need to use the Pandas head() method. In addition, we need to pass the number as the argument.

For example we have wanted 5 rows and columns, so we have passed 5 as an argument.

We can use the head() method directly to read the data.

print(pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/train.csv").head(3))

# output:
   survived     sex   age  n_siblings_spouses  parch     fare  class     deck  \
0         0    male  22.0                   1      0   7.2500  Third  unknown   
1         1  female  38.0                   1      0  71.2833  First        C   
2         1  female  26.0                   0      0   7.9250  Third  unknown   

   embark_town alone  
0  Southampton     n  
1    Cherbourg     n  
2  Southampton     y  

Shape of the tabular data

Usually while displaying the DataFrame, it gives us the number of rows and columns.

But we can use the shape() method to get it directly from any tabular data.

data_frame_shape = pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/train.csv")

print(data_frame_shape.shape)

# output:
(627, 10)

How to read CSV file from GitHub using Pandas

The GitHub repository of Pandas package has numerous CSV files to read data. 

Although we need to read from the raw data source.

Let’s take a look at the code below.

weather_quality_station = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_stations.csv")

print(weather_quality_station)

# output:
                      location  coordinates.latitude  coordinates.longitude
0                      BELAL01              51.23619                4.38522
1                      BELHB23              51.17030                4.34100
2                      BELLD01              51.10998                5.00486
3                      BELLD02              51.12038                5.02155
4                      BELR833              51.32766                4.36226
..                         ...                   ...                    ...
61             Southend-on-Sea              51.54420                0.67841
62  Southwark A2 Old Kent Road              51.48050               -0.05955
63                    Thurrock              51.47707                0.31797
64      Tower Hamlets Roadside              51.52253               -0.04216
65        Groton Fort Griswold              41.35360              -72.07890

[66 rows x 3 columns]

We’ll discuss more on this topic. We’ll also see how we can use this data for machine learning and data science.

What Next?

Books at Leanpub

Books in Apress

My books at Amazon

GitHub repository

Flutter, Dart and Algorithm

Twitter

Comments

4 responses to “How do I read and write tabular data?”

  1. […] In the last section, we have seen how we can read tabular data in Pandas. Besides reading, writing tabular data in Pandas is also easy. […]

  2. […] As we’ve downloaded the csv file in the same folder, we can use the Pandas read_csv() method. We’ve learned how to read and write tabular data using the Pandas package. […]

  3. […] In addition, we have seen how DataFrame works. […]

  4. […] with the help of the “iloc” method, we can select rows and columns of the tabular data at specific integer […]

Leave a Reply