In our previous section we learned how to use the Pandas package in python. In this section we’ll learn to read and write tabular data.
Tabular data is nothing but a two dimensional array. We place them as rows and columns.
In any relational database, we get tabular data. Not only that, we can get it in any format.
CSV, Excel, JSON, SQL, and many more.
The Pandas package provides the read_csv() function to read data stored as a CSV file.
Not only the CSV file, with the prefix read_*, we can read any format that supports tabular data.
In the Pandas package, the DataFrame represents as follows.
![How Pandas represent data](https://i0.wp.com/io.sanjibsinha.com/wp-content/uploads/2022/07/How-Pandas-represent-data.webp?ssl=1)
The column goes vertically, and the rows go horizontally.
As a result, in tabular data, usually, the number of columns is less than the number of rows.
To start with we import the Pandas package.
import pandas as pd
We’ll read the Titanic data stored in Google, to analyse for data science and machine learning purposes.
titanic_train = pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/train.csv")
print(titanic_train)
As an output, the DataFrame will display the first and last 5 ros.
survived sex age n_siblings_spouses parch fare class \
0 0 male 22.0 1 0 7.2500 Third
1 1 female 38.0 1 0 71.2833 First
2 1 female 26.0 0 0 7.9250 Third
3 1 female 35.0 1 0 53.1000 First
4 0 male 28.0 0 0 8.4583 Third
.. ... ... ... ... ... ... ...
622 0 male 28.0 0 0 10.5000 Second
623 0 male 25.0 0 0 7.0500 Third
624 1 female 19.0 0 0 30.0000 First
625 0 female 28.0 1 2 23.4500 Third
626 0 male 32.0 0 0 7.7500 Third
deck embark_town alone
0 unknown Southampton n
1 C Cherbourg n
2 unknown Southampton y
3 C Southampton n
4 unknown Queenstown y
.. ... ... ...
622 unknown Southampton y
623 unknown Southampton y
624 B Southampton y
625 unknown Southampton n
626 unknown Queenstown y
[627 rows x 10 columns]
We can see there are 627 rows and 10 columns.
The above data is for training purposes. However, for evaluating we have got another set of data.
Therefore we can read that as well.
titanic_eval = pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/eval.csv")
print(titanic_eval)
Let’s see the output.
survived sex age n_siblings_spouses parch fare class \
0 0 male 35.0 0 0 8.0500 Third
1 0 male 54.0 0 0 51.8625 First
2 1 female 58.0 0 0 26.5500 First
3 1 female 55.0 0 0 16.0000 Second
4 1 male 34.0 0 0 13.0000 Second
.. ... ... ... ... ... ... ...
259 1 female 25.0 0 1 26.0000 Second
260 0 male 33.0 0 0 7.8958 Third
261 0 female 39.0 0 5 29.1250 Third
262 0 male 27.0 0 0 13.0000 Second
263 1 male 26.0 0 0 30.0000 First
deck embark_town alone
0 unknown Southampton y
1 E Southampton y
2 C Southampton y
3 unknown Southampton y
4 D Southampton y
.. ... ... ...
259 unknown Southampton n
260 unknown Southampton y
261 unknown Queenstown n
262 unknown Southampton y
263 C Cherbourg y
[264 rows x 10 columns]
Altogether, in the above output 264 rows and 10 columns are there.
first_five_rows = titanic_eval.head(5)
print(first_five_rows)
# output:
survived sex age n_siblings_spouses parch fare class \
0 0 male 35.0 0 0 8.0500 Third
1 0 male 54.0 0 0 51.8625 First
2 1 female 58.0 0 0 26.5500 First
3 1 female 55.0 0 0 16.0000 Second
4 1 male 34.0 0 0 13.0000 Second
deck embark_town alone
0 unknown Southampton y
1 E Southampton y
2 C Southampton y
3 unknown Southampton y
4 D Southampton y
We can read the first five rows and columns. To do that, we need to use the Pandas head() method. In addition, we need to pass the number as the argument.
For example we have wanted 5 rows and columns, so we have passed 5 as an argument.
We can use the head() method directly to read the data.
print(pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/train.csv").head(3))
# output:
survived sex age n_siblings_spouses parch fare class deck \
0 0 male 22.0 1 0 7.2500 Third unknown
1 1 female 38.0 1 0 71.2833 First C
2 1 female 26.0 0 0 7.9250 Third unknown
embark_town alone
0 Southampton n
1 Cherbourg n
2 Southampton y
Shape of the tabular data
Usually while displaying the DataFrame, it gives us the number of rows and columns.
But we can use the shape() method to get it directly from any tabular data.
data_frame_shape = pd.read_csv("https://storage.googleapis.com/tf-datasets/titanic/train.csv")
print(data_frame_shape.shape)
# output:
(627, 10)
How to read CSV file from GitHub using Pandas
The GitHub repository of Pandas package has numerous CSV files to read data.
Although we need to read from the raw data source.
Let’s take a look at the code below.
weather_quality_station = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_stations.csv")
print(weather_quality_station)
# output:
location coordinates.latitude coordinates.longitude
0 BELAL01 51.23619 4.38522
1 BELHB23 51.17030 4.34100
2 BELLD01 51.10998 5.00486
3 BELLD02 51.12038 5.02155
4 BELR833 51.32766 4.36226
.. ... ... ...
61 Southend-on-Sea 51.54420 0.67841
62 Southwark A2 Old Kent Road 51.48050 -0.05955
63 Thurrock 51.47707 0.31797
64 Tower Hamlets Roadside 51.52253 -0.04216
65 Groton Fort Griswold 41.35360 -72.07890
[66 rows x 3 columns]
We’ll discuss more on this topic. We’ll also see how we can use this data for machine learning and data science.
Leave a Reply