Pandas: Tabular data

Unit 05

Working with tabular data

Pandas is one of the most famous python libraries for science, data science, and machine learning. It excels at manipulating tabular data, i.e. the kind of data you would have in a spreadsheet, a csv file, or weather station data.

Pandas is a fantastic library: it will take you minutes to grasp the fundamentals, and years to master it: this is what I like most about it ;-).

Fabien Maussion

Getting started with Pandas

For today, I strongly recommend to follow the various tutorials on the getting started page of the pandas documentation. Make sure you have had a look at assignment #5, so you know where you need to pay close attention and where you can have a lighter read. Another great idea is to solve the assignment tasks while you work your way through the tutorials. Use the learning checklist below to gauge whether you covered the ground.

Focus on the following:

Numpy vs. Pandas

You might ask yourself whether you need to know Numpy and Pandas. The simple answer is, yes! Numpy and Pandas are two essential libraries in Python for data manipulation and analysis, but they serve slightly different purposes. Numpy is primarily focused on efficient numerical computations and provides support for multi-dimensional arrays and matrices. It’s the foundation of many other libraries, also including Pandas. Yes, Pandas builds on Numpy’s capabilities by offering data structures that are more intuitive and easier to work with for labeled, tabular data (like Excel sheets). So, Pandas excels at handling real-world datasets that require organizing, cleaning, and analyzing diverse data types.

In fact, let’s have a look at a Pandas DataFrame and confirm that the underlying arrays are actually Numpy arrays:

import pandas as pd
df = pd.DataFrame(
  {
    "Country": ["Austria", "Germany", "Switzerland"],
    "Size"   : [83e3, 257e3, 41e3]
  }
)

df["Size"]

0     83000.0
1    257000.0
2     41000.0
Name: Size, dtype: float64

type(df["Size"])

pandas.core.series.Series

df["Size"].values

array([ 83000., 257000.,  41000.])

type(df["Size"].values)

numpy.ndarray

We see that the values (attribute) of a Pandas Series is actually a Numpy array. However, since the values attribute is slowly becoming deprecated, we will use the to_numpy()-method if we specifically need to access the underlying Numpy array:

df["Size"].to_numpy()

array([ 83000., 257000.,  41000.])

Learning checklist

I know what Pandas Series and DataFrames are.
I can create Series and DataFrames or read tabular data from csv files.
I can subset specific column(s) and/or row(s) of DataFrames using the loc/iloc operators.
I can do simple calculations with DataFrame subsets and create new columns.
I can do basic data exploration of DataFrames and subsets using pandas methods related to summary statistics.