Pandas: Index & datetime

Unit 06

The Index

When working with python lists or NumPy arrays, we accessed individual elements of the sequence by their integer location or index, like

import numpy as np
A = np.arange(6)
A[3]
3

In Pandas DataFrames,

import pandas as pd
DF = pd.DataFrame({'col1':[1, 2, 3], 'col2':['a', 'b', 'c'], 'col3':['row1', 'row2', 'row3']})
DF
col1 col2 col3
0 1 a row1
1 2 b row2
2 3 c row3

we access different columns by their names, like

DF['col1']
0    1
1    2
2    3
Name: col1, dtype: int64

Think of it as “row names”

So, what’s the index? An index is a data structure that provides an identifier for each row or observation in a DataFrame or Series. In the above example, the default index of our DataFrame, the very first column to the left without a name, is an integer array starting with 0. That’s neither new nor surprising, we know that from Numpy arrays already. However it’s new, that in Pandas DataFrames, we can make any column the index:

DF = DF.set_index('col3')
DF
col1 col2
col3
row1 1 a
row2 2 b
row3 3 c

Now we can access the row by it’s name

DF.loc['row2', 'col2']
'b'

So, a DataFrame has an index and several columns. If you extract a column, you get a Series. The Series has the same index as the DataFrame and the actual values of the column. In fact, the (numeric) values of the column are nothing else then a numpy array. The index and values are attributes of the DataFrame or Series, so you can

print(DF.index)
print(DF['col2'].values)
Index(['row1', 'row2', 'row3'], dtype='object', name='col3')
['a' 'b' 'c']

Alignment on the index

When working with multiple DataFrames or Series, the index can be very helpful to align corresponding rows. However, it can also cause confusion if you are not aware of the index. Check out this:

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([2, 3, 4], index=['b', 'c', 'd'])
s1 - s2
a    NaN
b    0.0
c    0.0
d    NaN
dtype: float64

Oops. We wanted to subtract two Series with three elements. We probably expected to get a result like -1, -1, -1. However, the Series had shifted indices, and Pandas aligned the rows, so that the result contained four elements, one for each index. When both indices existed, pandas solved the math, and in the other case it created a ‘NaN’.

Datetime

In almost all (larger) data sets, time plays some sort of a role. So knowing how to work with time series data efficiently will save you a lot of headache in the future. It’s super important. And the good news is, Pandas made it super simple!

Create datetime

Pandas has a datetime data type which contains all sort of date and time components. You can create a DatetimeIndex very easily yourself and initialize a DataFrame that contains a datetime index like so

datetime = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
DF = pd.DataFrame(index=datetime)

The datetime information does not have to be the index, but it often makes a lot of sense.

Make sure to use this opportunity to read up on handling time series data with ease if you have not done so last week. Furthermore, have a look at that overview table for time series frequencies. Above, I used freq='D' to denote daily sampling.

Parse datetime from spreadsheet

If you read spreadsheets with Pandas, they will often contain a useful datetime column. By default, Pandas will read these column values as strings, not as datetimes. Make sure you tell Pandas which column does contain datetime info with the argument parse_dates = ["name-of-datetime-column"] during your read_*() function call. Alternatively, you can convert the relevant column to datetime at any time by DF["name-of-datetime-column"] = pd.to_datetime(DF["name-of-datetime-column"]).

Resampling

A powerful feature of datetimes is that you can super easily resample them. Consider this example using above’s DataFrame that was created with daily sampling and the datetime as index

# Create column with some data
DF['count'] = np.arange(1, 366)
# Resample column to monthly sampling
DF['count'].resample('M')
<pandas.core.resample.DatetimeIndexResampler object at 0x7f174293dc10>

Only resampling alone does not yield anything interesting: A Pandas object that we can’t look at meaningfully. That makes sense though, because we need to tell Pandas how to process the resampled values. Let’s take the sum of count at the end of each month:

DF['count'].resample('M').sum()
2023-01-31      496
2023-02-28     1274
2023-03-31     2325
2023-04-30     3165
2023-05-31     4216
2023-06-30     4995
2023-07-31     6107
2023-08-31     7068
2023-09-30     7755
2023-10-31     8959
2023-11-30     9585
2023-12-31    10850
Freq: M, Name: count, dtype: int64

You can also upsample, like that

DF['count'].resample('H').mean()
2023-01-01 00:00:00      1.0
2023-01-01 01:00:00      NaN
2023-01-01 02:00:00      NaN
2023-01-01 03:00:00      NaN
2023-01-01 04:00:00      NaN
                       ...  
2023-12-30 20:00:00      NaN
2023-12-30 21:00:00      NaN
2023-12-30 22:00:00      NaN
2023-12-30 23:00:00      NaN
2023-12-31 00:00:00    365.0
Freq: H, Name: count, Length: 8737, dtype: float64

For all datetimes that didn’t exist before, Pandas inserts a row, but does not know a value to fill. So it puts NaN, or when you upsample with the .sum() method it puts a 0.

If you want to perform further tasks with your resampled Series, store it as variable, e.g., monthly_sum = DF['count'].resample('M').sum().

Accessing datetime components

Although the datetime object contains all the components, you can always just extract those you are interested in:

print(DF.index[35])
print(DF.index[35].month)
print(DF.index.weekday.unique())
2023-02-05 00:00:00
2
Index([6, 0, 1, 2, 3, 4, 5], dtype='int32')

Have a look at this overview table for datetime components to familiarize yourself with your many options.

If you work with a datetime object that is not the index, you need to access the components slightly differently:

DF['datetime'] = DF.index
DF['datetime'].dt.year.unique()
array([2023], dtype=int32)

Slicing

You can use the datetime to extract slices of your data like

DF.loc['2023-01-01':'2023-01-03', 'count']
2023-01-01    1
2023-01-02    2
2023-01-03    3
Freq: D, Name: count, dtype: int64

Be aware that the slicing is inclusive with the loc operator, and exclusive with the iloc operator (just like in NumPy).

Time durations

To handle time offsets, Pandas has another data type Timedelta.

dt = pd.Timedelta(hours=12)
print(dt)
print(DF.index[0])
print(DF.index[0] + dt)
0 days 12:00:00
2023-01-01 00:00:00
2023-01-01 12:00:00

Handling Time Zones

And finally, if you need a datetime with an associated timezone, use the Timestamp.

pd.Timestamp('2023-01-01 12:00:00', tz='UTC')
Timestamp('2023-01-01 12:00:00+0000', tz='UTC')

External resources

The following links are referenced in the text above:

Learning checklist

  • I know that the index plays an important role in Pandas Series and DataFrames.
  • I know what the index is, how to set it, and how to access it.
  • I understand the concept of alignment on the index when working with multiple Series/DataFrames.
  • I am aware that working with dates and times is crucial for many projects.
  • I know how to work with datetime objects in Pandas, how to read them from file, convert them after reading, or extract individual datetime components.
  • I can create a DatetimeIndex from scratch, i.e., by knowing start and end dates as well as a sampling frequency.
  • I can manipulate my Pandas objects by resampling the DatetimeIndex (and potentially computing summary statistics on the resampled object).
  • I can handle time differences with the Timedelta data type. I know how to add/subtract Timedeltas to Datetimes.