import numpy as np
= np.arange(6)
A 3] A[
np.int64(3)
Unit 06
When working with python lists or NumPy arrays, we accessed individual elements of the sequence by their integer location or index, like
In Pandas DataFrames,
import pandas as pd
DF = pd.DataFrame({'col1':[1, 2, 3], 'col2':['a', 'b', 'c'], 'col3':['row1', 'row2', 'row3']})
DF
col1 | col2 | col3 | |
---|---|---|---|
0 | 1 | a | row1 |
1 | 2 | b | row2 |
2 | 3 | c | row3 |
we access different columns by their names, like
So, what’s the index? An index is a data structure that provides an identifier for each row or observation in a DataFrame or Series. In the above example, the default index of our DataFrame, the very first column to the left without a name, is an integer array starting with 0
. That’s neither new nor surprising, we know that from Numpy arrays already. However it’s new, that in Pandas DataFrames, we can make any column the index:
Now we can access the row by it’s name
So, a DataFrame has an index and several columns. If you extract a column, you get a Series. The Series has the same index as the DataFrame and the actual values of the column. In fact, the (numeric) values of the column are nothing else then a numpy array. The index and values are attributes of the DataFrame or Series, so you can
When working with multiple DataFrames or Series, the index can be very helpful to align corresponding rows. However, it can also cause confusion if you are not aware of the index. Check out this:
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([2, 3, 4], index=['b', 'c', 'd'])
s1 - s2
a NaN
b 0.0
c 0.0
d NaN
dtype: float64
Oops. We wanted to subtract two Series with three elements. We probably expected to get a result like -1, -1, -1
. However, the Series had shifted indices, and Pandas aligned the rows, so that the result contained four elements, one for each index. When both indices existed, pandas solved the math, and in the other case it created a ‘NaN’.
In almost all (larger) data sets, time plays some sort of a role. So knowing how to work with time series data efficiently will save you a lot of headache in the future. It’s super important. And the good news is, Pandas made it super simple!
Pandas has a datetime data type which contains all sort of date and time components. You can create a DatetimeIndex very easily yourself and initialize a DataFrame that contains a datetime index like so
The datetime information does not have to be the index, but it often makes a lot of sense.
Make sure to use this opportunity to read up on handling time series data with ease if you have not done so last week. Furthermore, have a look at that overview table for time series frequencies. Above, I used freq='D'
to denote daily sampling.
If you read spreadsheets with Pandas, they will often contain a useful datetime column. By default, Pandas will read these column values as strings, not as datetimes. Make sure you tell Pandas which column does contain datetime info with the argument parse_dates = ["name-of-datetime-column"]
during your read_*()
function call. Alternatively, you can convert the relevant column to datetime at any time by DF["name-of-datetime-column"] = pd.to_datetime(DF["name-of-datetime-column"])
.
A powerful feature of datetimes is that you can super easily resample them. Consider this example using above’s DataFrame that was created with daily sampling and the datetime as index
# Create column with some data
DF['count'] = np.arange(1, 366)
# Resample column to monthly sampling
DF['count'].resample('M')
/tmp/ipykernel_243120/1242162469.py:4: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
DF['count'].resample('M')
<pandas.core.resample.DatetimeIndexResampler object at 0x7e9bc0108990>
Only resampling alone does not yield anything interesting: A Pandas object that we can’t look at meaningfully. That makes sense though, because we need to tell Pandas how to process the resampled values. Let’s take the sum of count
at the end of each month:
/tmp/ipykernel_243120/2321812004.py:1: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
DF['count'].resample('M').sum()
2023-01-31 496
2023-02-28 1274
2023-03-31 2325
2023-04-30 3165
2023-05-31 4216
2023-06-30 4995
2023-07-31 6107
2023-08-31 7068
2023-09-30 7755
2023-10-31 8959
2023-11-30 9585
2023-12-31 10850
Freq: ME, Name: count, dtype: int64
You can also upsample, like that
/tmp/ipykernel_243120/2724129280.py:1: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
DF['count'].resample('H').mean()
2023-01-01 00:00:00 1.0
2023-01-01 01:00:00 NaN
2023-01-01 02:00:00 NaN
2023-01-01 03:00:00 NaN
2023-01-01 04:00:00 NaN
...
2023-12-30 20:00:00 NaN
2023-12-30 21:00:00 NaN
2023-12-30 22:00:00 NaN
2023-12-30 23:00:00 NaN
2023-12-31 00:00:00 365.0
Freq: h, Name: count, Length: 8737, dtype: float64
For all datetimes that didn’t exist before, Pandas inserts a row, but does not know a value to fill. So it puts NaN
, or when you upsample with the .sum()
method it puts a 0
.
If you want to perform further tasks with your resampled Series, store it as variable, e.g., monthly_sum = DF['count'].resample('M').sum()
.
Although the datetime object contains all the components, you can always just extract those you are interested in:
2023-02-05 00:00:00
2
Index([6, 0, 1, 2, 3, 4, 5], dtype='int32')
Have a look at this overview table for datetime components to familiarize yourself with your many options.
If you work with a datetime object that is not the index, you need to access the components slightly differently:
You can use the datetime to extract slices of your data like
2023-01-01 1
2023-01-02 2
2023-01-03 3
Freq: D, Name: count, dtype: int64
Be aware that the slicing is inclusive with the loc
operator, and exclusive with the iloc
operator (just like in NumPy).
To handle time durations, Pandas has another data type Timedelta.
And finally, if you need a datetime with an associated timezone, use the Timestamp.
The following links are referenced in the text above:
If you want to dig even deeper, make sure to consider the official Pandas user guide on Time series and date functionality.