Pandas: File Input/Output#

For more cf. Pandas IO Tools.

import pandas as pd
import numpy as np

CSV: comma separated values#

df = pd.DataFrame(10*np.random.randn(10,3)+3, 
                  columns= ['temperature','CO2','energy consumption'],
                  index = range(2000,2010))
df
temperature CO2 energy consumption
2000 -12.286623 -9.017741 4.175555
2001 -0.206236 2.605373 -5.076079
2002 -8.669134 10.150683 3.959053
2003 -2.827104 -16.340451 -6.078593
2004 -16.600510 -0.214877 1.328272
2005 19.249710 -4.046548 15.387640
2006 2.724853 7.748419 9.717703
2007 0.738379 26.695446 1.515813
2008 -7.855695 -1.237906 -3.721848
2009 -11.135411 1.170897 -9.016716
df.to_csv('daten/my_test_data.csv')

Delete the data frame and read it in from the csv-file:

del df
df= pd.read_csv('daten/my_test_data.csv')
df
Unnamed: 0 temperature CO2 energy consumption
0 2000 -12.286623 -9.017741 4.175555
1 2001 -0.206236 2.605373 -5.076079
2 2002 -8.669134 10.150683 3.959053
3 2003 -2.827104 -16.340451 -6.078593
4 2004 -16.600510 -0.214877 1.328272
5 2005 19.249710 -4.046548 15.387640
6 2006 2.724853 7.748419 9.717703
7 2007 0.738379 26.695446 1.515813
8 2008 -7.855695 -1.237906 -3.721848
9 2009 -11.135411 1.170897 -9.016716
df= pd.read_csv('daten/my_test_data.csv', index_col=0)
df
temperature CO2 energy consumption
2000 -12.286623 -9.017741 4.175555
2001 -0.206236 2.605373 -5.076079
2002 -8.669134 10.150683 3.959053
2003 -2.827104 -16.340451 -6.078593
2004 -16.600510 -0.214877 1.328272
2005 19.249710 -4.046548 15.387640
2006 2.724853 7.748419 9.717703
2007 0.738379 26.695446 1.515813
2008 -7.855695 -1.237906 -3.721848
2009 -11.135411 1.170897 -9.016716

Excel#

df.to_excel('daten/my_test_data.xlsx')

Delete the data frame and read it in from the xlsx-file:

del df
df = pd.read_excel('daten/my_test_data.xlsx', sheet_name='Sheet1')
df
Unnamed: 0 temperature CO2 energy consumption
0 2000 -12.286623 -9.017741 4.175555
1 2001 -0.206236 2.605373 -5.076079
2 2002 -8.669134 10.150683 3.959053
3 2003 -2.827104 -16.340451 -6.078593
4 2004 -16.600510 -0.214877 1.328272
5 2005 19.249710 -4.046548 15.387640
6 2006 2.724853 7.748419 9.717703
7 2007 0.738379 26.695446 1.515813
8 2008 -7.855695 -1.237906 -3.721848
9 2009 -11.135411 1.170897 -9.016716

Many more file formats can be written and read, cf. Pandas IO Tools.

Date and Time Data#

For more cf. Pandas Time series / date functionality.

import datetime as dt
now = dt.datetime.now()
now
datetime.datetime(2022, 8, 29, 14, 51, 23, 765003)
print(now.year)
print(now.month)
print(now.day)
print(now.year)
print(now.hour)
print(now.minute)
# etc.
2022
8
29
2022
14
51
timestamp1 = dt.datetime(2014, 3, 28, 11, 30)
timestamp1 
datetime.datetime(2014, 3, 28, 11, 30)
timestamp2 = dt.datetime(1614, 12, 24, 19, 30)
timestamp2
datetime.datetime(1614, 12, 24, 19, 30)
delta = timestamp1 - timestamp2
delta
datetime.timedelta(days=145825, seconds=57600)
print(delta.days)
print(delta.seconds)
print(delta.total_seconds())
145825
57600
12599337600.0
timestamp1 + dt.timedelta(14)
datetime.datetime(2014, 4, 11, 11, 30)

Converting between string and datetime#

stamp = dt.datetime(2011, 1, 3, 17, 45)
type(stamp)
datetime.datetime
str(stamp)
'2011-01-03 17:45:00'
stamp.strftime('%Y/%B/%d-%H')
'2011/January/03-17'

Directive

Meaning

%a

Locale’s abbreviated weekday name

%A

Locale’s full weekday name

%b

Locale’s abbreviated month name

%B

Locale’s full month name

%c

Locale’s appropriate date and time representation

%d

Day of the month as a decimal number [01,31]

%H

Hour (24-hour clock) as a decimal number [00,23]

%I

Hour (12-hour clock) as a decimal number [01,12]

%j

Day of the year as a decimal number [001,366]

%m

Month as a decimal number [01,12]

%M

Minute as a decimal number [00,59]

%p

Locale’s equivalent of either AM or PM

%S

Second as a decimal number [00,61]

%U

Week number of the year (Sunday as the first day of the week)

%w

Weekday as a decimal number [0(Sunday),6]

%W

Week number of the year (Monday as the first day of the week)

%x

Locale’s appropriate date representation

%X

Locale’s appropriate time representation

%y

Year without century as a decimal number [00,99]

%Y

Year with century as a decimal number

%Z

Time zone name (no characters if no time zone exists)

%%

A literal '%' character

For more on format strings cf. http://docs.python.org/2/library/time.html#time.strftime.

dt.datetime.strptime('24/12/2012','%d/%m/%Y')
datetime.datetime(2012, 12, 24, 0, 0)

Time Series Basics#

Synthetic data:

df = pd.DataFrame(np.random.randn(10,3), 
                  columns= ['temperature','CO2','energy consumption'],
                  index = range(2000,2010))
df
temperature CO2 energy consumption
2000 -0.175572 1.750254 0.867122
2001 -0.056392 0.063977 -0.410385
2002 -1.324750 -0.037674 0.630271
2003 -0.867190 0.078205 -0.078580
2004 -0.359342 -0.782809 0.153157
2005 1.259952 0.016337 -0.935126
2006 -1.323949 -1.755193 -0.449625
2007 -2.010301 -0.642709 -0.381315
2008 0.026720 -0.491324 -1.285634
2009 -0.218182 -0.546904 0.283252
df.index
RangeIndex(start=2000, stop=2010, step=1)
pd.to_datetime(df.index, format='%Y')
DatetimeIndex(['2000-01-01', '2001-01-01', '2002-01-01', '2003-01-01',
               '2004-01-01', '2005-01-01', '2006-01-01', '2007-01-01',
               '2008-01-01', '2009-01-01'],
              dtype='datetime64[ns]', freq=None)
df.index = pd.to_datetime(df.index, format='%Y')
df
temperature CO2 energy consumption
2000-01-01 -0.175572 1.750254 0.867122
2001-01-01 -0.056392 0.063977 -0.410385
2002-01-01 -1.324750 -0.037674 0.630271
2003-01-01 -0.867190 0.078205 -0.078580
2004-01-01 -0.359342 -0.782809 0.153157
2005-01-01 1.259952 0.016337 -0.935126
2006-01-01 -1.323949 -1.755193 -0.449625
2007-01-01 -2.010301 -0.642709 -0.381315
2008-01-01 0.026720 -0.491324 -1.285634
2009-01-01 -0.218182 -0.546904 0.283252

Read an Excel file with a column already formated as date:

df = pd.read_excel('daten/my_test_data_2.xls', sheet_name='Sheet1', index_col=0)
df
temperature CO2 energy consumption
date
2000-06-15 0.294403 -0.484523 0.642076
2001-06-15 -1.146592 0.824076 -0.756397
2002-06-15 1.345931 -0.446263 -0.139374
2003-06-15 -0.840468 -1.695912 -0.674641
2004-06-15 0.473816 -1.183667 -1.006740
2005-06-15 0.957741 0.214721 -0.750404
2006-06-15 1.361329 1.388585 0.492154
2007-06-15 0.559999 0.417268 0.793223
2008-06-15 0.301742 0.068523 0.645997
2009-06-15 0.936508 1.420025 -0.793608
df.iloc[0]
temperature           0.294403
CO2                  -0.484523
energy consumption    0.642076
Name: 2000-06-15 00:00:00, dtype: float64
df['time'] = df.index - dt.datetime(2000,1,1)
df
temperature CO2 energy consumption time
date
2000-06-15 0.294403 -0.484523 0.642076 166 days
2001-06-15 -1.146592 0.824076 -0.756397 531 days
2002-06-15 1.345931 -0.446263 -0.139374 896 days
2003-06-15 -0.840468 -1.695912 -0.674641 1261 days
2004-06-15 0.473816 -1.183667 -1.006740 1627 days
2005-06-15 0.957741 0.214721 -0.750404 1992 days
2006-06-15 1.361329 1.388585 0.492154 2357 days
2007-06-15 0.559999 0.417268 0.793223 2722 days
2008-06-15 0.301742 0.068523 0.645997 3088 days
2009-06-15 0.936508 1.420025 -0.793608 3453 days
df['days'] = df['time'].dt.days
df
temperature CO2 energy consumption time days
date
2000-06-15 0.294403 -0.484523 0.642076 166 days 166
2001-06-15 -1.146592 0.824076 -0.756397 531 days 531
2002-06-15 1.345931 -0.446263 -0.139374 896 days 896
2003-06-15 -0.840468 -1.695912 -0.674641 1261 days 1261
2004-06-15 0.473816 -1.183667 -1.006740 1627 days 1627
2005-06-15 0.957741 0.214721 -0.750404 1992 days 1992
2006-06-15 1.361329 1.388585 0.492154 2357 days 2357
2007-06-15 0.559999 0.417268 0.793223 2722 days 2722
2008-06-15 0.301742 0.068523 0.645997 3088 days 3088
2009-06-15 0.936508 1.420025 -0.793608 3453 days 3453

Generating Date Ranges#

pd.date_range('2014-03-01','2016-07-05')    # default frequency = calender day
DatetimeIndex(['2014-03-01', '2014-03-02', '2014-03-03', '2014-03-04',
               '2014-03-05', '2014-03-06', '2014-03-07', '2014-03-08',
               '2014-03-09', '2014-03-10',
               ...
               '2016-06-26', '2016-06-27', '2016-06-28', '2016-06-29',
               '2016-06-30', '2016-07-01', '2016-07-02', '2016-07-03',
               '2016-07-04', '2016-07-05'],
              dtype='datetime64[ns]', length=858, freq='D')
# custom frequency and number of periods
pd.date_range('2014-04-05 08:00', periods=10, freq='2h30min') 
DatetimeIndex(['2014-04-05 08:00:00', '2014-04-05 10:30:00',
               '2014-04-05 13:00:00', '2014-04-05 15:30:00',
               '2014-04-05 18:00:00', '2014-04-05 20:30:00',
               '2014-04-05 23:00:00', '2014-04-06 01:30:00',
               '2014-04-06 04:00:00', '2014-04-06 06:30:00'],
              dtype='datetime64[ns]', freq='150T')
pd.date_range('2014-03-01','2016-07-05', freq='B') # business day
DatetimeIndex(['2014-03-03', '2014-03-04', '2014-03-05', '2014-03-06',
               '2014-03-07', '2014-03-10', '2014-03-11', '2014-03-12',
               '2014-03-13', '2014-03-14',
               ...
               '2016-06-22', '2016-06-23', '2016-06-24', '2016-06-27',
               '2016-06-28', '2016-06-29', '2016-06-30', '2016-07-01',
               '2016-07-04', '2016-07-05'],
              dtype='datetime64[ns]', length=612, freq='B')
pd.date_range('2014-03-01','2016-07-05', freq='BMS') # first week day of month
DatetimeIndex(['2014-03-03', '2014-04-01', '2014-05-01', '2014-06-02',
               '2014-07-01', '2014-08-01', '2014-09-01', '2014-10-01',
               '2014-11-03', '2014-12-01', '2015-01-01', '2015-02-02',
               '2015-03-02', '2015-04-01', '2015-05-01', '2015-06-01',
               '2015-07-01', '2015-08-03', '2015-09-01', '2015-10-01',
               '2015-11-02', '2015-12-01', '2016-01-01', '2016-02-01',
               '2016-03-01', '2016-04-01', '2016-05-02', '2016-06-01',
               '2016-07-01'],
              dtype='datetime64[ns]', freq='BMS')

For more frequency options cf. [Pandas Time series / date functionality ](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

Indexing, Selection, Subsetting#

dates = pd.date_range('2014-03-01','2016-07-05', freq='BMS')
df = pd.DataFrame(np.random.randn(len(dates),3), index = dates, columns = ['A','B','C'])
df.head()
A B C
2014-03-03 0.394580 -1.494955 -0.347213
2014-04-01 -0.828459 -1.064979 0.781559
2014-05-01 -0.477372 0.553372 -0.465750
2014-06-02 -2.028449 0.365953 0.158726
2014-07-01 0.941145 -0.978360 1.302139
df.loc['2014']
A B C
2014-03-03 0.394580 -1.494955 -0.347213
2014-04-01 -0.828459 -1.064979 0.781559
2014-05-01 -0.477372 0.553372 -0.465750
2014-06-02 -2.028449 0.365953 0.158726
2014-07-01 0.941145 -0.978360 1.302139
2014-08-01 -0.902680 -0.186172 -0.919617
2014-09-01 -0.465930 -0.631799 1.320748
2014-10-01 0.509732 -1.081103 2.057128
2014-11-03 -0.514350 0.729604 -0.561315
2014-12-01 0.021127 0.030578 -0.437966
df.loc['2014-06':'2014-12']
A B C
2014-06-02 -2.028449 0.365953 0.158726
2014-07-01 0.941145 -0.978360 1.302139
2014-08-01 -0.902680 -0.186172 -0.919617
2014-09-01 -0.465930 -0.631799 1.320748
2014-10-01 0.509732 -1.081103 2.057128
2014-11-03 -0.514350 0.729604 -0.561315
2014-12-01 0.021127 0.030578 -0.437966
df.loc[:'2014-12-01']
A B C
2014-03-03 0.394580 -1.494955 -0.347213
2014-04-01 -0.828459 -1.064979 0.781559
2014-05-01 -0.477372 0.553372 -0.465750
2014-06-02 -2.028449 0.365953 0.158726
2014-07-01 0.941145 -0.978360 1.302139
2014-08-01 -0.902680 -0.186172 -0.919617
2014-09-01 -0.465930 -0.631799 1.320748
2014-10-01 0.509732 -1.081103 2.057128
2014-11-03 -0.514350 0.729604 -0.561315
2014-12-01 0.021127 0.030578 -0.437966

Shifting (Leading and Lagging) Data#

# frequency M indicates the last calendar day of a month
ts = pd.Series(np.random.randn(4), index = pd.date_range('2000-01-01', periods=4, freq='M')) 
ts
2000-01-31   -1.513709
2000-02-29   -0.450101
2000-03-31    0.309443
2000-04-30   -1.272072
Freq: M, dtype: float64
ts.shift(2)
2000-01-31         NaN
2000-02-29         NaN
2000-03-31   -1.513709
2000-04-30   -0.450101
Freq: M, dtype: float64
ts.shift(-2)
2000-01-31    0.309443
2000-02-29   -1.272072
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64

If the frequency is passed to shift the timestamps are advanced.

ts.shift(2, freq='M')
2000-03-31   -1.513709
2000-04-30   -0.450101
2000-05-31    0.309443
2000-06-30   -1.272072
Freq: M, dtype: float64
ts.shift(2, freq='D')
2000-02-02   -1.513709
2000-03-02   -0.450101
2000-04-02    0.309443
2000-05-02   -1.272072
dtype: float64