import pandas as pd
import matplotlib.pyplot as pltAssignment #05
#05-04: Explore WX
- Read the comma separated spreadsheet
WX_GNP.csv
In [1]:
In [2]:
WX = pd.read_csv("WX_GNP.csv", sep=",", parse_dates=["datetime"])
WX| datetime | station_id | hs | hn24 | hn72 | rain | iswr | ilwr | ta | rh | vw | dw | elev | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-09-04 06:00:00 | VIR075905 | 0.000 | 0.000000 | 0.00000 | 0.000000 | 0.000 | 256.326 | 6.281980 | 94.7963 | 2.32314 | 241.468 | 2121 |
| 1 | 2019-09-04 17:00:00 | VIR075905 | 0.000 | 0.000000 | 0.00000 | 0.000196 | 555.803 | 288.803 | 12.524600 | 72.0814 | 4.38687 | 247.371 | 2121 |
| 2 | 2019-09-05 17:00:00 | VIR075905 | 0.000 | 0.000000 | 0.00000 | 0.000045 | 534.011 | 287.089 | 14.265400 | 55.1823 | 1.93691 | 239.254 | 2121 |
| 3 | 2019-09-06 17:00:00 | VIR075905 | 0.000 | 0.000000 | 0.00000 | 0.000026 | 546.008 | 292.024 | 14.136600 | 72.5560 | 3.67782 | 239.715 | 2121 |
| 4 | 2019-09-07 17:00:00 | VIR075905 | 0.000 | 0.000000 | 0.00000 | 0.000150 | 528.582 | 289.508 | 14.623800 | 68.9262 | 1.90232 | 227.356 | 2121 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 178733 | 2021-05-25 17:00:00 | VIR088016 | 279.822 | 2.197360 | 7.07140 | 0.108814 | 346.135 | 314.824 | 2.471430 | 94.4473 | 1.28551 | 170.356 | 2121 |
| 178734 | 2021-05-26 17:00:00 | VIR088016 | 272.909 | 0.000000 | 1.73176 | 0.001291 | 747.383 | 256.193 | 5.066340 | 78.2142 | 4.11593 | 208.909 | 2121 |
| 178735 | 2021-05-27 17:00:00 | VIR088016 | 267.290 | 2.412910 | 2.80912 | 0.167091 | 185.431 | 316.157 | 1.090880 | 97.3988 | 4.65204 | 218.963 | 2121 |
| 178736 | 2021-05-28 17:00:00 | VIR088016 | 275.573 | 11.509200 | 12.80780 | 0.000000 | 269.329 | 308.515 | 0.247583 | 88.5444 | 5.33803 | 263.788 | 2121 |
| 178737 | 2021-05-29 17:00:00 | VIR088016 | 267.562 | 0.259779 | 6.15200 | 0.000766 | 766.358 | 247.267 | 4.666620 | 69.3598 | 1.92538 | 256.041 | 2121 |
178738 rows × 13 columns
In [3]:
tabledict = {
'datetime': 'datetime in the form YYYY-MM-DD HH:MM:SS',
'station_id': 'ID of virtual weather station (i.e., weather model grid point)',
'hs': 'Snow height (cm)',
'hn24': 'Height of new snow within last 24 hours (cm)',
'hn72': 'Height of new snow within last 72 hours (cm)',
'rain': 'Liquid water accumulation within last 24 hours (mm)',
'iswr': 'Incoming shortwave radiation (Wm**-2)',
'iswr': 'Incoming longave radiation (W/m2)',
'ta': 'Air temperature (degrees Celsius)',
'rh': 'Relative humidity (%)',
'vw': 'Wind speed (m/s)',
'dw': 'Wind direction (degrees)',
'elev': 'Station elevation (m asl)'
}In [4]:
for key in tabledict:
print(f'{key:>10}: {tabledict[key]}') datetime: datetime in the form YYYY-MM-DD HH:MM:SS
station_id: ID of virtual weather station (i.e., weather model grid point)
hs: Snow height (cm)
hn24: Height of new snow within last 24 hours (cm)
hn72: Height of new snow within last 72 hours (cm)
rain: Liquid water accumulation within last 24 hours (mm)
iswr: Incoming longave radiation (W/m2)
ta: Air temperature (degrees Celsius)
rh: Relative humidity (%)
vw: Wind speed (m/s)
dw: Wind direction (degrees)
elev: Station elevation (m asl)
- How many different virtual stations are included in the data frame?
- How many unique time stamps does the data frame contain?
- What are the earliest and latest time records?
- How many stations are located above 2000m but below 2200m?
- Characterize the elevation distribution of the stations with a number of percentiles!
In [5]:
station_ids = WX['station_id'].unique()
n_stations = len(station_ids)
print(f'There are {n_stations} unique stations in the data frame and there labels are as follows:\n {station_ids}')There are 238 unique stations in the data frame and there labels are as follows:
['VIR075905' 'VIR075906' 'VIR075907' 'VIR076452' 'VIR076456' 'VIR076457'
'VIR076458' 'VIR076459' 'VIR077002' 'VIR077003' 'VIR077004' 'VIR077007'
'VIR077008' 'VIR077009' 'VIR077010' 'VIR077548' 'VIR077549' 'VIR077550'
'VIR077551' 'VIR077552' 'VIR077553' 'VIR077554' 'VIR077555' 'VIR077556'
'VIR077557' 'VIR077558' 'VIR077559' 'VIR077560' 'VIR077561' 'VIR078100'
'VIR078101' 'VIR078102' 'VIR078103' 'VIR078104' 'VIR078105' 'VIR078106'
'VIR078107' 'VIR078108' 'VIR078109' 'VIR078110' 'VIR078111' 'VIR078112'
'VIR078652' 'VIR078653' 'VIR078654' 'VIR078655' 'VIR078656' 'VIR078657'
'VIR078658' 'VIR078659' 'VIR078660' 'VIR078661' 'VIR078662' 'VIR079203'
'VIR079204' 'VIR079205' 'VIR079206' 'VIR079207' 'VIR079208' 'VIR079209'
'VIR079210' 'VIR079211' 'VIR079212' 'VIR079213' 'VIR079754' 'VIR079755'
'VIR079756' 'VIR079757' 'VIR079758' 'VIR079759' 'VIR079760' 'VIR079761'
'VIR079762' 'VIR079763' 'VIR079764' 'VIR080304' 'VIR080305' 'VIR080306'
'VIR080307' 'VIR080308' 'VIR080309' 'VIR080310' 'VIR080311' 'VIR080312'
'VIR080313' 'VIR080314' 'VIR080315' 'VIR080855' 'VIR080856' 'VIR080857'
'VIR080858' 'VIR080859' 'VIR080860' 'VIR080861' 'VIR080862' 'VIR080863'
'VIR080864' 'VIR080865' 'VIR080866' 'VIR081406' 'VIR081407' 'VIR081408'
'VIR081409' 'VIR081410' 'VIR081411' 'VIR081412' 'VIR081413' 'VIR081414'
'VIR081415' 'VIR081416' 'VIR081417' 'VIR081420' 'VIR081956' 'VIR081957'
'VIR081958' 'VIR081959' 'VIR081960' 'VIR081961' 'VIR081962' 'VIR081963'
'VIR081964' 'VIR081965' 'VIR081966' 'VIR081967' 'VIR081968' 'VIR081969'
'VIR081970' 'VIR081971' 'VIR082508' 'VIR082509' 'VIR082510' 'VIR082511'
'VIR082512' 'VIR082513' 'VIR082514' 'VIR082515' 'VIR082516' 'VIR082517'
'VIR082518' 'VIR082519' 'VIR082520' 'VIR082521' 'VIR082522' 'VIR082523'
'VIR083059' 'VIR083060' 'VIR083061' 'VIR083062' 'VIR083063' 'VIR083064'
'VIR083065' 'VIR083066' 'VIR083067' 'VIR083068' 'VIR083069' 'VIR083070'
'VIR083071' 'VIR083072' 'VIR083073' 'VIR083610' 'VIR083611' 'VIR083612'
'VIR083613' 'VIR083614' 'VIR083615' 'VIR083616' 'VIR083617' 'VIR083618'
'VIR083619' 'VIR083620' 'VIR083621' 'VIR083622' 'VIR083623' 'VIR084161'
'VIR084162' 'VIR084163' 'VIR084164' 'VIR084165' 'VIR084166' 'VIR084167'
'VIR084168' 'VIR084169' 'VIR084170' 'VIR084171' 'VIR084711' 'VIR084712'
'VIR084713' 'VIR084714' 'VIR084715' 'VIR084716' 'VIR084717' 'VIR084718'
'VIR084719' 'VIR084720' 'VIR084721' 'VIR084722' 'VIR084723' 'VIR085261'
'VIR085262' 'VIR085263' 'VIR085264' 'VIR085265' 'VIR085266' 'VIR085267'
'VIR085268' 'VIR085269' 'VIR085270' 'VIR085271' 'VIR085272' 'VIR085812'
'VIR085813' 'VIR085814' 'VIR085815' 'VIR085816' 'VIR085817' 'VIR085818'
'VIR085819' 'VIR085820' 'VIR085821' 'VIR085822' 'VIR085823' 'VIR086362'
'VIR086363' 'VIR086364' 'VIR086365' 'VIR086371' 'VIR086372' 'VIR086373'
'VIR086374' 'VIR086912' 'VIR086913' 'VIR086914' 'VIR086915' 'VIR086916'
'VIR087463' 'VIR087464' 'VIR087465' 'VIR088016']
In [6]:
print(f"There are {len(WX['datetime'].unique())} unique time stamps between '{WX['datetime'].min()}' and '{WX['datetime'].max()}'.")There are 751 unique time stamps between '2018-09-05 06:00:00' and '2021-05-29 17:00:00'.
In [8]:
WX.loc[(WX['elev'] > 2000) & (WX['elev'] < 2200), 'elev'].unique().shape[0]
## or alternatively (less reccommended):
# WX['elev'][(WX['elev'] > 2000) & (WX['elev'] < 2200)].unique().shape[0]
## this is powerful as well:
# WX.loc[(WX['elev'] > 2000) & (WX['elev'] < 2200), ('elev', 'station_id')]30
In [9]:
WX['elev'].describe()count 178738.000000
mean 1836.436975
std 236.796505
min 1279.000000
25% 1671.000000
50% 1828.000000
75% 1989.000000
max 2497.000000
Name: elev, dtype: float64
In [10]:
WX['elev'].quantile(0.90)np.float64(2139.0)
#05-05: Subset WX and compute more summary stats
- Create another data frame as subset of WX, which contains all the data from one
station_idof your choice. - What is the average air temperature and its standard deviation?
- What is the median relative humidity
rhwhen eitherhn24is greater than 10 cm orrainis greater than 2 mm? What about the medianrhduring the opposite conditions? - Compute a new column
hn72_checkthat should conceptually be identical tohn72. Use onlyhn24to derivehn72_check. - Test whether
hn72_checkis indeed equal tohn72. Why not? - Store the new data frame in a csv file.
In [11]:
wx = WX[WX['station_id'].isin(['VIR075905'])].copy()
## or more generically:
# wx = WX.loc[WX['station_id'].isin([WX['station_id'].unique()[0]]), ]
wx| datetime | station_id | hs | hn24 | hn72 | rain | iswr | ilwr | ta | rh | vw | dw | elev | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-09-04 06:00:00 | VIR075905 | 0.000 | 0.000000 | 0.00000 | 0.000000 | 0.000 | 256.326 | 6.28198 | 94.7963 | 2.32314 | 241.468 | 2121 |
| 1 | 2019-09-04 17:00:00 | VIR075905 | 0.000 | 0.000000 | 0.00000 | 0.000196 | 555.803 | 288.803 | 12.52460 | 72.0814 | 4.38687 | 247.371 | 2121 |
| 2 | 2019-09-05 17:00:00 | VIR075905 | 0.000 | 0.000000 | 0.00000 | 0.000045 | 534.011 | 287.089 | 14.26540 | 55.1823 | 1.93691 | 239.254 | 2121 |
| 3 | 2019-09-06 17:00:00 | VIR075905 | 0.000 | 0.000000 | 0.00000 | 0.000026 | 546.008 | 292.024 | 14.13660 | 72.5560 | 3.67782 | 239.715 | 2121 |
| 4 | 2019-09-07 17:00:00 | VIR075905 | 0.000 | 0.000000 | 0.00000 | 0.000150 | 528.582 | 289.508 | 14.62380 | 68.9262 | 1.90232 | 227.356 | 2121 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 114506 | 2021-05-25 17:00:00 | VIR075905 | 281.450 | 0.000000 | 0.00000 | 0.401526 | 346.135 | 314.824 | 3.71658 | 88.5444 | 1.56797 | 156.521 | 2121 |
| 114507 | 2021-05-26 17:00:00 | VIR075905 | 276.290 | 0.000000 | 0.00000 | 0.025453 | 774.383 | 242.693 | 6.56515 | 72.3112 | 5.24579 | 208.540 | 2121 |
| 114508 | 2021-05-27 17:00:00 | VIR075905 | 265.546 | 0.000000 | 0.00000 | 0.313308 | 239.181 | 316.157 | 2.42825 | 97.3988 | 7.03859 | 218.594 | 2121 |
| 114509 | 2021-05-28 17:00:00 | VIR075905 | 268.436 | 5.094300 | 5.09430 | 0.108912 | 376.829 | 301.828 | 1.07766 | 84.1171 | 6.69847 | 266.648 | 2121 |
| 114510 | 2021-05-29 17:00:00 | VIR075905 | 265.560 | 0.797445 | 3.09568 | 0.010523 | 793.358 | 240.517 | 5.01248 | 70.8355 | 2.92842 | 265.633 | 2121 |
751 rows × 13 columns
Mind the copy in assigning the filtered data frame WX to a new one wx! What happens if you don’t copy? (Tip: Try it out and see whether you will see a Warning several cells below!)
In [12]:
wx['ta'].mean()np.float64(-3.7703614580559255)
In [13]:
wx['ta'].std()np.float64(6.582577628629118)
In [14]:
wx.loc[(wx['hn24'] > 10) | (wx['rain'] > 2), 'rh'].median()np.float64(89.947)
In [15]:
wx.loc[~((wx['hn24'] > 10) | (wx['rain'] > 2)), 'rh'].median()
## or equivalently:
# wx.loc[~(wx['hn24'] > 10) & ~(wx['rain'] > 2), 'rh'].median()
# wx.loc[(wx['hn24'] <= 10) & (wx['rain'] <= 2), 'rh'].median()np.float64(82.03875)
In [16]:
import numpy as npThe following solution matches the Pandas knowledge from this unit,
In [17]:
wx['hn72_check'] = np.nan
for i, val in enumerate(wx['hn24']):
if i > 3:
wx.iloc[i, 13] = wx.iloc[i-2:i+1, 3].sum()whereas this next solution would be my go-to choice. There are a few extra tricks in here: What it comes down to, is that we would need to use the iloc operator for selecting the rows based on the integer count i, but at the same time we want to access the column by its name and not by its location, which requires the loc operator. I’m ultimately going for the loc operator to maximize readability and convenience. To still access the correct rows, however, I need to select wx.index[i] instead of i alone. Why? –> Because wx.index[i] will return the row names at the locations i (see Unit 6).
In [18]:
wx['hn72_check_II'] = np.nan
for i, val in enumerate(wx['hn24']):
if i > 3:
wx.loc[wx.index[i], 'hn72_check_II'] = wx.loc[wx.index[i-2]:wx.index[i], 'hn24'].sum()In [19]:
wx[['hn24', 'hn72', 'hn72_check', 'hn72_check_II']]| hn24 | hn72 | hn72_check | hn72_check_II | |
|---|---|---|---|---|
| 0 | 0.000000 | 0.00000 | NaN | NaN |
| 1 | 0.000000 | 0.00000 | NaN | NaN |
| 2 | 0.000000 | 0.00000 | NaN | NaN |
| 3 | 0.000000 | 0.00000 | NaN | NaN |
| 4 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| ... | ... | ... | ... | ... |
| 114506 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 114507 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 114508 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 114509 | 5.094300 | 5.09430 | 5.094300 | 5.094300 |
| 114510 | 0.797445 | 3.09568 | 5.891745 | 5.891745 |
751 rows × 4 columns
hn72 was computed on the hourly time series. Since we compute hn72_check on the daily time series, some values are identical while others are not. In any case, the first 4 entries can not be identical, because we are missing data points that are further in the past to compute the 3-day height of new snow.
In [76]:
wx.to_csv("WX_subset.csv", index=False)