Assignment #05

#05-04: Explore WX

Read the comma separated spreadsheet WX_GNP.csv

In [1]:

import pandas as pd
import matplotlib.pyplot as plt

In [2]:

WX = pd.read_csv("WX_GNP.csv", sep=",", parse_dates=["datetime"])
WX

	datetime	station_id	hs	hn24	hn72	rain	iswr	ilwr	ta	rh	vw	dw	elev
0	2019-09-04 06:00:00	VIR075905	0.000	0.000000	0.00000	0.000000	0.000	256.326	6.281980	94.7963	2.32314	241.468	2121
1	2019-09-04 17:00:00	VIR075905	0.000	0.000000	0.00000	0.000196	555.803	288.803	12.524600	72.0814	4.38687	247.371	2121
2	2019-09-05 17:00:00	VIR075905	0.000	0.000000	0.00000	0.000045	534.011	287.089	14.265400	55.1823	1.93691	239.254	2121
3	2019-09-06 17:00:00	VIR075905	0.000	0.000000	0.00000	0.000026	546.008	292.024	14.136600	72.5560	3.67782	239.715	2121
4	2019-09-07 17:00:00	VIR075905	0.000	0.000000	0.00000	0.000150	528.582	289.508	14.623800	68.9262	1.90232	227.356	2121
...	...	...	...	...	...	...	...	...	...	...	...	...	...
178733	2021-05-25 17:00:00	VIR088016	279.822	2.197360	7.07140	0.108814	346.135	314.824	2.471430	94.4473	1.28551	170.356	2121
178734	2021-05-26 17:00:00	VIR088016	272.909	0.000000	1.73176	0.001291	747.383	256.193	5.066340	78.2142	4.11593	208.909	2121
178735	2021-05-27 17:00:00	VIR088016	267.290	2.412910	2.80912	0.167091	185.431	316.157	1.090880	97.3988	4.65204	218.963	2121
178736	2021-05-28 17:00:00	VIR088016	275.573	11.509200	12.80780	0.000000	269.329	308.515	0.247583	88.5444	5.33803	263.788	2121
178737	2021-05-29 17:00:00	VIR088016	267.562	0.259779	6.15200	0.000766	766.358	247.267	4.666620	69.3598	1.92538	256.041	2121

178738 rows × 13 columns

In [3]:

tabledict = {
    'datetime': 'datetime in the form YYYY-MM-DD HH:MM:SS',
    'station_id': 'ID of virtual weather station (i.e., weather model grid point)',
    'hs': 'Snow height (cm)',
    'hn24': 'Height of new snow within last 24 hours (cm)',
    'hn72': 'Height of new snow within last 72 hours (cm)',
    'rain': 'Liquid water accumulation within last 24 hours (mm)',
    'iswr': 'Incoming shortwave radiation (Wm**-2)',
    'iswr': 'Incoming longave radiation (W/m2)',
    'ta': 'Air temperature (degrees Celsius)',
    'rh': 'Relative humidity (%)',
    'vw': 'Wind speed (m/s)',
    'dw': 'Wind direction (degrees)',
    'elev': 'Station elevation (m asl)'
}

In [4]:

for key in tabledict:
    print(f'{key:>10}: {tabledict[key]}')

  datetime: datetime in the form YYYY-MM-DD HH:MM:SS
station_id: ID of virtual weather station (i.e., weather model grid point)
        hs: Snow height (cm)
      hn24: Height of new snow within last 24 hours (cm)
      hn72: Height of new snow within last 72 hours (cm)
      rain: Liquid water accumulation within last 24 hours (mm)
      iswr: Incoming longave radiation (W/m2)
        ta: Air temperature (degrees Celsius)
        rh: Relative humidity (%)
        vw: Wind speed (m/s)
        dw: Wind direction (degrees)
      elev: Station elevation (m asl)

How many different virtual stations are included in the data frame?
How many unique time stamps does the data frame contain?
What are the earliest and latest time records?
How many stations are located above 2000m but below 2200m?
Characterize the elevation distribution of the stations with a number of percentiles!

In [5]:

station_ids = WX['station_id'].unique()
n_stations = len(station_ids)
print(f'There are {n_stations} unique stations in the data frame and there labels are as follows:\n {station_ids}')

There are 238 unique stations in the data frame and there labels are as follows:
 ['VIR075905' 'VIR075906' 'VIR075907' 'VIR076452' 'VIR076456' 'VIR076457'
 'VIR076458' 'VIR076459' 'VIR077002' 'VIR077003' 'VIR077004' 'VIR077007'
 'VIR077008' 'VIR077009' 'VIR077010' 'VIR077548' 'VIR077549' 'VIR077550'
 'VIR077551' 'VIR077552' 'VIR077553' 'VIR077554' 'VIR077555' 'VIR077556'
 'VIR077557' 'VIR077558' 'VIR077559' 'VIR077560' 'VIR077561' 'VIR078100'
 'VIR078101' 'VIR078102' 'VIR078103' 'VIR078104' 'VIR078105' 'VIR078106'
 'VIR078107' 'VIR078108' 'VIR078109' 'VIR078110' 'VIR078111' 'VIR078112'
 'VIR078652' 'VIR078653' 'VIR078654' 'VIR078655' 'VIR078656' 'VIR078657'
 'VIR078658' 'VIR078659' 'VIR078660' 'VIR078661' 'VIR078662' 'VIR079203'
 'VIR079204' 'VIR079205' 'VIR079206' 'VIR079207' 'VIR079208' 'VIR079209'
 'VIR079210' 'VIR079211' 'VIR079212' 'VIR079213' 'VIR079754' 'VIR079755'
 'VIR079756' 'VIR079757' 'VIR079758' 'VIR079759' 'VIR079760' 'VIR079761'
 'VIR079762' 'VIR079763' 'VIR079764' 'VIR080304' 'VIR080305' 'VIR080306'
 'VIR080307' 'VIR080308' 'VIR080309' 'VIR080310' 'VIR080311' 'VIR080312'
 'VIR080313' 'VIR080314' 'VIR080315' 'VIR080855' 'VIR080856' 'VIR080857'
 'VIR080858' 'VIR080859' 'VIR080860' 'VIR080861' 'VIR080862' 'VIR080863'
 'VIR080864' 'VIR080865' 'VIR080866' 'VIR081406' 'VIR081407' 'VIR081408'
 'VIR081409' 'VIR081410' 'VIR081411' 'VIR081412' 'VIR081413' 'VIR081414'
 'VIR081415' 'VIR081416' 'VIR081417' 'VIR081420' 'VIR081956' 'VIR081957'
 'VIR081958' 'VIR081959' 'VIR081960' 'VIR081961' 'VIR081962' 'VIR081963'
 'VIR081964' 'VIR081965' 'VIR081966' 'VIR081967' 'VIR081968' 'VIR081969'
 'VIR081970' 'VIR081971' 'VIR082508' 'VIR082509' 'VIR082510' 'VIR082511'
 'VIR082512' 'VIR082513' 'VIR082514' 'VIR082515' 'VIR082516' 'VIR082517'
 'VIR082518' 'VIR082519' 'VIR082520' 'VIR082521' 'VIR082522' 'VIR082523'
 'VIR083059' 'VIR083060' 'VIR083061' 'VIR083062' 'VIR083063' 'VIR083064'
 'VIR083065' 'VIR083066' 'VIR083067' 'VIR083068' 'VIR083069' 'VIR083070'
 'VIR083071' 'VIR083072' 'VIR083073' 'VIR083610' 'VIR083611' 'VIR083612'
 'VIR083613' 'VIR083614' 'VIR083615' 'VIR083616' 'VIR083617' 'VIR083618'
 'VIR083619' 'VIR083620' 'VIR083621' 'VIR083622' 'VIR083623' 'VIR084161'
 'VIR084162' 'VIR084163' 'VIR084164' 'VIR084165' 'VIR084166' 'VIR084167'
 'VIR084168' 'VIR084169' 'VIR084170' 'VIR084171' 'VIR084711' 'VIR084712'
 'VIR084713' 'VIR084714' 'VIR084715' 'VIR084716' 'VIR084717' 'VIR084718'
 'VIR084719' 'VIR084720' 'VIR084721' 'VIR084722' 'VIR084723' 'VIR085261'
 'VIR085262' 'VIR085263' 'VIR085264' 'VIR085265' 'VIR085266' 'VIR085267'
 'VIR085268' 'VIR085269' 'VIR085270' 'VIR085271' 'VIR085272' 'VIR085812'
 'VIR085813' 'VIR085814' 'VIR085815' 'VIR085816' 'VIR085817' 'VIR085818'
 'VIR085819' 'VIR085820' 'VIR085821' 'VIR085822' 'VIR085823' 'VIR086362'
 'VIR086363' 'VIR086364' 'VIR086365' 'VIR086371' 'VIR086372' 'VIR086373'
 'VIR086374' 'VIR086912' 'VIR086913' 'VIR086914' 'VIR086915' 'VIR086916'
 'VIR087463' 'VIR087464' 'VIR087465' 'VIR088016']

In [6]:

print(f"There are {len(WX['datetime'].unique())} unique time stamps between '{WX['datetime'].min()}' and '{WX['datetime'].max()}'.")

There are 751 unique time stamps between '2018-09-05 06:00:00' and '2021-05-29 17:00:00'.

In [8]:

WX.loc[(WX['elev'] > 2000) & (WX['elev'] < 2200), 'elev'].unique().shape[0]
## or alternatively (less reccommended):
# WX['elev'][(WX['elev'] > 2000) & (WX['elev'] < 2200)].unique().shape[0]
## this is powerful as well:
# WX.loc[(WX['elev'] > 2000) & (WX['elev'] < 2200), ('elev', 'station_id')]

In [9]:

WX['elev'].describe()

count    178738.000000
mean       1836.436975
std         236.796505
min        1279.000000
25%        1671.000000
50%        1828.000000
75%        1989.000000
max        2497.000000
Name: elev, dtype: float64

In [10]:

WX['elev'].quantile(0.90)

np.float64(2139.0)

#05-05: Subset WX and compute more summary stats

Create another data frame as subset of WX, which contains all the data from one station_id of your choice.
What is the average air temperature and its standard deviation?
What is the median relative humidity rh when either hn24 is greater than 10 cm or rain is greater than 2 mm? What about the median rh during the opposite conditions?
Compute a new column hn72_check that should conceptually be identical to hn72. Use only hn24 to derive hn72_check.
Test whether hn72_check is indeed equal to hn72. Why not?
Store the new data frame in a csv file.

In [11]:

wx = WX[WX['station_id'].isin(['VIR075905'])].copy()
## or more generically:
# wx = WX.loc[WX['station_id'].isin([WX['station_id'].unique()[0]]), ]
wx

	datetime	station_id	hs	hn24	hn72	rain	iswr	ilwr	ta	rh	vw	dw	elev
0	2019-09-04 06:00:00	VIR075905	0.000	0.000000	0.00000	0.000000	0.000	256.326	6.28198	94.7963	2.32314	241.468	2121
1	2019-09-04 17:00:00	VIR075905	0.000	0.000000	0.00000	0.000196	555.803	288.803	12.52460	72.0814	4.38687	247.371	2121
2	2019-09-05 17:00:00	VIR075905	0.000	0.000000	0.00000	0.000045	534.011	287.089	14.26540	55.1823	1.93691	239.254	2121
3	2019-09-06 17:00:00	VIR075905	0.000	0.000000	0.00000	0.000026	546.008	292.024	14.13660	72.5560	3.67782	239.715	2121
4	2019-09-07 17:00:00	VIR075905	0.000	0.000000	0.00000	0.000150	528.582	289.508	14.62380	68.9262	1.90232	227.356	2121
...	...	...	...	...	...	...	...	...	...	...	...	...	...
114506	2021-05-25 17:00:00	VIR075905	281.450	0.000000	0.00000	0.401526	346.135	314.824	3.71658	88.5444	1.56797	156.521	2121
114507	2021-05-26 17:00:00	VIR075905	276.290	0.000000	0.00000	0.025453	774.383	242.693	6.56515	72.3112	5.24579	208.540	2121
114508	2021-05-27 17:00:00	VIR075905	265.546	0.000000	0.00000	0.313308	239.181	316.157	2.42825	97.3988	7.03859	218.594	2121
114509	2021-05-28 17:00:00	VIR075905	268.436	5.094300	5.09430	0.108912	376.829	301.828	1.07766	84.1171	6.69847	266.648	2121
114510	2021-05-29 17:00:00	VIR075905	265.560	0.797445	3.09568	0.010523	793.358	240.517	5.01248	70.8355	2.92842	265.633	2121

751 rows × 13 columns

Mind the copy in assigning the filtered data frame WX to a new one wx! What happens if you don’t copy? (Tip: Try it out and see whether you will see a Warning several cells below!)

In [12]:

wx['ta'].mean()

np.float64(-3.7703614580559255)

In [13]:

wx['ta'].std()

np.float64(6.582577628629118)

In [14]:

wx.loc[(wx['hn24'] > 10) | (wx['rain'] > 2), 'rh'].median()

np.float64(89.947)

In [15]:

wx.loc[~((wx['hn24'] > 10) | (wx['rain'] > 2)), 'rh'].median()
## or equivalently:
# wx.loc[~(wx['hn24'] > 10) & ~(wx['rain'] > 2), 'rh'].median()
# wx.loc[(wx['hn24'] <= 10) & (wx['rain'] <= 2), 'rh'].median()

np.float64(82.03875)

In [16]:

import numpy as np

The following solution matches the Pandas knowledge from this unit,

In [17]:

wx['hn72_check'] = np.nan
for i, val in enumerate(wx['hn24']):
    if i > 3:
        wx.iloc[i, 13] = wx.iloc[i-2:i+1, 3].sum()

whereas this next solution would be my go-to choice. There are a few extra tricks in here: What it comes down to, is that we would need to use the iloc operator for selecting the rows based on the integer count i, but at the same time we want to access the column by its name and not by its location, which requires the loc operator. I’m ultimately going for the loc operator to maximize readability and convenience. To still access the correct rows, however, I need to select wx.index[i] instead of i alone. Why? –> Because wx.index[i] will return the row names at the locations i (see Unit 6).

In [18]:

wx['hn72_check_II'] = np.nan
for i, val in enumerate(wx['hn24']):
    if i > 3:
        wx.loc[wx.index[i], 'hn72_check_II'] = wx.loc[wx.index[i-2]:wx.index[i], 'hn24'].sum()

In [19]:

wx[['hn24', 'hn72', 'hn72_check', 'hn72_check_II']]

	hn24	hn72	hn72_check	hn72_check_II
0	0.000000	0.00000	NaN	NaN
1	0.000000	0.00000	NaN	NaN
2	0.000000	0.00000	NaN	NaN
3	0.000000	0.00000	NaN	NaN
4	0.000000	0.00000	0.000000	0.000000
...	...	...	...	...
114506	0.000000	0.00000	0.000000	0.000000
114507	0.000000	0.00000	0.000000	0.000000
114508	0.000000	0.00000	0.000000	0.000000
114509	5.094300	5.09430	5.094300	5.094300
114510	0.797445	3.09568	5.891745	5.891745

751 rows × 4 columns

hn72 was computed on the hourly time series. Since we compute hn72_check on the daily time series, some values are identical while others are not. In any case, the first 4 entries can not be identical, because we are missing data points that are further in the past to compute the 3-day height of new snow.

In [76]:

wx.to_csv("WX_subset.csv", index=False)

05_tasks_solved.ipynb

#05-04: Explore WX

#05-05: Subset WX and compute more summary stats