import numpy as np
= np.arange(1, 10)
vec1 vec1
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
Unit 08
You have already learned many skills in working with numerical arrays (using NumPy) and tabular data (using Pandas). So far, our NumPy arrays were mainly one-dimensional.
This week is about the analysis of gridded data, and focuses on two-dimensional data manipulation and visualization using both NumPy multi-dimensional arrays and Pandas data frames.
Multidimensional arrays are not different from one-dimensional arrays. Let’s start with a one-dimensional array and reshape it into a 2d array.
We could also create the 2d array manually like
mat2 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# confirm that both matrices have the same shape and contents:
np.array_equal(mat1, mat2)
True
Note that I constructed the matrix by passing three lists with three entries each to np.array()
. Each list informed one row of the matrix.
From the NumPy cheat sheet table of Unit 3, you know that numpy is able to perform linear algebra tasks. For example, you can compute the determinant of the matrix we just created by
If you indeed have to do linear algebra computations in the future, the numpy.linalg documentation offers all the relevant information about its functionality.
Since we have mainly worked with 1d arrays, here is also a quick recap of indexing nd-arrays. The image below illustrates various indexing examples
Note that indexing works slightly differently for NumPy arrays and Pandas data frames. If you are unsure, read up again on Unit 6.
Multi-dimensional arrays allow to store data in the shape that is most meaningful for its analysis, display, and storage. Think of the following example, where PV potential
Think of some other examples where data could be represented in such a format. For example, involving model parameters, other physical or energy quantities, time and space considerations, etc.
Gridded data refers to information that is organized in a grid or matrix structure. The grid can be regularly or irregularly spaced.
Let’s put (a few) numbers behind the previous example of the temperature map/grid. We focus this Section on a regularly spaced grid:
np.meshgrid
The function np.meshgrid() can be very useful when working with gridded data. Can you guess what it does from the example below?
(array([[-3, -2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1, 2],
[-3, -2, -1, 0, 1, 2]]),
array([[-2, -2, -2, -2, -2, -2],
[-1, -1, -1, -1, -1, -1],
[ 0, 0, 0, 0, 0, 0],
[ 1, 1, 1, 1, 1, 1],
[ 2, 2, 2, 2, 2, 2],
[ 3, 3, 3, 3, 3, 3],
[ 4, 4, 4, 4, 4, 4],
[ 5, 5, 5, 5, 5, 5],
[ 6, 6, 6, 6, 6, 6],
[ 7, 7, 7, 7, 7, 7]]))
Let’s say we have formula that computes temperature or any other quantity z of our interest based on our grid coordinates. Then, we can easily compute the field Z, for example
array([[ 7, 2, -1, -2, -1, 2],
[ 8, 3, 0, -1, 0, 3],
[ 9, 4, 1, 0, 1, 4],
[10, 5, 2, 1, 2, 5],
[11, 6, 3, 2, 3, 6],
[12, 7, 4, 3, 4, 7],
[13, 8, 5, 4, 5, 8],
[14, 9, 6, 5, 6, 9],
[15, 10, 7, 6, 7, 10],
[16, 11, 8, 7, 8, 11]])
So, we now we have computed a gridded 2D data set Z with the shape (len(x)
, len(y)
) from the coordinates x and y. X and Y have the same shape as Z and are called coordinate grids.
You could easily add more dimensions to our example, like time:
Now, each element of Z corresponds to one unique combination of x, y, t.
While we are not limited by adding more dimensions to our data set, visualizing these data sets is limited to few dimensions. In fact, there is plenty of cognitive visualization design research that tells us that humans are bad at interpreting figures that have more than 2 dimensions. So we will stick with 2D plots, but those can be very exciting!
I encourage you to go back to the overview page of matplotlib’s plot types to check out your options for gridded data sets. In the following we will look at the two most important types.
axes.pcolormesh
pcolormesh fills each grid cell with a color as defined by the values in Z:
<matplotlib.colorbar.Colorbar at 0x7f24d6b34850>
You can call pcolormesh
either with the grid coordinates X and Y, with the coordinate vectors x and y, or without coordinates at all. Note however that calling ax.pcolormesh(Z)
will result in a plot that does not know how to properly label the coordinates. Try it out to see what I mean.
To change the colormap or the mapping between values and colors you can play with the following parameters:
<matplotlib.colorbar.Colorbar at 0x7f24d6bdb9d0>
cmap="Reds_r"
. What does it do?contour
and contourf
Contours are leaving the “pixel space” to make smooth contour fields around the data points. Note that while polormesh
displays the (raw) data as is, contour plots create a continuous representation of the data by interpolating between the data points. Both plot types have their use cases and meaning, but can not always be used interchangingly without distorting the meaning of the figure.
Let’s first create a contour line plot that can be annotated with inline text
f, ax = plt.subplots()
c = ax.contour(X, Y, Z, levels=20, cmap="Reds_r")
ax.clabel(c, inline=True, fontsize=10)
<a list of 21 text.Text objects>
and a filled contour plot
f, ax = plt.subplots()
cf = ax.contourf(X, Y, Z, levels=np.arange(0, 15), cmap='Reds_r', extend='both')
f.colorbar(cf)
<matplotlib.colorbar.Colorbar at 0x7f24d52f39d0>
contour
and contourf
can be called analogously to pcolormesh
levels
controls the number of levels and also their locations if provided as vector.extend
can be used to indicate whether their are values in the data set that extend beyond the limits of the colorbar (set it to ‘max’, ‘min’, ‘both’, or ‘neither’).Not always when you analyze gridded data, you will actually compute it from scratch like we did in the previous example. Sometimes, you will read data from a csv file and would like to plot 2D data. In this case some data manipulation is required. Let’s look into a simple example.
The following data frame represents the data you read from your csv file. It contains 2D gridded data in a long format. That means you get two coordinate columns x and y, and a data column z. Take a few moments to read through the rows and understand the data format:
x | y | z | |
---|---|---|---|
0 | 1 | -1 | 0 |
1 | 2 | -1 | 3 |
2 | 3 | -1 | 8 |
3 | 1 | 0 | 1 |
4 | 2 | 0 | 4 |
5 | 3 | 0 | 9 |
6 | 1 | 1 | 2 |
7 | 2 | 1 | 5 |
8 | 3 | 1 | 10 |
9 | 1 | 2 | 3 |
10 | 2 | 2 | 6 |
11 | 3 | 2 | 11 |
Now, unfortunately we cannot simply plot a 2D field from this data format, because the plotting functions we just learned about require the data in a matrix format, like the grid data Z from the previous Section. We need to change the long data frame format to a wide format, which is the term used in data frame context for matrix-like data representation. Let me show you how to reshape the data frame and you will see what I mean:
x | 1 | 2 | 3 |
---|---|---|---|
y | |||
-1 | 0 | 3 | 8 |
0 | 1 | 4 | 9 |
1 | 2 | 5 | 10 |
2 | 3 | 6 | 11 |
While our long data frame had three columns (2 coordinates, 1 data) with 12 rows, our wide data frame looks like a (4x3) matrix (= 12 data elements) where the coordinates are the row names (i.e., index) and column names. From this format, we can call the plotting functions:
<matplotlib.colorbar.Colorbar at 0x7f24d526ef90>
If you want to plot the coordinates on the different axis, you can pivot your data frame the other way:
df_wide_T = df_long.pivot(index='x', columns='y', values='z')
f, ax = plt.subplots()
cf = ax.contourf(df_wide_T.columns, df_wide_T.index, df_wide_T)
f.colorbar(cf)
<matplotlib.colorbar.Colorbar at 0x7f24d4ecb9d0>
Together with the official pandas tutorial on pivoting data frames—how to reshape the layout of tables—, this should give you a pretty good understanding of how to plot 2D data from tabular sources. Note that the pandas tutorial also covers reshaping wide data frames to long formats using pd.DataFrame.melt()
, i.e., the opposite task.
Let’s go back from plotting 2D surfaces to standard plots, for example line plots or box plots. You have learned already that an individual component or element of a figure, such as one line, one box, etc, is referred to as an artist. You can create a figure with multiple artists very easily if your data is stored in a wide format. Each column will result in a different artist. Consider this
I find this functionality particularly useful for quick working plots to understand my data. Hence, the example is kept very simple, but of course you could also create a styled plot from this approach.
It might happen at some point that you pivot a data frame and find out that almost all elements are NaN’s. What happended?
Sometimes an example tells more than many words, consider this long-format data frame that I will pivot:
x | y | z | |
---|---|---|---|
0 | 1 | -1 | 0 |
1 | 2 | 0 | 4 |
2 | 3 | 1 | 10 |
x | 1 | 2 | 3 |
---|---|---|---|
y | |||
-1 | 0.0 | NaN | NaN |
0 | NaN | 4.0 | NaN |
1 | NaN | NaN | 10.0 |
The data was indeed in a long format, but it did not specify a complete 2D grid. It contained 3 data elements and we created a (3x3) matrix from it. The data is stored along the diagonal.
Imagine your data set contains measurements. For specifc combinations of your coordinates x and y you took several measurements to reduce potential error. Your data set could look like that
x | y | z | |
---|---|---|---|
0 | 1 | -1 | 0 |
1 | 2 | 0 | 4 |
2 | 3 | 1 | 99 |
3 | 3 | 1 | 10 |
Well, it turns out that you won’t be able to pivot your data frame and get an error message <<Index contains duplicate entries, cannot reshape>>. At this point you know already that the index of a data frame needs to be unique, so this error message should not surprise us all too much. In this case, what you likely want to do is pivot your data frame and in case there are several measurements apply some sort of aggregating function. You can achieve that task with pivot_table
:
x | 1 | 2 | 3 |
---|---|---|---|
y | |||
-1 | 0.0 | NaN | NaN |
0 | NaN | 4.0 | NaN |
1 | NaN | NaN | 54.5 |
Note that I demonstrated this concept with few data elements to make the point as obvious as possible. Optimally, your wide data frames has more data in it.
The pivot
function is super useful in the context of plotting 2D grids from data frames or for plotting multiple artists from your data set. Ideally, the data content is already suited as is and you just reshape the data format. As soon as you start using pivot_table
you start to aggregate your data content as well. In the use case I showed you above, this is totally fine and valid. In more complex situations, you should be aware that pivot_table
is sort of a special case of the more general and flexible groupby().agg()
approach. If you cannot remember what that was about, you should go back to the pandas tutorial on how to calculate summary statistics and read up on split-combine-apply patterns.
The following external links are referenced in the text above:
.groupby().agg()
.