Gridded 2D data

Unit 08

You have already learned many skills in working with numerical arrays (using NumPy) and tabular data (using Pandas). So far, our NumPy arrays were mainly one-dimensional.

This week is about the analysis of gridded data, and focuses on two-dimensional data manipulation and visualization using both NumPy multi-dimensional arrays and Pandas data frames.

Recap: NumPy’s ndarray

Multidimensional arrays are not different from one-dimensional arrays. Let’s start with a one-dimensional array and reshape it into a 2d array.

import numpy as np
vec1 = np.arange(1, 10)
vec1

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

mat1 = vec1.reshape(3, 3)
mat1

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

By now, you are already familiar with code that uses method chaining to perform several tasks in one line of code (think back to our previous chapters and exercises using Pandas). So, you could easily rewrite the above statement to

mat1 = np.arange(1, 10).reshape(3, 3)

We could also create the 2d array manually like

mat2 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# confirm that both matrices have the same shape and contents:
np.array_equal(mat1, mat2)

True

Note that I constructed the matrix by passing three lists with three entries each to np.array(). Each list informed one row of the matrix.

From the NumPy cheat sheet table of Unit 2, you know that numpy is able to perform linear algebra tasks. For example, you can compute the determinant of the matrix we just created by

np.linalg.det(mat1)

0.0

If you indeed have to do linear algebra computations in the future, the numpy.linalg documentation provides you with all the relevant information about its functionality.

Since we have mainly worked with 1d arrays, here is also a quick recap of indexing nd-arrays. The image below illustrates various indexing examples

Image taken from the scientific python lectures here

Quick Exercise

Construct the matrix from the image yourself in a quick and efficient way
Think of another way we indexed arrays in our lecture: logical indexing. Index all rows whose first element is larger than 20.

Note that indexing works slightly differently for NumPy arrays and Pandas data frames. If you are unsure, read up again on Unit 6.

nD arrays as structured data

Multi-dimensional arrays allow to store data in the shape that is most meaningful for its analysis, display, and storage. Think of the following example, where PV potential

a temperature map of shape (y, x) where y are the latitudes and x the longitudes
a series of temperature maps of shape (h, y, x) where h are different heights above ground (y and x remain lat/lon)
a series of temperature maps of shape (t, h, y, x) where t represents different time steps.

Think of some other examples where data could be represented in such a format. For example, involving model parameters, other physical or energy quantities, time and space considerations, etc.

Gridded data — An example

Gridded data refers to information that is organized in a grid or matrix structure. The grid can be regularly or irregularly spaced.

Let’s put (a few) numbers behind the previous example of the temperature map/grid. We focus this Section on a regularly spaced grid:

x = np.arange(-3, 3)
y = np.arange(-2, 8)

`np.meshgrid`

The function np.meshgrid() can be very useful when working with gridded data. Can you guess what it does from the example below?

X, Y = np.meshgrid(x, y)
X, Y

(array([[-3, -2, -1,  0,  1,  2],
        [-3, -2, -1,  0,  1,  2],
        [-3, -2, -1,  0,  1,  2],
        [-3, -2, -1,  0,  1,  2],
        [-3, -2, -1,  0,  1,  2],
        [-3, -2, -1,  0,  1,  2],
        [-3, -2, -1,  0,  1,  2],
        [-3, -2, -1,  0,  1,  2],
        [-3, -2, -1,  0,  1,  2],
        [-3, -2, -1,  0,  1,  2]]),
 array([[-2, -2, -2, -2, -2, -2],
        [-1, -1, -1, -1, -1, -1],
        [ 0,  0,  0,  0,  0,  0],
        [ 1,  1,  1,  1,  1,  1],
        [ 2,  2,  2,  2,  2,  2],
        [ 3,  3,  3,  3,  3,  3],
        [ 4,  4,  4,  4,  4,  4],
        [ 5,  5,  5,  5,  5,  5],
        [ 6,  6,  6,  6,  6,  6],
        [ 7,  7,  7,  7,  7,  7]]))

Let’s say we have formula that computes temperature or any other quantity z of our interest based on our grid coordinates. Then, we can easily compute the field Z, for example

Z = X**2 + Y
Z

array([[ 7,  2, -1, -2, -1,  2],
       [ 8,  3,  0, -1,  0,  3],
       [ 9,  4,  1,  0,  1,  4],
       [10,  5,  2,  1,  2,  5],
       [11,  6,  3,  2,  3,  6],
       [12,  7,  4,  3,  4,  7],
       [13,  8,  5,  4,  5,  8],
       [14,  9,  6,  5,  6,  9],
       [15, 10,  7,  6,  7, 10],
       [16, 11,  8,  7,  8, 11]])

So, we now we have computed a gridded 2D data set Z with the shape (len(x), len(y)) from the coordinates x and y. X and Y have the same shape as Z and are called coordinate grids.

You could easily add more dimensions to our example, like time:

t = np.arange(1, 3)               # two time step indices
T, X, Y = np.meshgrid(t, x, y)    # new coordinate grids
Z = (X**2 + Y) / T                # new 3D data set

Now, each element of Z corresponds to one unique combination of x, y, t.

Visualizing 2D gridded data

While we are not limited by adding more dimensions to our data set, visualizing these data sets is limited to few dimensions. In fact, there is plenty of cognitive visualization design research that tells us that humans are bad at interpreting figures that have more than 2 dimensions. So we will stick with 2D plots, but those can be very exciting!

I encourage you to go back to the overview page of matplotlib’s plot types to check out your options for gridded data sets. In the following we will look at the two most important types.

`axes.pcolormesh`

pcolormesh fills each grid cell with a color as defined by the values in Z:

import matplotlib.pyplot as plt
f, ax = plt.subplots()
im = ax.pcolormesh(X, Y, Z)
f.colorbar(im)

<matplotlib.colorbar.Colorbar at 0x7f6cd6dc07d0>

You can call pcolormesh either with the grid coordinates X and Y, with the coordinate vectors x and y, or without coordinates at all. Note however that calling ax.pcolormesh(Z) will result in a plot that does not know how to properly label the coordinates. Try it out to see what I mean.

To change the colormap or the mapping between values and colors you can play with the following parameters:

f, ax = plt.subplots()
im = ax.pcolormesh(X, Y, Z, cmap="Reds", vmin=0, vmax=12)
f.colorbar(im)

<matplotlib.colorbar.Colorbar at 0x7f6cd5500210>

Try cmap="Reds_r". What does it do?
Check out the different colormaps shown in the documentation of colormaps.

`contour` and `contourf`

Contours are leaving the “pixel space” to make smooth contour fields around the data points. Note that while polormesh displays the (raw) data as is, contour plots create a continuous representation of the data by interpolating between the data points. Both plot types have their use cases and meaning, but can not always be used interchangingly without distorting the meaning of the figure.

Let’s first create a contour line plot that can be annotated with inline text

f, ax = plt.subplots()
c = ax.contour(X, Y, Z, levels=20, cmap="Reds_r")
ax.clabel(c, inline=True, fontsize=10)

<a list of 21 text.Text objects>

and a filled contour plot

f, ax = plt.subplots()
cf = ax.contourf(X, Y, Z, levels=np.arange(0, 15), cmap='Reds_r', extend='both')
f.colorbar(cf)

<matplotlib.colorbar.Colorbar at 0x7f6cd542a290>

contour and contourf can be called analogously to pcolormesh
but have different parameters that control the appearance of the plot.
- levels controls the number of levels and also their locations if provided as vector.
- extend can be used to indicate whether their are values in the data set that extend beyond the limits of the colorbar (set it to ‘max’, ‘min’, ‘both’, or ‘neither’).

Gridded 2D data in Pandas: Wide frames

Not always when you analyze gridded data, you will actually compute it from scratch like we did in the previous example. Sometimes, you will read data from a csv file and would like to plot 2D data. In this case some data manipulation is required. Let’s look into a simple example.

The following data frame represents the data you read from your csv file. It contains 2D gridded data in a long format. That means you get two coordinate columns x and y, and a data column z. Take a few moments to read through the rows and understand the data format:

	x	y	z
0	1	-1	0
1	2	-1	3
2	3	-1	8
3	1	0	1
4	2	0	4
5	3	0	9
6	1	1	2
7	2	1	5
8	3	1	10
9	1	2	3
10	2	2	6
11	3	2	11

Now, unfortunately we cannot simply plot a 2D field from this data format, because the plotting functions we just learned about require the data in a matrix format, like the grid data Z from the previous Section. We need to change the long data frame format to a wide format, which is the term used in data frame context for matrix-like data representation. Let me show you how to reshape the data frame and you will see what I mean:

df_wide = df_long.pivot(index='y', columns='x', values='z')
df_wide

x	1	2	3
y
-1	0	3	8
0	1	4	9
1	2	5	10
2	3	6	11

While our long data frame had three columns (2 coordinates, 1 data) with 12 rows, our wide data frame looks like a (4x3) matrix (= 12 data elements) where the coordinates are the row names (i.e., index) and column names. From this format, we can call the plotting functions:

f, ax = plt.subplots()
cf = ax.contourf(df_wide.columns, df_wide.index, df_wide)
f.colorbar(cf)

<matplotlib.colorbar.Colorbar at 0x7f6cd5318d10>

If you want to plot the coordinates on the different axis, you can pivot your data frame the other way:

df_wide_T = df_long.pivot(index='x', columns='y', values='z')
f, ax = plt.subplots()
cf = ax.contourf(df_wide_T.columns, df_wide_T.index, df_wide_T)
f.colorbar(cf)

<matplotlib.colorbar.Colorbar at 0x7f6cd5150850>

Together with the official pandas tutorial on pivoting data frames—how to reshape the layout of tables—, this should give you a pretty good understanding of how to plot 2D data from tabular sources. Note that the pandas tutorial also covers reshaping wide data frames to long formats using pd.DataFrame.melt(), i.e., the opposite task.

Other tricks with wide data frames

Let’s go back from plotting 2D surfaces to standard plots, for example line plots or box plots. You have learned already that an individual component or element of a figure, such as one line, one box, etc, is referred to as an artist. You can create a figure with multiple artists very easily if your data is stored in a wide format. Each column will result in a different artist. Consider this

df_wide.plot(marker='o')

<Axes: xlabel='y'>

I find this functionality particularly useful for quick working plots to understand my data. Hence, the example is kept very simple, but of course you could also create a styled plot from this approach.

Special cases

Missing data values

It might happen at some point that you pivot a data frame and find out that almost all elements are NaN’s. What happended?

Sometimes an example tells more than many words, consider this long-format data frame that I will pivot:

	x	y	z
0	1	-1	0
1	2	0	4
2	3	1	10

df_long.pivot(index='y', columns='x', values='z')

x	1	2	3
y
-1	0.0	NaN	NaN
0	NaN	4.0	NaN
1	NaN	NaN	10.0

The data was indeed in a long format, but it did not specify a complete 2D grid. It contained 3 data elements and we created a (3x3) matrix from it. The data is stored along the diagonal.

Multiple measurements

Imagine your data set contains measurements. For specifc combinations of your coordinates x and y you took several measurements to reduce potential error. Your data set could look like that

	x	y	z
0	1	-1	0
1	2	0	4
2	3	1	99
3	3	1	10

Well, it turns out that you won’t be able to pivot your data frame and get an error message <<Index contains duplicate entries, cannot reshape>>. At this point you know already that the index of a data frame needs to be unique, so this error message should not surprise us all too much. In this case, what you likely want to do is pivot your data frame and in case there are several measurements apply some sort of aggregating function. You can achieve that task with pivot_table:

df_long.pivot_table(index='y', columns='x', values='z', aggfunc='mean')

x	1	2	3
y
-1	0.0	NaN	NaN
0	NaN	4.0	NaN
1	NaN	NaN	54.5

Note that I demonstrated this concept with few data elements to make the point as obvious as possible. Optimally, your wide data frames has more data in it.

More general and flexible approach of grouping data

The pivot function is super useful in the context of plotting 2D grids from data frames or for plotting multiple artists from your data set. Ideally, the data content is already suited as is and you just reshape the data format. As soon as you start using pivot_table you start to aggregate your data content as well. In the use case I showed you above, this is totally fine and valid. In more complex situations, you should be aware that pivot_table is sort of a special case of the more general and flexible groupby().agg() approach. If you cannot remember what that was about, you should go back to the pandas tutorial on how to calculate summary statistics and read up on split-combine-apply patterns.

External resources

The following external links are referenced in the text above:

numpy.linalg documentation
recap: overview page of matplotlib’s plot types
documentation of colormaps
recap: How to reshape the layout of tables
recap: How to calculate summary statistics

Learning checklist

I can work with multidimensional numpy arrays.
I can create 2D plots from gridded arrays.
I can translate long tables into wide tables and in doing so create 2D plots from tabular data as well.
I can plot several artists from one DataFrame plot command.
I know that pivoting tables is a special case of the more general split-apply-combine pattern employed by .groupby().agg().