Logical Indexing

Unit 04

From loops to vectorization

Computers excel at repetitive tasks. Since we humans would get bored by doing the same task over and over again, we can be happy to have learned about loops. Any task that needs to be applied to many numbers can be carried out using loops, basically applying the task to every number one after another. While computers can do that (and won’t complain to get bored), even computers take a lot longer to repeat tasks for individual numbers than to apply tasks to many numbers simultaneously. The concept of applying one task to many numbers without using loops is called vectorization, we briefly heard about it during Workshop 1 when we introduced numpy.

In fact, I need to revisit the statement that I just made. I like the metaphore of the statement, but it is not truly correct. To achieve vectorization, python does not apply one task to multiple numbers at the same time, but it usually uses code that is written in another programming language that is more low level (and therefore harder to code) but also orders of magnitude faster in its computation.

Well, vectorization sounds great, why do we even bother to use loops any longer? Not all tasks can be achieved with vectorization, and sometimes they can but only with a complex logic that is (a) time consuming to come up with and (b) challenging to understand for others who want to read your code. In other situations, vectorization is super straightforward and simple. In these situations, your vectorized code will often be a lot easier to write and also easier to read. So, when you write code, my recommendation for you is to start out with the easiest way of solving your problem. Sometimes the easiest way will be vectorized, sometimes it will involve a loop. If you solved your problem, think: Will the computer have to execute this line of code thousands and thousands of times? If your answer is no, don’t bother to invest more time. If your answer is yes and you coded a loop, think about ways to vectorize your problem.

You will find relevant challenges in this week’s exercise!

Logical indexing as a means of vectorization

So far, we have learned to index and slice sequences based on integers:

numbers = [1, 2, 3]
numbers[1]

A very powerful feature when working with libraries such as numpy (and also pandas–we will get to that soon) is indexing based on logicals, for example

import numpy as np
numbers = np.array(numbers)
numbers[np.array([True, False, True])]

array([1, 3])

This is particularly useful when using logical expressions to identify whether specific criteria are met for individual sequence elements:

numbers = np.linspace(0, 100, 33)
numbers_gt50 = numbers[numbers > 50]  # logical indexing

print(f"There are {len(numbers)} numbers, {len(numbers_gt50)} of which are > 50")

There are 33 numbers, 16 of which are > 50

You can use all the comparison operators (<, <=, >, >=, ==, !=) to create these conditions, and you can combine multiple conditions with the bitwise operators & (and), | (or), and ~ (not). And you can use other arrays for logical indexing, like this

# create two arrays
numbers = np.arange(1, 11)
countdown = np.arange(9, -1, -1)

# filter numbers based on a logical mask
mask = ~(numbers%2 == 0) & (countdown < 5)
selected_numbers = numbers[mask]

print("numbers  countdown  mask")
for n, c, m in zip(numbers, countdown, mask):
    print(f"{n:^7} {c:^10} {m}")

print("Selected Numbers:")
print(selected_numbers)

numbers  countdown  mask
   1        9      False
   2        8      False
   3        7      False
   4        6      False
   5        5      False
   6        4      False
   7        3      True
   8        2      False
   9        1      True
  10        0      False
Selected Numbers:
[7 9]

Logical indexing is a very efficient and fast way to mask and filter arrays and is therefore a key component of most data analyses. If logical indexing were not possible, we had to write all these masks by means of for loops and apply the logical condition to each array element individually. Due to logical indexing, we can apply the condition in a vectorized format.

Comment about logical and bitwise operators

We already know about the logical operators and, or, not. How are the bitwise operators &, |, ~ different from those? The logical operators (and, …) work with boolean values in logical operations (i.e., one instance of True or False). Bitwise operators (&, …) operate on the binary level and therefore consider the bits of each element in the array. When you evaluate conditions on arrays, you typically want boolean results for each element of the array, and therefore you have to use the bitwise operators. Here are a few examples to illustrate their different behaviour. Play with these examples yourself to understand when you can use which operators. Note that some of these examples produce errors. Make sure you understand which expressions are valid and which ones are not, by executing them in a console.

numbers = np.arange(10)
numbers%2 == 0                                  # valid
(numbers%2 == 0)                                # valid
~(numbers%2 == 0)                               # valid
(numbers%2 == 0) & (numbers > 5)                # valid
numbers%2 == 0 & numbers > 5                    # Error
(numbers%2 == 0) and (numbers > 5)              # Error
(numbers%2 == 0).any() and (numbers > 5).all()  # valid

Learning checklist

I know what vectorization means in the context of python programming.
I can use logical (also known as boolean) indexing for subsetting variables.
I can apply logical indexing to arrays, and can enchain multiple logical conditions.
I know the difference between logical and bitwise operators and know when to apply them.