import pandas as pd
= pd.DataFrame(
df
{"Country": ["Austria", "Germany", "Switzerland"],
"Size" : [83e3, 257e3, 41e3]
} )
Pandas: Tabular data
Unit 05
Working with tabular data
Pandas is one of the most famous python libraries for science, data science, and machine learning. It excels at manipulating tabular data, i.e. the kind of data you would have in a spreadsheet, a csv file, or weather station data.
Pandas is a fantastic library: it will take you minutes to grasp the fundamentals, and years to master it: this is what I like most about it ;-).
Getting started with Pandas
For today, I strongly recommend to follow the various tutorials on the getting started page of the pandas documentation. Make sure you have had a look at assignment #5, so you know where you need to pay close attention and where you can have a lighter read. Another great idea is to solve the assignment tasks while you work your way through the tutorials. Use the learning checklist below to gauge whether you covered the ground.
Focus on the following:
Numpy vs. Pandas
You might ask yourself whether you need to know Numpy and Pandas. The simple answer is, yes! Numpy and Pandas are two essential libraries in Python for data manipulation and analysis, but they serve slightly different purposes. Numpy is primarily focused on efficient numerical computations and provides support for multi-dimensional arrays and matrices. It’s the foundation of many other libraries, also including Pandas. Yes, Pandas builds on Numpy’s capabilities by offering data structures that are more intuitive and easier to work with for labeled, tabular data (like Excel sheets). So, Pandas excels at handling real-world datasets that require organizing, cleaning, and analyzing diverse data types.
In fact, let’s have a look at a Pandas DataFrame and confirm that the underlying arrays are actually Numpy arrays:
"Size"] df[
0 83000.0
1 257000.0
2 41000.0
Name: Size, dtype: float64
type(df["Size"])
pandas.core.series.Series
"Size"].values df[
array([ 83000., 257000., 41000.])
type(df["Size"].values)
numpy.ndarray
We see that the values (attribute) of a Pandas Series is actually a Numpy array. However, since the values attribute is slowly becoming deprecated, we will use the to_numpy()
-method if we specifically need to access the underlying Numpy array:
"Size"].to_numpy() df[
array([ 83000., 257000., 41000.])