Module 06

So you want to use pandas and matplotlib

Good choice. A surprising amount of research progress starts with “load the table, check the columns, and plot the thing before you overthink it.”

Estimated time: 45 to 60 min Prerequisite: Jupyter notebooks

Load a table

import pandas as pd

data = pd.read_csv("../data/morphological_measurements.csv")
data.head()

The first thing to do is almost never modeling. It is checking whether the table looks sensible.

Inspect useful columns

print(data.columns)
print(data.shape)
print(data["Area"].mean())
print(data["Perimeter"].max())
data[["Area", "Perimeter"]].head()

Make quick plots

import matplotlib.pyplot as plt

data["Cell Cycle Phase"].hist()
plt.title("Cell cycle phase counts")
plt.tight_layout()
data.plot.scatter(x="Area", y="Perimeter")
plt.title("Area vs Perimeter")
plt.tight_layout()

Group and summarize

summary = data.groupby("Cell Cycle Phase")[["Area", "Perimeter"]].mean()
print(summary)
summary.plot(kind="bar")
plt.ylabel("Mean value")
plt.tight_layout()

The habit worth building

Use pandas and matplotlib to ask the boring, high-value questions first: what is in the table, what is missing, what looks strange, and what changes across conditions?

Exercises

  1. Load the CSV file and print the first five rows.
  2. Plot a histogram of one numeric column.
  3. Make a scatter plot of two measurements.
  4. Group by one label column and compute a summary statistic.
  5. Write one sentence on what you think is worth checking next.