Module 06
So you want to use pandas and matplotlib
Good choice. A surprising amount of research progress starts with “load the table, check the columns, and plot the thing before you overthink it.”
Load a table
import pandas as pd
data = pd.read_csv("../data/morphological_measurements.csv")
data.head()
The first thing to do is almost never modeling. It is checking whether the table looks sensible.
Inspect useful columns
print(data.columns)
print(data.shape)
print(data["Area"].mean())
print(data["Perimeter"].max())
data[["Area", "Perimeter"]].head()
Make quick plots
import matplotlib.pyplot as plt
data["Cell Cycle Phase"].hist()
plt.title("Cell cycle phase counts")
plt.tight_layout()
data.plot.scatter(x="Area", y="Perimeter")
plt.title("Area vs Perimeter")
plt.tight_layout()
Group and summarize
summary = data.groupby("Cell Cycle Phase")[["Area", "Perimeter"]].mean()
print(summary)
summary.plot(kind="bar")
plt.ylabel("Mean value")
plt.tight_layout()
The habit worth building
Use pandas and matplotlib to ask the boring, high-value questions first: what is in the table, what is missing, what looks strange, and what changes across conditions?
Exercises
- Load the CSV file and print the first five rows.
- Plot a histogram of one numeric column.
- Make a scatter plot of two measurements.
- Group by one label column and compute a summary statistic.
- Write one sentence on what you think is worth checking next.