Module 10
So you want to classify single cells
This lesson adapts `2_sklearn_classification.ipynb`, which uses morphological measurements to classify *S. aureus* cells into different cell-cycle phases.
Core learning arc from the notebook
- Load tabular measurements from `morphological_measurements.csv` with pandas.
- Explore columns, row previews, histograms, and scatter plots.
- Normalize features by standardization (Z-scores).
- Train a logistic regression model for cell-cycle classification.
- Evaluate with a confusion matrix.
- Use k-fold cross-validation and parameter sweeps to refine the model.
Why it is valuable
This notebook shows a very common bioimage-analysis pattern: a segmentation or measurement step produces tabular features, and those features then become the input to a classifier. It helps researchers see how image-derived measurements connect to cell-state inference.
How the website should present it
Part 1: data loading and exploration
Slow down here and make the learner inspect the table first. The notebook already uses `data.head()`, histograms, and scatter plots. Those should become explicit mini-tasks in the site.
Part 2: feature normalization
Reintroduce standardization carefully. The notebook normalizes all non-label columns to mean 0 and standard deviation 1. This is a good chance to connect numeric preprocessing to model behavior.
Part 3: build a first classifier
Logistic regression is a helpful first model because it is not too opaque. The focus should be on understanding the pipeline rather than treating the model as magic.
Part 4: evaluate and tune
The confusion matrix, cross-validation, and parameter sweeps make this a strong intermediate lesson. These ideas should be preserved because they teach good scientific skepticism about model quality.
Representative code examples
Load and inspect the data
import pandas as pd
data = pd.read_csv("../data/morphological_measurements.csv")
data.head()
data["Cell Cycle Phase"].hist()
Normalize feature columns
def normalize_column(data_column):
mean = data_column.mean()
std = data_column.std()
normalized_column = (data_column - mean) / std
return normalized_column
normalized_data = data.copy()
feature_columns = normalized_data.columns[:-1]
for column_name in feature_columns:
normalized_data[column_name] = normalize_column(normalized_data[column_name])
Train a classifier
from sklearn.linear_model import LogisticRegression
X = normalized_data[feature_columns]
y = normalized_data["Cell Cycle Phase"]
model = LogisticRegression(max_iter=1000)
model.fit(X, y)
predictions = model.predict(X)
Research-facing framing
The biological goal is not just “build a classifier.” It is understanding whether the measured features contain enough information to separate meaningful cell states reliably.
Exercises worth carrying over
- Plot histograms for multiple features and compare their distributions.
- Make a scatter plot of `Area` vs `Perimeter` and discuss separation.
- Write a small normalization helper and apply it column by column.
- Train logistic regression and inspect the confusion matrix.
- Try one parameter sweep and discuss whether the model improved meaningfully.