Module 10

So you want to classify single cells

This lesson adapts `2_sklearn_classification.ipynb`, which uses morphological measurements to classify *S. aureus* cells into different cell-cycle phases.

Estimated time: 60 to 90 min Main tools: pandas, matplotlib, scikit-learn

Core learning arc from the notebook

  1. Load tabular measurements from `morphological_measurements.csv` with pandas.
  2. Explore columns, row previews, histograms, and scatter plots.
  3. Normalize features by standardization (Z-scores).
  4. Train a logistic regression model for cell-cycle classification.
  5. Evaluate with a confusion matrix.
  6. Use k-fold cross-validation and parameter sweeps to refine the model.

Why it is valuable

This notebook shows a very common bioimage-analysis pattern: a segmentation or measurement step produces tabular features, and those features then become the input to a classifier. It helps researchers see how image-derived measurements connect to cell-state inference.

How the website should present it

Part 1: data loading and exploration

Slow down here and make the learner inspect the table first. The notebook already uses `data.head()`, histograms, and scatter plots. Those should become explicit mini-tasks in the site.

Part 2: feature normalization

Reintroduce standardization carefully. The notebook normalizes all non-label columns to mean 0 and standard deviation 1. This is a good chance to connect numeric preprocessing to model behavior.

Part 3: build a first classifier

Logistic regression is a helpful first model because it is not too opaque. The focus should be on understanding the pipeline rather than treating the model as magic.

Part 4: evaluate and tune

The confusion matrix, cross-validation, and parameter sweeps make this a strong intermediate lesson. These ideas should be preserved because they teach good scientific skepticism about model quality.

Representative code examples

Load and inspect the data

import pandas as pd

data = pd.read_csv("../data/morphological_measurements.csv")
data.head()
data["Cell Cycle Phase"].hist()

Normalize feature columns

def normalize_column(data_column):
    mean = data_column.mean()
    std = data_column.std()
    normalized_column = (data_column - mean) / std
    return normalized_column

normalized_data = data.copy()
feature_columns = normalized_data.columns[:-1]

for column_name in feature_columns:
    normalized_data[column_name] = normalize_column(normalized_data[column_name])

Train a classifier

from sklearn.linear_model import LogisticRegression

X = normalized_data[feature_columns]
y = normalized_data["Cell Cycle Phase"]

model = LogisticRegression(max_iter=1000)
model.fit(X, y)
predictions = model.predict(X)

Research-facing framing

The biological goal is not just “build a classifier.” It is understanding whether the measured features contain enough information to separate meaningful cell states reliably.

Exercises worth carrying over

  1. Plot histograms for multiple features and compare their distributions.
  2. Make a scatter plot of `Area` vs `Perimeter` and discuss separation.
  3. Write a small normalization helper and apply it column by column.
  4. Train logistic regression and inspect the confusion matrix.
  5. Try one parameter sweep and discuss whether the model improved meaningfully.