Feature Engineering Pipeline
A deep dive into the preprocessing pipeline behind the painting classifier --- how we turned mixed-type survey data (numbers, Likert scales, free text, multi-select categories) into a fixed-width numeric feature matrix, using only numpy and pandas.
Architecture
The pipeline follows a fit / transform / save / load pattern:
# Training time
prep = Preprocessor()
prep.fit(train_df)
X, y, groups = prep.transform(train_df)
prep.save("preprocessing_params.json")
# Inference time
prep = Preprocessor.load("preprocessing_params.json")
X, _, _ = prep.transform(test_df)All statistics (means, standard deviations, TF-IDF vocabularies) are computed during fit() on training data only, then applied during transform(). This prevents data leakage --- the preprocessor never sees validation or test data during fitting.
Numeric Features: Z-Normalization
Numeric columns (emotional intensity, colour count, object count) are z-normalized:
The cost feature gets special treatment: it’s heavily right-skewed (a few students valued paintings in the millions), so we log-transform first:
log_cost = np.log1p(np.clip(cost, 0, None))
z_cost = (log_cost - mean) / stdlog1p handles zero values gracefully (since ), and clipping prevents negative values from breaking the transform.
Ordinal Features
Likert-scale responses (Strongly Disagree = 1 through Strongly Agree = 5) are treated as continuous after z-normalization. This is a common simplification --- it assumes equal spacing between response levels, which is debatable but works well in practice for 5-point scales.
Multi-Hot Encoding
Multi-select fields (room location, viewing companion, season) are encoded as binary vectors. If a student selected “Winter” and “Fall” for season:
[Spring=0, Summer=0, Fall=1, Winter=1]The vocabulary is learned during fit() (sorted set of all categories seen in training data), so unseen categories at inference time are simply ignored.
def _fit_multi_hot(self, df):
vocab = {}
for col in MULTI_SELECT_COLS:
cats = set()
for val in df[col].values:
cats.update(_parse_multi_select(val))
vocab[col] = sorted(cats)
self.params["multi_hot_vocab"] = vocabTF-IDF from Scratch
The three free-text columns (emotional description, food association, soundtrack association) are vectorized using TF-IDF with a vocabulary cap of 200 tokens per column, yielding a -dimensional sparse block.
The implementation:
- Tokenize: lowercase, strip non-alphanumeric characters, split on whitespace
- Document frequency: count how many responses contain each token
- Select top-k: keep the 200 most frequent tokens as the vocabulary
- Compute TF-IDF: for each response,
- L2-normalize each row so response length doesn’t dominate
The IDF formula uses smoothing to prevent division by zero:
def _fit_tfidf(self, df):
for col in TEXT_COLS:
texts = df[col].fillna("").astype(str).tolist()
tokenized = [_tokenize(t) for t in texts]
n_docs = len(tokenized)
doc_freq = Counter()
for tokens in tokenized:
for t in set(tokens): # set() avoids double-counting
doc_freq[t] += 1
top = doc_freq.most_common(MAX_TFIDF_FEATURES)
word2idx = {w: i for i, (w, _) in enumerate(top)}
idf = np.zeros(len(word2idx))
for w, i in word2idx.items():
idf[i] = math.log((1 + n_docs) / (1 + doc_freq[w])) + 1.0Why no pre-trained embeddings? The survey responses are short, informal, and emoji-heavy --- domain-specific TF-IDF on the actual vocabulary turned out to be more discriminative than generic embeddings would be.
Engineered Features
TODO(human)
Serialization
The entire preprocessor state serializes to a single JSON file (preprocessing_params.json), containing:
- Imputation values (medians for numerics, modes for ordinals)
- Multi-hot vocabularies
- TF-IDF vocabularies and IDF vectors
- Z-normalization statistics (mean/std per feature)
This means the inference script (pred.py) can reconstruct the exact same preprocessing pipeline with zero sklearn dependency --- just numpy, pandas, and the JSON file.