User Safety: safe

Summary

When using XGBoost the model expects the feature matrix supplied at prediction time to have exactly the same number of columns (and the same column order) as the matrix used for training. In competitions the test set does not contain the target column, so the naïve approach of passing the raw test data to predict() fails with a feature‑count mismatch. The fix is to construct the prediction matrix without the target column but with the same feature layout as the training matrix, and to pass that matrix (or an xgb.DMatrix) to predict().

Root Cause

XGBoost stores the number of features (n_features_model) when the Booster is built.
During predict(), it checks n_features_data == n_features_model.
The provided newdata still contains the target column (or is missing a column that was present in training), making the column counts differ (e.g., 8 vs. 7).

Why This Happens in Real Systems

Data pipelines often concatenate train and test frames before splitting, then drop the label only for the test set.
Feature engineering (one‑hot encoding, interaction terms, etc.) is usually performed on the whole dataset; the resulting feature matrix for test may have a different column set if the label column isn’t removed correctly.
Dynamic column ordering (e.g., using data.frame subsetting by name) can reorder columns, causing a mismatch even when the label is removed.

Real-World Impact

Model deployment delays – engineers spend time debugging a simple shape mismatch.
Incorrect submissions in competitions: predictions may be generated on the wrong feature set, leading to poor scores.
Production outages when batch inference jobs crash because the incoming schema differs from the training schema.

Example or Code (if necessary and relevant)

library(xgboost)
library(MASS)

# ----- Training -----
train <- MASS::Pima.tr
train$type <- ifelse(train$type == "No", 0, 1)

train_x <- data.matrix(train[, -ncol(train)])   # drop label
train_y <- train$type

dtrain <- xgb.DMatrix(data = train_x, label = train_y)

params <- list(max_depth = 3, objective = "binary:logistic", eval_metric = "logloss")
model <- xgb.train(params = params, data = dtrain, nrounds = 70)

# ----- Prediction on new data (no label) -----
test <- MASS::Pima.te
test$type <- ifelse(test$type == "No", 0, 1)

test_x <- data.matrix(test[, -ncol(test)])   # same columns as train_x
dtest  <- xgb.DMatrix(data = test_x)

pred_prob <- predict(model, dtest)   # returns probabilities

How Senior Engineers Fix It

Explicitly drop the target column before creating the prediction matrix: data.matrix(df[ , setdiff(names(df), target_col)]).
Validate column order using identical(colnames(train_x), colnames(test_x)); if they differ, reorder test_x to match.
Encapsulate preprocessing in a reusable function or recipe (e.g., recipes::recipe) that guarantees the same feature set for train and inference.
Version the schema: store the column list used for training alongside the model and assert equality at inference time.
Use xgb.DMatrix for both training and prediction to avoid implicit conversion errors.

Why Juniors Miss It

They assume predict() automatically aligns columns by name, which XGBoost does not do.
They often mix data frames and matrices without noticing that matrix conversion drops column names.
They skip the schema validation step, relying on “it works on the notebook” without testing a pure inference pipeline.
They may remove the label column after converting to a matrix, resulting in an off‑by‑one column count.

Bottom line: always feed XGBoost a feature matrix that mirrors the training matrix in both shape and order, even when the target column is absent.