Summary
When using XGBoost the model expects the feature matrix supplied at prediction time to have exactly the same number of columns (and the same column order) as the matrix used for training. In competitions the test set does not contain the target column, so the naïve approach of passing the raw test data to predict() fails with a feature‑count mismatch. The fix is to construct the prediction matrix without the target column but with the same feature layout as the training matrix, and to pass that matrix (or an xgb.DMatrix) to predict().
Root Cause
- XGBoost stores the number of features (
n_features_model) when the Booster is built. - During
predict(), it checksn_features_data == n_features_model. - The provided
newdatastill contains the target column (or is missing a column that was present in training), making the column counts differ (e.g., 8 vs. 7).
Why This Happens in Real Systems
- Data pipelines often concatenate train and test frames before splitting, then drop the label only for the test set.
- Feature engineering (one‑hot encoding, interaction terms, etc.) is usually performed on the whole dataset; the resulting feature matrix for test may have a different column set if the label column isn’t removed correctly.
- Dynamic column ordering (e.g., using
data.framesubsetting by name) can reorder columns, causing a mismatch even when the label is removed.
Real-World Impact
- Model deployment delays – engineers spend time debugging a simple shape mismatch.
- Incorrect submissions in competitions: predictions may be generated on the wrong feature set, leading to poor scores.
- Production outages when batch inference jobs crash because the incoming schema differs from the training schema.
Example or Code (if necessary and relevant)
library(xgboost)
library(MASS)
# ----- Training -----
train <- MASS::Pima.tr
train$type <- ifelse(train$type == "No", 0, 1)
train_x <- data.matrix(train[, -ncol(train)]) # drop label
train_y <- train$type
dtrain <- xgb.DMatrix(data = train_x, label = train_y)
params <- list(max_depth = 3, objective = "binary:logistic", eval_metric = "logloss")
model <- xgb.train(params = params, data = dtrain, nrounds = 70)
# ----- Prediction on new data (no label) -----
test <- MASS::Pima.te
test$type <- ifelse(test$type == "No", 0, 1)
test_x <- data.matrix(test[, -ncol(test)]) # same columns as train_x
dtest <- xgb.DMatrix(data = test_x)
pred_prob <- predict(model, dtest) # returns probabilities
How Senior Engineers Fix It
- Explicitly drop the target column before creating the prediction matrix:
data.matrix(df[ , setdiff(names(df), target_col)]). - Validate column order using
identical(colnames(train_x), colnames(test_x)); if they differ, reordertest_xto match. - Encapsulate preprocessing in a reusable function or
recipe(e.g.,recipes::recipe) that guarantees the same feature set for train and inference. - Version the schema: store the column list used for training alongside the model and assert equality at inference time.
- Use
xgb.DMatrixfor both training and prediction to avoid implicit conversion errors.
Why Juniors Miss It
- They assume
predict()automatically aligns columns by name, which XGBoost does not do. - They often mix data frames and matrices without noticing that matrix conversion drops column names.
- They skip the schema validation step, relying on “it works on the notebook” without testing a pure inference pipeline.
- They may remove the label column after converting to a matrix, resulting in an off‑by‑one column count.
Bottom line: always feed XGBoost a feature matrix that mirrors the training matrix in both shape and order, even when the target column is absent.