Building a custom head on Keras Hub ViT backbone correctly

Summary

A developer attempted to build a custom classification head on top of a Keras Hub ViT backbone, but encountered ambiguity regarding the tensor topology and preprocessing pipeline. The primary failure point was an assumption about how the backbone handles the CLS token and whether the Backbone object includes an integrated preprocessing layer. Failure to align the input distribution with the pretraining dataset results in a model that converges poorly or fails to generalize entirely.

Root Cause

The technical friction stems from three specific architectural misconceptions:

Tensor Slicing Ambiguity: While the ViTPatchingAndEmbedding class does concatenate the class_token at index 0, the developer was unsure if the Backbone output preserved this token or if it returned only patch embeddings.
Implicit vs. Explicit Preprocessing: Using Backbone.from_preset() loads the feature extractor, not the full Task Model. This means the heavy lifting of normalization (mean/std subtraction) and resizing is not included in the backbone call.
Dataset Divergence: The naming convention vit_base_patch16_224_imagenet in the Keras Hub ecosystem often refers to weights fine-tuned on ImageNet-1k, which has significantly different distribution characteristics than the massive ImageNet-21k used for initial pretraining.

Why This Happens in Real Systems

In production machine learning pipelines, abstraction leakage is common. High-level APIs like Keras Hub provide “Presets” that encapsulate several distinct components:

The Preprocessor (Resizing, Rescaling, Normalization).
The Backbone (The transformer layers).
The Head (The classification layer).

When an engineer extracts only the Backbone, they are effectively “unwrapping” the model. If they do not manually re-apply the corresponding Preprocessor, the input tensors will have values (e.g., 0-255) that the transformer weights were never trained to handle (expecting -1 to 1 or 0 to 1), leading to gradient instability.

Real-World Impact

Silent Failure: The model will train and the loss will decrease, but the accuracy will plateau at a level significantly lower than the baseline. This is the most dangerous outcome because it looks like a “learning” issue rather than a “data” issue.
Inference Divergence: A model might perform well in a notebook with manual preprocessing but fail in a production API if the normalization constants used during training are not identical to those used in the live pipeline.
Resource Waste: Significant GPU hours are wasted fine-tuning a backbone that is effectively “blind” to the input scale due to missing preprocessing.

Example or Code

import keras
import keras_hub
from keras import layers, models

def get_correct_vit_model(preset_path, input_shape=(224, 224, 3), num_classes=3):
    # 1. Load the Preprocessor separately to ensure data distribution matches training
    preprocessor = keras_hub.models.ViTImagePreprocessor.from_preset(preset_path)

    # 2. Load the Backbone
    backbone = keras_hub.models.Backbone.from_preset(preset_path)

    inputs = layers.Input(shape=input_shape, name='input_layer')

    # 3. Apply preprocessing before the backbone
    x = preprocessor(inputs)

    # 4. Extract features
    # For ViT, the output shape is (batch, sequence_length, embed_dim)
    # The CLS token is indeed at index 0 if 'use_class_token' is True
    features = backbone(x)
    cls_token = features[:, 0, :] 

    # 5. Custom Head
    x = layers.Dense(128, activation='relu')(cls_token)
    outputs = layers.Dense(num_classes, activation='softmax')(x)

    model = models.Model(inputs=inputs, outputs=outputs)
    return model

How Senior Engineers Fix It

Verify Component Integrity: Never assume a Backbone is a standalone model. Always inspect the config to see if use_class_token is true and verify the input normalization requirements.
End-to-End Validation: Instead of just looking at loss, senior engineers run a single batch through the full pipeline (Preprocessor $\rightarrow$ Backbone $\rightarrow$ Head) and check if the output distribution is sane.
Explicit Preprocessing: They explicitly include the Preprocessor in the Model definition to ensure that the model.save() artifact is self-contained and deployment-ready.
Metadata Auditing: They check the specific version of the weights (e.g., checking if it is the imagenet_21k version or imagenet_1k version) to decide whether to use a higher learning rate or different augmentation strategies.

Why Juniors Miss It

The “Black Box” Fallacy: Juniors often treat from_preset() as a magic function that provides a “ready-to-use” model, failing to realize that a Backbone is a sub-component, not a complete solution.
Focusing on Architecture over Data: They spend time tweaking the Dense layer size or the number of layers in the head, while the input signal-to-noise ratio is broken due to incorrect normalization.
Ignoring Config Files: They read the code but skip the config.json files, which contain the ground truth regarding image_shape, patch_size, and use_class_token.