Skip to content

Training Guide

Training MTCNN Networks: PNet, RNet, ONet

MTCNN consists of three convolutional neural networks: PNet, RNet, and ONet, each responsible for different stages of face detection and landmark prediction. In this guide, we will explain the following:

  • The architecture of each model (PNet, RNet, ONet).
  • How to generate a dataset to train each network.
  • The process for training the networks.
  • How to save and load the trained weights.

1. Model Architectures

The three networks that make up MTCNN are progressively more complex, each building upon the output of the previous network. Below are the detailed architectures for PNet, RNet, and ONet.

PNet (Proposal Network)

PNet is responsible for generating initial face region proposals. It works by sliding over an image and outputting bounding boxes and a face/non-face classification score.

Model: "p_net_1" 
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)            ┃ Output Shape            ┃ Param #       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv1 (Conv2D)          │ (None, None, None, 10)  │ 280           │
│ prelu1 (PReLU)          │ (None, None, None, 10)  │ 10            │
│ maxpooling1 (MaxPooling)│ (None, None, None, 10)  │ 0             │
│ conv2 (Conv2D)          │ (None, None, None, 16)  │ 1,456         │
│ prelu2 (PReLU)          │ (None, None, None, 16)  │ 16            │
│ conv3 (Conv2D)          │ (None, None, None, 32)  │ 4,640         │
│ prelu3 (PReLU)          │ (None, None, None, 32)  │ 32            │
│ conv4-1 (Conv2D)        │ (None, None, None, 4)   │ 132           │
│ conv4-2 (Conv2D)        │ (None, None, None, 2)   │ 66            │
└─────────────────────────┴─────────────────────────┴───────────────┘
Total params: 6,632

The tensorflow model can be directly loaded with:

from mtcnn.network import PNet
import tensorflow as tf

inp_layer = tf.keras.layers.Input((None, None, 3))
out_layer = PNet()(inp_layer)
pnet = tf.keras.models.Model(inp_layer, out_layer)

RNet (Refinement Network)

RNet refines the face proposals generated by PNet and rejects false positives. It also outputs refined bounding boxes and face/non-face classification scores.

Model: "r_net_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)            ┃ Output Shape            ┃ Param #       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv1 (Conv2D)          │ (None, 22, 22, 28)      │ 784           │
│ prelu1 (PReLU)          │ (None, 22, 22, 28)      │ 28            │
│ maxpooling1 (MaxPooling)│ (None, 11, 11, 28)      │ 0             │
│ conv2 (Conv2D)          │ (None, 9, 9, 48)        │ 12,144        │
│ prelu2 (PReLU)          │ (None, 9, 9, 48)        │ 48            │
│ maxpooling2 (MaxPooling)│ (None, 4, 4, 48)        │ 0             │
│ conv3 (Conv2D)          │ (None, 3, 3, 64)        │ 12,352        │
│ prelu3 (PReLU)          │ (None, 3, 3, 64)        │ 64            │
│ flatten3 (Flatten)      │ (None, 576)             │ 0             │
│ fc4 (Dense)             │ (None, 128)             │ 73,856        │
│ prelu4 (PReLU)          │ (None, 128)             │ 128           │
│ fc5-1 (Dense)           │ (None, 4)               │ 516           │
│ fc5-2 (Dense)           │ (None, 2)               │ 258           │
└─────────────────────────┴─────────────────────────┴───────────────┘
Total params: 100,178

The tensorflow model can be directly loaded with:

from mtcnn.network import RNet
import tensorflow as tf

inp_layer = tf.keras.layers.Input((24, 24, 3))
out_layer = RNet()(inp_layer)
rnet = tf.keras.models.Model(inp_layer, out_layer)

ONet (Output Network)

ONet is the final stage, responsible for providing the most precise bounding boxes and detecting facial landmarks.

Model: "o_net_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)            ┃ Output Shape            ┃ Param #       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv1 (Conv2D)          │ (None, 46, 46, 32)      │ 896           │
│ prelu1 (PReLU)          │ (None, 46, 46, 32)      │ 32            │
│ maxpooling1 (MaxPooling)│ (None, 23, 23, 32)      │ 0             │
│ conv2 (Conv2D)          │ (None, 21, 21, 64)      │ 18,496        │
│ prelu2 (PReLU)          │ (None, 21, 21, 64)      │ 64            │
│ conv3 (Conv2D)          │ (None, 8, 8, 64)        │ 36,928        │
│ flatten4 (Flatten)      │ (None, 1152)            │ 0             │
│ fc5 (Dense)             │ (None, 256)             │ 295,168       │
│ prelu5 (PReLU)          │ (None, 256)             │ 256           │
│ fc6-1 (Dense)           │ (None, 4)               │ 1,028         │
│ fc6-2 (Dense)           │ (None, 10)              │ 2,570         │
│ fc6-3 (Dense)           │ (None, 2)               │ 514           │
└─────────────────────────┴─────────────────────────┴───────────────┘
Total params: 389,040

The tensorflow model can be directly loaded with:

from mtcnn.network import ONet
import tensorflow as tf

inp_layer = tf.keras.layers.Input((48, 48, 3))
out_layer = ONet()(inp_layer)
onet = tf.keras.models.Model(inp_layer, out_layer)

Feeding the networks

You can try to feed the networks with random inputs and check the result tensor shapes:

>>> import numpy as np

>>> # PNET
>>> dummy_input = np.random.randn(1, 100, 100, 3) # batch of 1 image with 100x100 pixels and 3 channels
>>> result = onet(dummy_input)
>>> print(result[0].shape) # BBOX regression
(1, 45, 45, 4)
>>> print(result[1].shape) # Face classification
(1, 45, 45, 2)

# RNET
>>> dummy_input = np.random.randn(10, 24, 24, 3) # batch of 10 images with 24x24 pixels and 3 channels (crops, fixed-size)
>>> result = rnet(dummy_input))
>>> print(result[0].shape) # BBOX regression
(10, 4)
>>> print(result[1].shape) # Face classification
(10, 2)

# ONET
dummy_input = np.random.randn(10, 48, 48, 3) # batch of 10 images with 48x48 pixels and 3 channels (crops, fixed-size)
result = onet(np.random.randn(10, 48, 48, 3))
print(result[0].shape)  # BBOX regression
(10, 4)
print(result[1].shape)  # Landmarks regression
(10, 10)
print(result[2].shape)  # Face classification
(10, 2)

2. Preparing the Dataset for Training

Each network in MTCNN (PNet, RNet, ONet) requires specific formats for the input and output data. Below, we describe how to structure the dataset for each network, including the expected shapes for the bounding boxes, classifications, and facial landmarks. Additionally, we explain the preprocessing steps required to prepare images for each network.

Input and Output Formats

  • Bounding Boxes:

    • Format: (x1, y1, x2, y2)

      • x1, y1: Coordinates of the top-left corner of the bounding box.
      • x2, y2: Coordinates of the bottom-right corner of the bounding box.
    • Shape: For each image, the output bounding boxes are structured as a 4-element array [x1, y1, x2, y2]. Hence, expected result is a vector of shape (batch_size, 4)

  • Classifications:

    • MTCNN uses a multiclass classification output with 2 categories:
      • 0: Non-face region.
      • 1: Face region.
    • The classification is encoded as a one-hot vector of two categories:
      • For non-face: [1, 0].
      • For face: [0, 1].
    • Shape: For each image, the classification is a 2-element array, [non-face, face]. Hence, expected result is a one-hot-vector of shape (batch_size, 2)
  • Landmarks (for ONet only):

    • Format: 5 landmarks, where each landmark has two coordinates (x, y). The order is:

      1. Left eye.
      2. Right eye.
      3. Nose.
      4. Left mouth corner.
      5. Right mouth corner.
    • The predicted landmarks are structured as two consecutive arrays: first the x coordinates, then the y coordinates.

      • Example: [x_left_eye, x_right_eye, x_nose, x_mouth_left, x_mouth_right, y_left_eye, y_right_eye, y_nose, y_mouth_left, y_mouth_right]
    • Shape: A 10-element array for each image. Hence, expected result is vector of shape (batch_size, 10)


3. Dataset Preparation for Each Network

Since each network (PNet, RNet, ONet) performs progressively more refined tasks, the dataset and preprocessing required differ for each stage. Below is the dataset preparation workflow for each network.

A. PNet (Proposal Network)

Task: Generate initial face region proposals from multiple image scales.

Input:

  • Images: Input images are resized to multiple scales (image pyramid) to detect faces of various sizes.
  • Labels:
    • Bounding Boxes: For each image, annotate face regions with bounding boxes in the format [x1, y1, x2, y2].
    • Classifications: For each bounding box, generate one-hot encoded labels [non-face, face].

Output:

  • PNet outputs:
    • Bounding Box Regression: The predicted bounding boxes for face regions.
    • Classifications: Whether a region contains a face (one-hot encoding).

Preprocessing:

  1. Image Pyramid: Scale each image to multiple resolutions (typically downsampled by a factor of 0.709) to create an image pyramid.
  2. Sliding Window Detection: For each scale, PNet slides a window over the image, generating bounding boxes and classifications.
  3. Dataset:
    • Prepare multiple scales of each image.
    • Annotate each scale with the corresponding bounding boxes and classifications.

Training:

  • Input: Image scales.
  • Output: Bounding boxes and face/non-face classifications.
  • The network learns to propose candidate face regions from different scales of the input image.

B. RNet (Refinement Network)

Task: Refine the bounding box proposals from PNet and reject false positives.

Input:

  • Cropped Face Proposals: After PNet generates bounding boxes, use them to crop the face regions from the original images.
  • Labels:

    • Refined Bounding Boxes: Provide corrections for the bounding boxes proposed by PNet. This involves calculating the difference between the proposed bounding box and the ground truth bounding box in the format [x1, y1, x2, y2].
    • Classifications: One-hot encoded labels [non-face, face] for each proposed region.

Output:

  • RNet outputs:

    • Refined Bounding Boxes: The improved coordinates of the face regions.
    • Classifications: Whether the refined region contains a face (one-hot encoding).

Preprocessing:

  1. Cropped Face Regions: Use the bounding box proposals from PNet to crop the face regions from the original image.
  2. Scale and Align: Resize the cropped regions to the required input size for RNet.
  3. Bounding Box Regression: For each cropped face, calculate the adjustment needed to align the PNet bounding box with the ground truth bounding box.
  4. Dataset:

    • For each cropped region, provide the bounding box adjustments and the one-hot encoded face/non-face label.

Training:

  • Input: Cropped face proposals (resized to the input size of RNet).
  • Output: Refined bounding boxes and face classifications.

C. ONet (Output Network)

Task: Further refine bounding boxes and predict facial landmarks.

Input:

  • Cropped Face Proposals: Similar to RNet, but the crops are passed from RNet’s refined bounding boxes.
  • Labels:
    • Final Bounding Boxes: Provide final bounding box corrections based on the ground truth in the format [x1, y1, x2, y2].
    • Landmarks: For each face region, provide the coordinates of the 5 landmarks (left eye, right eye, nose, left mouth corner, right mouth corner) in the format [x_left_eye, x_right_eye, x_nose, x_mouth_left, x_mouth_right, y_left_eye, y_right_eye, y_nose, y_mouth_left, y_mouth_right].
    • Classifications: One-hot encoded labels [non-face, face] for each proposed region.

Output:

  • ONet outputs:

    • Final Bounding Boxes: The precise coordinates of the face regions.
    • Landmarks: The (x, y) coordinates for the 5 key facial landmarks.
    • Classifications: Whether the region contains a face (one-hot encoding).

Preprocessing:

  1. Cropped Face Regions: Use RNet’s output to crop the face regions from the original image.
  2. Scale and Align: Resize the cropped regions to ONet’s input size.
  3. Bounding Box Regression: For each cropped face, calculate the adjustment needed for the bounding box.
  4. Landmark Annotation: Provide the coordinates for the 5 key facial landmarks.
  5. Dataset:
    • Each image must have final bounding box adjustments, face classifications, and landmark annotations.

Training:

  • Input: Cropped face regions.
  • Output: Final bounding boxes, face classifications, and landmarks.

4. Training Process

Once the dataset is prepared, the training process involves:

  1. Loading the Dataset: Use the prepared dataset (images, bounding boxes, classifications, and landmarks).
  2. Model Compilation: Each network is compiled with appropriate loss functions:
  3. Bounding Box Loss: Mean squared error (MSE) or smooth L1 loss for bounding box regression.
  4. Classification Loss: Categorical cross-entropy for face/non-face classification.
  5. Landmark Loss (for ONet): Mean squared error for landmark regression.
  6. Training: Call .fit() on the model, passing the input data and the corresponding outputs.

5. Saving and Loading Weights

After training, you should save the model weights in a compressed format using joblib with LZ4 compression. Here’s how you can do it:

import joblib

# Saving the weights
joblib.dump(pnet.get_weights(), "pnet.lz4", compress=("lz4", 1))
joblib.dump(rnet.get_weights(), "rnet.lz4", compress=("lz4", 1))
joblib.dump(onet.get_weights(), "onet.lz4", compress=("lz4", 1))

To load the weights into the models:

from mtcnn.stages import StagePNet, StageRNet, StageONet

stage_pnet = StagePNet(weights="pnet.lz4")
stage_rnet = StageRNet(weights="rnet.lz4")
stage_onet = StageONet(weights="onet.lz4")

from mtcnn import MTCNN

mtcnn = MTCNN(stages=[stage_pnet, stage_rnet, stage_onet])

Conclusion

This guide covers the preparation of datasets, the specific formats required for each network, and the training process for PNet, RNet, and ONet. Each network performs progressively refined tasks, requiring different preprocessing steps and annotations. By following this guide, you can train MTCNN models on your custom datasets and use them for accurate face detection and landmark prediction.