Eric Schorn

Technical Director

Cryptography Services Practice, NCC Group

June 15, 2023

This executable blog post is the fourth in a series related to machine learning and is a fascinating trifecta involving hardened cryptography software, embedded IoT-type hardware, and deep machine learning techniques. While the AES algorithm is designed such that a brute-force secret key guessing attack would likely finish 'sometime near eternity', the power side-channel attack demonstrated here retrieves the 128-bit secret key 'probably closer to lunchtime'. After reviewing the specific attack scenario with its hardware and software elements, we utilize publicly available benchmark profiling data to train a deep machine learning model to support secret key extraction. We then proceed through a methodical process that begins with intermediate model predictions from benchmark attack data and removes the hardening protections to ultimately produce a secret key from approximately 40-100 power traces taken together. While the benchmark-oriented scenario is simplified for clarity, it is very indicative of the difficulty of protecting cryptographic primitives running on embedded hardware from power side-channel attacks.

The approach described here closely follows the paper Side Channel Analysis against the ANSSI’s protected AES implementation on ARM, and utilizes the ASCAD2 public benchmarking data/code available on GitHub at https://github.com/ANSSI-FR/ASCAD. The ASCAD2 traces were measured on a MCU running the Hardened Library for AES-128 encryption/decryption on ARM Cortex M4 Architecture developed by the French National Cybersecurity Agency (ANSSI). The ASCAD2 dataset was generated because many felt that prior benchmarking datasets had become too easy -- this dataset is extremely difficult to successfully attack. This dataset offers an excellent opportunity for study, exploration and preparation to attack post-quantum algorithms such as Kyber as noted in our Real World Crypto 2023 summary.

You can run everything for yourself by downloading the data and executing this .ipynb file on any Jupyter-based notebook system. The goal for the code below is to utilize state-of-the-art models while maximizing simplicity, understandability, and accessibility. This is consistent with the three prior posts in the blog series:

- Machine Learning 101: The Integrity of Image (Mis)Classification?
- Machine Learning 102: Attacking Facial Authentication with Poisoned Data
- Machine Learning 103: Exploring LLM Code Generation

A simplified scenario with full control over an Internet of Things microcontroller (MCU) is explored. The attacker can repeatedly encrypt chosen plaintext with a given key while capturing power consumption traces. Our goal is to develop and train a machine learning model on a large number of profiling traces to support key extraction. We then test the model's ability to support key extraction on a variable number of additional attack traces that were unseen during training. Note that in a real-world scenario there may be multiple hardware targets, such as one in the lab for controlled profiling with another under attack in the field, and data acquisition may be more challenging.

The target device for this research is an ARM Cortex-M4 based MCU on the Chip Whisperer board. This board allows users to load software and run repeated encryptions, while capturing a tremendous amount of power trace data. The MCU is underclocked to 4MHz to allow for 25 samples per clock cycle. Each sample is an 8-bit signed integer corresponding to instantaneous power consumption.

The ANSSI has developed a hardened AES library to protect against side-channel attacks, which exploit the physical properties of a device to reveal secret values. This library performs ECB-mode encryption and decryption in constant time (which itself can be challenging to achieve), and also uses a second-order masking scheme. Each byte `x`

of the internal (16-byte) state `s`

is replaced with `alpha * x + beta`

; where `alpha`

is a multiplicative share, and `beta`

is an additive share in GF(2^{8}) that are constant across rounds. The internal state starting position and the final results must be adjusted to compensate for the two shares. Additionally, some AES sub-routines are performed in a pseudorandom order, i.e., shuffled. These three additional operations increase the encryption run time to approximately 30-50k clock cycles per block, compared to around 1k clock cycles for a performance optimized version. Overall, these measures are intended to provide robustness against side-channel attacks and prevent the leakage of sensitive information during the execution of the algorithm.

The code starts here by importing standard machine learning frameworks and reporting major version information.

In [1]:

```
# Clone data at https://github.com/ANSSI-FR/ASCAD
DATA_FILE = "ASCAD/STM32_AES_v2/ASCAD_data/ASCAD_databases/ascadv2-extracted.h5"
import os ; os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import absl.logging ; absl.logging.set_verbosity(absl.logging.ERROR)
from random import shuffle
from functools import partial
import h5py
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
tf.get_logger().setLevel('ERROR')
from tensorflow import TensorSpec, convert_to_tensor, expand_dims, float32, int8
from tensorflow.data import AUTOTUNE, Dataset
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Activation, AveragePooling1D, BatchNormalization, \
Conv1D, Dense, Flatten, Input, MaxPooling1D, add
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from tensorflow.keras.utils import to_categorical
print(f"Tensorflow version {tf.__version__} with Keras version {tf.keras.__version__}")
```

Tensorflow version 2.12.0 with Keras version 2.12.0

To train the model for power side-channel analysis, sufficient training data must first be collected from a target device. This process is known as profiling and initially involved capturing 500,000 traces that each contain 2 million raw samples for a total of 500 GB. Given insight into the AES specification, the source code, and some visual inspection and analysis, two subset windows of potential side-channel leakage were identified. The first window of 5,000 samples is associated with the calculation of the initial/starting `s`

states utilizing alpha and beta. The second window of 10,000 samples is associated with the first AES `SubBytes()`

(or S-Box substitution) operations. Thus, each of the 500,000 derived training traces contains 15,000 samples that effectively carry the 'signal' and has attached metadata such as the secret key and plaintext. It is important to note that **NO** attack data will be used until after the model has been fully trained in order to ensure valid results.

To prepare the data for training, a custom Python generator is implemented and attached to two TensorFlow datasets suitable for model training and validation. A fixed 90% of the data is allocated to training only, with the remaining 10% held out for validation. This approach will allow us to effectively train the model to predict the sensitive variables in the target device's intermediate computations (with support from the metadata), thereby reducing the AES code's resistance to side-channel attacks.

In [2]:

```
# Python generator to yield samples of training data
def generate_data(h5_data, mode):
traces = np.array(h5_data['traces'])
labels = np.array(h5_data['labels'])
alpha = [to_categorical(labels['alpha_mask'], num_classes=256)]
beta = [to_categorical(labels['beta_mask'], num_classes=256)]
sboxes = [to_categorical(labels['sbox_masked'][:,i], 256) for i in range(16)]
perms = [to_categorical(labels['perm_index'][:,i], 16) for i in range(16)]
targets = alpha + beta + sboxes + perms # 34 separate targets per trace
if mode == "train": indices = [i for i in range(int(len(traces)*0.90))] # 90% train
else: indices = [int(len(traces)*0.90)+i for i in range(int(len(traces)*0.10))]
shuffle(indices)
for index in indices: # Yield one trace and one [targets] in random order
yield expand_dims(convert_to_tensor(traces[index]), -1), \
tuple([convert_to_tensor(targets[i][index]) for i in range(len(targets))])
# Load the data; we are constrained to strictly profiling/training data for now
data_file = h5py.File(DATA_FILE, "r")
profiling_data = data_file['Profiling_traces']
# Connect Python generators to Tensorflow datasets to feed model training
train_data = Dataset.from_generator(partial(generate_data, profiling_data, "train"),
output_signature=((TensorSpec(shape=(15000, 1), dtype=int8),
tuple([TensorSpec(shape=(256), dtype=float32) for i in range(18)] +
[TensorSpec(shape=(16), dtype=float32) for i in range(16)]
)))).cache().batch(64).prefetch(AUTOTUNE)
valid_data = Dataset.from_generator(partial(generate_data, profiling_data, "valid"),
output_signature=((TensorSpec(shape=(15000, 1), dtype=int8),
tuple([TensorSpec(shape=(256), dtype=float32) for i in range(18)] +
[TensorSpec(shape=(16), dtype=float32) for i in range(16)]
)))).cache().batch(64).prefetch(AUTOTUNE)
```

As can be seen from the output signature above, the TensorFlow datasets yield individual items consisting of one 15,000-sample trace and 34 target values which are then collected into batches of 64. The 34 target values include an alpha and beta byte, a byte for each of the 16 masked S-Box output predictions, and a nibble for each of the 16 possible shuffling permutations. Note that each of the byte and nibble values are expanded into their one-hot representation of 256 and 16 target classes respectively. Thus, the model predictions will essentially be `2 * 256 + 16 * 256 + 16 * 16 = 4864`

values which is relatively large. We will ultimately use the intermediate alpha, beta, and shuffling information to unmask the masked S-Box output predictions. With known plaintext, the unmasked S-Box output predictions are used to extract the key.

One full trace consisting of 15,000 samples is plotted below along with a zoomed-in subset of the first 1,100 samples. The top rightmost portion of the trace (starting near sample 12,000) shows 10 rounds of `Cipher()`

corresponding to Algorithm 1 in Section 5.1 of the AES specification.

In [3]:

```
for sample in train_data:
fig = plt.figure(figsize=(12, 2))
plt.plot(sample[0][0]); plt.xlim(0, 15000)
fig = plt.figure(figsize=(12, 2))
plt.plot(sample[0][0][0:1100]) ; plt.xlim(0, 1100)
break # We only want to draw the first trace
```

Now the model will be constructed by first defining a few helper functions that return common model sub-blocks and then utilizing them to build the full deep network model. Once completed, the model is ready for training with the profiling-only traces and intermediate/internal values. When the model is later used for the attack, it consumes the traces and predicts the intermediate/internal values.

The code below consists of a `resnet_layer()`

function that returns an individual layer for subsequent incorporation into a Residual Neural Network or `resnet`

-style block. The returned value is just a parameterized convolutional layer followed by optional batch normalization and activation operations.

The second function `output_block()`

below returns an output block reused for each of the model output stages that predict target values. The returned block consists of a fully connected 1,024-element dense layer followed by a normalization operation and finalized by a dense layer with softmax activation. The softmax activation produces one-hot output which can be (loosely) considered probabilities.

In [4]:

```
# Resnet layer sub-function
def resnet_layer(inputs, num_filters=16, k_size=11, strides=1,
activation='relu', batch_norm=True):
x = inputs
x = Conv1D(num_filters, kernel_size=k_size, strides=strides, padding='same')(x)
if batch_norm: x = BatchNormalization()(x)
if activation is not None: x = Activation(activation)(x)
return x
# Output block sub-function for alpha, beta, sboxes and perms
def output_block(inputs, name, width):
x = inputs
x = Dense(1024, activation='relu', name=f'fc1_{name}')(x)
x = BatchNormalization()(x)
x = Dense(width, activation="softmax", name=f'{name}_output')(x)
return x
```

Now the entire model can be built via the `build_model()`

function implemented below. As described earlier, the model input consists of a 15,000-sample trace. Nine `resnet`

components are stacked on top of each other with their output flattened before arriving at the 34 output blocks. Thus, the deep portion of the network is shared by all output target blocks. As can be seen, certain parameters, such as strides and the number of filters in a `resnet_layer`

, are adapted as the network grows.

In [5]:

```
# Build the complete 'ResNetSCA' model
def build_model():
num_filters = 16; strides = 1 # Starting condition
inputs = Input(shape=(15000,1))
x = resnet_layer(inputs=inputs)
for stack in range(9):
if stack > 0: strides = 2
y = resnet_layer(inputs=x, num_filters=num_filters, strides=strides)
y = resnet_layer(inputs=y, num_filters=num_filters, activation=None)
if stack > 0: x = resnet_layer(inputs=x, num_filters=num_filters, k_size=1,
strides=strides, activation=None, batch_norm=False)
x = add([x, y])
x = Activation('relu')(x)
if (num_filters<256): num_filters *= 2
x = AveragePooling1D(pool_size=4)(x)
x = Flatten()(x)
# Total of 34-hot out of 4864 output bits...
x_alpha = output_block(x, "alpha", 256)
x_beta = output_block(x, "beta", 256)
x_sboxes = [output_block(x, f"sbox_{i:02}", 256) for i in range(16)]
x_perms = [output_block(x, f"perm_{i:02}", 16) for i in range(16)]
return Model(inputs, [x_alpha, x_beta] + x_sboxes + x_perms, name='ResNetSCA')
```

First, the `plot_history()`

helper function is implemented below to plot training results. While it is rather involved and somewhat compressed/messy, there is no magic to be found here.

Next, the model is built and trained. The model consists of 137M trainable parameters, so will fit inside a consumer GPU. The initial learning rate is arguably excessive at `1e-3`

, but a callback reduces the learning rate by a factor of 4 whenever forward progress is not made over 2 adjacent epochs (which will happen several times during training). Similarly, the model is checkpointed such that epochs without forward progress are discarded. Thus, the performance plots will show no extended backward trends in validation loss performance. The model was trained for a total of 30 epochs where minor-but-valuable improvements in later epochs appear to primarily involve the S-Box results.

The plotted performance results suggest that predicting alpha is relatively easy. Predicting the 16 shuffling permutations range from easy to moderately challenging, while predicting beta is definitely moderately challenging (based on the shown validation accuracy). Recall that a random prediction for a one-hot byte would give 0.39% accuracy and a one-hot nibble gives 6.25% accuracy. As such, all but the S-Box output predictions look reasonably good at first glance.

The model's performance on S-Box output predictions is far lower than other values but still about 4X better accuracy than a random choice. As a result, the model will ultimately need to be run on multiple attack traces with predictions aggregated to strengthen the signal while weakening the noise.

In [6]:

```
# Plot the training history
def plot_history(hist):
plt.figure(figsize=(14, 10)); xticks = range(0,len(hist.history["loss"]))
ax = plt.subplot(2, 2, 1); plt.title("Total Loss")
plt.plot(hist.history["loss"]); plt.plot(hist.history["val_loss"])
plt.legend(["Train", "Validation"], loc="upper right"); plt.ylim(80, 100)
plt.xlabel("End of Epoch"); plt.xticks(xticks)
ax = plt.subplot(2, 2, 2); plt.title("Alpha/Beta Accuracy (val)")
plt.plot(hist.history["val_alpha_output_accuracy"])
plt.plot(hist.history["val_beta_output_accuracy"])
plt.legend(["Alpha", "Beta"], loc="upper right")
plt.xlabel("End of Epoch"); plt.xticks(xticks)
ax = plt.subplot(2, 2, 3); plt.title("SBox 0-15 Accuracy (val)")
for i in range(16): plt.plot(hist.history[f"val_sbox_{i:02}_output_accuracy"])
plt.xlabel("End of Epoch"); plt.xticks(xticks)
ax = plt.subplot(2, 2, 4); plt.title("Shuffling 0-15 Accuracy (val)")
for i in range(16): plt.plot(hist.history[f"val_perm_{i:02}_output_accuracy"])
plt.xlabel("End of Epoch"); plt.xticks(xticks)
plt.show()
# Build and train the model
def train_model():
model = build_model()
model.compile(loss='categorical_crossentropy', optimizer=RMSprop(1e-3), metrics=['accuracy'])
callbacks = [ReduceLROnPlateau(factor=0.25, patience=2, verbose=0)]
callbacks.append(ModelCheckpoint("model_checkpoint", save_best_only=True, verbose=0))
history = model.fit(train_data, validation_data=valid_data, epochs=30, callbacks=callbacks, verbose=0)
return history, model
history, model = train_model()
print(f"{np.sum([np.prod(v.shape) for v in model.trainable_variables])} total trainable parameters")
plot_history(history)
```

137396064 total trainable parameters