Introduction to TinyML

Introduction to TinyML

In this blog post, you’ll learn about Arduino Nano 33 BLE and the basics of TensorFlow Lite for Microcontrollers.

Since I’m selected to contribute to TensorFlow Lite for Microcontrollers for this year’s Google Summer of Code program, I will learn about running machine learning models on microcontrollers and share the results of my experiments and what I learned with a series of blog posts.

TinyML involves making predictions on microcontrollers using machine learning models. The most challenging part of TinyML is the hardware constraints. We have to work with limited memory and computational power. And, if we need to use battery power, we must reduce power consumption. More computational power drains more battery power.

Although there are different frameworks for TinyML applications, Google’s TensorFlow Lite Micro is widely used in this field.

TensorFlow Lite for Microcontrollers supports the following development boards: (https://www.tensorflow.org/lite/microcontrollers)

For my experiments, I’m going to use Arduino Nano 33 BLE.

According to the datasheet (https://docs.arduino.cc/resources/datasheets/ABX00030-datasheet.pdf), the features of the Arduino Nano 33 BLE are written below.

· Regulates input voltage from up to 21V with a minimum of 65% efficiency @minimum load

Arduino Nano 33 BLE uses an ARM Cortex M4F processor which runs up to 64MHz clock. That means it spends 1/64 = 0.015 microseconds in a clock cycle. The processor possesses 32 bits wide registers, 1MB of Flash memory, and 256 KB of RAM. “The flash (non-volatile memory) can be read an unlimited number of times by the CPU, but it has restrictions on the number of times it can be written and erased and also on how it can be written”. [1] The processor also contains a floating-point unit. A Floating-Point Unit (FPU) (or math coprocessor) performs operations such as addition, subtraction, multiplication, and division on floating-point numbers.

“Arm Cortex processors with Digital Signal Processing (DSP) extensions offer high performance signal processing for voice, audio, sensor hubs and machine learning applications, with flexible, easy-to-use programming. The extensions provide a unique combination of compute scalability, power efficiency, determinism and interface options in order to perform the signal processing required in multi-sensor devices that do not require dedicated DSP hardware. The benefits of DSP extensions in Cortex processors include:

· Simplify the design, lower the bill of materials, reduce power and area with DSP and ML capabilities on Arm processors across a single architecture.

· Reduce system-level complexity by removing the need for shared memory and DSP communication, complex multi-processor bus architectures, and other custom ‘glue’ logic between the processor and DSP.

· Reduce software development costs, as the entire project can be supported using a single compiler, debugger or IDE, programmable in a high-level programming language such as C or C++.” (https://developer.arm.com/Architectures/Digital%20Signal%20Processing)

ARM Cortex M4F implements Armv7E-M architecture. For more information about Armv7E-M architecture, you can download the architecture reference manual from here (https://developer.arm.com/documentation/ddi0403/latest). ARM Cortex M4 processors are based on Harvard computer architecture. That means, there are separate memory and pathways for instructions and data.

The Arduino Nano 33 BLE uses u-blox NINA-B306, which is A powerful 2.4 GHz Bluetooth® 5 Low Energy module from u-blox, with an internal antenna.

This LSM9DS1 module detects orientation, motion, and vibrations using a 3D accelerometer, gyroscope, and magnetometer.

The ARM Cortex M4F processor uses Mbed OS. The Mbed OS is an open-source (https://github.com/ARMmbed) operating system that targets microcontrollers, Internet of Things devices, and wearables.

TensorFlow Lite Micro is a machine learning framework for microcontrollers. It provides low memory usage and power consumption.

TensorFlow Lite Micro employs an interpreter-based approach which allows portability across different hardware platforms. On the other hand, code generation approach which generates models into C code does not allow portability. Furthermore, “code generation intersperses settings such as model architecture, weights, and layer dimensions in the binary, which means replacing the entire executable to modify a model. In contrast, an interpreted approach keeps all this information in a separate memory file/area, allowing model updates to replace a single file or contiguous memory area.” [2]

According to the official website of TensorFlow Lite Micro (https://www.tensorflow.org/lite/microcontrollers) to train and deploy your model to the device and make inference, the following steps are required.

· Build your model considering the limits of your device. Smaller models can cause underfitting, and larger ones might result in a higher duty cycle which drains more power.

· You need to convert your TensorFlow model to a TensorFlow Lite model.

· Convert your TensorFlow Lite model to a C array to deploy it to your device.

Unfortunately, TensorFlow Lite for Microcontrollers doesn’t support all of the TensorFlow operations. You can find a list of the supported operations here. (https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/all_ops_resolver.cc)

Intermediate tensors hold the intermediate computation results to reduce inference latency. These intermediate tensors can use large amounts of memory. They might be larger than the model. TensorFlow lite employs different approximation techniques for intermediate tensors problem. These techniques use a data structure named tensor usage records. They contain records of how big the intermediate tensor is and when it is used for the first and last time. The memory manager reads these records and computes the intermediate tensor usages to reduce memory footprint. (https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html)

Authors of MicroNets: Neural Network Architectures for Deploying TinyML Applications ON Commodity Microcontrollers paper run experiments on different kinds of MCUs using TensorFlow Lite Micro. You can find their models on https://github.com/ARM-software/ML-zoo.

Figure 3: An example memory map of how an audio keyword spotting model mapped onto ARM Cortex M7 with 320 KB of SRAM and 1 MB eFlash

In figure 3, authors show an example memory occupancy map for a KWS model on ARM Cortex M7 with 320 KB of Static Random Access Memory and 1 MB of on-chip embedded flash memory. Weights and biases (and the model architecture) must be stored in the flash memory due to its capability of maintaining the stored data without requiring power. In this example, the weights and biases occupy 500 KB of memory. The model size depends on how many layers and units on each layer we have and the size of our dataset.

This paper also measures the latencies of different layers of TensorFlow Lite Micro. Figure 5 and Figure 6 explain how convolution work.

This paper also measures the latencies of different layers of TensorFlow Lite Micro. Figure 5, Figure 6, and equation 1 explain how convolution works.

RGB images consist of three channels (matrices), red, green, and blue. Depthwise separable convolution processes each channel separately. Similar to spatial separable convolution, the depthwise separable convolution layer also divides the kernels into the smaller-sized kernels.

These layers work in two stages. The first stage is depthwise convolution. At this stage, each channel of the image is processed separately. This is not the final result, but an intermediate result for the point convolution, the second step of the depthwise separable convolution.

Pointwise convolution uses 1x1 kernels. For RGB images, the depth of this kernel must be equal to the number of channels, 3. As a result of the convolutions performed on the intermediate results obtained from the previous depthwise convolution step, we get an output of the same size as the output of the standard convolution layer.

According to Figure 4, depthwise convolutional layers increase the latency.

Table 1 shows what kind of neural nets we use for different types of data. Convolutional neural networks are quite useful for image data.

Google bought me an Arduino Nano 33 BLE board, but it didn’t arrive yet. For the next blog post, I’ll build and train a convolutional model using TensorFlow and convert it to TensorFlow Lite format and C array. I’ll also handle the basics of deep learning and image classification. I’ll use CIFAR-10 dataset for this project.

[5] Vincent Dumoulin and Francesco Visin, arXiv, A guide to convolution arithmetic for deep learning, 2016

Images Powered by Shutterstock