Edge AI on ESP32: Object Detection with Lightweight AI on Microcontrollers

ByHaneen ✓May 25, 2026

0

Edge AI, also called TinyML, represents a paradigm shift in artificial intelligence deployment — moving inference from cloud servers directly onto resource-constrained microcontrollers. This report presents a thorough examination of running object detection models on the ESP32 and ESP32-CAM microcontrollers, which offer dual-core Xtensa LX6/LX7 processors, up to 520 KB of SRAM, integrated Wi-Fi and Bluetooth, and optionally a 4 MB PSRAM for larger tensor arenas. We cover the theoretical foundations of TinyML and model quantization, explain the TensorFlow Lite for Microcontrollers (TFLM) inference engine, document a step-by-step practical experiment for real-time person detection, and provide complete, annotated Arduino/C++ source code. Materials, wiring, library installation, debugging, and performance benchmarks are all addressed. Upon completing this guide, a practitioner can deploy an INT8-quantized MobileNet model achieving ~60–80% person-detection accuracy entirely offline, with inference running locally at 2–5 frames per second.

1. Introduction

1.1 What is Edge AI?

Traditional AI pipelines send sensor data (images, audio, accelerometer readings) to remote cloud servers for inference, then stream results back. This introduces three critical problems: latency (round-trip network delay), bandwidth cost (continuous raw data transmission), and privacy risk (sensitive data leaving the device). Edge AI solves all three by executing the neural network directly on the endpoint device.

TinyML is the sub-field of edge AI focused specifically on microcontrollers — devices with clock speeds under 500 MHz, RAM measured in kilobytes, and power budgets measured in milliwatts. The ESP32 family sits at the sweet spot for TinyML experimentation: powerful enough for INT8 inference, cheap enough for prototyping (under ₹500 for the base module), and supported by mature libraries and toolchains.

1.2 Why ESP32?

The ESP32 offers a compelling combination of specifications for edge AI:

• Dual-core Xtensa LX6 @ 240 MHz (or LX7 on the S3 variant) — provides raw CPU throughput for matrix multiplications

• 520 KB internal SRAM — sufficient for small tensor arenas; the CAM variant adds 4 MB PSRAM

• 4–16 MB Flash — enough to store a 300 KB quantized MobileNet model

• Built-in Wi-Fi + BLE — enables hybrid edge-cloud architectures when needed

• OV2640 camera module (on ESP32-CAM) — provides the image input for vision tasks

• Price: ₹350–600 (ESP32-CAM module) — democratizes edge AI development

1.3 Scope of This Report

This document focuses on the canonical entry-point edge AI project: real-time person detection using a quantized MobileNet model and TensorFlow Lite for Microcontrollers (TFLM), running entirely on-device with no cloud dependency. Extensions to custom models via Edge Impulse are also discussed.

2. Theoretical Background

2.1 The TinyML Stack

Deploying a neural network to a microcontroller requires four layers of technology working together:

Parameter	Value / Detail
Layer	Component for This Project
Training Framework	TensorFlow / Keras (on PC/cloud)
Model Architecture	MobileNetV1 (depthwise separable convolutions)
Compression	Post-Training INT8 Quantization
Inference Engine	TensorFlow Lite for Microcontrollers (TFLM)
Hardware	ESP32-CAM (Xtensa LX6 @ 240 MHz)

2.2 Model Quantization

Standard neural network weights are stored as 32-bit floating-point numbers (float32). On a microcontroller with no FPU or a limited one, each floating-point multiply-accumulate requires 4–10 extra cycles compared to an integer operation. Quantization converts weights and activations from float32 to 8-bit integers (INT8), achieving three simultaneous benefits:

• 4× reduction in model size (e.g., a 1.2 MB float32 model becomes ~300 KB as INT8)

• 2–4× speedup in inference because integer ALUs are faster on embedded cores

• Reduced power consumption — critical for battery-powered deployments

Post-Training Quantization (PTQ) applies this transformation after training, without modifying the original training loop. Accuracy loss is typically 1–3% on image classification tasks — acceptable for most edge applications. The ESP32-S3 variant performs INT8 operations over 2× faster than float32, making quantization essentially mandatory for real-time inference.

2.3 MobileNet Architecture

MobileNetV1 (Howard et al., 2017) replaces standard convolutions with depthwise separable convolutions. This factorization reduces the number of multiply-accumulate operations by a factor of 8–9× compared to a standard convolution of equivalent receptive field, while retaining comparable feature extraction ability. The key parameters for the ESP32 target are:

Parameter	Value / Detail
Input resolution	96 × 96 pixels (grayscale)
Model size (INT8)	~300 KB Flash
Tensor arena (RAM)	~100 KB PSRAM
Output	Two scores: P(person), P(no-person)
Inference latency	4–5 seconds on ESP32 @ 240 MHz
Accuracy (person detect)	~60–80% in varied lighting

2.4 TensorFlow Lite for Microcontrollers (TFLM)

TFLM is a stripped-down C++ inference engine with zero dynamic memory allocation, no operating system dependency, and a binary footprint under 50 KB. It reads a model stored as a FlatBuffer byte array (compiled into a C header file), allocates all tensor memory in a statically declared byte array called the tensor arena, and executes the inference graph operation by operation.

Only the operators used by the deployed model need to be compiled in (via AllOpsResolver or a custom MicroMutableOpResolver), keeping the binary small. TFLM supports both INT8 and float32 models, though INT8 is strongly preferred for MCU targets.

3. Materials Required

3.1 Hardware Components

Component	Purpose	Est. Cost (INR)	Qty
ESP32-CAM (AI Thinker)	Main compute + OV2640 camera + 4 MB PSRAM	₹350–500	1
FTDI USB-to-TTL Adapter (CP2102/CH340)	Flashing firmware via UART	₹150–250	1
Micro USB / USB-A cable	Power and data for FTDI	₹80–150	1
Jumper wires (Male-Female)	FTDI to ESP32-CAM connections	₹50–80	8–10
Breadboard (optional)	Secure wiring during development	₹60–120	1
5V 2A USB power adapter	Stable power (camera draws ~300 mA peak)	₹150–250	1
LED + 220Ω resistor (optional)	External visual indicator of detection	₹20	1 each
TFT LCD display (optional, SPI)	Display detection result on-screen	₹350–500	1

3.2 Software & Tools

• Arduino IDE 2.x — primary development environment

• Espressif ESP32 Board Package (v2.x — IMPORTANT: v3.x may have compatibility issues with some TFLite libraries)

• tflm_esp32 library (by eloquentarduino) — TFLite Micro runtime for ESP32

• EloquentEsp32Cam library — high-level camera abstraction

• eloquent_tinyml library — simplifies model loading and inference

• Edge Impulse CLI (optional) — for training and deploying custom models

• Python 3.x + TensorFlow (optional) — for custom model training and conversion

4. Circuit & Wiring

4.1 FTDI to ESP32-CAM Connections

The ESP32-CAM does not have a built-in USB-to-Serial chip, so an FTDI adapter is required for programming. Wire as follows:

Parameter	Value / Detail
FTDI Pin	ESP32-CAM Pin
GND	GND
5V (or 3.3V)	5V (use 5V — camera needs it)
TX	U0R (GPIO3 / RXD0)
RX	U0T (GPIO1 / TXD0)
— (during upload only)	GPIO0 → GND (enables flash mode)

⚠️ Flash Mode

Connect GPIO0 to GND ONLY during firmware upload. Remove this wire before running the sketch. Failing to do so will prevent the device from booting normally.

⚡ Power Note

The OV2640 camera draws up to 300 mA during capture. Always use the 5V rail from the FTDI or an external 5V supply — do NOT power the ESP32-CAM from the FTDI's 3.3V rail, as it cannot supply enough current and will cause random resets or corrupted frames.

4.2 Optional LED Indicator

The ESP32-CAM has an onboard flash LED on GPIO4 which is used directly in the code for person detection indication. If you want an external LED as an alternative or additional indicator, connect it through a 220Ω resistor between any free GPIO and GND. Modify the LED_GPIO_NUM define in the code accordingly.

5. Software Setup

5.1 Arduino IDE Configuration

1. Open Arduino IDE 2.x → File → Preferences

2. In 'Additional Board Manager URLs', paste: https://dl.espressif.com/dl/package_esp32_index.json

3. Go to Tools → Board → Boards Manager → search 'esp32' → install 'esp32 by Espressif Systems' (version 2.x)

4. Select board: Tools → Board → ESP32 Arduino → AI Thinker ESP32-CAM

5. Set CPU Frequency: Tools → CPU Frequency → 240 MHz

6. Set Upload Speed: 115200

5.2 Library Installation

Install all three libraries via Arduino IDE Library Manager (Tools → Manage Libraries):

• Search 'tflm_esp32' → install 'tflm_esp32 by eloquentarduino'

• Search 'EloquentEsp32Cam' → install the latest version

• Search 'eloquent_tinyml' → install the latest version

📌 Version Note

Install ESP32 core version 2.x, NOT 3.x. Version 3.x introduces breaking changes in camera and peripheral APIs that are incompatible with the tflm_esp32 and EloquentEsp32Cam libraries as of May 2026.

6. Complete Annotated Source Code

6.1 person_detection_esp32cam.ino

The following code captures grayscale frames at 96×96, runs the bundled person detection model (a MobileNet INT8 quantized model included in the tflm_esp32 library), and toggles GPIO4 (onboard flash LED) when a person is detected with confidence > 60%.

/*

* Edge AI Object Detection on ESP32-CAM

* Model: MobileNet (INT8 quantized, 96x96)

* Framework: TensorFlow Lite for Microcontrollers

* Libraries: tflm_esp32, EloquentEsp32Cam, eloquent_tinyml

* Board: AI Thinker ESP32-CAM

*/

#include <Arduino.h>

#include <esp_camera.h>

#include <tflm_esp32.h>

#include <eloquent_tinyml.h>

#include <eloquent_tinyml/zoo/person_detection.h>

#include <eloquent_esp32cam.h>

using eloq::camera;

using eloq::tinyml::zoo::personDetection;

// ===== AI Thinker ESP32-CAM Pin Definitions =====

#define PWDN_GPIO_NUM 32

#define RESET_GPIO_NUM -1

#define XCLK_GPIO_NUM 0

#define SIOD_GPIO_NUM 26

#define SIOC_GPIO_NUM 27

#define Y9_GPIO_NUM 35

#define Y8_GPIO_NUM 34

#define Y7_GPIO_NUM 39

#define Y6_GPIO_NUM 36

#define Y5_GPIO_NUM 21

#define Y4_GPIO_NUM 19

#define Y3_GPIO_NUM 18

#define Y2_GPIO_NUM 5

#define VSYNC_GPIO_NUM 25

#define HREF_GPIO_NUM 23

#define PCLK_GPIO_NUM 22

#define LED_GPIO_NUM 4 // Onboard flash LED

void setup() {

Serial.begin(115200);

delay(2000);

Serial.println("Edge AI Person Detection - ESP32-CAM");

// Configure LED

pinMode(LED_GPIO_NUM, OUTPUT);

digitalWrite(LED_GPIO_NUM, LOW);

// Initialize camera at 96x96 grayscale for TFLite

while (!camera.begin(

FRAMESIZE_96X96,

PIXFORMAT_GRAYSCALE,

/* fps= */ 10

).isOk())

Serial.println(camera.exception.toString());

// Allocate tensor arena in PSRAM (100 KB)

if (!personDetection.begin().isOk()) {

Serial.println("[ERROR] TFLite init failed: " + personDetection.exception.toString());

while (true);

}

Serial.println("[OK] Model ready. Starting inference loop...");

}

void loop() {

// Capture frame

if (!camera.capture().isOk()) {

Serial.println("Capture failed: " + camera.exception.toString());

return;

}

// Run inference

if (!personDetection.run(camera).isOk()) {

Serial.println("Inference failed: " + personDetection.exception.toString());

return;

}

float personScore = personDetection.outputs[0]; // Person probability

float noPersonScore = personDetection.outputs[1]; // No-person probability

Serial.print("[Person: ");

Serial.print(personScore * 100, 1);

Serial.print("%] [No-Person: ");

Serial.print(noPersonScore * 100, 1);

Serial.println("%]");

// Toggle LED if person confidence > 60%

if (personScore > 0.60f) {

digitalWrite(LED_GPIO_NUM, HIGH);

Serial.println(">>> PERSON DETECTED <<<");

} else {

digitalWrite(LED_GPIO_NUM, LOW);

}

// Throttle to ~2 FPS (inference takes ~4-5 seconds on plain ESP32)

delay(500);

}

6.2 Code Walkthrough

Includes & Namespaces

The four headers bring in the camera driver (esp_camera.h), the TFLite Micro runtime (tflm_esp32.h), the generic TinyML wrapper (eloquent_tinyml.h), the pre-built person detection model and inference logic (zoo/person_detection.h), and the camera convenience layer (eloquent_esp32cam.h). The two using declarations import the camera and personDetection singleton objects into the global namespace.

Pin Definitions

These 18 GPIO definitions map the ESP32's physical pins to the OV2640 camera's parallel data bus (Y2–Y9), clock lines (XCLK, PCLK, VSYNC, HREF), I2C lines for SCCB configuration (SIOD, SIOC), and power control (PWDN). They must match the AI Thinker board layout exactly. The RESET_GPIO_NUM is -1 because the AI Thinker variant ties the camera reset to an internal pull-up.

setup()

The camera is initialized at FRAMESIZE_96X96 with PIXFORMAT_GRAYSCALE — exactly the resolution and color space that the bundled MobileNet model expects. Using a color format or larger resolution would require resizing and conversion in firmware, adding latency. After the camera starts, personDetection.begin() initializes TFLM, allocates the tensor arena in PSRAM, and loads the model from Flash. If PSRAM is not available the library falls back to internal SRAM, which may not have enough space for this model.

loop()

Each iteration captures one frame with camera.capture(), passes it directly to personDetection.run(camera), and reads two float outputs: outputs[0] is the probability that a person is present, outputs[1] is the probability that no person is present. A threshold of 0.60 (60%) triggers the LED. The delay(500) limits the loop rate, but the dominant bottleneck is inference time (~4–5 seconds on stock ESP32-CAM at 240 MHz), not the delay.

7. Step-by-Step Experiment Procedure

7.1 Phase 1 — Hardware Assembly

7. Identify the GND, 5V, U0T, U0R, and GPIO0 pins on your ESP32-CAM (consult the AI Thinker pinout diagram)

8. Wire the FTDI adapter to the ESP32-CAM per the table in Section 4.1

9. Do NOT connect GPIO0 to GND yet — leave it disconnected for now

10. Insert the OV2640 camera into the ESP32-CAM's ZIF connector (gold contacts facing down toward the board)

11. Connect the FTDI adapter to your PC via USB

7.2 Phase 2 — Firmware Upload

12. Open Arduino IDE. Create a new sketch and paste the full code from Section 6.1

13. Add the tflm_esp32, EloquentEsp32Cam, and eloquent_tinyml libraries if not already installed

14. Connect GPIO0 to GND on the ESP32-CAM (this enables bootloader/flash mode)

15. In Arduino IDE select the correct COM port under Tools → Port

16. Click Upload. The IDE will compile (~2–3 minutes first time) then transfer firmware

17. Once 'Done uploading' appears, immediately disconnect GPIO0 from GND

18. Press the RESET button (or power-cycle) the ESP32-CAM

7.3 Phase 3 — Running & Observing

19. Open Serial Monitor at 115200 baud

20. You should see 'Edge AI Person Detection - ESP32-CAM' followed by '[OK] Model ready'

21. After 4–5 seconds you will see score lines: [Person: 72.3%] [No-Person: 27.7%]

22. Hold the camera toward a person — the LED on GPIO4 should illuminate and '>>> PERSON DETECTED <<<' prints

23. Point the camera at a wall or empty room — the LED should turn off

7.4 Phase 4 — Observations & Experiments

• Vary lighting conditions (bright sunlight, dim indoor, backlit) and record detection accuracy changes

• Test distance sensitivity: at what distance does detection probability drop below 60%?

• Test with partial occlusion (only head/shoulders visible)

• Measure inference time using millis() before and after personDetection.run()

• Overclock to 240 MHz (default) vs. 160 MHz and compare inference time

• Log scores to Serial Plotter to visualize confidence over time

8. Going Further: Custom Models with Edge Impulse

8.1 Why Edge Impulse?

The bundled person detection model is a general-purpose pre-trained model. For domain-specific applications — detecting a specific product defect on a conveyor, classifying crop diseases, or recognizing custom gestures — you need to train your own model. Edge Impulse is a web-based MLOps platform that handles the complete pipeline from data collection to Arduino library export.

8.2 Edge Impulse Workflow

24. Create a free account at edgeimpulse.com

25. Create a new project and choose 'Image Classification' or 'Object Detection'

26. Collect training images (minimum 50–100 per class recommended; 300+ for robust results)

27. Design the Impulse: Image 96×96 → Image processing block → Classifier block (MobileNetV2 0.35 recommended for ESP32)

28. Train the model (Edge Impulse handles quantization automatically)

29. Export as Arduino Library (.zip)

30. In Arduino IDE: Sketch → Include Library → Add .ZIP Library → select the downloaded file

31. Integrate with the camera code by replacing the personDetection calls with your model's inference API

💡 Model Zoo

Edge Impulse's Model Zoo includes pre-optimized MobileNet, ResNet, and SqueezeNet variants tested on ESP32. You can also import custom ONNX or TensorFlow models if you have trained your own architecture.

9. Performance Benchmarks & Optimization

9.1 Measured Performance (ESP32-CAM)

Parameter	Value / Detail
Metric	Measured Value
Inference time (INT8 MobileNet)	~4,000–5,000 ms per frame
Model Flash footprint	~300 KB
Tensor arena (PSRAM)	~100 KB
Active power draw	~130–160 mW (inference running)
Detection FPS (effective)	0.2–0.25 FPS (limited by inference)
Accuracy (person/no-person)	~60–80% in varied conditions
INT8 vs Float32 speedup	>2× faster with INT8

9.2 Optimization Strategies

Hardware Upgrades

• ESP32-S3 — features vector instructions that accelerate INT8 MAC operations, achieving 2–4× speedup over plain ESP32

• PSRAM — essential for the 100 KB tensor arena; without it the model cannot load

• External SPIRAM (8 MB) — allows larger models and bigger tensor arenas

Software Optimizations

• Reduce image resolution from 96×96 to 64×64 if lighting is good — cuts inference time by ~40%

• Use MicroMutableOpResolver instead of AllOpsResolver — only compiles operators actually used, saving Flash

• Apply hierarchical inference: run a tiny motion detector first; only trigger the full model on motion

• Reduce tensor arena size by profiling with GetOperatorDetails() and allocating only what is needed

10. Real-World Applications

• Smart security cameras — on-device person/vehicle detection without cloud subscription

• Attendance systems — offline face presence detection for classrooms or factories

• Industrial defect detection — classify pass/fail on a production line with a custom model

• Precision agriculture — crop disease classification directly in the field without connectivity

• Wildlife monitoring — motion-triggered animal classification on battery-powered remote cameras

• Retail analytics — customer counting and zone monitoring with privacy preservation (no video leaves the device)

• Assistive technology — gesture recognition for hands-free device control

11. Challenges & Limitations

• Inference speed — 4–5 seconds per frame is inadequate for real-time video surveillance; ESP32-S3 or dedicated NPU hardware is needed

• Memory constraints — ESP32's 520 KB SRAM limits model size; PSRAM required for vision models

• Thermal management — continuous inference at 240 MHz causes the chip to run warm; thermal throttling may occur in enclosed housings

• Accuracy ceiling — tiny models with aggressive quantization cannot match cloud-scale models; error rate of 20–40% is common on challenging scenes

• Lighting sensitivity — OV2640's auto-exposure struggles with high-contrast or very low-light scenes

• Model update complexity — updating the model requires re-flashing firmware; no over-the-air model replacement without extra engineering

12. Conclusion

Edge AI on the ESP32 represents a genuinely accessible entry point into embedded machine learning. With freely available tools (Arduino IDE, TensorFlow Lite for Microcontrollers, Edge Impulse), an ESP32-CAM costing under ₹500, and the annotated code in this report, a practitioner can have a real-time, fully offline object detection system running within an afternoon.

The key enabling technologies — INT8 post-training quantization, the MobileNet depthwise separable convolution architecture, and the TFLM zero-allocation inference engine — cooperate to fit a functional neural network into the 300 KB Flash and 100 KB PSRAM budget of the ESP32-CAM. While inference speed (~4–5 seconds per frame) and accuracy (~60–80%) fall short of cloud-scale systems, they are entirely adequate for a broad class of practical IoT applications including presence detection, simple classification, and event-triggered monitoring.

For applications demanding higher throughput, the ESP32-S3's vector acceleration units provide a 2–4× improvement without changing the software stack. For custom domain-specific models, Edge Impulse provides a streamlined training-to-deployment pipeline compatible with the same hardware and libraries described in this report.

Edge AI on microcontrollers is not a future technology — it is a mature, deployable, and cost-effective approach to bringing intelligence to the sensor layer of the IoT stack today.

References & Further Reading

• Espressif Systems. (2025). ESP32-CAM Datasheet. espressif.com

• TensorFlow Lite for Microcontrollers Documentation. tensorflow.org/lite/microcontrollers

• Edge Impulse Documentation. docs.edgeimpulse.com

• Howard, A. et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861

• Sensors (MDPI). (2025). Design and Implementation of ESP32-Based Edge Computing for Object Detection. doi:10.3390/s25061656

• EloquentArduino. ESP32-CAM Person Detection Library. eloquentarduino.com

• Zbotic. (2026). AI TinyML with Person Detection on ESP32-CAM Offline. zbotic.in

• MakerGuides. (2026). Train an Object Detection Model with Edge Impulse for ESP32-CAM. makerguides.com

Support My Work with a Cup of Chai! ☕

If you are located in India, I kindly request your support through a small contribution.

Please note that the UPI payment method is only available within India.

Accepted Payment Methods: Google Pay, PhonePe, PayTM, Amazonpay UPI

UPI ID :

haneenthecreate@postbank

If you are not located in India , Do the Payments via BUY ME A COFEE

Wishing you a wonderful day!

Tags: Electronics robotics

*