JetsonProj — CUDA Hook Analyzer
A CUDA-level profiling and intelligent multi-model inference scheduling framework for edge AI, achieving 114.67 FPS YOLOv8n on Jetson Orin Nano with TensorRT FP16 — 3.8x faster than real-time.
// DESCRIPTION
Project Overview / 项目概述
Jetson-HookAnalyzer (also referred to as JetsonProj) is a lightweight CUDA-level performance profiling and intelligent multi-model inference scheduling framework built for edge AI deployment on the NVIDIA Jetson Orin Nano. The flagship result is 114.67 FPS YOLOv8n object detection — 3.8× faster than the 30 FPS real-time target — achieved using TensorRT FP16 precision on a device with only 7.6 GB of LPDDR5 memory.
Jetson-HookAnalyzer(也称 JetsonProj)是一个轻量级的 CUDA 级性能分析 和智能多模型推理调度框架,专为在 NVIDIA Jetson Orin Nano 上进行边缘 AI 部署而构建。 核心成果是在仅有 7.6 GB LPDDR5 内存的设备上,使用 TensorRT FP16 精度实现了 114.67 FPS 的 YOLOv8n 目标检测 — 比 30 FPS 实时目标快 3.8 倍。
Situation & Motivation / 背景与动机
Edge AI inference is fundamentally constrained: power budgets are tight, memory is scarce, and multiple models must often run concurrently (e.g. object detection + classification + tracking). Standard profiling tools like Nsight Systems provide excellent insight on workstations but are heavyweight for embedded deployments. Resource contention between concurrent models on a single small GPU can cause unpredictable latency spikes, yet the tooling to diagnose and schedule around this contention was largely absent from the open- source embedded AI ecosystem.
边缘 AI 推理面临根本性的约束:功耗预算紧张、内存稀缺,且通常需要多个模型并发运行 (例如目标检测 + 分类 + 追踪)。标准分析工具(如 Nsight Systems)在工作站上效果出色, 但对于嵌入式部署来说过于重量级。单个小型 GPU 上并发模型之间的资源竞争会导致不可预测的延迟尖峰, 然而用于诊断和调度这种竞争的工具在开源嵌入式 AI 生态系统中基本缺失。
Task & Design Goals / 目标与设计
Design and implement a four-layer framework: (1) a CUDA hook & profiling layer for real-time kernel-level visibility, (2) an intelligent multi-model scheduler with priority queues and dynamic batching, (3) TensorRT and ONNX Runtime engine adapters, and (4) a monitoring stack exposing metrics via Prometheus and a RESTful FastAPI interface. All components must compile and run on JetPack 6.x (CUDA 12.6, TensorRT 10.3).
设计并实现一个四层框架:(1)CUDA 钩子与分析层,提供实时内核级可见性; (2)具有优先级队列和动态批处理的智能多模型调度器; (3)TensorRT 和 ONNX Runtime 引擎适配器; (4)通过 Prometheus 和 RESTful FastAPI 接口暴露指标的监控栈。 所有组件必须在 JetPack 6.x(CUDA 12.6、TensorRT 10.3)上编译运行。
Action & Implementation / 实施细节
The system is built in C++17 with CUDA 12.6 and structured into four main subsystems:
-
CUDA Hook Layer — intercepts
cudaMalloc/cudaFreecalls at runtime, profiles kernel launch times via CUPTI, tracks memory access patterns, and monitors GPU utilisation. Error handling uses a defensive fallback pattern (fail to device 0) to ensure robustness on constrained hardware. - Intelligent Scheduler — priority-based multi-model task queue with dynamic batching, stream-level parallelism, and latency-aware resource allocation. Google Test suite covers 9 scheduler test cases with 100% pass rate.
- TensorRT Wrapper (~350 LOC C++) — engine loading and serialisation, GPU memory management, synchronous and asynchronous inference, FP16 precision, and a comprehensive benchmarking harness.
- Custom CUDA Kernels — hand-optimised GEMM achieving 146 GFLOPS (68.6% of cuBLAS) at 512×512, plus element-wise ops and activations. Key finding: 16×16 tiles give 100% SM occupancy on Jetson; 32×32 tiles drop to 67% occupancy and run 20% slower — a critical insight for edge GPU tuning.
HookAnalyzer/ ├── core/ │ ├── cuda_hook/ # cudaMalloc/cudaFree interception + CUPTI │ ├── scheduler/ # Priority queue, dynamic batching, streams │ └── profiler/ # Flame graphs, Chrome trace export ├── engines/ │ └── tensorrt_adapter/ # 350 LOC TensorRT C++ wrapper ├── kernels/optimized/ # Custom GEMM, Conv, Softmax, ReLU ├── api/server/ # FastAPI REST interface ├── monitoring/ # Prometheus metrics + Grafana ├── bindings/ # pybind11 Python-C++ bridge (95% complete) └── tests/ # 27 Google Test unit tests (15/16 passing)
Results & Impact / 成果与影响
Experiment 3 (TensorRT YOLOv8 integration) delivered production-grade numbers on a constrained edge device: 114.67 FPS throughput, 8.72 ms average latency (P99 < 14 ms), and a GPU memory footprint of only 7.4 MB for the YOLOv8n engine. The Google Test integration raised test coverage from 15% to 60%+, with 15 of 16 tests passing (93.8% pass rate). Python–C++ bindings via pybind11 are 95% complete, enabling Python orchestration of the C++ inference engine. The project demonstrates production-ready architecture for edge AI workloads on resource-constrained Jetson hardware.
实验 3(TensorRT YOLOv8 集成)在受限边缘设备上交付了生产级数据:114.67 FPS 吞吐量、 8.72 ms 平均延迟(P99 < 14 ms),YOLOv8n 引擎 GPU 内存占用仅 7.4 MB。 Google Test 集成将测试覆盖率从 15% 提升至 60%+,27 个单元测试中 15 个通过(通过率 93.8%)。 通过 pybind11 实现的 Python–C++ 绑定已完成 95%,支持用 Python 编排 C++ 推理引擎。 该项目展示了在资源受限的 Jetson 硬件上处理边缘 AI 工作负载的生产就绪架构。