COMPLETED Systems

JetsonProj — CUDA Hook Analyzer

A CUDA-level profiling and intelligent multi-model inference scheduling framework for edge AI, achieving 114.67 FPS YOLOv8n on Jetson Orin Nano with TensorRT FP16 — 3.8x faster than real-time.

status COMPLETED
type Systems
stack ['C++17' 'CUDA 12.6' 'TensorRT 10.3' 'CMake' 'CUPTI' 'pybind11' 'FastAPI' 'Python' 'Prometheus' 'Docker' 'Google Test' 'JetPack 6.x']

// DESCRIPTION

Project Overview / 项目概述

Jetson-HookAnalyzer (also referred to as JetsonProj) is a lightweight CUDA-level performance profiling and intelligent multi-model inference scheduling framework built for edge AI deployment on the NVIDIA Jetson Orin Nano. The flagship result is 114.67 FPS YOLOv8n object detection — 3.8× faster than the 30 FPS real-time target — achieved using TensorRT FP16 precision on a device with only 7.6 GB of LPDDR5 memory.

Jetson-HookAnalyzer(也称 JetsonProj)是一个轻量级的 CUDA 级性能分析 和智能多模型推理调度框架,专为在 NVIDIA Jetson Orin Nano 上进行边缘 AI 部署而构建。 核心成果是在仅有 7.6 GB LPDDR5 内存的设备上,使用 TensorRT FP16 精度实现了 114.67 FPS 的 YOLOv8n 目标检测 — 比 30 FPS 实时目标快 3.8 倍。

Situation & Motivation / 背景与动机

Edge AI inference is fundamentally constrained: power budgets are tight, memory is scarce, and multiple models must often run concurrently (e.g. object detection + classification + tracking). Standard profiling tools like Nsight Systems provide excellent insight on workstations but are heavyweight for embedded deployments. Resource contention between concurrent models on a single small GPU can cause unpredictable latency spikes, yet the tooling to diagnose and schedule around this contention was largely absent from the open- source embedded AI ecosystem.

边缘 AI 推理面临根本性的约束:功耗预算紧张、内存稀缺,且通常需要多个模型并发运行 (例如目标检测 + 分类 + 追踪)。标准分析工具(如 Nsight Systems)在工作站上效果出色, 但对于嵌入式部署来说过于重量级。单个小型 GPU 上并发模型之间的资源竞争会导致不可预测的延迟尖峰, 然而用于诊断和调度这种竞争的工具在开源嵌入式 AI 生态系统中基本缺失。

Task & Design Goals / 目标与设计

Design and implement a four-layer framework: (1) a CUDA hook & profiling layer for real-time kernel-level visibility, (2) an intelligent multi-model scheduler with priority queues and dynamic batching, (3) TensorRT and ONNX Runtime engine adapters, and (4) a monitoring stack exposing metrics via Prometheus and a RESTful FastAPI interface. All components must compile and run on JetPack 6.x (CUDA 12.6, TensorRT 10.3).

设计并实现一个四层框架:(1)CUDA 钩子与分析层,提供实时内核级可见性; (2)具有优先级队列和动态批处理的智能多模型调度器; (3)TensorRT 和 ONNX Runtime 引擎适配器; (4)通过 Prometheus 和 RESTful FastAPI 接口暴露指标的监控栈。 所有组件必须在 JetPack 6.x(CUDA 12.6、TensorRT 10.3)上编译运行。

Action & Implementation / 实施细节

The system is built in C++17 with CUDA 12.6 and structured into four main subsystems:

  • CUDA Hook Layer — intercepts cudaMalloc / cudaFree calls at runtime, profiles kernel launch times via CUPTI, tracks memory access patterns, and monitors GPU utilisation. Error handling uses a defensive fallback pattern (fail to device 0) to ensure robustness on constrained hardware.
  • Intelligent Scheduler — priority-based multi-model task queue with dynamic batching, stream-level parallelism, and latency-aware resource allocation. Google Test suite covers 9 scheduler test cases with 100% pass rate.
  • TensorRT Wrapper (~350 LOC C++) — engine loading and serialisation, GPU memory management, synchronous and asynchronous inference, FP16 precision, and a comprehensive benchmarking harness.
  • Custom CUDA Kernels — hand-optimised GEMM achieving 146 GFLOPS (68.6% of cuBLAS) at 512×512, plus element-wise ops and activations. Key finding: 16×16 tiles give 100% SM occupancy on Jetson; 32×32 tiles drop to 67% occupancy and run 20% slower — a critical insight for edge GPU tuning.
HookAnalyzer/
├── core/
│   ├── cuda_hook/       # cudaMalloc/cudaFree interception + CUPTI
│   ├── scheduler/       # Priority queue, dynamic batching, streams
│   └── profiler/        # Flame graphs, Chrome trace export
├── engines/
│   └── tensorrt_adapter/ # 350 LOC TensorRT C++ wrapper
├── kernels/optimized/   # Custom GEMM, Conv, Softmax, ReLU
├── api/server/          # FastAPI REST interface
├── monitoring/          # Prometheus metrics + Grafana
├── bindings/            # pybind11 Python-C++ bridge (95% complete)
└── tests/               # 27 Google Test unit tests (15/16 passing)

Results & Impact / 成果与影响

Experiment 3 (TensorRT YOLOv8 integration) delivered production-grade numbers on a constrained edge device: 114.67 FPS throughput, 8.72 ms average latency (P99 < 14 ms), and a GPU memory footprint of only 7.4 MB for the YOLOv8n engine. The Google Test integration raised test coverage from 15% to 60%+, with 15 of 16 tests passing (93.8% pass rate). Python–C++ bindings via pybind11 are 95% complete, enabling Python orchestration of the C++ inference engine. The project demonstrates production-ready architecture for edge AI workloads on resource-constrained Jetson hardware.

实验 3(TensorRT YOLOv8 集成)在受限边缘设备上交付了生产级数据:114.67 FPS 吞吐量、 8.72 ms 平均延迟(P99 < 14 ms),YOLOv8n 引擎 GPU 内存占用仅 7.4 MB。 Google Test 集成将测试覆盖率从 15% 提升至 60%+,27 个单元测试中 15 个通过(通过率 93.8%)。 通过 pybind11 实现的 Python–C++ 绑定已完成 95%,支持用 Python 编排 C++ 推理引擎。 该项目展示了在资源受限的 Jetson 硬件上处理边缘 AI 工作负载的生产就绪架构。

// HIGHLIGHTS

['114.67 FPS YOLOv8n inference on Jetson Orin Nano — 3.8× faster than 30 FPS real-time target with TensorRT FP16', '8.72 ms average latency, P99 < 14 ms, 7.4 MB GPU memory footprint — production-stable on 7.6 GB LPDDR5', 'CUDA hook layer intercepts cudaMalloc/cudaFree at runtime using CUPTI for kernel-level profiling on edge hardware', 'Custom CUDA GEMM kernel at 146 GFLOPS (68.6% cuBLAS); discovered 16x16 tiles achieve 100% SM occupancy vs 67% for 32x32', '~350 LOC C++17 TensorRT wrapper with sync/async inference, FP16 precision, and comprehensive benchmarking harness', '27 Google Test unit tests (15/16 passing, 93.8%) raising test coverage from 15% to 60%+', 'Four-layer architecture: CUDA hooks → intelligent scheduler → engine adapters → Prometheus/FastAPI monitoring stack', 'pybind11 Python-C++ bindings 95% complete, enabling Python-level orchestration of the C++ inference engine']