Deep Learning Performance Architect - Perf Tools
NVIDIA
We are looking for a first-class Deep Learning Performance architect to join us to shape the performance analysis infrastructures for GPUs. We build cutting-edge analysis tools and visualization frameworks that empower engineers to optimize GPU performance for Deep Learning and HPC workloads—spanning pre-silicon architectural exploration to post-silicon validation and optimization. Your work will directly shape the tools that define how NVIDIA GPUs are analyzed, tuned, and scaled for next-gen AI systems, and impact the next-gen GPUs architectures.
What you'll be doing:
Architect Performance Tooling: Develop infrastructure tools/libraries for GPU performance analysis, visualization, and automated workflows used across GPU SW/HW development life cycle.
Unlock Architectural Insights: Analyze GPU workloads to identify bottlenecks and define new hardware profiling features that enhance perf debug and profiling capabilities.
AI-Powered Automation: Build AI/ML-driven tools to automate performance analysis, generate perf optimization guidance, and improve user experience of profiling infrastructure.
Cross-Stack Collaboration: Partner with kernel developers, system software teams, and hardware architects to support performance study, improve CUDA software stack, and co-design performance-centric solutions for current and next-generation GPU architecture
What we need to see:
BS+ in Computer Science, Electronic Engineering or related (or equivalent experience)
4+ years of software development
Strong software skill in design, coding (C++ and Python), analytical and debugging in low-level program
Strong grasp of computer architecture (pipelines, memory hierarchies) and operating system fundamentals
Experience with performance modeling, architecture simulation, profiling, and analysis.
Self-starter who thrives in dynamic environments and manages competing priorities effectively.
Ways to stand out from the crowd:
Experience with building performance debugging and analysis tools on silicon and simulators. Experience of developing application snapshot and replay tool is a big plus.
Familiar with CUDA System Software Stack(e.g., CUDA Driver/Runtime APIs), CUDA kernel optimization and understand GPU architecture
Familiarity with GPU performance profiling tools like Nsight System, Nsight Compute, NVTX, etc, or experience for developing similar tools for other processors.
Practical experience or projects demonstrating AI/ML-based code generation, automated data analysis, or workflow assistants.