SK hynix America logo

AI/ML Cluster System Design Engineer, Contractor

SK hynix America
Contract
On-site
San Jose, California, United States
$150,000 - $200,000 USD yearly
Job Title: AI/ML Cluster System Design Engineer, Contractor
Office Location: San Jose, CA
Job Type: Contract (6 - 12 months)
Work Model: Onsite

About SK hynix America
At SK hynix America, we're at the forefront of semiconductor innovation, developing advanced memory solutions that power everything from smartphones to data centers. As a global leader in DRAM and NAND flash technologies, we drive the evolution of advancing mobile technology, empowering cloud computing, and pioneering future technologies. Our cutting-edge memory technologies are essential in today's most advanced electronic devices and IT infrastructure, enabling enhanced performance and user experiences across the digital landscape.
We're looking for innovative minds to join our mission of shaping the future of technology. At SK hynix America, you'll be part of a team that's pioneering breakthrough memory solutions while maintaining a strong commitment to sustainability. We're not just adapting to technological change – we're driving it, with significant investments in artificial intelligence, machine learning, and eco-friendly solutions and operational practices. As we continue to expand our market presence and push the boundaries of what's possible in semiconductor technology, we invite you to be part of our journey to creating the next generation of memory solutions that will define the future of computing.

Position Overview:

SK hynix is seeking an expert AI/ML Cluster System Design Contractor to join our AI/ML IT infrastructure architecture team for a 6-12 month engagement. This role requires specialized expertise in designing, architecting, and optimizing large-scale GPU clusters for AI/ML training and inference workloads. The ideal candidate brings deep technical knowledge of compute architecture, high-performance networking, storage systems, and the intricate interdependencies between software frameworks and hardware infrastructure in AI/ML environments.

Responsibilities:

  • Architect robust, scalable, and efficient computing clusters that maximize AI workload performance while meeting operational and budgetary constraints.
  • Collaborate across hardware capabilities and AI/ML framework requirements, translating model training needs and inference performance targets into concrete system specifications.
  • Design end-to-end cluster architectures that encompass compute resources, networking fabric, storage subsystems, and power/cooling integration.
  • Select appropriate GPU platforms based on workload characteristics, designing network topologies that minimize communication bottlenecks in distributed training scenarios,
  • Architect storage solutions that can sustain the high-throughput demands of large-scale AI operations.
  • Conduct detailed performance modeling and capacity planning exercises, predicting cluster behavior under various workload scenarios and identifying potential bottlenecks before deployment.
  • Guide decisions on cluster topology, including considerations for rail-optimized designs, spine-leaf architectures, and direct GPU-to-GPU connectivity technologies such as NVLink and InfiniBand configurations.
  • Understand and plan for the infrastructure requirements that support cluster operations, which includes calculating aggregate power requirements based on GPU selection and cluster scale, specifying cooling capacity needed to maintain optimal operating temperatures, determining network bandwidth requirements for different training paradigms, and identifying facility-level dependencies that impact cluster deployment feasibility.

Continuous Contribution Areas:

  • Contribute your expertise by conducting architecture reviews, optimize existing cluster configurations, and prototype new design approaches.
  • Provide technical guidance on emerging technologies in AI accelerators, networking, and infrastructure, evaluate vendor solutions against architectural requirements, and benchmark alternative designs.
  • Contribute insights that shape both immediate deployment plans and long-term infrastructure strategy and ensure AI computing capabilities remain competitive, efficient, and future-ready.

Qualifications:

  • Proven experience designing and deploying large-scale AI/ML clusters in production environments, including clusters with 100+ GPUs.
  • Direct involvement in hardware selection, network design, and performance optimization for AI workloads.
  • Hands-on expertise with modern GPU architectures from NVIDIA or AMD, plus familiarity with emerging AI accelerator technologies.
  • Comprehensive knowledge of AI/ML frameworks and their infrastructure requirements, including PyTorch and distributed training libraries such as DeepSpeed, Megatron-LM, and Ray.
  • Understanding of how framework-specific optimizations impact cluster design decisions and how architectural choices affect model training efficiency and scalability.
  • Strong background in high-performance networking, including designing low-latency, high-bandwidth network fabrics (e.g., InfiniBand, RoCE, or proprietary interconnects).
  • Understanding of network topology implications for distributed training patterns, including all-reduce operations, parameter server architectures, and pipeline parallelism.
  • Practical experience integrating cluster design decisions with facility requirements, including Power density considerations based on GPU selection, cooling architecture for varying cluster sizes, and space optimization and data center infrastructure alignment.
  • Ability to collaborate effectively with facility engineers to ensure clusters are operationally feasible.

Preferred Qualifications:

  • Bachelor’s degree in engineering and science discipline with training that matches standard college level training for computer engineering
  • 8+ years of professional experience in systems architecture.
  • Minimum 3 years dedicated to AI/ML infrastructure design and deployment.
  • Track record of designing clusters supporting diverse workloads from large language model training, to high performance computing and/or computer vision applications.
  • Deep understanding of how workload characteristics influence architectural decisions.
  • Proven ability to balance technical performance with practical constraints such as budget, timeline, and operational feasibility.

 

Equal Employment Opportunity:

SKHYA is an Equal Employment Opportunity Employer. We provide equal employment opportunities to all qualified applicants and employees and prohibit discrimination and harassment of any type without regard to race, sex, pregnancy, sexual orientation, religion, age, gender identity, national origin, color, protected veteran or disability status, genetic information or any other status protected under federal, state, or local applicable laws. 

 

Compensation:

Our compensation reflects the cost of labor across several U.S. geographic markets, and we pay differently based on those defined markets. Pay within the provided range varies by work location and may also depend on job-related skills and experience. Your Recruiter can share more about the specific salary range for the job location during the hiring process.

Pay Range
$150,000$200,000 USD
Apply now
Share this job