We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results

HPC Systems Engineer - 132811

UC San Diego
United States, California, Oakland
1111 Franklin Street (Show on map)
Nov 10, 2024
HPC Systems Engineer - 132811
Click Here to
Apply Online
Job Description
Extended Deadline: Thu 11/28/2024
UC San Diego values equity, diversity, and inclusion. If you are interested in being part of our team, possess the needed licensure and certifications, and feel that you have most of the qualifications and/or transferable skills for a job opening, we strongly encourage you to apply.

UCSD Layoff from Career Appointment: Apply by 10/22/24 for consideration with preference for rehire. All layoff applicants should contact their Employment Advisor.

Special Selection Applicants: Apply by 11/1/24. Eligible Special Selection clients should contact their Disability Counselor for assistance.

Job posting will remain open until a suitable candidate has been identified.

DESCRIPTION

DEPARTMENT OVERVIEW:

The Mission of the San Diego Supercomputer Center is to translate innovation into practice. SDSC adopts and partners on innovations in industry and academia in the areas of software, hardware, computational and data sciences, and related areas, and translates them into cyberinfrastructure that solves practical problems across any and all scientific domains and societal endeavors. Cyberinfrastructure refers to an accessible, integrated network of high-performance computing, data, and networking resources and expertise, focused on accelerating scientific inquiry and discovery. With more than 250 employees and $30-50M of revenue a year, SDSC is a global leader in the design, development, and operations of cyberinfrastructure.

SDSC supports hundreds of multidisciplinary programs spanning a wide variety of domains, from earth sciences and biology to astrophysics, bioinformatics, and health IT. SDSC presently operates multiple large HPC systems ranging from a 120k x86 CPU core general purpose system to a system explicitly designed for Artificial Intelligence and Machine Learning, and a nationally distributed system open for all of academia to integrate with. SDSC offers research data services across the entire vertical stack from universally scalable storage to consulting services on FAIR, Big Data, and AI. SDSC offers a rich set of cloud services both on-premise, in the commercial cloud, and as hybrid services across both.

SDSC has three geographic scopes, a national scope supporting cyberinfrastructure for the entire US research and education community, a California scope with a special focus on convergence research that addresses the three dominant threats to CA: Drought, Fire, Earthquakes, and a campus scope focusing on advancing the global impact of SDSC by advancing the research objectives of the UC San Diego faculty, researchers, and students. SDSC impacts researchers at scales from 1,000's to Millions. SDSC annually trains thousands of researchers in cyberinfrastructure tools and software, and supports thousands of individual researchers via Unix accounts on its large HPC systems. SDSC was a leader developing the Science Gateway concept, and continues to be a global leader in its evolution. SDSC operates multiple major such gateways with user communities ranging from the tens of thousands to the millions. SDSC's educational programs includes online courses that have been attended by more than a million students.

SDSC is committed to democratizing access to cyberinfrastructure across all of its geographic scopes. SDSC strives towards a culture that supports our employees to be their best, achieve their goals, and enjoy their lives, both professionally and personally.

The Data Enabled Scientific Computing (DESC) division within SDSC designs and jointly proposes with other SDSC researchers, supercomputing systems in response to tens of millions of dollars call-for-proposals from the National Science Foundation (NSF), various government organizations and UC entities; it responds to calls for proposals for cyberinfrastructure (CI) related research, solutions and support. DESC manages, operates and troubleshoots issues with advanced, leading edge, complex, multi-petaflop and multi-petabyte data intensive supercomputer systems, file systems (Lustre, Ceph etc.), interconnects (such as InfiniBand, NVLink, ethernet etc.) and CI projects housed at SDSC. Research leaders within DESC submit high performance computing (HPC), high throughput computing (HTC), AI, CI, data science, computational science, science gateways and scientific software research proposals and acquire funding from NSF, National Institutes of Health (NIH), Department of Energy (DOE), Department of Defense (DOD) and industry. DESC carries out supercomputing, CI, data science, computational science and scientific software research and development projects. This division provides consulting and user support to researchers and users from academia as well as collaborates with them and industrial users. DESC provides advanced computational science, CI and scientific software support for the national and UC user communities as a part of projects/machines such as the Expanse machine (a five-year ~$25-million project supported by the NSF and enables tens of thousands of users to use HPC, HTC and GPUs), Voyager machine (a five-year, ~$12-million project supported by the NSF and enables researchers to experiment with and use AI-focused hardware for scientific applications), PNRP project ( a five-year , ~$12-million project supported by the NSF and enables distributed computing with resources of GPUs, FPGAs and CPUs), the Triton Shared Compute Cluster (TSCC - which is a condo cluster primarily for UCSD researchers), and the CloudBank project ( a five-year, ~$6-million project supported by the NSF to Simplify Cloud Access for Computer Science Research and Education). Various other funded CI research and development, and domain science (e.g. biochemistry, bioinformatics, cosmology etc.) projects are directed by DESC researchers. DESC staff and researchers are involved with the NSF funded Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program. DESC is involved in various HPC/HTC/AI training, workshop, outreach, workforce development and K-12 student programs and associated NSF funded projects. This division stays current with HPC, HTC, accelerators, CI, computational science and scientific software research and technology trends and engages with supercomputer vendors (e.g. Dell, Supermicro, Intel, NVIDIA, AMD, IBM, Hewlett Packard Enterprise, Data Direct Network, Aeon Computing, Arista etc.) to remain current with future technologies utilized in supercomputer designs.

POSTIION OVERVIEW:

The HPC Systems Engineer is responsible for overseeing the management of national and campus-level high-performance computing (HPC) clusters and their associated storage systems, including large parallel file systems, NFS file servers, and underlying storage technologies. This role involves Linux systems administration, with on-call duties, managing hardware, operating systems, I/O, and the installation and maintenance of the software environment. The engineer supports resource managers, schedulers, and ensures seamless client access to parallel and distributed file systems. In this capacity, they perform in-depth analysis, testing, scripting, and benchmarking, working with advanced systems, data, and networks in a research and performance evaluation setting. They contributes to the design, installation, and upgrade of large-scale HPC clusters and storage resources. Familiarity with high-performance file systems like Lustre, Ceph, and GPFS is required, along with deep knowledge of system internals, storage, networking, operating systems, and emerging technologies.

The engineer works closely with other teams to integrate HPC systems into SDSC's network, cloud, and user environments, and help develop and implement security procedures to safeguard these systems. The role also involves managing performance testing at all levels, including CPU, memory, GPU, interconnect bandwidth, and file system IOPS. This performance testing is crucial for troubleshooting and optimizing system functionality. The engineer plays a key role in deploying monitoring tools with appropriate alert mechanisms to ensure rapid incident detection and response, and collaborates with SDSC Operations to streamline incident evaluation and resolution processes.

The engineer also serves as a liaison between the Operations team and computational scientists, providing advanced technical guidance and supporting ongoing training efforts. They contribute to national projects, like ACCESS, and may present their work at national meetings as necessary. Furthermore, they help implement version control for system configurations, utilizing tools like Git for tracking changes and ensuring system stability. With strong communication skills, the engineer works effectively in collaborative settings, often addressing undefined problems and making impactful recommendations that influence the overall project or system. Their expertise spans system internals, data management, storage, network infrastructure, security, and the interrelationships between these critical components.

For more information, please visit: https://www.sdsc.edu/

QUALIFICATIONS
  1. Bachelor's degree in related area and /or equivalent experience / training. Professional technical engineering or technical programming experience or Master's Degree preferred.

  2. Advanced knowledge of systems integration and deploying moderately complex systems integration solutions. Specifically demonstrated through experience administering large-scale HPC clusters and their related filesystems.

  3. Strong knowledge of administering Linux systems, primarily Red Hat and its derivatives, including services, networking, and file systems.

  4. Ability to install, maintain, upgrade, and troubleshoot large (petabyte scale) high performance parallel and distributed filesystems such as Luster, GPFS and Ceph.

  5. Strong demonstrated experience with a major configuration management software, including application packaging and installation.

  6. Demonstrated experience programming and scripting. Experience with Bash and Python preferred.

SPECIAL CONDITIONS
  • Job offer is contingent upon satisfactory clearance based on Background Check results.

  • Occasional evenings and weekends may be required.

  • Overtime and weekends may be required.

Pay Transparency Act

Annual Full Pay Range: $94,400 - $176,800 (will be prorated if the appointment percentage is less than 100%)

Hourly Equivalent: $45.21 - $84.67

Factors in determining the appropriate compensation for a role include experience, skills, knowledge, abilities, education, licensure and certifications, and other business and organizational needs. The Hiring Pay Scale referenced in the job posting is the budgeted salary or hourly range that the University reasonably expects to pay for this position. The Annual Full Pay Range may be broader than what the University anticipates to pay for this position, based on internal equity, budget, and collective bargaining agreements (when applicable).

If employed by the University of California, you will be required to comply with our Policy on Vaccination Programs, which may be amended or revised from time to time. Federal, state, or local public health directives may impose additional requirements.

To foster the best possible working and learning environment, UC San Diego strives to cultivate a rich and diverse environment, inclusive and supportive of all students, faculty, staff and visitors. For more information, please visit UC San Diego Principles of Community.

UC San Diego is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age or protected veteran status.

For the University of California's Affirmative Action Policy please visit: https://policy.ucop.edu/doc/4010393/PPSM-20
For the University of California's Anti-Discrimination Policy, please visit: https://policy.ucop.edu/doc/1001004/Anti-Discrimination

UC San Diego is a smoke and tobacco free environment. Please visit smokefree.ucsd.edu for more information.

Application Instructions

Please click on the link below to apply for this position. A new window will open and direct you to apply at our corporate careers page. We look forward to hearing from you!

Apply Online
Payroll Title:
SYS INTEGRATION ENGR 3
Department:
San Diego Supercomputer Center
Hiring Pay Scale
$94,400 - $115,000 / Year
Worksite:
Hybrid
Appointment Type:
Career
Appointment Percent:
100%
Union:
Uncovered
Total Openings:
1
Work Schedule:
Days, 8 hrs/day, Mon - Fri
Click Here to
Apply Online
X
Share This Page
HPC Systems Engineer - 132811
Share link. Copy this URL:

Posted: 11/10/2024

Job Reference #: 132811

Applied = 0

(web-5584d87848-99x5x)