Failure Analysis Engineer
![]() | |
![]() | |
![]() | |
![]() United States, California, Fremont | |
![]() | |
*Job Title: Failure Analysis Engineer*
******* This role will not accept C2C candidates******** *Location: *Fremont, CA *Work-type:* On-Site *Employment Type: *6 Month Contract (Contract-to-Hire) *Industry: *AI, Cloud Computing, Data Centers * About the Role* We are seeking a highly skilled Failure Analysis Engineer to join our team and work on diagnosing, troubleshooting, and resolving failures in server racks, AI hardware, and large-scale distributed systems. In this role, you will be responsible for analyzing failed components, debugging hardware/software issues, and optimizing system reliability in a fast-paced, mission-critical environment. * This position requires strong expertise in Linux, debugging, AI/ML, storage, and server hardware, with hands-on experience in Kubernetes, Docker, firmware/BIOS debugging, and networking protocols in an enterprise environment.* * Key Responsibilities* *Manage and maintain a fleet of server racks across different OEMs (network, storage, compute, AI hardware). *Debug and troubleshoot complex hardware and software failures related to storage, compute, and AI. *Support failure analysis initiatives, validating system and component failures from data centers. *Work with network infrastructure, configuring and managing TCP/IP, DNS, DHCP protocols. *Interface with OEM vendors for firmware and driver updates. *Design and implement containerized applications using Docker and Kubernetes. *Manage and maintain virtual machines (VMware, KVM). *Perform root cause analysis, diagnosing failures in platform, firmware, BIOS, CPLD, and other applications. *Support failure analysis labs, including inventory management and safety audits. *Collaborate with cross-functional teams to provide updates and implement solutions. *Must-Have Qualifications* * Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field.* * 5+ years of experience in server rack management, failure analysis, and hardware debugging.* * Strong expertise in Linux (RedHat, Fedora, CentOS) or Unix environments.* * Proficiency in scripting languages such as Python, PowerShell, PHP, or Perl.* * Hands-on experience with containerization, Kubernetes, Docker, and virtual machines.* * Experience with server hardware validation (BIOS, CPLD, firmware debugging).* * Knowledge of network protocols (TCP/IP, DNS, DHCP).* * Strong troubleshooting and problem-solving skills in complex systems.* * Excellent communication and documentation skills.* * Nice-to-Have Skills* * Experience with LLMs (Large Language Models) and AI frameworks (TensorFlow, PyTorch).* * Hands-on experience with failure analysis tools for FW/BIOS debugging.* * **Background** in data center infrastructure or cloud computing.* *Experience Level* Expert Level *Pay and Benefits* The pay range for this position is $60.00 - $75.00/hr. Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to specific elections, plan, or program terms. If eligible, the benefits available for this temporary role may include the following: * Medical, dental & vision * Critical Illness, Accident, and Hospital * 401(k) Retirement Plan - Pre-tax and Roth post-tax contributions available * Life Insurance (Voluntary Life & AD&D for the employee and dependents) * Short and long-term disability * Health Spending Account (HSA) * Transportation benefits * Employee Assistance Program * Time Off/Leave (PTO, Vacation or Sick Leave) *Workplace Type* This is a fully onsite position in Fremont,CA. *Application Deadline* This position is anticipated to close on Mar 26, 2025. About TEKsystems: We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company. The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law. |