New

Manager, Site Reliability Engineering (SRE)

Intercontinental Exchange
United States, Florida, Jacksonville
4800 East Deer Lake Drive (Show on map)
Nov 05, 2025
Overview Job Purpose ICE Mortgage Technology is the leading provider for the mortgage finance industry. ICE Mortgage Technology provides best-in-class servicing solutions to help manage all aspects of loan servicing - from loan boarding to default. Transform your performance with automation and insights and enhance the customer experience. Our solutions support first mortgages as well as home equity loans, and help servicers lower costs, reduce risk and operate more efficiently. We're looking for motivated, results-oriented people to join our team. Intercontinental Exchange is seeking a Lead Site Reliability Engineer who is service oriented, delivery focused and can build rapport with key members of the Operations and SRE teams specifying and implementing automation changes, fixes, and improvement projects. The ideal candidate will have excellent time and customer management skills combined with a range of technical skills and knowledge. This position is for a hands-on technical manager to lead a team of Engineers and Analyst, focused on providing resilient, secure, scalable and supportable services for mortgage borrowers and lenders. You will contribute to the strategy and delivery of the team, as well as managing the day-to-day workload. This role requires building a close relationship with our customer support, operations, engineering, database and product organizations. Responsibilities Team Leadership: Lead a geographically distributed team of Engineers and Analysts, providing guidance, mentoring, and performance management. Foster a culture of continuous improvement. Process Improvement: Develop and refine processes that streamline workflows, reduce bottlenecks and increase overall velocity. Participate in or lead continuous improvement projects driven by automation. Set individual goals and manage personal growth of team members. Act as primary point of contact for staff issues. Training and Support: Provide training and documentation to other team members on the effective use of existing toolsets and best practices. Collaboration and Communication: Work closely with partner teams to understand their pain points and automation goals. Work with teams to make them more efficient. Lead complex projects such as data center migrations, major systems upgrades, tech stacks Employ deep troubleshooting skills to improve the availability, performance, and security of IMT Services. Participate in on-call rotations and lead Incident Response and Root Cause Analysis. Conduct root cause analysis and post-mortems for production incidents Ensure services are designed with 24/7 availability and operational readiness and rigor Implementation of proactive monitoring, alerting, trend analysis and self-healing systems Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems Identify, evaluate, and execute preventive measures to minimize/avoid impact to the customers experience. Proactive v/s Customer escalated Resolution of product/service defects or design changes, infrastructure changes, or operational changes Additional Duties: Perform any other activities as directed by management. Knowledge and Experience 5+ years of experience as a people manager or in a team lead role with delegation duties in a 24x7 Production technical support services environments BS in Computer Science, Computer Engineering, Math, or equivalent professional experience Fluency with one or more current generation scripting language (Python/Shell/Perl/ PHP/Ruby) AND/OR Java Development and .NET Excellent troubleshooting skills, utilizing a systematic problem-solving approach Experience leading Incident Response and root cause analysis (RCA) / post-mortems Experience in Windows, Linux, OCP, and AWS Experience with Continuous Integration and Continuous Delivery concepts Experience with monitoring and alerting tools (Splunk, BigPanda, PagerDuty) Experience with automation of business continuity/disaster recovery/application resiliency Process-oriented with great documentation skills (Confluence) Must be able to multitask in a fast-paced environment with focus on timeliness, documentation, and communications with peers and business users alike Strong communication skills #LI-RS1 #LI-Onsite