As a Site Reliability Engineer (SRE) you will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems.
Much of our support and software development focuses on optimizing existing systems, building infrastructure and reducing work through automation.
You’ll join a team of curious problem solvers with a diverse set of perspectives who are thinking big and taking risks. In this environment you’ll take the lead on relevant projects, supported by an organization that provides the support and mentorship you need to learn and grow.
As an SRE you’ll be focused on running better production applications and systems.
Design, code, test and deliver software to automate manual operational work
Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents
Engage with development team throughout the life cycle to help develop software for reliability and scale, ensuring minimal refactoring or changes
Identify application patterns and analytics in support of better service level objectives
Design self-healing and resiliency patterns
Design automated software and product upgrades, change management, and release management solutions
Coach or manage teams as applicable
Participate in the 24x7 support coverage as needed
Bachelor’s degree or equivalent experience in an software engineering discipline
Expertise in at least one technology stack designing, coding, testing, and delivering software
Proficiency in one or more technology domains, may be a cross-domain expert able to solve complex and mission critical problems within a business or across the firm
Working knowledge of infrastructure components. (E.g. routers, load balancers , cloud products , container systems , compute, storage and networks)
Excellent debugging and trouble shooting skills
5+ years of software engineer with hands-on experience and / or site reliability engineering in the following languages : Java, UNIX, and Oracle
Experience implementing and / or using Git / Stash, Jenkins, JIRA, and code quality & security scanning tools
Developing monitoring tools and log analysis tools to manage operations
Exposure to App Dynamics, Splunk / Kibana, Elasticsearch / Kibana would be a plus
Design and contribute to performance monitoring and capacity management tools
Leading a team of engineers or production management personnel
Knowledge of cloud-based technologies and tools especially in deployment, monitoring and operations, such as Kubernetes, AWS, PCF etc.
Proficient in service-level changes to a system and troubleshooting components.
Proficient in the development of automated tools, systems and services in multiple technology domains
Experience in Agile development techniques, including Scrum
Proficient knowledge of one or more infrastructure components such as networking, cloud services, orchestration tools, containerization, compute and storage systems
As a JPMorgan Chase & Co. Site Reliability Engineering (SRE) you will combine software and systems to help us build a world-class engineering function.
Working with your team, you’ll focus on improving our production applications and systems to creatively solve operations problems.
Much of our support and software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation.