Feedzai is the market leader in managing financial risk with AI. We're coding the future of commerce with today's most advanced risk management platform powered by big data and machine learning. Founded and developed by data scientists and aerospace engineers, Feedzai has one mission: to make banking and commerce safe. The world's largest banks, processors, and retailers use Feedzai's fraud prevention and anti-money laundering products to manage risk, while improving customer Experience.
Engineering is responsible for all Feedzai product development. Together with Product Management and Data Science, we are building the next generation of tools to catch fraud in real time with a machine learning first approach. Formed by engineers and managed by engineers, at Feedzai you will find one of the most talented teams out there from junior to senior engineers that provide a safe, open and collaborative environment leading to a continuous learning of everyone. While building the best value to our customers, you will be exposed to a wide range of technical challenges such as building distributed systems that need to operate 24/7 and ultra-low latencies.
Feedzai has an infrastructure growing at a rate of 400% a year, and with a mindset of continuous innovation, keeping these services available and performant is super challenging. As a Senior Site Reliability Engineer you will be a driver of changes in architecture, processes and operation in order to push both availability and performance up.
At Feedzai you will be able to deal with a fast growing infrastructure with thousands of machines that process terabytes of data on a daily basis. Your challenge is to guarantee that such operation works like a swiss clock!
Who are you:
You are a person that is curious and that loves designing distributed systems, scaling them and plan for failures.
You accept the fact that failures will happen and plan component architecture to handle them gracefully as possible.
You know very well what is the CAP theorem and how it relates to each component architecture
When problems occur you feel the urge to identify the root cause of the problems and how you can improve the solution to prevent it in the future.
You will elaborate solution sizings based on client needs and benchmarks done to the Feedzai solution.
You will improve products by developing features that support the correct daily operation of the solution. (eg. exposing metrics over products internals such as: thread Pool usage; rate of events processed; exceptions occurred, etc)
You will maintain and evolve the monitoring and alerting platform in order to guarantee problems are detected in a timely manner and solved before causing significant service degradation.
You will be part of product features design and development with a focus on operation and scalability jointly with engineering teams.
You will work closely with Customer Success and Cloud Engineering Teams to ensure healthy environments.
You will develop and automate routine manual tasks.
You will be responsible to support cloud engineering team to produce Root Cause Analysis and post-mortems in case of incidents by directly producing the documents and guarantee that the necessary infrastructure are processes are put in place to support such analysis. (eg. monitoring, centralised logging, tracing, etc)
Computer Science Degree
5+ years of software development in Java
Knowledge in designing, analyzing and troubleshooting distributed systems.
Nice to have requirements:
Knowledge in CI/CD processes
Knowledge in cloud environments (eg. AWS, Azure, Google Cloud)
Experience on monitoring distributed solutions and setting up alerting systems
Knowledge in RabbitMQ, Kakfa and Cassandra internals
Continuous improvement SRE mindset