View all jobs

Senior ML Reliability Engineer

Remote, US
Classification: Direct Hire
Contract Length: N/A
Location: 100% Remote (United States)
Job ID:

CereCore® provides EHR implementations, IT and application support, IT managed services, technical staffing, strategic IT consulting, and advisory services to hospitals and health systems nationwide. Our heritage is in the hallways of some of America’s top-performing hospitals. We have served as leaders in finance, operations, technology, and as clinicians turned power users and innovators. At CereCore, we know firsthand the power that aligned technology can provide in delivering care. As a wholly-owned subsidiary of HCA Healthcare, we are committed to bringing the expertise we have gained as operators to deliver IT services that emphatically address the needs of health systems across the United States. Our team of over 600 clinical and technical professionals has implemented EHR systems in more than 400 facilities and provides managed services support to tens of thousands of health system employees. We work tirelessly to provide healthcare organizations specialized IT services that support the delivery of patient care. The Link to Life-Saving Care.

CereCore is seeking a Senior ML Reliability Engineer to join our team in Remotely. As a Senior ML Reliability Engineer in our Data Science team, you will be instrumental in building, managing, and maintaining the tools and infrastructure that ensure the reliability and efficiency of our Machine Learning Operations (MLOps) platform. This role requires a balance of technical expertise in reliability engineering and operational skills to maintain and optimize AI systems and infrastructure.

  • Tool Development and Management: Build, manage, and maintain tools for system reliability, including dashboards, logging systems, and pager systems.
  • Infrastructure Maintenance: Help maintain and enhance our CI/CD pipelines, logging infrastructure, and other operational systems crucial for MLOps.
  • Monorepo Management: Keep the monorepo up-to-date with the latest dependency and security updates, ensuring a secure and efficient development environment.
  • Vendor Collaboration: Assist in implementing and maintaining infrastructure and systems managed by external vendor teams.
  • Incident Management: Lead and participate in incident management processes, including troubleshooting, root cause analysis, and implementing corrective measures to prevent future occurrences.

  • Hands on technical engineering, architecture and development experience with Google Cloud Platform Vertex AI MLOps live and in production at scale.
  • AI/ML Knowledge: Solid understanding of AI/ML principles and technologies.
  • System Monitoring and Tools: Experience with system monitoring tools and observability. Knowledge of GCP, Vertex AI, or other cloud platforms is highly beneficial.
  • Programming and Scripting: Proficiency in programming languages such as Python and scripting for automation.
  • Problem-Solving Skills: Strong analytical and problem-solving skills, with the ability to work under pressure.
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
  • 5+ years of experience in the technology field
  • Proven experience in a reliability engineering role, preferably with a focus on AI/ML systems.
  • Experience in incident management and performance optimization.
  • Excellent communication and teamwork skills.
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Share This Job

Powered by