Data Engineer, AI

Biohub is the first large-scale initiative bringing frontier AI models, massive compute, and experimental capabilities under one roof. The mission is to build a general-purpose system that integrates biological foundation models and lab capabilities to accelerate scientific discovery and cure disease.

The Team

The Data Engineering team owns the strategy and implementation for the data that fuels AI research. Their goal is to maximize the speed and capability of biological AI by connecting massive public data resources and Biohub’s experimental platforms to AI systems. They handle diverse modalities, including sequences, images, spatial coordinates, time series, and molecular structures.

The Opportunity

As a Data Engineer, you will design systems that ingest data from public repositories and transform heterogeneous biological formats into AI-ready datasets at petabyte scale. You will work in a small, well-resourced team that uses AI tools aggressively (Claude Code, automated agents, LLMs) and values technical correctness alongside scientific accuracy.

What You'll Do

  • Scale Infrastructure: Design and build data pipelines processing genomic and imaging data at petabyte scale.
  • Automation: Build agent-based systems for automated dataset curation, quality control, and workflow generation.
  • Optimization: Solve performance and bandwidth challenges using creative engineering.
  • Accessibility: Create tooling for data cataloging and registration to make datasets discoverable and accessible.
  • Collaboration: Partner with AI Research and scientists to translate model requirements into data specifications.
  • Reliability: Improve pipeline observability, targeting 99%+ success rates without manual intervention.

What You'll Bring

  • Experience: 5+ years of building reliable data systems at scale (100s of terabytes to petabytes).
  • Distributed Systems: Proficiency with frameworks like Databricks, Spark, or Ray.
  • Cloud & HPC: Experience with AWS and High-Performance Computing environments.
  • Engineering Rigor: Strong software engineering fundamentals and interest in AI-native development practices.
  • Scientific Curiosity: A background in computational biology or bioinformatics is a "nice to have," specifically familiarity with formats like FASTQ, BAM, VCF, OME-Zarr, or HDF5.

Compensation & Logistics

  • Base Pay Range: $241,000 - $404,000+ (Redwood City, CA or NYC).
  • Work Model: Hybrid position (60% onsite, approximately 3 days per week).

Benefits

  • Generous 401(k) employer match.
  • Paid time off for volunteering.
  • Family-forming benefits funding.
  • Relocation support.

Biohub

Apply
Job Type:
Permanent
Location:
New York, NY (Hybrid); Redwood City, CA (Hybrid)
Hybrid
Date posted:
April 13, 2026
$241,000 - $404,000+