Prompt/LLM Engineer

PhnyX Lab • Full-time • Palo Alto, CA, US • 5m ago

Company Overview

PhnyX Lab, founded in 2024 and based in Silicon Valley, California, is a Stanford-based AI startup on a mission to revolutionize the pharmaceutical and medical industries.

We’ve built Cheiron, a generative AI platform purpose-built for pharma, and in just 4 months since launch, we’ve captured 30% market share in the Korean pharmaceutical sector and are already generating meaningful revenue.

All of this has been achieved pre-VC, backed instead by a $6M+ raise from leading angels—including several international tycoon families like SK and Samsung, and two co-authors of “Attention is All You Need.”

We’re now growing our core team in Palo Alto and looking for exceptional AI and software engineers to join as founding engineers. If you're excited to shape core infrastructure, ship cutting-edge GenAI products, and build systems that will redefine how pharmaceutical companies work—you’ll thrive here.

News:

https://finance.yahoo.com/news/ai-startup-phnyx-lab-secures-110000098.html
https://www.koreaherald.com/article/10512114
https://www.koreabiomed.com/news/articleView.html?idxno=26053

About The Role

PhnyX Lab is seeking a highly analytical and systematic Prompt/LLM Engineer to join our technical team. This is a critical, hands-on role focused on architecting the core logic that powers our generative AI platform, Cheiron. You will be responsible for designing, evaluating, and optimizing the high-performing prompts that generate accurate, safe, and contextually relevant outputs for the pharmaceutical industry. You will blend a deep technical understanding of LLM behavior with a rigorous, metrics-driven approach to experimentation. This is not just a creative role; we are looking for an engineer who knows how to systematically measure, debug, and improve prompt performance at scale.

Key Responsibilities

Develop, optimize, and document high-performing prompts for LLM-powered medical and pharmaceutical applications.
Build and maintain evaluation pipelines integrating both automated metrics (BLEU, ROUGE, BERTScore, embedding similarity, factual accuracy) and domain-expert reviews.
Tune prompts based on quantitative KPIs and qualitative feedback, using structured experimentation, A/B testing, and multi-metric scoring.
Conduct prompt failure mode analysis, debugging unexpected outputs, and designing fixes for edge cases.
Curate and maintain domain-specific evaluation datasets incorporating medical ontologies and compliance constraints (SNOMED CT, RxNorm, MeSH, ICD codes).
Translate clinical and pharmaceutical objectives into reproducible, high-quality prompt logic.
Collaborate with ML engineers, researchers, and compliance specialists to integrate prompts into production-grade GenAI workflows.
Continuously explore and apply SOTA prompting techniques, including zero-shot, few-shot, chain-of-thought, and multi-agent orchestration.

Key Skill Sets

Bachelor’s degree or higher in Computer Science, Computational Linguistics, Data Science, or related fields.
Proven experience designing, tuning, and evaluating prompts for LLMs in production or research settings.
Deep familiarity with prompt structures, evaluation methodologies, and optimization strategies.
Proficiency in Python for evaluation scripting, automation, and batch testing.
Solid understanding of LLM behavior, limitations, and failure patterns.
Strong ability to track iterations, log prompt changes, and measure output quality over time.
Excellent organizational skills with attention to detail in evaluation and documentation.

Nice to have:

Experience in medical, healthcare, biotechnology, or pharmaceutical domains
Familiarity with LangChain, LangGraph, DSPy, and AI agents.
Experience with RAG pipelines and vector databases (Milvus, Pinecone, FAISS, Weaviate).
Contributions to open-source AI/LLM tooling or published research in prompt engineering or LLM evaluation.
Startup experience or demonstrated success in fast-paced, high-growth environments.