Technical architecture

System architecture

ResistAI is a production full-stack platform integrating structural biology, protein language models, vector search, and LLM reasoning into a single automated pipeline.

Data pipeline

Input
UniProt REST APIWHO ESKAPE pathogensM. tuberculosis targets
Structure
AlphaFold DB v4ESMFold API (fallback)PDB coordinate parsing
Analysis
fpocket cavity detectionDruggability scoring (≥0.7 = high)ESM-2 embeddings (480-dim)
Storage
proteins_annotated.csvembeddings.parquetChromaDB vector index
Serving
FastAPI REST endpointsPubMed E-utilities (RAG)Llama 3.3 70B via Groq
Frontend
Next.js 14 (App Router)Supabase authResend transactional email

Pipeline orchestration

WorkflowNextflow DSL2 — modular, reproducible
Modulesfetch_card, esmfold, fpocket, summary
ContainersDocker + Singularity (HPC)
SchedulerSlurm-compatible (LSF/PBS ready)
Scale2,433 proteins, parallelised

Infrastructure

API hostingRender (FastAPI + uvicorn)
FrontendVercel (Next.js 14)
AuthSupabase (PostgreSQL + JWT)
Vector DBChromaDB (local persistent)
EmailResend (noreply@resistai.bio)
Domainresistai.bio (Namecheap)

Scientific methods

Druggability assessment

fpocket 4.0 identifies binding cavities via alpha-sphere clustering on the protein surface. Druggability score integrates volume, hydrophobicity, and polarity. Threshold: score ≥ 0.7 = high, ≥ 0.4 = medium.

Protein embeddings

ESM-2 (esm2_t12_35M_UR50D, 35M parameters) generates a 480-dimensional per-protein embedding encoding evolutionary and structural context. Used for similarity search and downstream ML classification.

Literature retrieval (RAG)

PubMed abstracts are indexed in ChromaDB as dense vectors. At query time, top-k most relevant articles are retrieved by cosine similarity and passed as context to Llama 3.3 70B for PMID-cited synthesis.

Structural prediction

AlphaFold DB v4 provides precomputed high-confidence structures for the majority of targets. ESMFold API is used as fallback for proteins not covered by AlphaFold DB, with graceful placeholder handling for unresolvable structures.

Model benchmarking note

Benchmarking ESM-2 35M (480-dim) against ESM-2 150M (640-dim) embeddings for druggability classification showed no performance gain from the larger model (ROC-AUC 0.79 in both). This indicates the sequence-level druggability signal saturates at smaller model capacity — the next lever for improvement would be incorporating explicit structural features (pocket geometry, hydrophobicity) rather than scaling the language model. The production pipeline uses the 35M model for efficiency with no loss in accuracy.

Scientific disclaimer

Druggability scores are computed by fpocket and serve as structural proxies for binding site tractability. Thresholds (high ≥ 0.7, medium ≥ 0.4) are conservative estimates based on Le Guilloux et al. 2009. Scores reflect static AlphaFold-predicted structures and do not account for protein flexibility or allosteric effects. Experimental validation is required to confirm druggability.

Datasets

UniProt (reviewed)

2,433 resistance proteins — WHO ESKAPE + M. tuberculosis

rest.uniprot.org

AlphaFold DB v4

High-confidence predicted 3D structures

alphafold.ebi.ac.uk

PubMed / NCBI

2,508 indexed antibiotic resistance articles

eutils.ncbi.nlm.nih.gov

ESM-2 model weights

esm2_t12_35M_UR50D (HuggingFace)

huggingface.co/facebook/esm2_t12_35M_UR50D