System architecture
ResistAI is a production full-stack platform integrating structural biology, protein language models, vector search, and LLM reasoning into a single automated pipeline.
Data pipeline
Pipeline orchestration
Infrastructure
Scientific methods
Druggability assessment
fpocket 4.0 identifies binding cavities via alpha-sphere clustering on the protein surface. Druggability score integrates volume, hydrophobicity, and polarity. Threshold: score ≥ 0.7 = high, ≥ 0.4 = medium.
Protein embeddings
ESM-2 (esm2_t12_35M_UR50D, 35M parameters) generates a 480-dimensional per-protein embedding encoding evolutionary and structural context. Used for similarity search and downstream ML classification.
Literature retrieval (RAG)
PubMed abstracts are indexed in ChromaDB as dense vectors. At query time, top-k most relevant articles are retrieved by cosine similarity and passed as context to Llama 3.3 70B for PMID-cited synthesis.
Structural prediction
AlphaFold DB v4 provides precomputed high-confidence structures for the majority of targets. ESMFold API is used as fallback for proteins not covered by AlphaFold DB, with graceful placeholder handling for unresolvable structures.
Model benchmarking note
Benchmarking ESM-2 35M (480-dim) against ESM-2 150M (640-dim) embeddings for druggability classification showed no performance gain from the larger model (ROC-AUC 0.79 in both). This indicates the sequence-level druggability signal saturates at smaller model capacity — the next lever for improvement would be incorporating explicit structural features (pocket geometry, hydrophobicity) rather than scaling the language model. The production pipeline uses the 35M model for efficiency with no loss in accuracy.
Scientific disclaimer
Druggability scores are computed by fpocket and serve as structural proxies for binding site tractability. Thresholds (high ≥ 0.7, medium ≥ 0.4) are conservative estimates based on Le Guilloux et al. 2009. Scores reflect static AlphaFold-predicted structures and do not account for protein flexibility or allosteric effects. Experimental validation is required to confirm druggability.
Datasets
UniProt (reviewed)
2,433 resistance proteins — WHO ESKAPE + M. tuberculosis
AlphaFold DB v4
High-confidence predicted 3D structures
PubMed / NCBI
2,508 indexed antibiotic resistance articles
ESM-2 model weights
esm2_t12_35M_UR50D (HuggingFace)