LLM and supervised learning for prediction of protein variants
Two different zero-shot prediction models were used to guide the design of the initial variant library: EVmutation and ESM-2. EVmutation is a statistical model designed by Hopf et al.23 that captures the co-variations between pairs of residues in an amino acid sequence. This is achieved by fitting a pairwise graph model to the multiple sequence alignment (MSA) of the homologous to the target protein. The model then scores the impact of amino acid substitutions by calculating the log-ratio of sequence probabilities between the mutant and wild-type sequence. In this work, we utilized the integrated web server EVcouplings52 with the default searching parameters and highlighted bit-scores. ESM-2 is a transformer-based protein language model designed by Rives et al.53 that is trained on large and diverse protein sequence database and captured the rules governing the protein structure and functions. ESM-2 can output the probability of a certain amino acid occurring at a given position based on the up and down stream context. Thus, we can score any given mutation by using wild-type as a reference and compare the probability of the given mutation with that of the wild-type. In this work, we used the workflow implement in ESM-2 (https://github.com/facebookresearch/esm) to make the prediction with the model esm2_t36_3B_UR50D.
For each round of engineering, we trained a supervised prediction model based on all experimentally measured variant fitness data from current round and all rounds prior. We first preprocess raw data by removing all zero or negative variants and normalize the data by taking the log scale. In this work, we trained the supervised model by modifying the workflow implement by Hsu et al.24 (https://github.com/chloechsu/combining-evolutionary-and-assay-labelled-data). We initially trained the model using augmented Potts and ESM approach, but the evolutionary density score feature has negatively correlated with experimental results. Thus, we modified the original training script and used the command ‘python src/evaluate.py phytase onehot –n_seeds = 1 –n_threads = 1 –n_train = -1’ for training and prediction.
GPT-based user interface
To implement a Generative Pre-trained Transformer (GPT)-based user interface, OpenAI’s assistant Application Programming Interface (API) (https://platform.openai.com/docs/assistants/overview) was employed. The assistant was configured using the ‘gpt-4-turbo’ model. The user interface carries tools of file-search (https://platform.openai.com/docs/assistants/tools/file-search) and function-calling (https://platform.openai.com/docs/assistants/tools/function-calling). The user interface will invoke file searching to respond to user’s general questions and invoke function calling to assist the design of initial variant library.
Expression plasmids
The HMT and CiVCPO plasmids were a gift from Uwe Bornscheuer lab. Both the HMT plasmids and empty vector (pET-28a) were transformed in Escherichia coli BL21 (Δmtn) strain, a gift from Uwe Bornscheuer lab, which is an MTA/SAH nucleosidase knockout E. coli BL21(DE3) strain25. The phytase cDNA was ordered from Twist Biosciences and cloned in pET-28a vector using HiFi assembly. These plasmids were sequence verified using Plasmidsaurus (San Francisco, CA, USA). Complete sequence of these plasmids are provided in Supplementary Data 5.
Site-directed mutagenesis
For site-directed mutagenesis, three PCR fragments were assembled using HiFi assembly (Fig. 2b). The mutagenesis primers (27 base-pairs, 12 + 3 + 12) were designed using a Python script and ordered in 96-deepwell plates from IDT. For PCR templates, the plasmids were linearized with restriction enzymes and purified using a PCR cleanup kit (Zymo #D4018). The HMT plasmid was linearized with EcoRV (for mutagenesis PCR of the ORF) and XbaI (for backbone PCR). The phytase plasmid was linearized with HpaI (for mutagenesis PCR of the ORF) and XhoI (for backbone PCR). The PCR fragments contained 27-40 overlapping base pairs. The PCR for the vector backbone was treated with DpnI (37 °C for 4 h, followed by overnight incubation at room temperature) and purified using a PCR cleanup kit (Zymo #D4018), and stored at -20 °C until used. The mutagenesis PCR for forward and reverse fragments was run in separate 96-well plates, followed by DpnI treatment. Q5 DNA polymerase (NEB #M0491) was used for mutagenesis PCR (50 µL reaction, ~200 pg template, 18 cycles, 65 °C 30 s, and 72 °C 1 min). We observed that NEB Q5 polymerase can tolerate a wide range of theoretical annealing temperatures, thus all mutagenic PCRs were annealed at 65 °C for ease of automation. Ensuring correct amount of input template DNA was critical to successful PCRs and efficient SDM, thus the DNA concentration was measured using Invitrogen Qubit Fluorometer with dsDNA Quantitation kit (Invitrogen #Q32853) and further verified by agarose gel electrophoresis. The primers were ordered from IDT in 96-well plate at 2 uM. For PCRs, 12.5 ul of primers were added to 96-well PCR plates on Tecan Fluent. After PCR, 2.5 µL of the reaction was mixed with 50 µL of 1x EvaGreen Dye (Biotinum #31000) and fluorescence (λex = 488 nm / λem = 535 nm) was measured to verify successful PCR.
HiFi assembly reactions (15 µL) were prepared using 30 ng of purified vector backbone PCR, 1.25 µL each of DpnI treated forward and reverse PCR reactions, and 7.5 µL of HiFi master mix (NEB #E2621) followed by incubation at 50 °C for 30 min. To increase the efficiency of HiFi assembly, 50 ng of single-strand binding protein (NEB #M2401) was added to 500 µL HiFi master mix54. Homemade competent DH5α cells in a 96-well plate (Biorad # HSS9641) were transformed with 5 µL of the HiFi reaction by 30 s heat shock, followed by 1 h outgrowth in 150 µL SOCS media. DH5α outgrowth cultures were spread on 8-well omnitray plates (120 µL per well, kanamycin 50 µg/mL) and incubated overnight at 37 °C. DH5α colonies were picked using Pickolo and incubated overnight at 37 °C in a 96-deepwell plate containing 1 mL of Terrific Broth (TB) and kanamycin. Minipreps were then performed using PureLink Pro Quick96 Plasmid Purification Kit (Invitrogen #K211004A). Miniprep DNA of some mutants from the 96-well plate was sent for sequencing. The miniprep DNA was used to transform competent BL21 cells in a 96-well plate for further protein expression and functional assays.
E. coli heat shock competent cell preparation in 96-well plate
DH5α, BL21, and BL21(Δmtn) competent cells were prepared in the lab and stored in a high-profile 96-well PCR plate (Bio-Rad #HSS9641) at -80 °C. Briefly, an overnight preculture (no antibiotics, 37 °C) was started from a single colony. A 250 mL LB media was inoculated with 5 mL of the starter culture and grown at 30 °C until the OD reached 0.4–0.6, and then transferred to cold room (4 °C) for 1 h. The cells were centrifuged at 1000xg for 20 min at 4 °C. The cells were then gently resuspended in Buffer RF1 (100 mM rubidium chloride, 50 mM manganese chloride, 30 mM potassium acetate, 10 mM calcium chloride, 15% w/v glycerol, pH 5.8), incubated at 4 °C for 30 min, then centrifuged. Subsequently, the cells were resuspended in 10 mL of RF2 buffer (10 mM MOPS, 10 mM rubidium chloride, 75 mM calcium chloride, 15% w/v glycerol, pH 5.8). 50 µL was aliquoted into each well of a 96-well plate in a cold room, followed by sealing and snap-freezing in liquid nitrogen. The cells were stored at -80 °C until used.
Iodide detection assay for the ethyltransferase activity of HMT
We followed the iodide quantification protocol developed by Tang et al.25 that can specifically detect iodide and is insensitive to chloride concentration as described here. For HMT screening in 96-well plates, the iodide detection reagent consisted of 1 μL purified chloroperoxidase from Curvularia inaequalis (CiVCPO, 0.75 mg/mL) and 79 μL 3,3′,5,5′-tetramethylbenzidine (TMB, Sigma #T0565), to which 20 μL of HMT reaction I (see below) was added. For enzyme kinetics of purified HMT proteins, the iodide detection reagent contained 47 μL TMB, 0.5 μL CiVCPO (0.75 mg/mL), and 2.5 μL HMT reaction I. For the standard curve (Supplementary Fig. 13), 2.5 μL of potassium iodide (KI) at varying concentrations (5 μM to 500 μM) was added to 47 μL TMB and 0.5 μL CiVCPO (0.75 mg/mL), and absorbance at 570 nm was measured (Supplementary Fig. 11) using a Tecan Infinite plate reader. Variance of iodide detection assay for AtHMT activity in 96-well plate was measured using wild-type and V140T mutant (Supplementary Fig. 6).
High throughput screening for ethyltransferase activity of HMT
Single colonies of HMT mutants in BL21 (Δmtn) cells were picked and inoculated into 800 μL LB broth supplemented with 50 μg/mL kanamycin for preculture. From preculture, 300 μL was used for glycerol stocks and stored at -80 °C. Then, 50 μL of the preculture was inoculated into 1 mL of TB (96- deepwell plate, 50 μg/mL kanamycin, 100 μM isopropyl β-D-1-thiogalactopyranoside (IPTG)) and incubated overnight at 30 °C, 900 rpm in Cytomat automated shaking incubator. Cells were pelleted at 2900xg for 15 min, then lysed in 300 μL of lysis buffer (1.5 mg/mL lysozyme, 0.1x Bugbuster reagent, 10 μg/mL DNase I, 1x Halt Protease inhibitor, 50 mM sodium phosphate buffer, pH 7.5). Lysis occurred at 30 °C for 1.5 h at 900 rpm, followed by centrifugation for 20 min at 2900xg at room temperature.
For Reaction I, 160 μL of crude cell lysate was mixed with 20 μL SAH (1 mM) and 20 μL ethyl iodide (5 mM), both freshly prepared in DMSO. After 1 h incubation at room temperature, 20 μL of Reaction I was added to 80 μL of Reaction II (1 μL CiVCPO + 79 μL TMB), and absorbance at 570 nm was measured using Tecan Infinite plate reader. A Python script calculated the slope of the increase in absorbance for each well, then calculated the fold change relative to the wild type. Each screening plate contained 90 mutants with controls (two wells each of wild type, V140T, and empty vector). For screening V140T/S99T triple mutants, each 96-well plate contained duplicates of 36 triple mutants containing V140T/S99T, six triple mutants from the third round, and the same controls as above.
HMT kinetics assay
The Michaelis-Menten kinetics for HMT and its mutants were determined for ethyl iodide and methyl iodide. A 50 μL reaction was prepared with 1 mM SAH, 10 μL haloalkane, and purified HMT in 50 mM sodium phosphate buffer (pH 7.5). To prevent hydrolysis, 5x haloalkane stocks were freshly prepared in DMSO. For ethyl iodide, purified wild-type HMT (wtHMT) was used at a final concentration of 87 μM (2.4 mg/mL), V140T, round 2, and round 3 mutants at 29 μM, and round 4 mutants at 8.7 μM. For methyl iodide, wtHMT was used at 0.145 nM, and V140T, round 2, round 3, and round 4 mutants at 0.29 nM. After a 10 min incubation, 2.5 μL of the above reaction was added to the iodide assay reagent (0.5 μL CiVCPO + 47 μL TMB), and absorbance at 570 nm was measured to determine iodide concentration. For the kinetic assay, the concentration of haloalkanes was varied while keeping HMT and SAH concentration constant. The specific activity was calculated at 15 mM ethyl iodide for all purified HMT proteins. Autohydrolysis rates for each haloalkane concentration were simultaneously measured and subtracted from enzymatic reaction rates. The amount of iodide produced per minute was calculated using the standard curve (Supplementary Fig. 9) and fit to the Michaelis-Menten model using Origin software. All reactions were performed in triplicate.
Purification of HMT proteins
We followed the published protocol25 for purifying chloroperoxidase from Curvularia inaequalis (CiVCPO) and stored in 50 mM sodium phosphate buffer (pH 8.0) with 100 µM sodium orthovanadate. For the iodide detection assay, 0.375 mg CiVCPO was added per 50 μL reaction.
For HMT purification, starter cultures were prepared from glycerol stocks of mutant libraries in BL21 Δmtn (DE3) strain. For protein expression, a 1:100 preculture was added to TB (50 μg/mL kanamycin) and grown at 37 °C until the OD600 reached ~0.6. Cells were cooled and incubated overnight at 20 °C with 200 μM IPTG. The cells were harvested by centrifugation at 10,000 x g at 4 °C for 10 min, and the pellet was resuspended in lysis buffer (20 mM sodium phosphate, 0.5 M sodium chloride, 20 mM imidazole, 1× PIC, pH 7.5). Cells were lysed by sonication (10 min on ice, 20% amplitude), followed by centrifugation (12,000xg, 20 min at 4 °C). His6-tagged proteins were purified using immobilized metal-affinity chromatography (IMAC). Lysates were clarified through a 0.45 μm filter, and 2 mL of pre-washed nickel beads (Nuvia™ IMAC Resin, Bio-Rad #7800800) were added. After 45 min of incubation and mixing at 4 °C, the lysates were centrifuged at 2900xg, and the supernatant was discarded. The affinity beads were transferred to a column and washed with 10x volume of lysis buffer. Proteins were eluted with elution buffer (20 mM sodium phosphate, 0.5 M sodium chloride, and 0.2 M imidazole, pH 7.5), then desalted using PD-10 desalting columns (Cytiva #17085101) and Amicon Ultra 10 kDa centrifugal filters. The proteins were stored in 50 mM sodium phosphate (pH 7.5), and concentrations were determined using absorbance at 280 nm with a Nanodrop, using the HMT extinction coefficient (42,065 M−1cm−1). Purity was analyzed via SDS-PAGE (Supplementary Fig. 15).
High throughput screening for phytase activity using 4-MUP assay
Single colonies of phytase plasmids in BL21 cells were picked and inoculated into 800 μL LB broth supplemented with 50 μg/mL kanamycin for preculture in a 96-deepwell plate. 1 mL of autoinduction media (15 g/L peptone, 30 g/L yeast extract, 6.25 mL/L glycerol, 90 mL 1 M potassium phosphate buffer pH 7, 10 mL glucose (50 g/L), 100 mL lactose (20 g/L), and 50 μg/mL kanamycin) was inoculated with 5 μL aliquot of the preculture in a 96-deepwell plate. Phytase was expressed overnight (16 h, 37 °C, 900 rpm) in a Cytomat automated shaking incubator. Cells were pelleted by centrifugation (2900xg, 15 min, room temperature) and resuspended in 200 μL lysis buffer (1 mg/mL lysozyme, 50 mM Tris-HCl buffer, pH 7.5). Lysis was carried out by incubating at 37 °C, 900 rpm for 1 h, followed by centrifugation to remove cell debris.
Phytase activity was measured using a fluorescence-based 4-MUP assay26,27. Cleared cell lysate (10 μL) was added to 90 μL of 1.11 mM 4-MUP (4-methylumbelliferyl phosphate, Sigma #M8168) in Tris-maleate buffer (0.2 M, pH 6.6) in black, clear-bottom plates. The increase in fluorescence (λex = 354 nm / λem = 465 nm) was measured over time using a Tecan Infinite M1000 plate reader. Each screening plate contained 90 mutants and six control wells (wild type, M16 (T44V/K45E) mutant, empty vector, and the best mutant from the previous round). The slope of the fluorescence increase was calculated for each well using a Python script, with empty vector values subtracted at each time point for normalization.
Standard curve for 4-MUP assay
4-Methylumbelliferone (4-MU, Sigma #M1381) was dissolved into methanol and 50 µL of dissolved 4-MU solution was mixed with 50 µL of appropriate pH buffer. For pH 4.5, the buffer used was 0.25 M sodium acetate, 1 mM calcium chloride, 0.01% Tween-20. For pH 5.6 and 6.6, 0.2 M tris maleate buffer was used. Fluorescence (λex = 354 nm / λem = 465 nm) was measured using Tecan Infinite plate reader and plotted against 4-MU concentrations to generate standard curves for each pH (Supplementary Fig. 17).
Phytase kinetics assay
Kinetics of phytase variants was determined using a 4-methylumbelliferyl phosphate (4-MUP) assay as previously described with some modifications5 as described here. Assays at pH 5.6 and 6.6 were performed in 0.2 M tris maleate buffer while assays at pH 4.5 were performed in 0.25 M sodium acetate, 1 mM calcium chloride, 0.01% Tween-20 buffer. Standard curves for pH 4.5, 5.6 and 6.6 were prepared for different concentrations of 4-MU (4-methylumbelliferone), the fluorescent product of 4-MUP (pH 4.5: 2.5-640 μM; pH 5.6: 2.5-320 μM; pH 6.6: 3.125-40 μM) and can be viewed in Supplementary Fig. 15. 50 μL of 4-MU dissolved in methanol was mixed with 50 μL of the specified buffer followed by fluorescent measurement with λex 354 nm and λem 465 nm on a Tecan INFINITE plate reader. To prepare enzymes for kinetic assays, proteins were first mixed with 1 mM AEBSF (4-benzenesulfonyl fluoride hydrochloride) and incubated at 37 °C for 30 min. 90 μL of 4-MUP substrate in respective buffer was then mixed with 10 μL of the enzyme solution followed by kinetic measurement on the TECAN infinite. Fluorescence data was converted to mM 4-MU using the appropriate standard curves then divided by the enzyme concentration and used to calculate initial reaction rates. As Yersinia mollaretii phytase is a tetrameric enzyme with allosteric interactions, initial reaction rates were fit using OriginLab graphing software to the Hill equation y = Vmax * x / (Khalf + x) where y is the initial reaction rate, Vmax is the maximum reaction rate, X is the substrate concentration, and Khalf is the substrate concentration that enables half the maximum reaction rate.
Purification of phytase proteins
For phytase protein purification, BL21 glycerol stock from each round was used to inoculate a starter culture in LB medium followed by growth at 37 °C for 12–16 h. 5 mL of starter cultures were added to flasks containing 500 mL Terrific Broth (6 g tryptone, 12 g yeast extract, 2 mL glycerol, 0.085 mol potassium dihydrogen phosphate, 0.36 mol dipotassium hydrogen phosphate) supplemented with 50 μg/mL kanamycin. Flasks were then shaken at 200 rpm and 37 °C for 12–16 h. Following growth, protein expression was induced by adding 100 µM IPTG and culturing for an additional 4 h at 200 rpm and 37 °C. The cells were harvested by centrifugation at 10,000xg at 4 °C for 10 min, and the pellet was resuspended in lysis buffer (20 mM sodium phosphate, 0.5 M sodium chloride, 30 mM imidazole, 1× Protease Inhibitor Cocktail (Sigma #P8849), pH 7.5) supplemented with 1 mg/mL lysozyme. Cells were then lysed for 1 h at 37 °C followed by centrifugation (12,000xg, 20 min at 4 °C). His6-tagged proteins were purified using immobilized metal-affinity chromatography (IMAC). Lysates were clarified through a 0.45 μm filter, and 3 mL of pre-washed nickel beads (Nuvia™ IMAC Resin, Bio-Rad #7800800) were added. After 45 min of incubation and mixing at room temperature, the lysates were centrifuged at 2900xg, and the supernatant was discarded. The affinity beads were transferred to a column and washed with 10x volume of lysis buffer. Protein was eluted by adding 10 mL elution buffer (20 mM sodium phosphate, 0.5 M sodium chloride, and 0.2 M imidazole, pH 7.5). Protein was concentrated and the buffer was exchanged to 0.25 M sodium acetate, pH 5.5, 1 mM calcium chloride, 0.01% Tween-20 using Amicon Ultra 10 kDa centrifugal filters. Concentrations were determined using absorbance at 280 nm with a Nanodrop and enzymes were stored at 4 °C. Purity was analyzed via SDS-PAGE (Supplementary Fig. 16).
PAGE gels for purified proteins
For SDS-PAGE, the purified proteins were mixed with 1x Laemmli sample buffer (Biorad #1610737) with 5% 2-mercaptoethanol (Sigma #M6250) and boiled at 95 °C for 5 min. An equal amount of each sample was then loaded onto a 10% precast gel (Biorad #4561034) with protein ladder (Biorad # 1610374) and run using Tris/Glycine/SDS running buffer (Biorad # 1610732). After Coomassie staining, the gels were thoroughly washed before imaging. For native gel for phytase proteins, the purified proteins were mixed with native sample buffer (Biorad #1610738) and equal amount was loaded onto 7.5% gel (Biorad #4568024) and run with Tris/glycine running buffer (Biorad # 1610771). NativeMark unstained protein standard (Invitrogen # LC0725) ladder was used for native gel.
Worklist generation and laboratory automation
The seamless integration of various modules required generating worklists for PCR, which were created using Python scripts. Each round of mutagenesis added one additional mutation to plasmids generated in the previous round, such that the variants in the third round of evolution contained three total mutations. Thus, linearized plasmids containing mutations from one round of screening serve as templates for the next round. A Python script efficiently selected the minimal number of PCR templates needed for the next cycle. The linearized PCR templates were transferred to a 384-well plate, and a worklist was generated to distribute them into a 96-well PCR plate. Another worklist was created to distribute the primers using Tecan Fluent in 96-well PCR plates. Primers were ordered in a 96-well format from IDT and diluted to 2 μM stocks with 12.5 μL being added to each PCR reaction. Worklists were also used for spreading the transformed E. coli cells in 8-well omnitray LB plates. Laboratory automation was facilitated using the iBioFAB facility, which employs a Thermo F5 robotic arm to connect instruments and Thermo Momentum Scheduling software to execute integrated and automated experimental workflows. The comprehensive descriptions about instrumentation and automation are provided in Supplementary Fig. 2–5.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.