This commit is contained in:
Kevin Kaichuang Yang 2022-08-10 14:04:25 -04:00
Родитель 8df3de679d
Коммит 457527eca7
1 изменённых файлов: 64 добавлений и 33 удалений

Просмотреть файл

@ -5,8 +5,8 @@ Here we will demonstrate the application of several tools we hope will help with
## Installation
```
$ pip install sequence-models
$ pip install git+https://github.com/microsoft/protein-sequence-models.git # bleeding edge, current repo main branch
$ pip install sequence-models
$ pip install git+https://github.com/microsoft/protein-sequence-models.git # bleeding edge, current repo main branch
```
@ -22,7 +22,7 @@ from sequence_models.pretrained import load_model_and_alphabet
model, collater = load_model_and_alphabet('carp_640M')
```
Available models are
Available models are
- `carp_600k`
- `carp_38M`
@ -49,35 +49,34 @@ To encode a batch of sequences:
seqs = [['MDREQ'], ['MGTRRLLP']]
x = collater(seqs)[0] # (n, max_len)
rep = model(x) # (n, max_len, d_model)
```
CARP also supports computing representations from arbitrary layers and the final logits.
```
rep = model(x, repr_layers=[0, 2, 32], logits=True)
```
### Compute embeddings in bulk from FASTA
We provide a script that efficiently extracts embeddings in bulk from a FASTA file. A cuda device is optional and will be auto-detected. The following command extracts the final-layer embedding for a FASTA file from the `CARP_640M` model:
```
$ python scripts/extract.py carp_640M examples/some_proteins.fasta \
examples/results/some_proteins_emb_carp_640M/ \
--repr_layers 0 32 33 logits --include mean per_tok
```
Directory `some_proteins_emb_carp_640M/` now contains one `.pt` file per extracted embedding; use `torch.load()` to load them. `scripts/extract.py` has flags that determine what's included in the .pt file:
`--repr-layers` (default: final only) selects which layers to include embeddings from. `0` is the input embedding. `logits` is the per-token logits.
`--include` specifies what embeddings to save. You can use the following:
- `per_tok` includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
- `mean` includes the embeddings averaged over the full sequence, per layer.
`logits` are always saved as `per_tok`.
`scripts/extract.py` also has `--batchsize` and `--device` flags. For example, to use GPU 2 on a multi-GPU machine, pass `--device cuda:2`. The default is to use a batchsize of 1 and `cpu` if cuda is not detected or `cuda:0` if cuda is detected.
```
CARP also supports computing representations from arbitrary layers and the final logits.
```
rep = model(x, repr_layers=[0, 2, 32], logits=True)
```
### Compute embeddings in bulk from FASTA
We provide a script that efficiently extracts embeddings in bulk from a FASTA file. A cuda device is optional and will be auto-detected. The following command extracts the final-layer embedding for a FASTA file from the `CARP_640M` model:
```
$ python scripts/extract.py carp_640M examples/some_proteins.fasta \
examples/results/some_proteins_emb_carp_640M/ \
--repr_layers 0 32 33 logits --include mean per_tok
```
Directory `examples/results/some_proteins_emb_carp_640M/` now contains one `.pt` file per extracted embedding; use `torch.load()` to load them. `scripts/extract.py` has flags that determine what .pt files are included:
`--repr-layers` (default: final only) selects which layers to include embeddings from. `0` is the input embedding. `logits` is the per-token logits.
`--include` specifies what embeddings to save. You can use the following:
- `per_tok` includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
- `mean` includes the embeddings averaged over the full sequence, per layer (only valid for representations).
- `logp` computes the average log probability per sequence and stores it in a csv (only valid for logits).
`scripts/extract.py` also has `--batchsize` and `--device` flags. For example, to use GPU 2 on a multi-GPU machine, pass `--device cuda:2`. The default is to use a batchsize of 1 and `cpu` if cuda is not detected or `cuda:0` if cuda is detected.
## Masked Inverse Folding (MIF) and Masked Inverse Folding with Sequence Transfer (MIF-ST)
@ -102,9 +101,41 @@ batch = [[wt, torch.tensor(dist, dtype=torch.float),
torch.tensor(omega, dtype=torch.float),
torch.tensor(theta, dtype=torch.float), torch.tensor(phi, dtype=torch.float)]]
src, nodes, edges, connections, edge_mask = collater(batch)
rep = model(src, nodes, edges, connections, edge_mask)
# can use result='repr' or result='logits'. Default is 'repr'.
rep = model(src, nodes, edges, connections, edge_mask)
```
### Compute embeddings in bulk from csv
We provide a script that efficiently extracts embeddings in bulk from a csv file. A cuda device is optional and will be auto-detected. The following command extracts the final-layer embedding for a FASTA file from the `mifst` model:
```
$ python scripts/extract_mif.py mifst examples/gb1s.csv \
examples/ \
examples/results/some_proteins_mifst/ \
repr --include mean per_tok
```
Directory `examples/results/some_proteins_mifst/` now contains one `.pt` file per extracted embedding; use `torch.load()` to load them. `scripts/extract_mif.py` has flags that determine what .pt files are included:
The syntax is:
```
$ python script/extract_mif.py <model> <csv_fpath> <pdb_dir> <out_dir> <result> --include <pooling options>
```
The input csv should have columns for `name`, `sequence`, and `pdb`. The script looks in `pdb_dir` for the filenames in the `pdb` column.
The options for `result` are `repr` or `logits`.
`--include` specifies what embeddings to save. You can use the following:
- `per_tok` includes the full sequence, with an embedding per amino acid (seq_len x hidden_dim).
- `mean` includes the embeddings averaged over the full sequence, per layer (only valid for representations).
- `logp` computes the average log probability per sequence and stores it in a csv (only valid for logits).
`scripts/extract.py` also has a `--device` flags. For example, to use GPU 2 on a multi-GPU machine, pass `--device cuda:2`. The default is to use `cpu` if cuda is not detected or `cuda:0` if cuda is detected.
## Biosynthetic gene cluster CARP (BiGCARP)
We make available pretrained CNN Pfam domain masked language models of BGCs. All of these have a ByteNet encoder architecture and are pretrained on antiSMASH using the same masked language modeling task as in BERT and ESM-1b.