Snakemake workflows¶
Introduction¶
A workflow is a collection of rules that define the construction both the DTI and the protein pretraining datasets. It should be run with the following command:
snakemake -j 1 --use-conda --configfile your_config_file.yaml
Confuguration¶
The config files for the snakemake workflow are located in the config/snakemake/ file.
source field is required to set the correct working directory, the results of the pipeline will be put in the results directory in the same folder.
Examples of files for most standard DTI datasets are provided in the aforementioned directory.
File naming¶
The resulting files are named according the the string entries in the config file, followed by a string representing a hashed config dictionary. While this decreases human readability, it is necessary to ensure that the once the config changes, the results are not overwritten.
For example, given the following config:
prots:
structs:
method: whole
features:
method: distance
node_feats: onehot
edge_feats: none
drugs:
max_num_atoms: 150
node_feats: label
edge_feats: none
split_data:
method: random
parse_dataset:
filtering: all
sampling: none
task: class
The resulting file will be <target>/results/prepare_all/wdonlnranc_0f6b0ac6.pkl.
In this file, the first part (wdonlnranc) is human-readable compression of the config (w for whole, d for distance, etc), while the second part (0f6b0ac6) is a hashed version of the config.
File structure¶
It is recommended to organise your datasets folder as following:
dataset1
└── resources
├── structures
│ ├── struct1.pdb
│ ├── struct2.pdb
│ ├── struct3.pdb
├── tables
│ ├── inter.tsv
│ ├── lig.tsv
│ └── prot.tsv
└── templates
└── template1.pdb
After running the snakemake workflow for dataset1 and dataset2, the following files and directories will be generated (the actual directories might differ, depending on your config):
test_data
├── resources
│ ├── structures
│ ├── tables
│ └── templates
└── results
├── parse_dataset
├── parsed_structs
├── prepare_all
├── prepare_drugs
├── prot_data
├── pymol_logs
├── pymol_scripts
├── rinerator
└── split_data
DTI dataset creation¶
In order to create a DTI dataset, the following requirements have to be met:
PDB structures, located in the
<source>/resources/structuresdirectoryNecessary tsv tables located in the
<source>/resources/tablesdirectory: -<source>/resources/tables/inter.tsv- The interactions data, has to contain Drug_ID, Target_ID and Y columns, -<source>/resources/tables/lig.tsv- The ligand data, has to contain Drug_ID and Drug columns, where Drug contains SMILES representation of the drug. -<source>/resources/tables/prot.tsv- The protein data, has to contain Target_ID and Target columns, where Target contains the protein sequence.only_proteinsentry in the snakemake config has to be false
After running the pipeline with snakemake -j 16 --use-conda --configfile your_config_file.yaml, the pickle file should be created in <target>/results/prepare_all/ folder.
File validation¶
The following code can be used to validate the configuration file:
from snakemake.utils import validate
from rindti.utils import read_config
default_config = read_config('config/snakemake/default.yaml')
your_config = read_config('config/snakemake/your_config.yaml')
default_config.update(your_config)
validate(default_config, 'workflow/schemas/config.schema.yaml')
The following code can be used to validate the tables:
from snakemake.utils import validate
import pandas as pd
for i in ['inter', 'lig', 'prot']:
df = pd.read_csv(f'test/test_data/resources/tables/{i}.tsv', sep='\t')
validate(df, f'workflow/schemas/{i}.schema.yaml'.format(i))
Protein dataset creation¶
In order to create the protein-only dataset (for pretraining), the following requirements have to be met:
PDB structures, located in the
<source>/resources/structuresdirectoryonly_proteinsentry in the snakemake config has to be true
After running the pipeline with snakemake -j 16 --use-conda, the pickle file should be created in <target>/results/prot_data/ folder.