We downloaded association evidence information for all target-disease direct association pairs from the Open Targets Platform via API. Disease terms to be included were filtered to remove non-specific terms and we removed disease terms whose therapeutic areas belonged to measurement, phenotype, biological process, and/or cell proliferation disorders. The filtered data set then included 1,378,786 direct target-disease associations for 24,064 unique targets after this processing. Among these, there were 990 targets with at least one indication in clinical trials. OT associations for these target-indication pairs were used to build a working dataset for model evaluation, which consisted of 229,228 target-disease pairs. Another 23,074 targets and their indication associations were used as a prediction dataset for novel indication prediction. The working dataset was further split into training set and testing sets of 70% and 30% for model evaluation: the training set contained 159,249 target-disease pairs for 693 unique targets and the testing set contained 69,979 target-disease pairs for 297 unique targets.
pn c 04753 pdf download
In order to derive the target-target similarity matrix based on tissue specificity, the RNA-seq data were downloaded from GTEx v8 [34] where the samples were obtained from 54 tissues in 30 tissue groups. We removed lowly expressed genes and only kept the genes with at least non-zero count in one tissue. Limma package [35] was used to estimate tissue specificity by comparing samples in one tissue with samples in all other tissues from different tissue groups, we built the linear model by taking age and gender as covariates. Then we estimated gene similarity by computing cosine similarity on the tissue specificity matrix.
Protein to protein interaction network was downloaded from STRING database [16] to calculate node embedding similarity matrix for gene pairs. The STRING database is one of the most comprehensive protein to protein interaction network with predicted and known interactions. Each edge is given a weight to identify the degree of confidence. In order to generate a reliable, high-trust level network reference for our approach, we kept interactions with confidence score greater than 0.5 defined by STRING. There are many analysis approaches could be used to estimate node similarity in a network, such as random walk [38], network propagation [39], etc. In our approach, we used node2vec [17], a deep learning model which performs random walks through the network by starting at a random node and following a series of steps to random neighbors. Each of these random walks forms a sentence that can be fed to word2vec [40] to generate the embedding for each node. Compared to other algorithms, node2vec detects homophily and structural similarities using depth and breadth-first search, generate node embeddings that can be expanded to predictive models for deeper investigation. The gene similarities then can be systematically computed with each node embeddings by cosine similarity. 2ff7e9595c
Comments