Introduction
In order to speed up calculations, modern computers are able to perform processes in paralell, which is called parallel computing. In parallel computing the workload is divided over several central processing units (CPUs). The aim of this project is to visualize the time that can be gained by doing parallel computing in contrast to sequential computing.
Not every task can be done in parallel; when one step of the process depends on the result of a previous step, parallel computing is not possible. Calculations that need to be performed on each entry of a large dataset are especially suitable for parallel computing, because the dataset can be split in smaller groups and then processed in parallel. In this project, I calculate the logP value for all compounds in the Wikidata database [ref1] based on their simplified molecular-input line-entry system (SMILES).
Methods
The compounds are obtained from the wikidata using the getSMILES.rq file provided by Egon Willighagen. This gives a tab-delimited text file containg the wikidata link, canonical SMILES and isomeric SMILES for all compounds. The nextflow file CPU_time_logP.nf uses this text file as input and calculates the logP values, using between 1 and 8 CPUs. It returns the calculation time for each number of CPUs in a tab-delimited text file called CPU_duration.tsv. This file is used here to plot the results.
#Install required package stringr.
#install.packages('stringr')
library('stringr')
#Load the CPU_duration text file.
CPU_duration <- read.table(file = 'CPU_duration.tsv', sep = '\t', header = FALSE)
#Split the time and seconds label.
CPU_duration[,2:3] <- str_split_fixed(CPU_duration[,2], " ", 2)
Results
#Plot the calculation time vs. the number of CPUs.
plot(CPU_duration[,1],
CPU_duration[,2],
main = "Figure 1: Calculation time vs. the number of CPUs",
xlab = "number of CPUs",
ylab = "Calculation time (s)"
)

Discussion
As can be seen in figure 1, the computatation time is not reduced when using more CPUs. This could be caused by the calcuation of logP values, which is not possible for many of the compounds. The process skips all SMILES that it cannot calculate the logP value for. Therefore, calculation the entire process sequentially could be so fast that it is hard to measure the improvement of calculation time in parallel computing. A second possibility is that the method of measuring the calculation time used here does not reflect the actual calculation time for the logP value calculation. Due to time constraints, it has not been possible to solve these constraints in the current project. It would, however, be interesting to improve this in future versions of the code.
References
ref1: https://www.wikidata.org/wiki/Wikidata:Main_Page (12-10-2019)
LS0tDQp0aXRsZTogIkVmZmVjdCBvZiBwYXJhbGxlbCBjb21wdXRpbmcgb24gY2FsY3VsYXRpb24gdGltZXMiDQphdXRob3I6ICJTdXVzIHRlbiBIYWdlIg0KZGF0ZTogIjI1IG9rdG9iZXIgMjAxOSINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCioqSW50cm9kdWN0aW9uKiogPGJyLz4NCkluIG9yZGVyIHRvIHNwZWVkIHVwIGNhbGN1bGF0aW9ucywgbW9kZXJuIGNvbXB1dGVycyBhcmUgYWJsZSB0byBwZXJmb3JtIHByb2Nlc3NlcyBpbiBwYXJhbGVsbCwgd2hpY2ggaXMgY2FsbGVkIHBhcmFsbGVsIGNvbXB1dGluZy4gSW4gcGFyYWxsZWwgY29tcHV0aW5nIHRoZSB3b3JrbG9hZCBpcyBkaXZpZGVkIG92ZXIgc2V2ZXJhbCBjZW50cmFsIHByb2Nlc3NpbmcgdW5pdHMgKENQVXMpLiBUaGUgYWltIG9mIHRoaXMgcHJvamVjdCBpcyB0byB2aXN1YWxpemUgdGhlIHRpbWUgdGhhdCBjYW4gYmUgZ2FpbmVkIGJ5IGRvaW5nIHBhcmFsbGVsIGNvbXB1dGluZyBpbiBjb250cmFzdCB0byBzZXF1ZW50aWFsIGNvbXB1dGluZy4gPGJyLz4NCk5vdCBldmVyeSB0YXNrIGNhbiBiZSBkb25lIGluIHBhcmFsbGVsOyB3aGVuIG9uZSBzdGVwIG9mIHRoZSBwcm9jZXNzIGRlcGVuZHMgb24gdGhlIHJlc3VsdCBvZiBhIHByZXZpb3VzIHN0ZXAsIHBhcmFsbGVsIGNvbXB1dGluZyBpcyBub3QgcG9zc2libGUuIENhbGN1bGF0aW9ucyB0aGF0IG5lZWQgdG8gYmUgcGVyZm9ybWVkIG9uIGVhY2ggZW50cnkgb2YgYSBsYXJnZSBkYXRhc2V0IGFyZSBlc3BlY2lhbGx5IHN1aXRhYmxlIGZvciBwYXJhbGxlbCBjb21wdXRpbmcsIGJlY2F1c2UgdGhlIGRhdGFzZXQgY2FuIGJlIHNwbGl0IGluIHNtYWxsZXIgZ3JvdXBzIGFuZCB0aGVuIHByb2Nlc3NlZCBpbiBwYXJhbGxlbC4gSW4gdGhpcyBwcm9qZWN0LCBJIGNhbGN1bGF0ZSB0aGUgbG9nUCB2YWx1ZSBmb3IgYWxsIGNvbXBvdW5kcyBpbiB0aGUgV2lraWRhdGEgZGF0YWJhc2UgW3JlZjFdIGJhc2VkIG9uIHRoZWlyIHNpbXBsaWZpZWQgbW9sZWN1bGFyLWlucHV0IGxpbmUtZW50cnkgc3lzdGVtIChTTUlMRVMpLiA8YnIvPiANCg0KKipNZXRob2RzKiogPGJyLz4NClRoZSBjb21wb3VuZHMgYXJlIG9idGFpbmVkIGZyb20gdGhlIHdpa2lkYXRhIHVzaW5nIHRoZSBnZXRTTUlMRVMucnEgZmlsZSBwcm92aWRlZCBieSBFZ29uIFdpbGxpZ2hhZ2VuLiBUaGlzIGdpdmVzIGEgdGFiLWRlbGltaXRlZCB0ZXh0IGZpbGUgY29udGFpbmcgdGhlIHdpa2lkYXRhIGxpbmssIGNhbm9uaWNhbCBTTUlMRVMgYW5kIGlzb21lcmljIFNNSUxFUyBmb3IgYWxsIGNvbXBvdW5kcy4gVGhlIG5leHRmbG93IGZpbGUgQ1BVX3RpbWVfbG9nUC5uZiB1c2VzIHRoaXMgdGV4dCBmaWxlIGFzIGlucHV0IGFuZCBjYWxjdWxhdGVzIHRoZSBsb2dQIHZhbHVlcywgdXNpbmcgYmV0d2VlbiAxIGFuZCA4IENQVXMuIEl0IHJldHVybnMgdGhlIGNhbGN1bGF0aW9uIHRpbWUgZm9yIGVhY2ggbnVtYmVyIG9mIENQVXMgaW4gYSB0YWItZGVsaW1pdGVkIHRleHQgZmlsZSBjYWxsZWQgQ1BVX2R1cmF0aW9uLnRzdi4gVGhpcyBmaWxlIGlzIHVzZWQgaGVyZSB0byBwbG90IHRoZSByZXN1bHRzLiA8YnIvPiAgDQoNCmBgYHtyIGluY2x1ZGUgPSBUUlVFLCBtZXNzYWdlID0gRkFMU0UsIHdhcm5pbmcgPSBGQUxTRX0NCiNJbnN0YWxsIHJlcXVpcmVkIHBhY2thZ2Ugc3RyaW5nci4gDQojaW5zdGFsbC5wYWNrYWdlcygnc3RyaW5ncicpDQpsaWJyYXJ5KCdzdHJpbmdyJykNCg0KI0xvYWQgdGhlIENQVV9kdXJhdGlvbiB0ZXh0IGZpbGUuIA0KQ1BVX2R1cmF0aW9uIDwtIHJlYWQudGFibGUoZmlsZSA9ICdDUFVfZHVyYXRpb24udHN2Jywgc2VwID0gJ1x0JywgaGVhZGVyID0gRkFMU0UpDQojU3BsaXQgdGhlIHRpbWUgYW5kIHNlY29uZHMgbGFiZWwuIA0KQ1BVX2R1cmF0aW9uWywyOjNdIDwtIHN0cl9zcGxpdF9maXhlZChDUFVfZHVyYXRpb25bLDJdLCAiICIsIDIpDQoNCmBgYA0KDQoqKlJlc3VsdHMqKiA8YnIvPg0KYGBge3IgaW5jbHVkZSA9IFRSVUUsIG1lc3NhZ2UgPSBGQUxTRSwgd2FybmluZyA9IEZBTFNFfQ0KDQojUGxvdCB0aGUgY2FsY3VsYXRpb24gdGltZSB2cy4gdGhlIG51bWJlciBvZiBDUFVzLiANCnBsb3QoQ1BVX2R1cmF0aW9uWywxXSwgDQogICAgIENQVV9kdXJhdGlvblssMl0sIA0KICAgICBtYWluID0gIkZpZ3VyZSAxOiBDYWxjdWxhdGlvbiB0aW1lIHZzLiB0aGUgbnVtYmVyIG9mIENQVXMiLCANCiAgICAgeGxhYiA9ICJudW1iZXIgb2YgQ1BVcyIsIA0KICAgICB5bGFiID0gIkNhbGN1bGF0aW9uIHRpbWUgKHMpIg0KICAgICApDQpgYGAgDQoNCioqRGlzY3Vzc2lvbioqIDxici8+DQpBcyBjYW4gYmUgc2VlbiBpbiBmaWd1cmUgMSwgdGhlIGNvbXB1dGF0YXRpb24gdGltZSBpcyBub3QgcmVkdWNlZCB3aGVuIHVzaW5nIG1vcmUgQ1BVcy4gVGhpcyBjb3VsZCBiZSBjYXVzZWQgYnkgdGhlIGNhbGN1YXRpb24gb2YgbG9nUCB2YWx1ZXMsIHdoaWNoIGlzIG5vdCBwb3NzaWJsZSBmb3IgbWFueSBvZiB0aGUgY29tcG91bmRzLiBUaGUgcHJvY2VzcyBza2lwcyBhbGwgU01JTEVTIHRoYXQgaXQgY2Fubm90IGNhbGN1bGF0ZSB0aGUgbG9nUCB2YWx1ZSBmb3IuIFRoZXJlZm9yZSwgY2FsY3VsYXRpb24gdGhlIGVudGlyZSBwcm9jZXNzIHNlcXVlbnRpYWxseSBjb3VsZCBiZSBzbyBmYXN0IHRoYXQgaXQgaXMgaGFyZCB0byBtZWFzdXJlIHRoZSBpbXByb3ZlbWVudCBvZiBjYWxjdWxhdGlvbiB0aW1lIGluIHBhcmFsbGVsIGNvbXB1dGluZy4gQSBzZWNvbmQgcG9zc2liaWxpdHkgaXMgdGhhdCB0aGUgbWV0aG9kIG9mIG1lYXN1cmluZyB0aGUgY2FsY3VsYXRpb24gdGltZSB1c2VkIGhlcmUgZG9lcyBub3QgcmVmbGVjdCB0aGUgYWN0dWFsIGNhbGN1bGF0aW9uIHRpbWUgZm9yIHRoZSBsb2dQIHZhbHVlIGNhbGN1bGF0aW9uLiBEdWUgdG8gdGltZSBjb25zdHJhaW50cywgaXQgaGFzIG5vdCBiZWVuIHBvc3NpYmxlIHRvIHNvbHZlIHRoZXNlIGNvbnN0cmFpbnRzIGluIHRoZSBjdXJyZW50IHByb2plY3QuIEl0IHdvdWxkLCBob3dldmVyLCBiZSBpbnRlcmVzdGluZyB0byBpbXByb3ZlIHRoaXMgaW4gZnV0dXJlIHZlcnNpb25zIG9mIHRoZSBjb2RlLiA8L2JyPg0KDQoqKlJlZmVyZW5jZXMqKiA8YnIvPg0KcmVmMTogaHR0cHM6Ly93d3cud2lraWRhdGEub3JnL3dpa2kvV2lraWRhdGE6TWFpbl9QYWdlICgxMi0xMC0yMDE5KSA8YnIvPg0KDQo=