Skip to content

Latest commit

 

History

History
51 lines (40 loc) · 5.08 KB

public-datasets.md

File metadata and controls

51 lines (40 loc) · 5.08 KB
title sidebar_position
Public datasets
4

Public datasets

Datagrok hosts a number of public datasets that are ready to use for testing and prototyping:

Dataset Description
Bioactive molecules with drug-like properties - ChEMBL data A dataset from EBI's manually curated chemical database of bioactive molecules with drug-like properties
Clinical trials - AACT data The AACT (Aggregate Analysis of ClinicalTrials.gov) dataset is a publicly available, comprehensive resource that contains information on every study registered in ClinicalTrials.gov, including protocol and result data elements for each clinical trial
Toxic chemical data - ToxCast data EPA's most updated, publicly available high-throughput toxicity data on thousands of chemicals. This data is generated through the EPA's ToxCast research effort. The dataset includes qualitative results of over 600 experiments on 8k compounds
Toxic chemical data - Tox21 Data Challenge 2014 A dataset created as a result of the initiative to create a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. The data provides assay activity data and chemical structures on the Tox21 collection of ~10,000 compounds (Tox21 10K)
Drug side effects - EMBL's SIDER The SIDER (Side Effect Resourse) dataset, which is a comprehensive database of marketed drugs and their assoictaed adverse reactions (ADR). The SIDER dataset in DeepChem groups drug side effects into 27 system organ classes following the MedDRA (Medical Dictionary for Regulatory Activities) classifications. The dataset covers 1,427 approved drugs and contains information on their chemical structures, associated ADRs, and the frequency of these side effect.
Bioassays: small molecules - PCBA A subset of PubChem BioAssay (PCBA)' dataset containing biological activities of small molecules generated by high-throughput screening. The selection consists of 128 assays measured over 400,000 compounds.
MUV data A benchmark dataset selected from PubChem BioAssay by applying a refined nearest neighbor analysis. The MUV dataset contains 17 challenging tasks for around 90,000 compounds and is specifically designed for validation of virtual screening techniques
Lipophilicity - ChEMBL data A curated dataset from ChEMBL database with experimental results on octanol/water distribution coefficient (logD at pH=7.4).
AIDS antiviral screen data - NCI DTP data A dataset with AIDS antiviral screen data, introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds

Synthetic datasets

In additiona to public datasets, you can use the following synthetic datasets:

Table name Description
Demog Synthetic demographics data
Biosensor Simulates biosensor signal (3-axis accelerometer, temperature, and EDA)
Plates Experimental plate data: barcode, row, col, volume
Random walk N columns; each row value differs from a previous one by a small delta
Molecules Chemical compounds in SMILES format and their lipophilicity properties
Geo Information about specific data points in relation to their geographical coordinates
Stock prices Information about stock prices (company ticker, date, price)
Dose-response Information on effects of various compounds on cell viability
Cars Information about cars
Customers Customer ID and name
Orders Information about customer orders (customer id, item, quantity, price)
Products Information about products (product, id, category, price)

To access these datasets, follow these steps:

  1. On the Sidebar, click the Hamburger icon > Tools > Dev > Open test dataset. An Open test dataset dialog opens.
  2. In the dialog, set the desired number of rows and columns, and select the demo table.
  3. Click OK to open the generated test dataset in Datagrok.

:::tip

You can connect to public providers, such as OpenWeatherMap, Alpha Vantage, commerce.gov, etc. by importing their swagger file. To learn more about connecting to webservices, see OpenAPI.

:::