Skip to content

Latest commit

 

History

History
359 lines (293 loc) · 23.4 KB

access-data.md

File metadata and controls

359 lines (293 loc) · 23.4 KB

Data access

This article will walk you through different ways of data extraction, from managing connections to data sources and running queries, to using REST APIs, and to reading files into dataframe objects in your application.

Connections

In the Datagrok platform, you can retrieve data from a variety of sources, be it a file, database, cloud, or web service. And what's remarkable, this can be accomplished equally well from the user interface and from your program. You can learn more about connections and look at how to create them from the UI in the dedicated article, while here we go over the technical details for developers, namely:

Let's outline a general workflow for accessing data using connections. You will add a connection to your package, use it to query the database, and get a dataframe after running the query in JavaScript code. This is how you obtain the data for further work.

Adding connections

As with everything else, development starts with a package. Packages may contain one or more data connections under the connections folder. For each connection, you need to create a separate json file with the required parameters. Here is an example:

{
  "name": "ChEMBL",
  "parameters": {
    "server": "$GROK_DB_SERVER",
    "db": "chembl_24"
  },
  "dataSource": "PostgreSQL",
  "description": "CHEMBL db",
  "tags": [
    "demo",
    "chem"
  ]
}

The field name is optional, if omitted, the file name (without the extension) will be used as the connection name. In any case, remember that you should not rely on letter case to distinguish between connections, since their names are not case-sensitive. Giving parameters for the connection in json is completely equivalent to what you can do from the platform's interface: you would go to Data | Databases and right-click on the data source PostgreSQL to add such a connection (or, more generally, perform it from Actions | Add New Connection).

Our package utilities provide a similar template on running the grok add connection <name> command. To see other examples, open Chembl or UsageAnalysis packages in our public repository.

Parameters

Connection parameters are specific to a data source. However, most of the data providers require some of these common parameters to be specified:

Data Source Server Port DB Cache Schema Cache Results SSL Connection String Login Password Other Parameters
Access
Athena See the list
BigQuery See the list
Cassandra
DB2
Denodo
DropBox See the list
Files See the list
Firebird
Git See the list
Google Cloud See the list
HBase
Hive
Hive2
Impala See the list
MariaDB
MongoDB
MS SQL
MySql
Neo4j
OData See the list
Oracle
PostgresNet
PostgreSQL
Redshift
S3 See the list
Snowflake
Socrata See the list
Sparql See the list
SQLite
Teradata
Twitter See the list
Vertica
Virtuoso
Web See the list

When providing a connection string, you do not have to pass other parameters, as they will not be taken into account. The only exception is credentials: for such parameters as Login and Password, or Access Key and Secret Key, the case is different. We will cover that in the next section. For now, let's assume that parameters carrying sensitive data form a distinct category. Given the parameters listed above, in the most general case, you would add a connection with similar contents to pull data from a provider:

{
  "name": "Northwind",
  "parameters": {
    "server": "dev.datagrok.ai",
    "port": 23306,
    "db": "Northwind",
    "cacheSchema": false,
    "cacheResults": false,
    "ssl": false,
    "connString": ""
  },
  "credentials": {
    "parameters": {
      "login": "",
      "password": ""
    }
  },
  "dataSource": "MariaDB",
  "description": "Northwind Connection",
  "tags": [
    "demo"
  ]
}

Managing credentials

Private information is always a special case. Datagrok has a built-in credentials management system that protects such data (for more details, see the Security article). For this reason, parameters of a connection that regulate access to the resource are processed independently. Therefore, you should not include these parameters to a custom connection string. However, pushing them in a json file to your project's repository is not a good idea either:

{
  "credentials": {
    "parameters": {
      "login": "<login>",
      "password": "<password>"
    }
  }
}

What you can do instead is to deploy a connection and send a POST request to $(GROK_HOST)/api/credentials/for/$(PACKAGE_NAME).$(CONNECTION_NAME) with raw body containing JSON, such as {"login": "abc", "password": "123"}, and headers {"Authorization": $(API_KEY), "Content-Type": "application/json"} (the API key should be taken from your profile page in Datagrok, e.g., https://public.datagrok.ai/u).

Queries

Creating queries

Once the connection is established, the next step is to extract data. This can be done by sending a query to the data source. In a package, queries are typically placed in the queries folder. Let's start with a simple example for your queries.sql file:

--name: protein classification
--connection: chembl
select * from protein_classification
--end

SQL statements are annotated with comments, just like scripts, since the underlying mechanism is essentially the same (read more on the concept of functions). Here we have two header parameters: the query name and the connection to use. In fact, this particular query could have been even simpler: there is no need to specify connection if the package only has one. Similarly, the tag end is not required if there is only one query per file: the parser needs it to understand where the current query ends and the next one begins. So safely omit the name of connection and/or the end tag if these conditions are met.

To use an existing connection in a query, specify its name along with the namespace in the connection parameter. For example, the above chembl connection that lives in the Chembl package has the following path: chembl:chembl. When browsing connections on the platform, you can identify such path by opening Links in the tab Details of the property panel.

You can find a list of header parameters and other details related to the query annotation in this article. In addition to this, examples of data queries are available in the Chembl package. To quickly insert a query template into your package, type grok add query <name> in the terminal.

Executing queries

There are several ways in which queries can be run in Datagrok. The first and most natural way is to launch a query from the interface, which will be equivalent to the line $(PACKAGE_NAME):$(QUERY_NAME)() in the console, for example, Chembl:ProteinClassification(). As you might know, it is possible to call any function that can be run in the console through Datagrok's JS API. Thus, the fact that a query behaves like a regular function allows us to use the corresponding method in JavaScript:

grok.functions.call('Chembl:ProteinClassification')
  .then(t => grok.shell.addTableView(t));

There is also a special method for queries that takes the query name as a required parameter and a few additional ones ( query parameters, whether it is ad hoc or not, and the polling interval):

grok.data.query(`${PACKAGE_NAME}:${QUERY_NAME}`, {'parameter': 'value'}, true, 100);

See how this method works in the example on Datagrok.

Sharing connections

Data connections can be shared as part of a project, package ( and repository containing this package), or as an independent entity. Access rights of a database connection inherit access rights of a query. However, access rights of the query don't inherit access rights of the database connection. Thus, if one shares a query, the associated database connection shall automatically be shared. At the same time, when you are sharing a connection, your queries aren't going to be shared automatically. As for web queries, they are automatically shared along with sharing the corresponding connection.

Caching results

See db query caching for details

Rest endpoints

Web services provide endpoints that you can programmatically connect to. There are two main options for this: the first is to use OpenAPI/Swagger format supported by Datagrok, the second one involves the use of the platform's server to send a network request. The details of Swagger-based connections are further explained in the dedicated article. The method used to proxy requests has an interface similar to the standard fetch API and can be applied as follows:

const url = 'https://jsonplaceholder.typicode.com/posts';
const data = {name: 'username', password: 'password'};

grok.dapi.fetchProxy(url, {
  method: 'POST',
  headers: {'Content-Type': 'application/json'},
  body: JSON.stringify(data)
}).then(response => grok.shell.info(response.ok));

Reading files

In addition to connections and queries, it is useful to look at methods for reading files. In our JavaScript API examples, you can find methods that provide data for demonstration (grok.data.demo or grok.data.getDemoTable) and testing purposes (grok.data.testData). However, here we will cover the part that you will often use to deliver data to your applications.

Package files

If you want to access data that resides within your package (e.g. open a csv table), load it from URL, as you would do with other external files. The package root for client side can be found with webRoot property. The example shown below gets the test.csv table from the data-samples subdirectory and opens a table view for it:

export let _package = new DG.Package();

grok.data.loadTable(`${_package.webRoot}data-samples/test.csv`)
  .then(t => grok.shell.addTableView(t));

File shares

Connecting to file shares offers you more opportunities: files form a hierarchy, which you can browse naturally from the interface. Let's start with an existing file share — the user's home directory. In Datagrok's file browser, each user has a special HOME folder to store their files, which makes it a perfect example. Here is how you can work with your files located there:

grok.functions.eval(`OpenServerFile("${USER}:Home/data.csv")`)
  .then(t => grok.shell.addTableView(t[0]));

But more importantly, file shares let you gain access to data from various locations. So let's find out how to create such a connection in your package. Again, it's just a json file with the corresponding parameters:

{
  "name": "New File Share",
  "parameters": {
    "dir": "/home/www/master/servergrok/data/demo",
    "indexFiles": true
  },
  "credentials": {
    "parameters": {
      "login": "",
      "password": ""
    }
  },
  "dataSource": "Files",
  "tags": [
    "demo"
  ]
}

You should specify two parameters for connection: the directory you are going to work with and whether you want to index its files (if you do, there will be an indexing data job). When referring to a file from your code, put the names of package and connection before the file. The path to it should be relative to what you previously specified in the dir parameter:

grok.functions.eval(`OpenServerFile("${PACKAGE_NAME}:${CONNECTION_NAME}/data.csv")`)
  .then(t => grok.shell.addTableView(t[0]));

Reading and writing files in file shares

The grok.dapi provides for fine-grained operations on files from file shares. It's possible to write files, read from them, check existence, search for presence by a pattern, rename, move and delete. For example, here is how simple it is to create a text file in your local file share:

grok.dapi.files.writeAsText('<YOUR_NAME>:Home/testFile.txt', 'Hello, world!');

All the dapi.files methods accept three types of inputs:

  • a fully specified file path, as in the examples
  • a variable of type file ( see FileInfo) which may come, for example, from info panels working on files
  • a string with a file share connection GUID

All the dapi.files methods are asynchronous.

This example provides a full understanding of the files API.

Other ways for reading files

Finally, let's walk through other methods that can be used to open files from JavaScript:

  • If you define a function that takes an input of file type (also see FileInfo) , this creates a number of opportunities. First, you can call file.readAsBytes() or file.readAsString() methods on it. For instance, if you pass an obtained string to grok.data.parseCsv(csv, options), it is possible to fine-tune the construction of dataframe from comma-separated values. Alternatively, you can pass a file to a script to compute something and get the results back to your application's code.
  • The method grok.data.openTable(id) might come in handy when you are reproducing a process and need to open a specific table by its id. See an example.

See also: