These programming exercises are designed to help developers get proficient with the Datagrok platform. The exercises are organized as progressive steps, with tasks of increasing complexity built on top of the previously completed steps.
During this course, we will be building support for handling DNA nucleotide sequences. Let that not scare you, think of
them as regular strings that can only contain characters G
, A
, C
, and T
(and now you know the origins of the
"Gattaca" movie name). We will start with writing standalone functions, then automatically recognizing nucleotide
sequences in the imported data, and then going all the way to custom visualizations, relational databases querying,
predictive models, integration with the external utilities, data augmentation, and custom applications.
- Querying databases
- Creating a scripting viewer
- Transforming dataframes
- Custom cell renderers with 3-rd party JS libraries
- Accessing Web services with OpenAPI
- Creating an info panel with a REST web service
- Enhancing Datagrok with dialog-based functions
Prerequisites: basic TypeScript or JavaScript knowledge. Useful links:
- Install the necessary tools (Node.js, npm, webpack, datagrok-tools) following these instructions
- Create a branch from master at GitHub or using your IDE; Use your credentials as a name of the branch
- Get a dev key for Dev Server (you will work with this server) and add it by
running
grok config
. Open https://dev.datagrok.ai/u, click onDeveloper key
, copy thegrok
command and execute it to add the key to your config - Create a default package in your branch called
<yourFirstName>-sequence
using datagrok-tools:grok create <yourFirstName>-sequence
with specifying the--ts
option to create a package with TypeScript configuration (if you are new to TypeScript, you can specify the--js
option); Note that detectors.js file should be in JavaScript anyway. Also you can add--eslint
option to add eslint checker feature to the package - Run
npm install
to link the dependencies mentioned inpackage.json
file of your package - Upload it to the server: run
webpack
andgrok publish dev
(see other options here) - Launch the platform and run the package's
info()
function using different methods:
- via the Functions view
- via the Packages menu (find your package, click on it and run
info()
from theFunctions
pane in the property panel on the left) - via the console: press
~
key anywhere inside Datagrok, the Console will appear to the right; execute<loginName>Sequence:info()
there. The identifier used as package name (before ':') will be obtained by transformation kebab style of folder name to camel style, or can be specified directly with attributefriendlyName
inpackage.json
file.
As a result of the function execution you should see an info notification with url of package's webRoot.
Prerequisites: basic TypeScript or JavaScript knowledge.
Details: How to define semantic type detectors, How to add an info panel.
You will learn: how to write semantic type detectors, how to develop context-specific data augmentation.
-
Create a
complement
function insrc/package.ts
which takes a nucleotide string and returns its complement://name: complement //input: string nucleotides //output: string result export function complement(nucleotides): /*type*/ { // your code goes here }
Note that comments on the top of the function declaration are crucial for running it on the platform. They determine the function name, the input and output types. Essentially, change each character to the complementary one:
A <=> T
,G <=> C
. Run it and check whether everything works fine. -
Now, let's specify that this function is meant to accept not any string, but nucleotides only, and to return a nucleotide string as well. In order to do that, let's annotate both input and output parameters with the
dna_nucleotide
semantic type://input: string nucleotides {semType: dna_nucleotide} //output: string result {semType: dna_nucleotide}
At this point,
dna_nucleotide
string does not have any meaning, but we will connect the dots later. -
Define a
detectNucleotides
semantic type detector function as part of the specialdetectors.js
file.class <yourFirstName>SequencePackageDetectors extends DG.Package { //tags: semTypeDetector //input: column col //output: string semType detectNucleotides(col) { // your code goes here } }
It should check whether a column is a string column, and whether each string represents a nucleotide. If condition is met, it should return
"dna_nucleotide"
string.For best performance, don't iterate over all column values, instead iterate on
column.categories
. Full Datagrok Column type API could be found here. -
Upload your package to
dev.datagrok.ai
usinggrok publish dev
command. When everything is done correctly, thedetectors.js
file will get loaded by the platform automatically, and thedetectNucleotides
function will be executed against every column in a newly added table. -
Reload the
dev.datagrok.ai
page to use updated version of your package. -
Test your implementation by opening the following CSV or TXT file (or go to
📁 (Data) | Text
and paste it there). Make sure you click on DONE (this will trigger semantic types detection):sequence, id GATTACA, 1997 ATTCGGA, 1984 TTTAGGC, 2021
Hover over the
sequence
column header after the data is imported — if everything is done correctly, you will seequality: dna_nucleotide
in the bottom of the tooltip: Alternatively, you can find this information if you click on the column and expand the 'Details' pane in the property panel on the right. -
Now let’s put the result of the previously created
complement
function into an info panel: Create functioncomplementWidget
and add special comments block to allow Datagrok system recognise it and upload properly (see an example here).//name: complementWidget //tags: panel, widgets //input: string nucleotides {semType: dna_nucleotide} //output: widget result //condition: true
The
panel
andwidgets
tags and output typewidget
allows Datagrok to determine how the result ofcomplementWidget
function will appear in the system. Listed above block of comments will instruct the platform to use thecomplementWidget
function for providing additional information for string values of thedna_nucleotide
semantic type. To test it, simply open our test file, click on any cell in thesequence
column, and find thecomplementWidget
property in the panel on the right as it is shown on screenshot:
Prerequisites: basic Python knowledge.
Details: Scripting, Dev Meeting 1 | First-class functions
You will learn: how to create and invoke Datagrok scripts in data science languages like R and Python.
In this exercise, we will count occurrences of a given subsequence in a nucleotide sequence, using Python.
-
Open Datagrok and navigate to
Functions | Scripts | Actions | New Python Script
. -
Observe a default script created for you. All script attributes are specified in the beginning in comments. There we have the script name, language, one input value of type
dataframe
, and one output value of typeint
. The script simply computes number of cells in the dataframe. Dataframe is a high-performance, easy to use tabular structure with strongly-typed columns of different types (supported types are:string
,bool
,int
,bigint
,double
,qnum
anddatetime
). In this exercise, we only see a dataframe as is in the default script; there is another exercise to learn manipulating dataframes in JavaScript. -
Run the script to get a hint for creating an input file. An attribute
#sample: cars.csv
is responsible for it. To open a default input filecars
, click theStar
icon in the top menu. -
Run the script again and proceed to the Datagrok's console. As in Quake, it's available by pressing a
~
button anywhere inside Datagrok. In the console, you would see the script execution result. Just one line above the result you could see the console's command to execute the script. Enter it again to the console to get the same result (but in console you should specify script with namespace prefix as<yourLogin>:<script_name>
). -
Let's modify the script to solve the task of counting sequence occurrences. Add a new preamble: (use any
#description
you like). Spaces are not allowed between '#' and attribute name:# name: CountSubsequencePython # language: python # input: string sequence # input: string subsequence # output: int count
In the body, implement a Python function counting all occurrences of a given
subsequence
in asequence
. Return acount
the same way as in the default script from p. 2. -
Run the script function, provide input values in the dialog and get to the console to see the result. Now run the script function again through the console completely, passing different arguments values:
<yourLogin>:CountSubsequencePython('ATGATC', 'A')
. You can find your login inside the profile page between name and email (under avatar), or in the profile URL:https://dev.datagrok.ai/u/<yourLogin>/summary
. -
Let's apply
CountSubsequencePython
to the input dataframe using Datagrok UI. Open a table — say, let's go forsars-cov-2.csv
. Navigate toData | Files
and openDemo Files / bio / sars-cov-2.csv
. Navigate to a menu itemEdit | Add New Column...
and click it. Type in your expression using the function you've just previously created: Observe how thePreview Result Columns
change while you are modifying the expression. There, notice a namespace<yourLogin>
as part of a qualified function name<yourLogin>:<functionName>
,JDoe:CountSubseqnecePython
in this case. Namepaces are used through Datagrok very commonly. In general, there shall be no case where you would call a function without specifying a namespace. Datagrok namespaces originate from the names of packages, projects, and users, and always qualify a resource name, be it a package, a function, a connection or a query. Now hit "Ok" and have the new column inserted to the dataframe.
Prerequisites: basic Python knowledge.
You will learn: how to manipulate tables, which we usually call dataframes, using a server scripting language, expand dataframes with newly computed values, and modify the dataframes.
In the previous exercise we learnt a fast method to apply a function to a table and produce a new column in it. Another means to introduce new columns to the dataframes is to programmatically manipulate dataframes right in scripts. Let's repeat what we've achieved in the last point of the previous exercise, now with more scripting.
-
Let's create a different kind of our
CountSubsequencePython
function, now calledCountSubsequencePythonDataframe
. While the original function could only operate on a single row, the new function shall operate on the entire dataframe. To start with, the function's Datagrok signature should look as follows:# name: CountSubsequencePythonDataframe # language: python # input: dataframe sequences # input: column columnName # input: string subsequence = "acc" # output: dataframe result {action:join(sequences)}
This function takes as an input a dataframe with a column containing nucleotide sequences, named as a value of
columnName
, a nucleotide subsequencesubsequence
being sought, and outputs an input dataframe with a new column appended to it, containing numbers of subsequence occurrences in the nucleotide sequences. Say, for a table on the left the following table on the right should be produced for a subsequenceacc
being sought:GenBank ID MT079845.1 ctacaagaga MT079851.1 attaaaggtt MT326187.1 gttctctaaa GenBank ID N(acc) MT079845.1 ctaccagaga 1 MT079851.1 attaaaggtt 0 MT326187.1 gttctctacc 1 -
Implement a function
CountSubsequencePythonDataframe
. Assume theresult
is a Python dataframe with just this one columncolumnName
. After theresult
column is computed and returned from the server to the client, based on thejoin
instruction,result
will be appended to the existing input dataframesequences
. As this is performed purely on the client, we save the bandwidth without needing to return a copy of a dataframe which we already passed to the server.- Use Pandas dataframes as
pd
to access the input dataframe and create an output dataframe - You don't need to import
pandas
, Datagrok does this automatically: to each Python script it adds a preamble with most popular imports (os
,io
,json
,pandas as pd
,requests
,datetime
,timedelta
) - Note that the column
columnName
is just a string with a column name passed to a script, not an actual column content
- Use Pandas dataframes as
-
Run the function with a "Play" button on top of the function window. The dialog will prompt you to select a dataframe. Navigate to a "Data" view (first button on the left sidebar) and open a file with nucleotide sequences (say,
Demo Files / bio / sars-cov-2.csv
available at public.datagrok.ai). Go back to theRun Function
dialog to select the opened dataframe. -
Now choose a column with nucleotide sequences from the dropdown. Notice how the list of columns is automatically formed for the selected dataframe. Finally, run the function to get the resulting dataframe.
-
As for modifying the dataframes in general, just consider removing the
{action:join}
option and do whatever is needed to the output dataframeresult
before the end line of the script. This will return exactly theresult
dataframe after all modifications. -
Consider that the function may have several outputs. In case you return two dataframes, both will appear in the Datagrok interface. There's also a special syntax in Datagrok JS API to call functions which return several parameters, we'll review this in one of the following exercises.
Prerequisites: basic JavaScript knowledge.
You will learn: how to create and invoke Datagrok JavaScript scripts.
- Go to
Functions | Scripts
and hitNew JavaScript Script
. - Implement the function
CountSubsequenceJS
in JavaScript, which does the same asCountSubsequencePython
. Follow the same conventions on the parameters in the comments block and returning a result via a variable. - Run
CountSubsequenceJS
using thePlay
button; using the console. From same console, runCountSubsequencePython
yet again. You can notice that both Python and JS versions of our function, implemented as scripts, are homogeneous functions in Datagrok. It's also possible to call them in a uniform fashion using our JavaScript API. - Don't forget to save these two scripts. We would re-use parts of them in the following exercises.
The difference between the two scripts is that the first, CountSubsequencePython
, runs on our server by
a compute virtual machine, whereas the second, CountSubsequenceJS
,
runs directly in the browser. To run CountSubsequencePython
, Datagrok passes the script arguments over the network and
fetches back the result to the browser.
Prerequisites: basic SQL knowledge
Details: Connecting to Databases, How to Access Data
Note: Editing an existing data query requires the respective access permission. You might need to request one.
In this exercise, we will work with a northwind
PostgreSQL database (in case the name sounds familiar, this is a demo
database that Microsoft often uses for showcasing its technology). The database is already deployed and is accessible
from our server.
-
Navigate to the
Data | Databases | PostgreSQL | northwind | Schemas | public | orders
table -
Make this table current by left-clicking on it, and explore its property panels on the right. The
Content
pane should be showing first 50 rows of that table. -
Right-click on the table, and choose
New SQL Query...
-
Execute the query and make sure it returns results.
-
Modify the query to accept a
country
parameter, and return the sum of freights for the specified country, grouped bycustomerid
:--input: string country select customerid, sum(freight) from public.orders where shipcountry = @country group by customerid
-
Run the query, enter one of the countries in the input box (such as
USA
, without quotation marks or apostrophes). Run it the second time, notice that previously entered parameters could be quickly reused by clicking on the watch icon in the left bottom corner of the dialog window. -
Rename this query from your name to
ordersByCountry
, and save it. -
Try different ways to execute it:
- Right-click on
Data | Databases | PostgreSQL | northwind | ordersByCountry
, selectRun
from the context menu, enter the country name, and run it - Click on
Data | Databases | PostgreSQL | northwind | ordersByCountry
, expand theRun
pane on the right, enter the country name and run it - Open console by pressing
~
key, see the results of the previous invocations. Copy-paste the corresponding command and run it from the console.
- Right-click on
-
Now, let's add this query to our package. Create a connection by running
grok add connection <yourFirstName>
, then, as instructed here, create the '.sql' file under thequeries
folder, and paste our query there. Give it a name by adding the--name: ordersByCountry
line on top of it. -
Deploy the package, launch the platform, find the query in the package, and run it.
-
Create a JavaScript function (in
src/package.js
) that has no parameters and returns a dataframe with the results of theordersByCountry('USA')
call://name: getOrders //output: dataframe df export async function getOrders() { return await grok.data.query(`${packageName}:${queryName}`, { country: 'USA'}); }
There is another way to pass a country name to the query: you can provide a default value for the input parameter (see examples in the article Parameterized Queries).
Prerequisites: basic Python knowledge, matplotlib or a similar library
Details: Scripting , Scripting Viewer, Creating a scripting viewer (video).
Amino acids counting task. In this exercise, we'd use a Python script to generate a histogram
(a distribution plot) for amino acids occurring in a column of nucleotide sequences. Amino acids are simply triples of
nucleotides from which the nucleotide DNA sequence is made of. These are also called triplets, or codon-codes. As there
are 4 letters G
, A
, T
, C
, there are 4 to power 3 protein amino acids: GTA
, AGC
, TTT
, and so forth.
We don't know at which starting point each nucleotide sequence was cut: it could either be a junction of two triplets, or one-third or two-third of a triplet. Therefore, we'd count in our statistics for all three possible cuts, starting the reading frame off at offsets 0, 1, and 2 from the beginning of the nucleotide sequence.
Say, we are given a sequence TTTAATTACAGACCTGAA
. We start to count triplets without overlap from an offset 0 first,
getting: TTT
, AAT
, TAC
, ...
, GAA
. Then we move to the offset 1, getting: TTA
, ...
, CTG
. Lastly, we move
to the offset 2 and get: TAA
, ...
, TGA
. In the histogram we'd count for all these triplets.
First, let's explore how scripting viewer works.
-
Open a
demog
demo file with demographic data. It is located atData | Files | Demo Files | demog.csv
.Data
corresponds to the first button from the top of the Datagrok sidebar. Make sure the table view with the data appears. -
Activate the top menu from the sidebar, using a
Windows | Menu
switch. -
In this menu, hit
Add | Scripting Viewers | Python | Scatter Plot
. -
See that the viewer appeared on the right, telling though it is "Unable to plot with current settings".
-
Proceed to the viewer properties by hitting on the gear icon in the viewer's title.
-
Make sure the chosen values for "Data" are
HEIGHT
forX
,WEIGHT
forY
, andAGE
forColor
. After checking this you should see a nice scatter plot forWEIGHT
andHEIGHT
with the color corresponding toAGE
: -
In the property panel, proceed to modify the value of the "Script" field by clicking on a "..." icon in the text field.
-
The Python code you see is what renders the scatter plot form p.6 on the Datagrok server. Let's walkthrough this code.
- The script takes as inputs the original dataframe and the three columns. Remember form p.6 there were selectors
for
X
,Y
, andColor
in the property panel. In fact, these three property names are declared with the notation<propertyName>ColumnName
in the names of the three#input
columns. - The script produces an
#output
of typegraphics
. It is important the graphics appear in the end of the Python script. This is exactly what happens with theplt.show()
in the last line of the script.
- The script takes as inputs the original dataframe and the three columns. Remember form p.6 there were selectors
for
-
Modify the name of
colorColumnName
to atemperatureColumnName
, hitApply
in the bottom of the window, and check what happens to theColor
field in the property panel. -
Add another input parameter to the script with a name
Title
. HitApply
and check what appears in the property panel. -
Add another input column to the script with a name
SEX
. HitApply
and check what appears in the property panel. -
Now there's all you need to create a Python scripting viewer for our amino acid histogram task. Open a demo file with nucleotide sequences. It is located at
Data | Files | Demo Files | bio | sars-cov-2.csv
.Data
corresponds to the first button from the top on the Datagrok sidebar. -
In the top menu you've activated at p.2, hit
Add | Scripting Viewers | New Scripting Viewer
. -
Follow what you've learned in the points 1 to 11 to create a scripting viewer taking a column of strings, expecting to have nucleotide sequences in them, and plotting a Matplotlib's histogram with all amino acid triplets occurred within all of these sequences. As you may notice,
numpy
andmatplotlib
are already available for your Python scripting in Datagrok. Reuse them to finish this exercise.
Prerequisites: exercises "Setting up the environment", "Semantic types".
You will learn: how to join and union dataframes using the knowledge of semantic types, and display the result.
-
Make sure the prerequisites are prepared on your machine, including the package called
<yourFirstName>-sequence
Assure the package carries a relevant semantic type detector from the exercise "Semantic Types". -
Add a function to the package as follows:
//name: fuzzyJoin //input: dataframe df1 //input: dataframe df2 //input: int N ...
-
Implement a
fuzzyJoin
function which takes two dataframesdf1
anddf2
, and does the following:- takes a first column in
df1
which has a semantic type ofdna_nucleotide
, let's say it iscol1
- takes a first column in
df2
which has a semantic type ofdna_nucleotide
, let's say it iscol2
- creates a dataframe
df
out ofdf1
anddf2
in the following way:- the content of
df2
goes afterdf1
, and all columns ofdf1
anddf2
are preserved — this is a UNION operation for dataframes, as in SQL; use the dataframe's.append
method - a new column
Counts
appears indf
, which contains:- for each row
R
fromdf1
,R.counts
is a number of matches of all the subsequences inR.col1
of lengthN
in all the sequences ofcol2
- symmetrically, same for each row from
df2
— consider this as a fuzzy, programmatic JOIN of the two dataframes; usedf.columns.addNew
,col.set(i, value)
on a newly created column
- for each row
- the content of
- displays
df
withgrok.shell.addTableView
- takes a first column in
-
Deploy the package with
webpack
andgrok publish dev
. Unlike with the first exercise, where the package was built on the Datagrok server, in this one we locally build the package before sending it. In addition, webpack output helps find some syntactic errors in JavaScript. -
Launch the platform, open the two files from
"Demo files"
:sars-cov-2.csv
anda-h1n1.csv
, and run the package'sfuzzyJoin
function using one of the methods you've learned. The result for N=3 should look similar to: -
Read more about joining dataframes through the case reviewed at our Community Forum, and with a sample.
You will learn: reuse 3-rd party JavaScript libraries in your Datagrok packages; render cells by semantic types.
Prerequisites: exercises "Setting up the environment", "Semantic types".
-
Navigate into the folder with your
<yourFirstName>-sequence
package created in "Setting up the environment". -
Let's add a custom cell renderer for a nucleotide sequence box to represent our sequences in high density on the screen. We need to render each nucleotide sequence with a monospace font in small letter sizing, fitting into a rectangular cell area and adding ellipsis to the end of the string if it won't fit. This is a basis for a very useful nucleotide sequence representation in bioscience applications. Let's use a 3-rd party JavaScript library
fusioncharts-smartlabel
to compute the text fit. Add it to your package by navigating in its folder and calling:npm install fusioncharts-smartlabel --save
The--save
key updatespackage.json
to add this library to your package dependencies. -
Add a class to
src/package.js
for the new cell renderer:- use
fusioncharts-smartlabel
to break the original sequence in the current cell into lines which fit into a cell's canvas rectangle; learn here how to do it, considerSmartLabel.textToLines(...).lines
as a target array of lines to render - Datagrok grid is rendered through an
HTML5 Canvas. The grid's canvas is
g.canvas
. Iterate through the resulting lines and bring them to ag.canvas
in therender
method withg.canvas.getContext("2d").fillText
; learn more about HTML Canvas if it's new for you - Hint: pay attention to managing
line-height
both at computing the box and rendering text lines
class NucleotideBoxCellRenderer extends DG.GridCellRenderer { get name() { return 'Nucleotide cell renderer'; } get cellType() { return 'dna_nucleotide'; } render(g, x, y, w, h, gridCell, cellStyle) { let seq = gridCell.cell.value; const sl = new SmartLabel('id', true); sl.setStyle({/* ... */}); // ... let ctx = g.canvas.getContext("2d"); ctx.font = '11px courier'; // ... const lines = labelObj.lines; for (let i = 0; i < lines.length; i++) ctx.fillText(/* ... */); } }
- use
-
Add the below to
src/package.js
to make the new cell renderer part of the package://name: nucleotideBoxCellRenderer //tags: cellRenderer //meta.cellType: dna_nucleotide //output: grid_cell_renderer result export function nucleotideBoxCellRenderer() { return new NucleotideBoxCellRenderer(); }
-
Deploy the package as usual with
grok publish dev
. In Datagrok, navigate to a file with nucleotide sequences from"Demo files"
, such assars-cov-2.csv
. Verify you get the desired result, it should look similar to this: Change the "Sequence" column width and rows heights with a mouse to see how things adujst. -
(*) Implement a colored nucleotide sequence box where backgrounds of
A
,G
,C
,T
vary. Choose one of the popular coloring conventions, following this link.
Details: OpenAPI access
Web services often provide their API specs in an OpenAPI (Swagger) format in a JSON or a yaml file. Because OpenAPI spec file is standardized, the API may now be directly loaded and later queried. Datagrok provides for connecting to API data sources and fetching API querying results as dataframes. In this lesson we will connect to the European Nucleotide Archive (ENA) and fetch some nucleotide data regarding coronavirus.
-
Obtain a ENA's Swagger JSON file for the ENA Browser, following this link. It usually takes you some effort to reach the JSON Swagger at the API Browser link. The rule of thumb there is to use the browser's "Network" tab in the Developers Console (available by F12), and identify there a resource called " api-docs". Usually, when there's an API Navigator, there's also a JSON Swagger. It's also possible to understand the API through its Swagger API tester. Follow recommendations here. In particular, modify the connection's
Name
toENA
,Url
tohttps://www.ebi.ac.uk/ena/browser/api/
. Save this file to a desktop with a name, say,ENA.json
. -
Load the Swagger into Datarok by drag-and-dropping it into the platform window.
-
Check the connection is valid with the
Test
button, and hitOk
to save the edit. -
In the expanded view of the
ENA
connection, locatePerform a text search and download data in XML format
and hitRun
or double-click it. -
Enter the parameter values: set
Query
tocoronavirus
,Result
toassembly
. HitOk
. As a result, you'd find a table, which was prepared from the received XML file by Datagrok. -
Close the table, locate the saved query in the list and run it.
-
Bring the connection to the package:
- Put the Swagger file in a
swaggers
folder of the package. Make sure the swagger contains the correctbasePath
andhost
, in some Swaggers it isn't always the case. - Add the following function to
package.js
:
//name: testENASwagger export async function testENASwagger() { let data = await grok.data.query('<yourFirstName>sequence:PerformATextSearchAndDownloadDataInXMLFormat', {'query': 'coronavirus', 'result': 'assembly'}); grok.shell.addTableView(data); }
- Note how the Swagger's query name translates into a package query name.
- You can obtain this query name with the Datagrok UI. Click on the query of interest,
"Perform a text search and download data in XML format"
in our case, and find aLinks...
section. Click it and copy a function's name from the URI. - Deploy the package and make sure
testENASwagger
function works in Datagrok.
- Put the Swagger file in a
We provide a handful of demo Swaggers, check their source JSON files here and see in action in Datagrok at
the Web Services
section of the Datagrok UI.
We will use the ENA REST API to output sequences and associated data in the info panel, based on the ENA sequence ID contained in a currently selected grid cell.
-
Searching through the ENA archive, you may notice the sequences' IDs have a format of
[A-Z]{2}[0-9]{6}
(two capital letters + six digits). Go to the detectors file of your package and add a detector which recognizes a string of this form://input: string str //output: bool result isPotentialENAId(str) { // returns true, if name is of the form [A-Z]{2}[0-9]{6} }
-
Use
fetchProxy
to get a sequence for the potential corresponding ENA ID in fasta format. For example, this GET fetches the sequence for theID=AA046425
:https://www.ebi.ac.uk/ena/browser/api/fasta/AA046425
Use the following structure for the into panel function in yoursrc/package.js
://name: ENA Sequence //tags: panel, widgets //input: string cellText {semType: ENA} //output: widget result //condition: isPotentialENAId(cellText) export async function enaSequence(cellText) { const url = `https://www.ebi.ac.uk/ena/browser/api/fasta/${cellText}`; const fasta = await (await grok.dapi.fetchProxy(url)).text(); return new DG.Widget(ui.box( // ... the widget controls are composed here )); }
Incorporate a
textInput
control to display a sequence in a scrollable fashion. Add a caption to that text area to display an ENA's name for this sequence, which also comes in the fasta file. Use asplitV
control to nicely locate the caption at the top and the text area at the bottom.
fetchProxy
mimics the regular fetch
method of ECMAScript, but solves
a CORS limitation of JavaScript. In this panel, you'd query
the external domain from your web page, whereas CORS prevents you from querying anything outside a reach of your web
page's domain. Thus Datagrok provides a proxy facility in the neat fetchProxy
wrapper.
In the previous exercises we've learned how the Datagrok function inputs are offered in a dialog window automatically once you run the function. In this exercise we find how to expand these dialogs with the behaviour beyond simple arguments-to-inputs mapping.
So, in some previous exercises we've used the files a-h1n1.csv
and sars-cov-2.csv
which we prepared in advance.
These files contain ENA sequence ID along with the first 60 letters of the sequence by ID. Let's construct a
dialog-based function which forms such files automatically by a given search input. The search topic may
be coronavirus
, influenza
etc.
-
This
GET
query performs a text search in the EMBL database, returning alimit
first results (10
in this case):https://www.ebi.ac.uk/ena/browser/api/embl/textsearch?result=sequence&query=coronavirus&limit=10
By the way, you could discover this API endpoint via a Swagger API navigator at this link. Let's assume the result we want is always of typesequence
. Create a function_fetchENASequence
which takes as parameters aquery
ad alimit
and returns a dataframe with two string columnsID
andSequence
. Use this structure for dataframe construction:df = DG.DataFrame.fromColumns([ DG.Column.fromList(DG.COLUMN_TYPE.STRING, 'ID', [ /* a list of IDs you've parsed from a ENA output */ ]), DG.Column.fromList(DG.COLUMN_TYPE.STRING, 'Sequence', [ /* corresponding list of sequences */ ]) ]);
The output from
ebi.ac.uk
is a raw text, and you need to parse it to get the desired pieces. Trim the sequence so that isn't longer than 60 characters. Use your previous knowledge aboutfetchProxy
to do theGET
query. -
Make a function
formENADataTable
which constructs a dialog giving the user a two-step process for constructing a dataframe with ENA sequence data in it.- First, the user can type in the query (
coronavirus
is the default setting) and see the first 10 results in the grid right in this window after clicking the "Search" button. Consider this as a preview before the actual dataframe is produced. - Second, when the user is happy with what's in the preview, he/she proceeds to the "Ok" button to get the actual
dataframe with the ENA data on the screen in the Datagrok's grid view. This table shall consist of the number of rows
the user chooses (
100
set as a default).
Here is the code scaffold for the
formENADataTable
function:let grid = DG.Viewer.grid(df); let limitInput = ui.intInput('How many rows: ', 100); let queryInput = ui.stringInput('Query: ', 'coronavirus'); let button = ui.button('Preview'); ui.dialog('Create sequences table') .add(ui.splitV([ ui.splitH([ ui.span([queryInput.root]), button ]), ui.div([grid]), ui.div([limitInput]) ])) .onOK(() => { /* Handle table creation */ // Display the resulting table grok.shell.addTableView(df); }) .show();
Re-use twice the
_fetchENASequence
function you've prepared previously. - First, the user can type in the query (
-
In this first version we fetched
60
characters for a sequence. Add a new text field calledSequece length
to let the user specify this trim length, set it60
as a default. -
Make your function set a proper semantic type for the
Sequence
column. -
(*) You may notice the sequences you get in this order are not too different. Add more diversity to these tables. For example, you can use the
offset
parameter of theGET
query.