Skip to content

PanDAWMS/dkb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

7dad261 · Apr 30, 2019
Sep 6, 2017
Apr 28, 2017
May 2, 2017
Apr 29, 2017
Oct 23, 2017
Apr 28, 2017
Sep 6, 2017
Apr 30, 2019
Sep 6, 2017
May 22, 2017
Apr 13, 2018
Feb 19, 2018

Repository files navigation

================
Directory layout
================

./DB/              # Database schemas etc
  Virtuoso/
    ATLAS.owl
  Impala
    dkb_schema.sql

./DKBFrontEnd/     # Web interface
  src/             # Application
  deployment/      # Scripts and helpers for Web server deployment
  vendor/          # Vendor code (currently for Ontodia)

./Utils/           # Data and database management scripts
  Virtuoso/
    load_ontology.sh
    create_graph.sh
  Impala/
    create_dkb.sh
  Dataflow/
    StageX/
      README         # Description of input, tmp and output files
      stagex.sh
      stagex.py
    README         # Dataflow description
    config/        # Common directory for the stage configs
  Elasticsearch/   # Tools for working with elasticsearch
    config/        # ES config files

./DataSamples/     # Data samples for dataflow scripts
  input/
     StageX/
  output/
     StageX/
  tmp/
     StageX/

./DatasetDiscovery # all information about datasets, theirs parameters, Oracle/AMI/RUCIO 		   # requests

./README           # This file

========
Dataflow
========

It is suggested to treat all the data management scripts as a consequent steps 
of the dataflow.
For example:
1)   Get papers with links to supporting documents from GLANCE
  input/...  (please fill if aware)
  output/... (please fill if aware)
2)   Get papers metadata from CDS
  input/...  (please fill if aware)
  output/... (please fill if aware)
3)   Get supporting notes metadata from CDS
  input/...  (please fill if aware)
  output/... (please fill if aware)
4)   Download Supporting Notes PDF papers from CDS: 
  input/...  (please fill if aware)
  output/... (please fill if aware)
5)   Get PDF URLs from CDS
  input/...  (please fill if aware)
  output/... (please fill if aware)
6)   Convert PDF to a text file:
  input/PDF_Analyzer  -> (step 5 output)
  output/PDF_Analyzer           -- JSON files
7.1) Convert paper metadata to triples:
  input/preparePapers -> (step 2 output)
  output/preparePapers/ttl      -- TTL and...
  output/preparePapers/sparql   -- ...SPARQL files
7.2) Convert SupportingDocuments metadata to triples:
  input/prepareSDocs -> output/PDF_Analyzer
  output/preparSDocs/ttl         -- TTL and...
  output/prepareSDocs/sparql     -- ...SPARQL files
7.2) Get dataset metadata:
  input/ds_get_metadata -> output/parseTXT
  output/ds_get_metadata        -- CSV files
8)   Convert dataset metadata to triples:
  input/prepareDatasets -> output/ds_get_metadata
  output/prepareDatasets/ttl    -- TTL and...
  output/prepareDatasets/sparql -- ...SPARQL files
9)   Upload data to Virtuoso:
  input/upload2Virtuoso -> output/prepare*/*
  output/upload2Virtuoso         -- empty