Skip to content

PanDAWMS/dkb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

a9987f5 · Jan 16, 2024
Aug 14, 2020
Jun 2, 2023
Jun 2, 2023
Apr 29, 2017
Apr 1, 2020
Jul 18, 2023
Sep 6, 2017
Aug 14, 2020
Mar 27, 2020
May 31, 2019

Repository files navigation

================
Directory layout
================

./DB/              # Database schemas etc
  Virtuoso/
    ATLAS.owl
  Impala
    dkb_schema.sql

./Utils/           # Data and database management scripts
  Virtuoso/
    load_ontology.sh
    create_graph.sh
  Impala/
    create_dkb.sh
  Dataflow/
    StageX/
      README         # Description of input, tmp and output files
      stagex.sh
      stagex.py
    README         # Dataflow description
    config/        # Common directory for the stage configs
  Elasticsearch/   # Tools for working with elasticsearch
    config/        # ES config files

./DataSamples/     # Data samples for dataflow scripts
  input/
     StageX/
  output/
     StageX/
  tmp/
     StageX/

./DatasetDiscovery # all information about datasets, theirs parameters, Oracle/AMI/RUCIO 		   # requests

./README           # This file

========
Dataflow
========

It is suggested to treat all the data management scripts as a consequent steps 
of the dataflow.
For example:
1)   Get papers with links to supporting documents from GLANCE
  input/...  (please fill if aware)
  output/... (please fill if aware)
2)   Get papers metadata from CDS
  input/...  (please fill if aware)
  output/... (please fill if aware)
3)   Get supporting notes metadata from CDS
  input/...  (please fill if aware)
  output/... (please fill if aware)
4)   Download Supporting Notes PDF papers from CDS: 
  input/...  (please fill if aware)
  output/... (please fill if aware)
5)   Get PDF URLs from CDS
  input/...  (please fill if aware)
  output/... (please fill if aware)
6)   Convert PDF to a text file:
  input/PDF_Analyzer  -> (step 5 output)
  output/PDF_Analyzer           -- JSON files
7.1) Convert paper metadata to triples:
  input/preparePapers -> (step 2 output)
  output/preparePapers/ttl      -- TTL and...
  output/preparePapers/sparql   -- ...SPARQL files
7.2) Convert SupportingDocuments metadata to triples:
  input/prepareSDocs -> output/PDF_Analyzer
  output/preparSDocs/ttl         -- TTL and...
  output/prepareSDocs/sparql     -- ...SPARQL files
7.2) Get dataset metadata:
  input/ds_get_metadata -> output/parseTXT
  output/ds_get_metadata        -- CSV files
8)   Convert dataset metadata to triples:
  input/prepareDatasets -> output/ds_get_metadata
  output/prepareDatasets/ttl    -- TTL and...
  output/prepareDatasets/sparql -- ...SPARQL files
9)   Upload data to Virtuoso:
  input/upload2Virtuoso -> output/prepare*/*
  output/upload2Virtuoso         -- empty