Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ontology repacking and exporting - relates to #34 #44

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

Conversation

dimatr
Copy link
Collaborator

@dimatr dimatr commented Apr 28, 2020

Please review.

  • To use the code please add to the parameters --do-ttl True
  • The output .ttl files will go to the {bin_width}-turtle folder besides the main .json output folder, e.g. 1000-turtle
  • Double check the repacking logic in the PangenomeSchematic.py
  • Check if the bin/path/etc. ids are used correctly
  • Do not run it on the heavy sets, the turtle repack+dump may take really long

@subwaystation
Copy link
Member

Hi @dimatr the good news first:
http://ttl.summerofcode.be/ says you produced valid .ttl syntax! Good job. One minor thing: Please remove the newlines between each record, this will save some space, hopefully. It will still be valid Turtle.

For better debugging, let's use the files given in svg_opg1.zip. It is a small example graph I came up with.
I ran

python ~/git/component_segmentation/segmentation.py -j svg1.gfa.og.w1.json -f svg1.gfa.og.fasta -o ./cs_svg1 -t svg1.gfa.og.w1.json.ttl

And got cs_svg1.zip.
On first glance everything looks good, except the following:

  • Each path name in a FALDO region is always path0. Here the corresponding path name should pop up.
  • Path names in vg:linkPaths are counted up from 1-n. Please replace these with the corresponding names.
  • vg:Link: I am missing forwardLinkEdge and reverseLinkEdge.
  • vg:Bin: I am missing forwardBinEdge and reverseBinEdge.

dimatr added 2 commits April 30, 2020 20:39
# Conflicts:
#	matrixcomponent/PangenomeSchematic.py
… edges definition for Component and Bin containers; Links are a part of ZoomLevel component
@josiahseaman josiahseaman requested review from subwaystation and 6br May 6, 2020 21:44
@josiahseaman josiahseaman removed their assignment May 6, 2020
@6br
Copy link

6br commented May 7, 2020

Thanks for your implementation!
My comments:

  • linkRank or forwardLinkEdge/reverseLinkEdge should be defined between two links, by sorting all paths with two keys (upstream, downstream).
  • I think vg:Link does not affect on forwardBinEdge and reverseBinEdge, just connecting bins with arrival and departure.
  • More precisely, faldo:region consists of two exact positions pointing out a reference c.f. https://github.com/OBF/FALDO.

@subwaystation
Copy link
Member

  • I see
<5/region/28-28> a faldo:Region ;
    faldo:begin <5/position/28> ;
    faldo:end <5/position/28> .

which does not make sense, because the overall pangenome nucleotide length is 17. See

[heumos@wave svg_opg1]$ odgi stats -S -i svg1.gfa.og -V
length:	17
nodes:	4
edges:	7
paths:	3
cov	sets
3	5,5-,
10	5,5-,6,
4	6,
  • No FALDO regions for paths -5 and 6. This means that all classes relying on that information have the wrong regions, as mentioned above.
  • An addition to @6br FALDO comment: For single positions, which we will have for w1, an exact position is sufficient -> https://github.com/OBF/FALDO#single-position. We don't need a full FALDO region.

- use faldo.ExactPosition when appropriate
@dimatr
Copy link
Collaborator Author

dimatr commented May 11, 2020

@subwaystation
Copy link
Member

I just realized, the testing GFA has two 6 paths in it....... sorry @dimatr .
I will fix this and start a fresh testing.

@subwaystation
Copy link
Member

This would also explain the weird odgi bin outputs we observed. So no bug, I guess :) Just a dump user.

Copy link

@6br 6br left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for update. I have left some comments.

matrixcomponent/PangenomeSchematic.py Outdated Show resolved Hide resolved
matrixcomponent/ontology.py Show resolved Hide resolved
matrixcomponent/ontology.py Show resolved Hide resolved
@subwaystation
Copy link
Member

<pg/zoom1/component2/bin3/cell5> a vg:Cell ;
    vg:cellRegion <5/region/3> ;
    vg:inversionPercent 0 ;
    vg:positionPercent 1 .

We shoud have a faldo:exactPosition and not a vg:cellRegion here, right? Else, we could not connect to the exact positions.

The cleaned data:
svg_opg1_12052020.zip

I think CS is outputting more links compared to what odgi bin gives us. Short example:

<pg/zoom1/link2> a vg:Link ;
    vg:arrival <pg/zoom1/component3/bin5> ;
    vg:linkPaths <5>,
        <5-> .

But in the odgi bin output using -g, such a link does not exist. CS must make them up?! @6br @dimatr Can you confirm this?
By the way, I saw these links also in the usual CS output, so it is not an issue with the Turtle implementation.

@subwaystation
Copy link
Member

I decided to open an issue for my concerns about CS #48.

- every Link has linkRank numbered after the pair sort (component.id, [component.departures])
@dimatr
Copy link
Collaborator Author

dimatr commented May 12, 2020

I have made another update. The Cell - Region - ExactPosition relations are now understandable:

<pg/zoom1/component2/bin3/cell5> a vg:Cell ;
    vg:cellRegion <5/3-3> ;
    vg:inversionPercent 0 ;
    vg:positionPercent 1 .
<5/3-3> a faldo:Region ;
    faldo:begin <5/3> ;
    faldo:end <5/3> .
<5/3> a faldo:ExactPosition ;
    faldo:position 3 .

This way all the object go down to the atomic ones. vg:Cell should contain vg:Region - this is in the vg schema.

I use a short identifier <path1/2-3> instead of a longer version <path1/region/2-3> as in the example. Is it an acceptable approach?

@6br
Copy link

6br commented May 13, 2020

For me, <path1/2-3> is natural and acceptable. My concern is if there is a way to describe the path name as objects explicitly.

@6br
Copy link

6br commented May 13, 2020

@dimatr Is it possible to embed a path on faldo:ExactPosition?

<5/1> a faldo:ExactPosition ;
    faldo:position 1 ;
    faldo:reference 5 .

The orientation of path can be encoded like this (if the reference is positive strand)

_:1b   a  faldo:ExactPosition, faldo:ForwardStrandPosition ;
            faldo:position 1 ;
            faldo:reference ddbj:XXXDSDS .

Because if the path is inverted is encoded in vg:inversionPercent, I think we can dispense with faldo:ForwardStrandPosition.

@6br
Copy link

6br commented May 13, 2020

As @JervenBolleman suggests, it's better to set a path name as an independent subject.

<chr1/exactposition/1>  a  faldo:ExactPosition, faldo:ForwardStrandPosition ;
            faldo:position 1 ;
            faldo:reference <chr1> .
<chr1> a vg:Path .

@subwaystation
Copy link
Member

@dimatr here is the current output of odgi bin with the orientation encoded in the ranges:
svg1.gfa.og.w1.json.zip
Command line I used:

/home/heumos/git/odgi/bin/odgi bin -i svg1.gfa.og -f svg1.gfa.og.fasta -j -w 1 -s > svg1.gfa.og.w1.json

Now it should be possible to encode the orientation of a path for all cases.

Do you have any more questions? Need some feedback? Or comments from @6br ?

@6br
Copy link

6br commented May 26, 2020

Feel free to contact me at any time when you have any questions!

dimatr added 3 commits June 16, 2020 09:59
# Conflicts:
#	matrixcomponent/JSONparser.py
#	matrixcomponent/PangenomeSchematic.py
#	matrixcomponent/matrix.py
#	segmentation.py
@6br
Copy link

6br commented Jun 25, 2020

I agree to have a consistent IRI with spOdgi. Also, I also would like to replace pg with vg so it will be easer to sync with SpOdgi.

subwaystation and others added 3 commits June 26, 2020 14:25
* parallel write out of the gzip compressed ontology files - no memory leaks due to the utilization of separate processes!
* use the N-triples format to be 10x quicker than the Turtle (see format='nt' in PangenomeSchematic.py)
* be gentle with the string variables, do not use "a"+"b" but rather "{0}{1}".format(a,b). This does not create small temporary object and leads to a lower memory fragmentation/leak
* the RDF output folder is named *-rdf
@dimatr
Copy link
Collaborator Author

dimatr commented Jul 4, 2020

I have just pushed a big update targeting large datasets, please have a look.

@subwaystation
Copy link
Member

Nice work @dimatr ! Really cool.

PubSeq

I can get a RDF output for the current PubSeq data of ~1300 genomes in 22 minutes :)
But it gives me the following warning:

[06/07/2020 08:48:39 - INFO - __main__ - 97] Starting Segmentation process on 1190 Paths.
[06/07/2020 08:48:39 - INFO - __main__ - 247] Largest bin_id was 129679; Found 7427 dividers.
[06/07/2020 08:48:39 - INFO - __main__ - 252] Input has 33806 listed Links.  Segmentation eliminated 0.0% of them.
[06/07/2020 08:48:39 - INFO - __main__ - 254] Found 33806 unique links
[06/07/2020 08:48:39 - INFO - __main__ - 105] Created dividers
[06/07/2020 08:48:39 - INFO - __main__ - 118] Created 7999 components
[06/07/2020 08:49:08 - INFO - __main__ - 92] Populated Matrix and Occupancy per component per path.
[06/07/2020 08:49:08 - INFO - __main__ - 122] populated matrix
[06/07/2020 08:49:09 - INFO - __main__ - 152] Created 12010 LinkColumns
/home/ubuntu/software/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
Saved file2bin mapping to PubSeqTurtle_02062020/bin2file.json
[06/07/2020 09:09:13 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json

real    22m9.622s
user    72m51.954s
sys     1m37.888s

The resulting n3 is 108GB uncompressed! So no use putting it into a SPARQL endpoint on a machine with 64GB for now.

Pantograph Demo

I also ran it on the current Pantograph demo data which consists of 169 genomes. Here it only took 3 minutes ;) We have 8GB uncompressed and it fits into the RAM database of a Fuseki-Jena-Server which needs 60GB RAM.
Of course, I tried our default queries, and the one which should retrieve all the links, including arrival, departure, linkPaths
and linkZoomLevel did not succeed. I double-checked and for some links departure exists but not the corresponding arrival. Please check the attachment for the complete list. If you want to take a closer look at these files just ping me.
In my example tests, this problem did not occur. Might be due to the nature of the SARS data, but still, we would need these links fully serialized to RDF. Or not at all in the RDF. Whatever the biology says.
selectLinksResult.txt

Path Encoding

The encoding of the paths is working, but still not 100% in line with SpOdgi. When listing them, we are doing it correctly, but when referring to them, we are still omitting the path/.

@6br
Copy link

6br commented Jul 17, 2020

PubSeq

It's possible to tune parameters to avoid warnings.

Pantograph Demo

At first glance on selectLinksResult.txt, the same link identifier is shared among more than two bins, which might be the matter.
I suspect that something wrong in some boundary condition, but I haven't checked yet.

Path Encoding

I'll vote for adding the prefix path/ in the path identifier. I'll update it if there is no objection.

@subwaystation
Copy link
Member

PubSeq

Cool, that would be awesome.

Pantograph Demo

I think the problem is that only departures are synced so far, but not arrivals. @dimatr suggested implementing the same sanity check for arrivals as was done in departures. Maybe he can give you some tips.

Path Encoding.

Yes, I vote for it! I had that, but I added it only in the Path ontology object. Then I realized, we need to alter the code in every ontology object, that emits a path. So be careful, when changing this!

@6br
Copy link

6br commented Jul 18, 2020

PubSeq

[18/07/2020 14:07:07 - INFO - __main__ - 97] Starting Segmentation process on 1190 Paths.
[18/07/2020 14:07:07 - INFO - __main__ - 247] Largest bin_id was 129679; Found 7427 dividers.
[18/07/2020 14:07:07 - INFO - __main__ - 252] Input has 33806 listed Links.  Segmentation eliminated 0.0% of them.
[18/07/2020 14:07:07 - INFO - __main__ - 254] Found 33806 unique links
[18/07/2020 14:07:07 - INFO - __main__ - 105] Created dividers
[18/07/2020 14:07:07 - INFO - __main__ - 118] Created 7999 components
[18/07/2020 14:07:35 - INFO - __main__ - 92] Populated Matrix and Occupancy per component per path.
[18/07/2020 14:07:35 - INFO - __main__ - 122] populated matrix
[18/07/2020 14:07:35 - INFO - __main__ - 152] Created 12010 LinkColumns
Saved file2bin mapping to PubSeqTurtle_02062020/bin2file.json
[18/07/2020 14:31:28 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json

real    26m12.550s
user    19m48.632s
sys     0m22.386s

Now warning messages disappeared.

self.path = path

def ns_term(self):
return "path/{0}".format(self.path) # path1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All path should be changed as like that.

@6br
Copy link

6br commented Jul 23, 2020

In pubseq data, path name looks like
<path/http://collections.lugli.arvadosapi.com/c=13a2b522d373d0f6bfd95a58f821c677+123/sequence.fasta>
Is it fine?

Furthermore, cell name looks like
<http://example.org/vg/zoom1/component4/bin3/cellhttp://collections.lugli.arvadosapi.com/c=e4c1e7ed3a305e5b49993a2c042c4572+123/sequence.fasta>

@6br
Copy link

6br commented Jul 26, 2020

cell name is now updated to
<http://example.org/vg/zoom1/component4/bin3/cell/path<pathname>/http://collections.lugli.arvadosapi.com/c=e4c1e7ed3a305e5b49993a2c042c4572+123/sequence.fasta>

@6br
Copy link

6br commented Jul 26, 2020

I added assertion on building pangenomic sequence, and many downstream looks missing.

[26/07/2020 05:44:52 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 114499
[26/07/2020 05:44:52 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 110940
[26/07/2020 05:44:53 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 101825
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 60195
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 33805
[26/07/2020 05:44:54 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 98595
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 42908
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 105785
[26/07/2020 05:44:55 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 121714
[26/07/2020 05:44:56 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 34421
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 55038
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 80784
[26/07/2020 05:44:57 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 104129
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 76952
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 101961
[26/07/2020 05:44:58 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 59040
[26/07/2020 05:45:00 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 42975
[26/07/2020 05:45:00 - INFO - matrixcomponent.PangenomeSchematic - 141] No downstream 30423

@6br
Copy link

6br commented Jul 28, 2020

No-arrivals error is fixed

I added the assertion if there is no arrivals. So we easily know when arrivals are missing.
I found this error is due to multi-processing. Instead of using multi-processing, I succeeded to get RDF without any errors.

Saved file2bin mapping to PubSeqTurtle_02062020_2/bin2file.json
[28/07/2020 07:59:10 - INFO - __main__ - 413] Finished processing the file relabeledSeqs_dedup_relabeledSeqs_dedup.odgi.w1.g.json
24310.23user 418.13system 6:27:39elapsed 106%CPU (0avgtext+0avgdata 316126732maxresident)k
264inputs+235223048outputs (0major+118306390minor)pagefaults 0swaps
$ ~/data/PubSeqTurtle_02062020/PubSeqTurtle_02062020_2/1-rdf$ ls -ltrh
total 13G
-rw-rw-r-- 1 ubuntu ubuntu 13G Jul 28 07:22 seq_chunk00000_bin1.nt.gz

@josiahseaman
Copy link
Member

This has become a very long branch. Is this PR ready for merging now? I'm not available to review at the moment. @subwaystation are you available for review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants