-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ontology repacking and exporting - relates to #34 #44
base: master
Are you sure you want to change the base?
Conversation
Hi @dimatr the good news first: For better debugging, let's use the files given in svg_opg1.zip. It is a small example graph I came up with.
And got cs_svg1.zip.
|
# Conflicts: # matrixcomponent/PangenomeSchematic.py
… edges definition for Component and Bin containers; Links are a part of ZoomLevel component
Thanks for your implementation!
|
<5/region/28-28> a faldo:Region ;
faldo:begin <5/position/28> ;
faldo:end <5/position/28> . which does not make sense, because the overall pangenome nucleotide length is 17. See
|
- use faldo.ExactPosition when appropriate
|
I just realized, the testing GFA has two |
This would also explain the weird |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for update. I have left some comments.
We shoud have a The cleaned data: I think CS is outputting more links compared to what
But in the |
I decided to open an issue for my concerns about CS #48. |
- every Link has linkRank numbered after the pair sort (component.id, [component.departures])
I have made another update. The
This way all the object go down to the atomic ones. vg:Cell should contain vg:Region - this is in the vg schema. I use a short identifier <path1/2-3> instead of a longer version <path1/region/2-3> as in the example. Is it an acceptable approach? |
For me, |
@dimatr Is it possible to embed a path on
The orientation of path can be encoded like this (if the reference is positive strand)
Because if the path is inverted is encoded in |
As @JervenBolleman suggests, it's better to set a path name as an independent subject.
|
@dimatr here is the current output of
Now it should be possible to encode the orientation of a path for all cases. Do you have any more questions? Need some feedback? Or comments from @6br ? |
Feel free to contact me at any time when you have any questions! |
# Conflicts: # matrixcomponent/JSONparser.py # matrixcomponent/PangenomeSchematic.py # matrixcomponent/matrix.py # segmentation.py
… in each Position. Add Path write out
I agree to have a consistent IRI with |
* parallel write out of the gzip compressed ontology files - no memory leaks due to the utilization of separate processes! * use the N-triples format to be 10x quicker than the Turtle (see format='nt' in PangenomeSchematic.py) * be gentle with the string variables, do not use "a"+"b" but rather "{0}{1}".format(a,b). This does not create small temporary object and leads to a lower memory fragmentation/leak * the RDF output folder is named *-rdf
I have just pushed a big update targeting large datasets, please have a look. |
Nice work @dimatr ! Really cool. PubSeqI can get a RDF output for the current PubSeq data of ~1300 genomes in 22 minutes :)
The resulting Pantograph DemoI also ran it on the current Pantograph demo data which consists of 169 genomes. Here it only took 3 minutes ;) We have 8GB uncompressed and it fits into the RAM database of a Fuseki-Jena-Server which needs 60GB RAM. Path EncodingThe encoding of the paths is working, but still not 100% in line with SpOdgi. When listing them, we are doing it correctly, but when referring to them, we are still omitting the |
PubSeqIt's possible to tune parameters to avoid warnings. Pantograph DemoAt first glance on Path EncodingI'll vote for adding the prefix |
PubSeqCool, that would be awesome. Pantograph DemoI think the problem is that only departures are synced so far, but not arrivals. @dimatr suggested implementing the same sanity check for arrivals as was done in departures. Maybe he can give you some tips. Path Encoding.Yes, I vote for it! I had that, but I added it only in the |
PubSeq
Now warning messages disappeared. |
self.path = path | ||
|
||
def ns_term(self): | ||
return "path/{0}".format(self.path) # path1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All path should be changed as like that.
In pubseq data, path name looks like Furthermore, cell name looks like |
cell name is now updated to |
I added assertion on building pangenomic sequence, and many downstream looks missing.
|
No-arrivals error is fixedI added the assertion if there is no arrivals. So we easily know when arrivals are missing.
|
This has become a very long branch. Is this PR ready for merging now? I'm not available to review at the moment. @subwaystation are you available for review? |
Please review.
--do-ttl True
{bin_width}-turtle
folder besides the main .json output folder, e.g.1000-turtle