-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WS285 data missing for Affymetrix chip GPL19230 #238
Comments
Hi @sdiamantakis , fyi |
These are all the probes missing genes: |
We create these mappings during the build: Microarray_results : "SMD_0C24D10.5" Microarray_results : "SMD_0C26F1.3" |
Using the script map_microarray.pl |
The probes belong to the genome array: Microarray_results : "GPL19230_18574193" |
My reply to Wen: During the build we first map CDSs, transcripts, and pseudogenes to the Oligo_set and PCR_product objects. We then map these results onto the Microarray_results objects using the xref from Oligo_set/PCR_product to Microarray_results and add the Gene annotation based on xrefs from the mapped entities. Looking at the citace dump used for the latest build, I cannot see any link between a Microarray_results object and the Microarray GPL19230. As such, I believe that this is the root cause of the problem. Hopefully that all makes sense, please let me know if you have any more questions or if you still think that the problem lies in the build process." |
From Wen: I did not know that the names of the Microarray_results must match the platform form. All GPL19230 Microarray_results link to Microarray "Affymetrix_C.elegans_Genome_Array" which is the original/generic name for all platforms by Affymetrix. You cannot see Microarray_results under Microarray. There is no XREF in the schema. 20 years ago Affymetrix only had one major platform called GPL200 (in GEO), we named it as "Affymetrix_C.elegans_Genome_Array" Sometimes people custome made a new chip using the same probes, which gets a new platform name in GEO, but WormBase do not create a new Microarray object for that, instead we group all of them into the same generic platform in WormBase. But GPL19230 is a platform with all new probes totally unrelated to GPL200. We created a new Microarray object for it, so it makes sense to point it to its own platform. I can update this in WS286 so that they point to the Microarray object "GPL19230." Looking at the rest of Microarray_results without Gene link, I saw 841 GPL200 objects (named as *_at) have no Gene link, the rest of 21732 GPL200 Microarray_results have Gene links. All of them point to Microarray "Affymetrix_C.elegans_Genome_Array" So these 841 objects indeed map nowhere to the genome? And the Pristionchus pacificus microarray platform GPL14372 has 29547 Microarray_results. All of them point to Microarray "GPL14372 '' but none got a Gene link. This is really strange because I am pretty sure that they used to be fine. I will dig into the earlier archives of WormBase to see when we lost them. Wen" My reply: After further investigation, I've realised that the issue with the GPL19230 array is that we have not mapped the probesets against the genome. This is done outside of the normal build cycle and it looks like it hasn't been done since December 2018, so we will need to investigate if there are additional arrays that have not been mapped. We'll work on remedying this before the next build starts. Regarding the Pristionchus microarray GPL14372 - although this has ~29,500 Microarray_results that don't have any mapped gene, it has ~62,000 that are mapped to genes. If this hasn't always been the case and some gene mappings have been lost then there must be a separate issue causing the discrepancy. As for the PL200 (*_at) objects without a gene link, it looks to me like the vast majority of these are mapped to the genome. Is it possible that these simply do not overlap genes, or do those numbers not make sense to you? Let me know if you have any further questions / thoughts. Thanks, |
Affymetrix emailed regarding issues with missing probe sequences and mismatching information. Kevin previously emailed them regarding the same issues several years ago, but 2nd time lucky perhaps! "We, at WormBase, would like to align the probe sequences for the GPL19230 dataset against the latest version of the C. elegans assemby. However, we have run into a number of issues. On the GEO page (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL19230) the pgf file, which contains probe sequences, refers to a different probeset to the mps file. It is the latter file, which does not contain sequence information, that contains the probset that we want. Confusingly, the probe set annotation file linked to from the Thermofisher page for the GeneChip C. elegans Gene 1.0 ST Array (https://www.thermofisher.com/order/catalog/product/902160) matches the probset in the pgf file while the annotation table on the GEO page matches the probe set in the mps file. Furthermore, although we really need the probe sequences themselves, I tried extracting sequences from the specified assembly (UCSC.ce6) using the chromosomal coordinates given in the GEO page table but a number of the coordinates lie outside of the chromosome limits for that assembly. Would it be possible for you to send us the probe sequences for this array?" |
Among the 813514 Microarray_results objects in WS285, 178124 objects have no gene mapping, including all the probes from the Affymetrix chip GPL19230.
Would you take a look and find out what went wrong? Although these do not affect the website of WormBase, I rely on the probe-gene mapping to generate SPELL.
It probably has been going on for a long time. I only noticed it today because it only affects a small percentage of datasets in SPELL.
The text was updated successfully, but these errors were encountered: