Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_eurostat fails with correct table ID #293

Open
bt-hb opened this issue Jan 29, 2024 · 10 comments
Open

get_eurostat fails with correct table ID #293

bt-hb opened this issue Jan 29, 2024 · 10 comments
Labels
API issues Issues related to functionalities / limitations of the Eurostat API documentation

Comments

@bt-hb
Copy link

bt-hb commented Jan 29, 2024

Many thanks for the very useful package.

Up until early January (I think prior to the latest update of the namq_10_gdp dataset on 26th January), I was able to pull quarterly GDP data using the code:
data <- eurostat::get_eurostat("namq_10_gdp")

However, now I have been getting the following error message:
Error in eurostat::get_eurostat("namq_10_gdp") : get_eurostat_raw fails with the id namq_10_gdp

I have double checked the dataset ID using search_eurostat and manually via the website and I believe it is correct. https://ec.europa.eu/eurostat/databrowser/view/namq_10_gdp/default/table?lang=en&category=euroind.ei_qna.ei_namq_10_ma

Other datasets download fine -- for example nama_10_gdp works -- and check_access_to_data() is TRUE.

For info, I am running v4.0 of the eurostat package with Rstudio v2023.03.1.

@CubicTom
Copy link

I have the same issue with different codes. Any news or hints on how to resolve this?

@pitkant
Copy link
Member

pitkant commented Apr 25, 2024

@bt-hb @CubicTom Thank you for reporting.

I tried replicating data <- eurostat::get_eurostat("namq_10_gdp") with package version 4.0.0 and was able to download the dataset:

data <- eurostat::get_eurostat("namq_10_gdp")
trying URL 'https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/namq_10_gdp?format=TSV&compressed=true'
downloaded 17.6 MB

Table namq_10_gdp cached at /var/folders/f4/h_r3y60n0nn0qm6qx5hnx1s00000gn/T//RtmpflpkBC/eurostat/43b7bc3103625c870037c912b6b61df5.rds

@CubicTom what datasets do you get the issue with, or is it with all available datasets? If all the datasets that you fail to download are on the larger side (such as the quarterly data namq_10_gdp with 6.4 M rows) it might give me a clue on what the issue is.

@CubicTom
Copy link

CubicTom commented Apr 25, 2024

@pitkant Thanks for your reply! Using version 4.0.0 I can confirm that namq_10_gdp can be downloaded.

Two series IDs that are reproducing the error for me are sbs_na_ind_r2 and sbs_ovw_act, e.g.:

## get annual detailed enterprise statistics for industry from Eurostat
eurostat_data <- get_eurostat(id="sbs_ovw_act", time_format="num", keepFlags = TRUE)
 
Error in get_eurostat(id = "sbs_ovw_act", time_format = "num", keepFlags = TRUE) :                                                                                                                        
  get_eurostat_raw fails with the id sbs_ovw_act

Last time I used the function succesfully with these IDs was March 28th 2024.

@pitkant
Copy link
Member

pitkant commented Apr 25, 2024

Right, thanks. I think I now understand what the problem is. If you look at that dataset in the Eurostat data browser you can see that it's a rather big one, with 920 different categories for different types of activities alone.

The returned object of those two queries is not the data but an XML file, like 31a8c39c-c1e3-4ab6-b698-1ddb7aa851e5.xml:

<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/">
  <env:Header/>
  <env:Body>
    <ns0:syncResponse xmlns:ns0="http://estat.ec.europa.eu/disschain/soap/extraction">
      <processingTime>36</processingTime>
      <queued>
        <id>31a8c39c-c1e3-4ab6-b698-1ddb7aa851e5</id>
        <status>PROCESSING</status>
      </queued>
    </ns0:syncResponse>
  </env:Body>
</env:Envelope>

This is described in the Eurostat help pages: API - Detailed guidelines - Asynchronous API

When I use the abovementioned URI for an asynchronous request, I get the following message:

<?xml version="1.0"?>
<S:Fault xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
  <faultcode>100</faultcode>
  <faultstring>
DATA_NOT_YET_AVAILABLE: Requested data is not yet available for download. Check the status of your request.
</faultstring>
</S:Fault>

And so on. The eurostat package does not currently have the functionalities to handle asynchronous requests. This might get implemented sometime in the future but I have to be frank that it's not very high on my priority list right now. PR's are of course always welcome.

@CubicTom
Copy link

@pitkant Thanks for the explanation. Any idea why this has only stopped working recently? I have been using this request regularly for about two or three years now without any issues...

@pitkant
Copy link
Member

pitkant commented Apr 25, 2024

Good question! It may be that Eurostat has changed something on their side and the Asynchronous API guidelines page seems to be more detailed now than it was when I last checked it. See especially the "More details on asynchronous trigger and thresholds..." collapsible section there:

When a data request is initiated, the system first checks if the exact same request was already performed previously and if applicable lookup the data directly from an internal cache and return it as a response.
If the data is not cached, the data needs to be extracted and the system estimates the related "extraction cost" in term of potential number of data cells returned.
To compute this cost, the system resolves the number of positions matched by each dimension filter.

If you only need a subset of the data then filtering it accordingly might solve your problem.

I will have to make sure that a sensible message is displayed to the end user if the server is attempting to give an asynchronous response.

@CubicTom
Copy link

@pitkant If using filters, get_eurostat will not allow me to get the flags. If you have any idea how to retrieve those, I would happily filter the query.

@pitkant pitkant added the API issues Issues related to functionalities / limitations of the Eurostat API label Apr 26, 2024
@pitkant
Copy link
Member

pitkant commented Apr 26, 2024

@CubicTom in that case it seems I should hurry with 4.1 release that adds the option to make SDMX queries with filters, instead of directing filtered queries to API Statistics.

Retrieving some big datasets can quite quickly reach "between 500 000 cells and 5 000 000 cells", the level where async kicks in. Above 5 000 000 cells the query needs to be filtered because otherwise it seems that it won't play nice at all: "if above 5 000 000 cells, a client request error is returned and more filters need to be added to the extraction query to reduce its estimated cost.

EDIT:

How stringent this limit of 500 000 cells then is in practice? It of course depends on the number of values and so on but also on the number of categories:

As an example, if a dataset has 3 dimensions with respectively 5, 10 and 20 positions available for each dimension, the dataset cardinality is 5 x 10 x 20 = 1000 cells.
An extraction request asking for:

  • 3 positions for the first dimension
  • 2 positions for the second dimension
  • no filtering for the third dimension

will potentially match 3 x 2 x 20 = 120 cells which is also the estimated cost of this request.

Testing with some items from the eurostat TOC, I noticed that datasets that had under 1 million values were handled normally, whereas datasets with over 1 million values returned an XML response. I was writing the faster data.table functionalities with datasets that have 100+ million values in mind so there has definitely been a policy change with regards to accessing data.

@pitkant
Copy link
Member

pitkant commented Apr 29, 2024

As posted in #304 :

@CubicTom I have received the following message from Eurostat user support:

"Last week maintenance introduced major changes to internal storage that required a big step forward and to renew the cache from scratch.

Sadly this release also contained a performance issue that remained undetected that is currently slowing down the repopulation process.

A hotfix was applied today at around 13:30."

So I think that the issues are related to big datasets not being cached as they previously were. Excerpt from the Eurostat documentation:

"When a data request is initiated, the system first checks if the exact same request was already performed previously and if applicable lookup the data directly from an internal cache and return it as a response."

I'm not sure if today's hotfix has renewed the cache for all files or not (probably not, sounds like a process that takes some time) but maybe something has changed for the better now.

@CubicTom
Copy link

As posted in #304 :

@CubicTom I have received the following message from Eurostat user support:

"Last week maintenance introduced major changes to internal storage that required a big step forward and to renew the cache from scratch.
Sadly this release also contained a performance issue that remained undetected that is currently slowing down the repopulation process.
A hotfix was applied today at around 13:30."

So I think that the issues are related to big datasets not being cached as they previously were. Excerpt from the Eurostat documentation:

"When a data request is initiated, the system first checks if the exact same request was already performed previously and if applicable lookup the data directly from an internal cache and return it as a response."

I'm not sure if today's hotfix has renewed the cache for all files or not (probably not, sounds like a process that takes some time) but maybe something has changed for the better now.

@pitkant Thank you so much for this info! Indeed, the old code I had already commented out in favor for another (but waaaaaay slooooower) procedure now warks flawlessly again! 🚀

Best wishes and keep up the good work,
Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API issues Issues related to functionalities / limitations of the Eurostat API documentation
Projects
Status: To do
Development

No branches or pull requests

3 participants