Json parsing error #62

Skealz opened this issue Jan 29, 2024 · 9 comments

bug Something isn't working


Skealz commented Jan 29, 2024

🐛 Bug Report

 41%|████      | 50/122 [00:21<00:29,  2.44it/s]ERROR:root:impossible to get data from CDSfor query: https://catalogue$filter=OData.CSC.Intersects(area=geography'SRID=4326;POLYGON((127.25 31.23
333333333333, 127.2432586173411 31.09610933687195, 127.2230993925645 30.96020688251075, 127.1897164700251 30.826934785
17709, 127.1434313455158 30.69757652802221, 127.0846897700877 30.57337790177694, 127.0140574572236 30.45553500710589, 
126.9322146347078 30.34518273550423, 126.8399494936612 30.24338383967217, 126.7381505978291 30.1511186986255, 126.6277
983262274 30.06927587610977, 126.5099554315564 29.99864356324564, 126.3857568053111 29.93990198781754, 126.25639854815
62 29.89361686330824, 126.1231264508226 29.86023394076881, 125.9872239964614 29.84007471599226, 125.85 29.833333333333
34, 125.7127760035386 29.84007471599226, 125.5768735491774 29.86023394076881, 125.4436014518437 29.89361686330824, 125
.3142431946889 29.93990198781753, 125.1900445684436 29.99864356324564, 125.0722016737726 30.06927587610977, 124.961849
4021709 30.1511186986255, 124.8600505063388 30.24338383967217, 124.7677853652922 30.34518273550423, 124.6859425427764 
30.45553500710589, 124.6153102299123 30.57337790177694, 124.5565686544842 30.69757652802221, 124.5102835299749 30.8269
3478517708, 124.4769006074355 30.96020688251075, 124.4567413826589 31.09610933687195, 124.45 31.23333333333333, 124.45
67413826589 31.37055732979472, 124.4769006074355 31.50645978415591, 124.5102835299749 31.63973188148958, 124.556568654
4842 31.76909013864446, 124.6153102299123 31.89328876488973, 124.6859425427764 32.01113165956077, 124.7677853652922 32
.12148393116244, 124.8600505063388 32.2232828269945, 124.9618494021709 32.31554796804117, 125.0722016737726 32.3973907
905569, 125.1900445684436 32.46802310342103, 125.3142431946889 32.52676467884913, 125.4436014518437 32.57304980335843,
 125.5768735491774 32.60643272589785, 125.7127760035386 32.62659195067441, 125.85 32.63333333333333, 125.9872239964614
 32.62659195067441, 126.1231264508226 32.60643272589786, 126.2563985481562 32.57304980335843, 126.3857568053111 32.526
76467884914, 126.5099554315564 32.46802310342103, 126.6277983262274 32.3973907905569, 126.7381505978291 32.31554796804
117, 126.8399494936612 32.22328282699451, 126.9322146347078 32.12148393116244, 127.0140574572236 32.01113165956078, 12
7.0846897700877 31.89328876488974, 127.1434313455158 31.76909013864447, 127.1897164700251 31.63973188148959, 127.22309
93925645 31.50645978415593, 127.2432586173411 31.37055732979473, 127.25 31.23333333333334, 127.25 31.23333333333333))'
) and Collection/Name eq 'SENTINEL-1' and Attributes/OData.CSC.StringAttribute/any(att:att/Name eq 'productType' and a
tt/OData.CSC.StringAttribute/Value eq 'GRD') and ContentDate/Start gt 2022-09-05T06:30:00.000Z and ContentDate/Start l
t 2022-09-05T07:30:00.000Z&$top=1000&$expand=Attributes: Traceback (most recent call last):
  File "/home1/datahome/oarcher/storm_watch/conda3/lib/python3.8/site-packages/cdsodatacli/", line 500, in fet
    json_data = requests.get(url).json()
  File "/home1/datahome/oarcher/storm_watch/conda3/lib/python3.8/site-packages/requests/", line 898, in json
    return complexjson.loads(self.text, **kwargs)
  File "/home1/datahome/oarcher/storm_watch/conda3/lib/python3.8/json/", line 357, in loads
    return _default_decoder.decode(s)
  File "/home1/datahome/oarcher/storm_watch/conda3/lib/python3.8/json/", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home1/datahome/oarcher/storm_watch/conda3/lib/python3.8/json/", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

🔬 How To Reproduce

Il semble que ça ne se reproduise pas systématiquement du tout (même avec le même geodataframe en entrée). Je me demande si c'est pas lié à du rate-limiting, car on dirait que ça n'arrive que lorsque j'effectue plusieurs requêtes + ou - d'affilée.

Il faudrait que je chope le contenu du json en réponse...
Une première chose à faire dans le code de cdsodatacli, c'est d'afficher le contenu que renvoie le site web en cas derreur, avant de parser avec json.
J'essaye de faire ça de mon côté.


conda list

📈 Expected behavior

📎 Additional context

@Skealz Skealz added the bug Something isn't working label Jan 29, 2024
Skealz commented Jan 29, 2024

Donc j'ai modifié le code de cdsodatacli pour voir ce que me retournait le site (j'ai affiché le response.text), voici :
reponse data : 'upstream connect error or disconnect/reset before headers. reset reason: connection termination'

I think it could be related to this:

Skealz commented Jan 29, 2024

I'm not sure because I just ran it again and still get the error.

Skealz commented Jan 30, 2024

@agrouaze I use the cdsodatacli query command in a script launched using xargs, meaning there are several (the last I tried was 5 in parallel) queries in parallel.

I'll try without multi-process to see if the error still occurs.

Can you give us the snippet to reproduce your query?

Skealz commented Feb 5, 2024

I deactivated the multi-process, and I still got the issue.
Maybe this is still due to rate-limiting ? It seems related to how speedy I make the requests, because without using multi-process, I got a lot more results in the end using my script than with it, suggesting that I go a lot less JSON parse error maybe ?? I'm not sure.

To reproduce you can try :

ls /home/datawork-cersat-public/cache/project/hurricanes/analysis/best-tracks-atcf-merge/b*.dat | grep "$1" | egrep -v 'b..[89].*' | egrep -v 'b....201[01]' | xargs -n 1 -P 1 -r  stdbuf -oL /home1/datahome/oarcher/storm_watch/ --minspeed=34 --ddeg=0.4 --ddegcatfactor=2 --outdir=/tmp/bt_test

I just tried it and got the error.

You can use this conda env to launch the code : /home1/datahome/oarcher/storm_watch/conda_bt2sar_new

Skealz commented Feb 6, 2024

I made some kind of patch, in

def get_json_with_retries(url, retries=3, delay=2):
    """Attempt to get JSON data from URL with specified retries and delay between retries."""
    for attempt in range(retries):
            response = requests.get(url)
            response.raise_for_status()  # Raises HTTPError for bad responses
            return response.json(), True
        except requests.exceptions.HTTPError as e:
            logging.error("HTTP Error for URL %s: %s", url, e)
        except requests.exceptions.ConnectionError as e:
            logging.error("Connection Error for URL %s: %s", url, e)
        except requests.exceptions.Timeout as e:
            logging.error("Timeout Error for URL %s: %s", url, e)
        except requests.exceptions.RequestException as e:
            logging.error("Request Exception for URL %s: %s", url, e)
        except KeyboardInterrupt:
  "Operation cancelled by user.")
        except Exception as e:
            logging.error("An error occurred for URL %s: %s", url, traceback.format_exc())

        # Log the attempt and wait before retrying"Attempt %d for URL %s failed, retrying in %d seconds...", attempt + 1, url, delay)

    return None, False

def fetch_one_url(url, cpt, index, cache_dir):

    url (str)
    cpt (defaultdict(int))
    index (int)
    cache_dir (str)

    cpt (defaultdict(int))
    collected_data (pandas.GeoDataframe)

    json_data = None
    collected_data = None
    if cache_dir is not None:
        cache_file = get_cache_filename(url, cache_dir)
        if os.path.exists(cache_file):
            cpt["cache_used"] += 1
            logging.debug("cache file exists: %s", cache_file)
            with open(cache_file, "r") as f:
                json_data = json.load(f)
                collected_data = process_data(json_data)
    if (
        json_data is None
    ):  # means that cache cannot be used (or user used cache_dir=None or there is no associated json file
        logging.debug("no cache file -> go for query CDS")
        cpt["urls_tested"] += 1
            json_data, success = get_json_with_retries(url, retries=10, delay=2)
            if not success:
                cpt["urls_KO"] += 1
                logging.error("Couldn't get data from API after multiple tries")
            #json_data = requests.get(url).json()
                cpt["urls_OK"] += 1
... rest of the function is the same

agrouaze commented Feb 7, 2024

@Skealz The snippet you provided doesnt seem to be related to the cdsodatacli.
About the proposition of source modification, could you open a PR so that we could easily investigate your proposition?

Skealz commented Feb 7, 2024 via email

