Github regular timeout issue #117

kptdobe · 2020-04-06T08:12:34Z

We see a regular timeout happening when requesting raw content on github. The default timeout is set to 1000ms. In the discussion that started on Slack, we seem to have several tracks to "explore":

extend the timeout to x000ms
- drawback: the timeout might still occur, probably less frequently, but this only push the problem somewhere else
implement a re-try in the Downloader - if timeout, just retry a few times before really failing
add an MD cache

Initial discussion transcript:

@kptdobe Github was to slow to answer (>1s):
GET https://raw.githubusercontent.com/davidnuescheler/lr-landing/d90a42b1babf33d430450e05cbd3dc1edc6b7135/index.md timed out after 1000 ms
@rofe this is happening quite often...
@stefan-guggisberg To be precise: It's Fastly's connect timeout, i.e. setting up a secure connection to raw.github.com took more than 1s, which is quite long.
@MarquiseRosier This looks like dispatch territory? If it happens too often, maybe we can speed it up there. Perhaps using helix-fetch? (edited)
@kptdobe I do not understand. The action triggering this error is
/helix-pages/51eb728d800d5307212e9f96a65a6f9ff7ec1e47/html
how would that be Fastly ?
Wherever it is, it happens "frequently", we should consider having a retry or something to handle that case.
@stefan-guggisberg Ok, the error you quoted led me on the wrong track. We do have a raw.github.com origin configured in the helix-pages service. The connect timeout for this origin is 1000 ms. That's why I thought it's Fastly experiencing a timeout while connecting to the origin. In such a situation Fastly returns a 503.
After looking into it I saw that Fastly returned a 504 so my assumption was wrong 🙂
@stefan-guggisberg Here's the problem: https://dashboard.epsagon.com/spans/0e6c4b50-cb13-7b34-b4a1-0207aa2d6547?tab=graph
@stefan-guggisberg The request to
https://raw.githubusercontent.com/davidnuescheler/lr-landing/d90a42b1babf33d430450e05cbd3dc1edc6b7135/index.md
took 1780 ms. So it's our timeout of 1000 ms that we pass to our request library.
And yes, we might want to consider increasing it.
https://github.com/adobe/helix-pipeline/blob/master/docs/secrets.md#HTTP_TIMEOUT
@kptdobe I am not sure about the timeout increase. It happens "rarely". If you increase the timeout, it might still happen but more "rarely". So we push the problem to something harder to analyse. We should think about it. I'll create a ticket tomorrow to follow up and start the discussion.
@stefan-guggisberg Well, you said it happens "frequently" 🙂
I agree that we shouldn't increase the timeout unless there's no other option. I am reluctant to increase it.
David mentioned a couple of times that we need some sort of md cache 😉
@trieloff I wouldn’t take it for granted that an MD cache would solve much. Much of the stuff that is cacheable is already cached at a higher level, so an MD cache might just add another layer of caching, with low cache efficiency.
But we could implement a retry logic in the downloader, which would simply try again n times, with exponential backoff.

trieloff · 2020-04-14T07:45:05Z

Do we have a way of getting the response time distribution (ms to first byte would be ideal) from Coralogix? Once we know the real response time distribution, we can set an informed timeout with a target failure rate.

In addition, I think the re-try strategy is something that could be added both to a downloader and to an MD cache. In the downloader it would be easier, because the downloader already exists.

tripodsan · 2020-04-14T07:49:37Z

we can grab the HTML action timing easily from epsagon:

           "Server-Timing": {
                "0": "p00;dur=0.889538;desc=fetchFstab ,p01;dur=7385.588455;desc=fetchExternal ,p02;dur=0.609955;desc=fetchMarkupConfig ,p03;dur=1.390485;desc=fetchMarkdown ,p04;dur=13.146781;desc=parseMarkdown ,p05;dur=1.617181;desc=parseFrontmatter ,p06;dur=1.166587;desc=find ,p07;dur=1.271029;desc=fetch ,p08;dur=0.990822;desc=reformat ,p09;dur=0.776217;desc=iconize ,p10;dur=1.692485;desc=split ,p11;dur=3.467526;desc=getmetadata ,p12;dur=0.824109;desc=unwrap ,p13;dur=2.078935;desc=adjustMDAST ,p14;dur=0.795197;desc=selectstrain ,p15;dur=0.777052;desc=selecttest ,p16;dur=3.708712;desc=fillDataSections ,p17;dur=29.054696;desc=html ,p18;dur=0.843263;desc=adjustHTML ,p19;dur=0.79287;desc=sanitize ,p20;dur=20.437751;desc=once ,p21;dur=0.910263;desc=setmime ,p22;dur=1.061008;desc=cache ,p23;dur=1.942294;desc=key ,p24;dur=2.175862;desc=tovdom ,p25;dur=10.459243;desc=clean ,p26;dur=1.145533;desc=rewrite ,p27;dur=0.940876;desc=addHeaders ,p28;dur=1.458378;desc=stringify ,p29;dur=0.934212;desc=flag ,p30;dur=0.702583;desc=debug ,p31;dur=0.054923;desc=report",
                "1": "total;dur=7495.326742"
            },

kptdobe added the question Further information is requested label Apr 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Github regular timeout issue #117

Github regular timeout issue #117

kptdobe commented Apr 6, 2020

trieloff commented Apr 14, 2020

tripodsan commented Apr 14, 2020

Github regular timeout issue #117

Github regular timeout issue #117

Comments

kptdobe commented Apr 6, 2020

trieloff commented Apr 14, 2020

tripodsan commented Apr 14, 2020