Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github regular timeout issue #117

Open
kptdobe opened this issue Apr 6, 2020 · 2 comments
Open

Github regular timeout issue #117

kptdobe opened this issue Apr 6, 2020 · 2 comments
Labels
question Further information is requested

Comments

@kptdobe
Copy link
Contributor

kptdobe commented Apr 6, 2020

We see a regular timeout happening when requesting raw content on github. The default timeout is set to 1000ms. In the discussion that started on Slack, we seem to have several tracks to "explore":

  • extend the timeout to x000ms
    • drawback: the timeout might still occur, probably less frequently, but this only push the problem somewhere else
  • implement a re-try in the Downloader - if timeout, just retry a few times before really failing
  • add an MD cache

Initial discussion transcript:

@kptdobe Github was to slow to answer (>1s):
GET https://raw.githubusercontent.com/davidnuescheler/lr-landing/d90a42b1babf33d430450e05cbd3dc1edc6b7135/index.md timed out after 1000 ms
@rofe this is happening quite often...
@stefan-guggisberg To be precise: It's Fastly's connect timeout, i.e. setting up a secure connection to raw.github.com took more than 1s, which is quite long.
@MarquiseRosier This looks like dispatch territory? If it happens too often, maybe we can speed it up there. Perhaps using helix-fetch? (edited)
@kptdobe I do not understand. The action triggering this error is
/helix-pages/51eb728d800d5307212e9f96a65a6f9ff7ec1e47/html
how would that be Fastly ?
Wherever it is, it happens "frequently", we should consider having a retry or something to handle that case.
@stefan-guggisberg Ok, the error you quoted led me on the wrong track. We do have a raw.github.com origin configured in the helix-pages service. The connect timeout for this origin is 1000 ms. That's why I thought it's Fastly experiencing a timeout while connecting to the origin. In such a situation Fastly returns a 503.
After looking into it I saw that Fastly returned a 504 so my assumption was wrong 🙂
@stefan-guggisberg Here's the problem: https://dashboard.epsagon.com/spans/0e6c4b50-cb13-7b34-b4a1-0207aa2d6547?tab=graph
@stefan-guggisberg The request to
https://raw.githubusercontent.com/davidnuescheler/lr-landing/d90a42b1babf33d430450e05cbd3dc1edc6b7135/index.md
took 1780 ms. So it's our timeout of 1000 ms that we pass to our request library.
And yes, we might want to consider increasing it.
https://github.com/adobe/helix-pipeline/blob/master/docs/secrets.md#HTTP_TIMEOUT
@kptdobe I am not sure about the timeout increase. It happens "rarely". If you increase the timeout, it might still happen but more "rarely". So we push the problem to something harder to analyse. We should think about it. I'll create a ticket tomorrow to follow up and start the discussion.
@stefan-guggisberg Well, you said it happens "frequently" 🙂
I agree that we shouldn't increase the timeout unless there's no other option. I am reluctant to increase it.
David mentioned a couple of times that we need some sort of md cache 😉
@trieloff I wouldn’t take it for granted that an MD cache would solve much. Much of the stuff that is cacheable is already cached at a higher level, so an MD cache might just add another layer of caching, with low cache efficiency.
But we could implement a retry logic in the downloader, which would simply try again n times, with exponential backoff.

@kptdobe kptdobe added the question Further information is requested label Apr 6, 2020
@trieloff
Copy link
Contributor

Do we have a way of getting the response time distribution (ms to first byte would be ideal) from Coralogix? Once we know the real response time distribution, we can set an informed timeout with a target failure rate.

In addition, I think the re-try strategy is something that could be added both to a downloader and to an MD cache. In the downloader it would be easier, because the downloader already exists.

@tripodsan
Copy link
Contributor

we can grab the HTML action timing easily from epsagon:

           "Server-Timing": {
                "0": "p00;dur=0.889538;desc=fetchFstab ,p01;dur=7385.588455;desc=fetchExternal ,p02;dur=0.609955;desc=fetchMarkupConfig ,p03;dur=1.390485;desc=fetchMarkdown ,p04;dur=13.146781;desc=parseMarkdown ,p05;dur=1.617181;desc=parseFrontmatter ,p06;dur=1.166587;desc=find ,p07;dur=1.271029;desc=fetch ,p08;dur=0.990822;desc=reformat ,p09;dur=0.776217;desc=iconize ,p10;dur=1.692485;desc=split ,p11;dur=3.467526;desc=getmetadata ,p12;dur=0.824109;desc=unwrap ,p13;dur=2.078935;desc=adjustMDAST ,p14;dur=0.795197;desc=selectstrain ,p15;dur=0.777052;desc=selecttest ,p16;dur=3.708712;desc=fillDataSections ,p17;dur=29.054696;desc=html ,p18;dur=0.843263;desc=adjustHTML ,p19;dur=0.79287;desc=sanitize ,p20;dur=20.437751;desc=once ,p21;dur=0.910263;desc=setmime ,p22;dur=1.061008;desc=cache ,p23;dur=1.942294;desc=key ,p24;dur=2.175862;desc=tovdom ,p25;dur=10.459243;desc=clean ,p26;dur=1.145533;desc=rewrite ,p27;dur=0.940876;desc=addHeaders ,p28;dur=1.458378;desc=stringify ,p29;dur=0.934212;desc=flag ,p30;dur=0.702583;desc=debug ,p31;dur=0.054923;desc=report",
                "1": "total;dur=7495.326742"
            },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants