Do no accept invalid HTTPs connections #166

rgaudin · 2023-01-09T10:07:38Z

We've never allowed invalid HTTPs as start URL for zimit but I frequently see failed attempts in youzim.it logs about self-signed certificates.

I am not sure about the crawler's behavior regarding this. There doesn't seem to be a related flag on the crawler and there is no issue regarding certificates so maybe the default is to allow insecure connection…
If it's not, we could still pass --allow-running-insecure-content via the CHROME_FLAGS environment.

Currently, it fails before we start the crawler, when we check the URL in zimit (python).

The text was updated successfully, but these errors were encountered:

kelson42 · 2023-01-09T11:24:19Z

If it's not, we could still pass --allow-running-insecure-content via the CHROME_FLAGS environment.

--insecure seems indeed the way to go.

kelson42 · 2023-05-22T16:41:10Z

@rgaudin This seems straighforward to implement. Per default it should not be insecure, but in the context of youzim.it I would activate it (not need to allow user to change this IMO).

Doing this would allow to reduce easily the number of scraping just failing.

kelson42 · 2023-11-04T17:19:41Z

I found new errors (for Zimit scrapes) because actually the https configuration we were using is pretty picky. We should really implement this ASAP.

benoit74 · 2023-11-06T08:31:58Z

This is definitely an important issue BUT not a "good first issue" AFAIK, since this change must be done in python check_url + in call to Browsertrix crawler.
And we may probably not be able to use the Chrome flag mentioned by @rgaudin now that crawler has been migrated to Brave.
I would like to not use a trick like this kind of flag but implement the needed option in crawler (maybe it even exists now) since otherwise it puts our code at risk in case of a new change of underlying browser used like it just happened with the move to Brave (and implementing this upstream is probably a no-brainer).
Probably not a small issue we can implement quickly with all that said.

benoit74 · 2023-11-06T08:35:43Z

And from a higher PoV, I'm really not sure it is a good idea to make this a default.

As mentioned by another issue, security of content grabbed into the ZIM is quite important since we are somehow responsible of this. If some viruses achieves to make its way into the ZIM, we will at least have our share of responsibility.

If someone achieves to man-in-the-middle any of our Zimit worker, he will very easily be able to inject all kind of payloads into the ZIM.

With insecure HTTP accepted by default, it means that the kind of attacks will be even easier to perform.

rgaudin · 2023-11-06T08:59:19Z

@kelson42 did not propose to make it a default but to use it by default in youzim.it Responsibility chain is very different there as users request a particular URL to be turned into a ZIM. TLS only ensures encryption between zimit and the server. Bad content can already be served and thus included into ZIM.

While I'm in favor of an option to disable security, I'm not comfortable enabling it on youzim.it because we'd be breaking the chain of trust.

benoit74 · 2023-11-06T09:12:49Z

I agree I did not get the point very precisely.

And even if the chain of responsibility is different on youzim.it, while the end-user can check that the URL he is entering has no bad content (are at least that this is what he intend to package), he won't be aware of any man-in-the-middle between zimit and his intended website manipulating the ZIM content. Ensuring there is minimal risk of such a man-in-the-middle is our responsibility. Ensuring no-one can package bad content into a ZIM (e.g. by setting up a server with bad content and requesting it in youzim.it) is something obviously way more debatable.

kelson42 · 2024-02-08T16:34:34Z

We have regularly case because of current strict TLS configuration, see https://github.com/kiwix/k8s/issues?q=is%3Aissue+is%3Aclosed+sort%3Aupdated-desc+routine. I have removed the question tag of this issue.

benoit74 · 2024-02-08T19:04:51Z

Correct me if you are not ok with proposition below.

Part 1

we want to add a new --insecure parameter to the crawler which would turn off all SSL certificate checks (or at least as much as possible)
this flag would be turned off by default (i.e. we use a secure configuration)
it will be exposed just like other flags on the zimfarm and zimit-frontend UIs

Part 2

an environment variable at Zimfarm level will allow to alter the default value of --insecure parameter (just like we did for Publisher which is "openZIM" by default but can be modifier if needed) ; this will be used on youzim.it to reduce errors

Some notes:

this situation is one more reason why we would benefit to have run a dry-run of the zimit scraper, to raise a nice warning early if the user forgot to disable SSL checks on his insecure website (Design/Architecture: how to run some checks of recipe configuration from the UI (dry run) zimfarm#891)
I'm not really comfortable with part 2, but I can live with it especially until we have the dry-run feature (which might not be there until few years tbh)

Popolechien · 2024-02-09T07:03:39Z

(stupid) question: what is the risk of having --insecure on by default? We would have less errors, but on the flip side...?

kelson42 · 2024-02-09T07:52:17Z

* we want to add a new `--insecure` parameter to the crawler which would turn off all SSL certificate checks (or at least as much as possible)

For Zimit only.

* this flag would be turned off by default (i.e. we use a secure configuration)

I don't care, but "yes".

* it will be exposed just like other flags on the zimfarm and zimit-frontend UIs

Yes for Zimfarm, I don't think this is necessary for Zimit-frontend.

* an environment variable at Zimfarm level will allow to alter the default value of `--insecure` parameter (just like we did for  Publisher which is "openZIM" by default but can be modifier if needed) ; this will be used on youzim.it to reduce errors

Configure software with proper configuration files, configuration via ENV variable is simply bad in 90% of the cases.

* this situation is one more reason why we would benefit to have run a dry-run of the zimit scraper, to raise a nice warning early if the user forgot to disable SSL checks on his insecure website ([Design/Architecture: how to run some checks of recipe configuration from the UI (dry run) zimfarm#891](https://github.com/openzim/zimfarm/issues/891))

99% of people have no clue how to deal with a SSL error, here does not make sense IMHO. Keep it simple, the user risk is really almost null.

benoit74 · 2024-02-09T09:01:41Z

(stupid) question: what is the risk of having --insecure on by default? We would have less errors, but on the flip side...?

The flip side is that we will have all recipes running an insecure configuration. Meaning a bigger attack surface. More chances to put unsolicited content in a ZIM. And this will be our responsibility since we took the decision to run an insecure configuration by default.

The secure configuration / HTTPS ensures that the website which is responding is the proper one. If an attacker achieves to modify our network to respond with another server than the one which is supposed to respond, he won't have the proper HTTPS setup, and in the secure context the connection will fail, we won't create the ZIM with unexpected/bad content.

When a user requests an zimit scrape, he probably hopes we will put inside the ZIM content from the real website he is targeting. If an attacker has modified our network to reply to web requests with whatever he likes and we are running in the insecure configuration, we will put this illegitimate content inside the ZIM. The user will never be warned about it. We will never be warned about it. The ZIM might now contain a defaced website, harmful payloads, ...

The risk of an attacker being able to modify the network setup is probably limited in the zimit setup where we have full control on the worker. It is bigger on the opemZIM farm, because we have no control on the worker (and the worker owner could be the attacker). We do not share the same sensitivity on this risk with Emmanuel.

Once the risk will materialize (if it does, we have significant chances it won't) and if the insecure configuration has been used by default, we will have very little arguments but "we are sorry, we decided to run an insecure configuration to have less work/errors". Most users won't care. Technical users will be very angry.

If the insecure configuration is not applied by default at all and it is the youzim.it user which explicitly decides to run the recipe in an insecure configuration, as said, risk is transferred, we honored our responsibilities in the chain of trust.

Trust is always very difficult to gain and very easy to loose. But it is definitely your responsibility. I can definitely implement whatever you decide. I just cannot let you decide without warning you about the risks as I perceive them (and my perception might be totally wrong).

benoit74 · 2024-02-09T09:01:45Z

For Zimit only.

Yes

Yes for Zimfarm, I don't think this is necessary for Zimit-frontend.

OK

Configure software with proper configuration files, configuration via ENV variable is simply bad in 90% of the cases.

This is the current way of doing things in zimfarm (and most other tools we have AFAIK). But I could say that the ENV variable is configured in a k8s configuration file, so we match somehow your requirement.

99% of people have no clue how to deal with a SSL error, here does not make sense IMHO. Keep it simple, the user risk is really almost null.

I disagree, I don't see why we can't say to the user something like "oh, we are sorry but it looks like the website you are targeting is not running a secure configuration ; do you allow to us to proceed with an insecure one". All browsers already do it. And yes, in 99% of the cases the user decides to take the risk, but the responsibility has been explicitly transferred.

benoit74 · 2024-03-25T12:39:59Z

#285 somehow resolved the point for Zimit2:

Browsertrix crawler runs in insecure mode by default and has no option to force HTTPS certificates validation and so on
We have now removed the additional Python code which was forcing zimit scraper to have a valid HTTPS certificate

Should we open another issue to "force" the validity of HTTPS connection before proceeding?

kelson42 · 2024-03-25T12:50:46Z

Should we open another issue to "force" the validity of HTTPS connection before proceeding?

This should be done by Browsertrix and we should instrument it. @rgaudin and @benoit74 wanted to have a secure behaviour per default and fine to me.

benoit74 · 2024-03-25T12:51:42Z

I'm going to open issues then

benoit74 · 2024-03-25T12:52:21Z

Or more exactly, I will reopen this one and open one in Browsertrix Crawler.

benoit74 · 2024-03-25T12:57:49Z

Upstream issue is here: webrecorder/browsertrix-crawler#510

benoit74 · 2024-05-28T13:22:50Z

Is this mandatory for 2.0?

rgaudin added the question label Jan 9, 2023

kelson42 added enhancement good first issue labels Feb 7, 2023

kelson42 added this to the 2.0.0 milestone Apr 24, 2023

kelson42 assigned rgaudin May 22, 2023

kelson42 modified the milestones: 2.0.0, 1.7.0 Nov 4, 2023

kelson42 assigned benoit74 and unassigned rgaudin Nov 4, 2023

kelson42 removed the question label Feb 8, 2024

benoit74 mentioned this issue Feb 29, 2024

Upgrade browsertrix crawler and remove redirect handling #285

Merged

benoit74 closed this as completed Mar 25, 2024

kelson42 modified the milestones: 1.7.0, 2.0.0 Mar 25, 2024

benoit74 reopened this Mar 25, 2024

benoit74 changed the title ~~Should we accept invalid HTTPs?~~ Do no accept invalid HTTPs connections Mar 25, 2024

kelson42 modified the milestones: 2.0.0, 2.1.0 May 28, 2024

benoit74 modified the milestones: 2.1.0, 2.2.0 Jun 18, 2024

benoit74 modified the milestones: 2.2.0, later Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do no accept invalid HTTPs connections #166

Do no accept invalid HTTPs connections #166

rgaudin commented Jan 9, 2023

kelson42 commented Jan 9, 2023

kelson42 commented May 22, 2023

kelson42 commented Nov 4, 2023

benoit74 commented Nov 6, 2023

benoit74 commented Nov 6, 2023

rgaudin commented Nov 6, 2023

benoit74 commented Nov 6, 2023 •

edited

Loading

kelson42 commented Feb 8, 2024

benoit74 commented Feb 8, 2024

Popolechien commented Feb 9, 2024

kelson42 commented Feb 9, 2024

benoit74 commented Feb 9, 2024

benoit74 commented Feb 9, 2024

benoit74 commented Mar 25, 2024

kelson42 commented Mar 25, 2024

benoit74 commented Mar 25, 2024

benoit74 commented Mar 25, 2024

benoit74 commented Mar 25, 2024

benoit74 commented May 28, 2024

Do no accept invalid HTTPs connections #166

Do no accept invalid HTTPs connections #166

Comments

rgaudin commented Jan 9, 2023

kelson42 commented Jan 9, 2023

kelson42 commented May 22, 2023

kelson42 commented Nov 4, 2023

benoit74 commented Nov 6, 2023

benoit74 commented Nov 6, 2023

rgaudin commented Nov 6, 2023

benoit74 commented Nov 6, 2023 • edited Loading

kelson42 commented Feb 8, 2024

benoit74 commented Feb 8, 2024

Popolechien commented Feb 9, 2024

kelson42 commented Feb 9, 2024

benoit74 commented Feb 9, 2024

benoit74 commented Feb 9, 2024

benoit74 commented Mar 25, 2024

kelson42 commented Mar 25, 2024

benoit74 commented Mar 25, 2024

benoit74 commented Mar 25, 2024

benoit74 commented Mar 25, 2024

benoit74 commented May 28, 2024

benoit74 commented Nov 6, 2023 •

edited

Loading