Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure HttpHook's auth_type from Connection #35591

Open
wants to merge 172 commits into
base: main
Choose a base branch
from

Conversation

Joffreybvn
Copy link
Contributor

@Joffreybvn Joffreybvn commented Nov 12, 2023

Hello,

This PR makes possible to setup and parameterize HttpHook and HttpAsyncHook's auth_type from the Connection UI.

Concretely, this PR:

  • Add two extra settings in Http 'Extra' Connection:
    • The reserved auth_type field to define a auth class
    • The reserved auth_kwargs field to provide a dict of extra parameters to the auth_type class.
  • The auth_type is validated against a list of Auth classes, to protect against code injection.
  • The list of Auth classes can be customized:
    • Via the airflow_local_settings.
    • Via the new AIRFLOW__HTTP__EXTRA_AUTH_TYPES config
  • The changes are applied to both HttpHook and HttpAsyncHook, via a new common mixin class.
  • The UI of the Http Connection is refactored:
    • To make those Auth classes more discoverable, with a SelectField for auth_type
    • To have a convenient dedicated field for auth_kwargs

Side effect of the UI changes: The Extra field was until now used to pass params to the Headers (anything in the Extra was passed to the Headers). But now, auth_kwargs and auth_type are also being written over there, which I don't find very convenient. Furthermore, this PR add logic to exclude those keys from the Headers (IMO this start to be a bit of tech-debt). And finally, user cannot pass a header named like those keys (it is unlikely, but it could happen).

Thus, I propose to deprecate headers parameters passed directly in the 'Extra' field. And to pass them via a dedicated "Headers" field.

UI:
Screenshot from 2024-01-05 09-05-03

Screenshot from 2024-01-05 09-04-59

Side effect:
Screenshot from 2024-01-05 09-11-34

I also tried to add a CodeMirrorField (for Headers and Auth kwargs), and a CollapsibleField (to hide Extra), but it was a bit too much compared to the initial goal of this PR. Maybe in a future one.

Use-case:

The auth_type is typically a subclass of request.AuthBase. Many custom Auth classes exist for many different protocols. Sometimes, passing only two hard-coded conn.username and conn.password is not enough: The Auth class expects more than two arguments.

Examples:

Right now, to deal with those cases, they are three possibilities:

  • Using functools.partial, like mentioned in this PR, in the dag file / in the operator declaration.
    Opinion: The dag developer should not care about handling the connection. He just want a working connection_id to call an endpoint (especially if its a beginner / low-experienced dev). Furthermore, some parameters are sensitive and cannot be written in a dag.
  • Writing a custom Hook, which dispatch the parameters from the Connection correctly (eventually using partial).
    Opinion: This is not okay. Other hooks are doing better. Take the ODBCHook, which allows to parameterize every aspect of the connection without subclassing anything:
    • Defining which driver has to be used
    • Defining connection schema for SQLAlchemy
    • Parameterize the driver via extra driver-specific parameters
    • Parameterize the behavior of pyodbc ("connect_kwargs").
    Everything can be controlled in the Connection UI ! I'd expect the HttpHook to behave similarly, and let me configure everything, which includes the underlying authentication.
  • Misusing the "username" and "password" fields of a Connection, to stack and pass multiple parameters in it + implementing a thin layer on top of a Auth class to re-dispatch the parameters.
    Opinion: This is definitively a bad workaround. I'm mentioning it because this PR won't entirely solve the issue, and this may (continue to) happen.

Coming to this PR, I propose to add two reserved field: "auth_type" and "auth_kwargs", which are passed to the underlying Auth class. No breaking change. This solve most of the issues: a partial is not needed anymore, a subclass is not needed anymore, and there are less cases where conn.username and conn.password will be misused.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch from c80e6e1 to c8574c1 Compare November 12, 2023 21:15
@potiuk
Copy link
Member

potiuk commented Nov 15, 2023

This looks good in principle (small and simple, yet powerful), but it would require a more complete documentation and examples in order to be mergable.

@Joffreybvn
Copy link
Contributor Author

@potiuk What about a more aggressive PR, with most likely a breaking change ? (Sorry for the big chunk of text, here is the important part:)

This PR could copy further what the ODBCHook does, to:

  • Implement a way to set a custom Auth class from the Connection UI / conn.extra_json.
  • Remove the two hard-coded conn.username and conn.password from class instantiation to replaces them by keywords arguments, eventually customizable in conn.extra_json too.
  • Implement a way to customize proxy, ssl verify, and all other extra requests parameters from the Connection UI / conn.extra_json.

Would that be okay ?

@potiuk
Copy link
Member

potiuk commented Nov 15, 2023

  • Implement a way to set a custom Auth class from the Connection UI / conn.extra_json.

I'd say I am not so thrilled by the other option. and Certainly would not comment on it unless you show the code rather than explain in words what you really mean by changing it. I am not sure if you can pass in words what you want to do do without actually trying and implementing POC where you would show the code and we could assess how "breaking" it is. Airflow is used by 10s of thousands of enterprises and we cannot afford breaking changes that will make everyones workflows broken when migrating. We can break few peple workflows (this is inevitable) but not everyoene's

So before even attempting that, you should answer yourself a few questions. And decide if you want to go there at all.

  • Will that make everyone's connection stop working and require them to manually modify their connnections ? -> certainly NOT - probably not even in Airflow 3 (which we do not even plan yet).

  • Will there be change in any of the Public APIs that you can use to access connections ? -> absolutely notv in Airflow 2. This is forbidden by SemVer and we cannot change the API (incliuding Connection Python object that is retrieved by Hooks and operators) deliberately in Airflow 2.

  • Will there be any migration (automated or not) for the users? Will it handle all the cases?

  • what exactly do you want to achieve and Why you think is benefitial - i.e. is the chnge cost and potential problems justified by the benefit ?

  • How will it impact the UI and ways of defining connections via other mechanisms (env vars, secrets? )

....

I think most of thos questions are only worth looking in detail when there is at least Proof-Of-Concept where discussion can be done over the code rather than abstract concept of the change :) . Otherwise It will take too much time of those who review it to understand what you really want to do - having a code to look at is pretty much starting point of someone looking at proposing a change touching this part - part that is pretty much "core" of Airflow and part of Public Interface of Airflow: https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html

@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch from c8574c1 to ba6d715 Compare November 23, 2023 06:19
@Joffreybvn Joffreybvn marked this pull request as draft November 23, 2023 21:30
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch from 73a3cfe to 0daaa70 Compare November 23, 2023 23:26
@Joffreybvn Joffreybvn marked this pull request as ready for review November 24, 2023 00:20
@Joffreybvn
Copy link
Contributor Author

Joffreybvn commented Nov 24, 2023

Thanks for the detailed answer !

I won't go for proxy settings and extra parameters in this PR. Just for info, adding a tool like forwarder solve globally all proxy configuration issues. And disabling ssl "verify" can be done via the CURL_CA_BUNDLE trick instead of adding parameters to control that in the Connection.


For the rest, currently this PR:

  • Allows to configure the Auth class from the Connection
  • Allows to pass extra parameters to instantiate the Auth class from the Connection

There is no breaking change in Airflow. It may be a breaking change for user relying on the previous logic of the property (see below's code-review). But as the property was introduced recently, that won't break many users' worflows.


@auth_type.setter
def auth_type(self, v):
self._auth_type = v
Copy link
Contributor Author

@Joffreybvn Joffreybvn Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the auth_type property by a simple class attribute. In a previous PR (#29206 (review) - 8 months ago), it was introduced as a mechanism to detect if the user passed a custom auth class. I replaced that by the self._is_auth_type_setup attribute.

Why ? Maintaining this property means adding in it all the logic to load and return a auth_type potentially defined in the Connection. Which is useless regarding how this property is used in the rest of the codebase (here in livy, and in dbt hook).

@Joffreybvn Joffreybvn changed the title Pass parameters to HttpHook's auth_type from Connection Configure HttpHook's auth_type from Connection Nov 24, 2023
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch from 14d4b30 to af69628 Compare November 25, 2023 15:19
@potiuk
Copy link
Member

potiuk commented Nov 29, 2023

i just realised there is likely one big problem here - security.

While we cannot prevent it completely for some kind of connections (this is why Connection Editing user should be highly priviledged, introducing RCE deliberately is another thing.

If I understand correctly, someone who edits connection can decide which arbtirary class will be instantiated and executed when HTTP connection is established via HTTP Hook ? Which - if I understand correctly is basically a "no-go" - we removed a number of cases like that from the past from a number of providers precisely for that reason.

Is there any way we can make UI connection "declarative" for that? for example we could limit the list of predefined auth types we can choose. Does it make sense at all?

@Joffreybvn Joffreybvn marked this pull request as draft December 5, 2023 06:09
@Joffreybvn Joffreybvn marked this pull request as draft December 5, 2023 06:09
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch 3 times, most recently from 8ddedcf to b01d620 Compare December 6, 2023 19:25
@Joffreybvn
Copy link
Contributor Author

Joffreybvn commented Dec 6, 2023

Makes total sense ! I added a AIRFLOW__HTTP__EXTRA_AUTH_TYPES parameter for the Airflow 'Deployment manager' to control which classes can be set as "auth_type". By default, only classes from requests.auth and the ones added via this parameter can be used. If any other class is setup, the following message will appear in the logs, when executing the task:
MicrosoftTeams-image


Still WIP: I'm looking into customizing the UI to have special fields based on the extra params.

@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch 3 times, most recently from 8763e34 to 11408ce Compare December 20, 2023 15:18
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch from c22fd95 to 29c3cff Compare December 23, 2023 13:25
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch 2 times, most recently from c9e5d10 to 53a38eb Compare January 5, 2024 07:49
@Joffreybvn Joffreybvn force-pushed the feature/http-extra-auth-parameter-in-connection branch from 7f0e8fb to 070be4b Compare January 8, 2024 18:21
@dabla
Copy link
Contributor

dabla commented Jan 11, 2025

i just realised there is likely one big problem here - security.

While we cannot prevent it completely for some kind of connections (this is why Connection Editing user should be highly priviledged, introducing RCE deliberately is another thing.

If I understand correctly, someone who edits connection can decide which arbtirary class will be instantiated and executed when HTTP connection is established via HTTP Hook ? Which - if I understand correctly is basically a "no-go" - we removed a number of cases like that from the past from a number of providers precisely for that reason.

Is there any way we can make UI connection "declarative" for that? for example we could limit the list of predefined auth types we can choose. Does it make sense at all?

@potiuk The number of possible connections is already limited and is exposed as a frozenset in the HttpHook, so there is not way to fiddle with it. You can also configure the allowed auth types through the airflow.cfg, so this means that you would actually need access to the airflow installation to be able to modify it. So maybe I'm naïve here, but I think we're quite safe here unless I'm missing something important here which could be the case. Even if you would fiddle with the HTTP form on the client side, it wouldn't be accepted as the changed auth type wouldn't be part of the allowed auth type, which is checked in this part:

    def _load_conn_auth_type(self, module_name: str | None) -> Any:
        """
        Load auth_type module from extra Connection parameters.

        Check if the auth_type module is listed in 'extra_auth_types' and load it.
        This method protects against the execution of random modules.
        """
        if module_name:
            if module_name in self.get_auth_types():
                try:
                    module = import_string(module_name)
                    self._is_auth_type_setup = True
                    self.log.info("Loaded auth_type: %s", module_name)
                    return module
                except Exception as error:
                    self.log.error("Cannot import auth_type '%s' due to: %s", module_name, error)
                    raise AirflowException(error)
            self.log.warning(
                "Skipping import of auth_type '%s'. The class should be listed in "
                "'extra_auth_types' config of the http provider.",
                module_name,
            )
        return None

@dabla
Copy link
Contributor

dabla commented Jan 11, 2025

@potiuk @jscheffl I think it would be nice, once we know how the connection forms work in Airflow 3.0, to finish this PR, as that would be a nice feature. For example when using the LivyHook/Operator, it would then be easily possible to change the auth type to kerberos for example as this is of how we use it, at the moment we have to patch the LivyHook to be able to use it that way. Of course, the feature should downgrade itself if it detects if the provider is used on Airflow 2.x, which I think is feasible until the provider is Airflow 3.x only.

@jscheffl
Copy link
Contributor

@potiuk @jscheffl I think it would be nice, once we know how the connection forms work in Airflow 3.0, to finish this PR, as that would be a nice feature. For example when using the LivyHook/Operator, it would then be easily possible to change the auth type to kerberos for example as this is of how we use it, at the moment we have to patch the LivyHook to be able to use it that way. Of course, the feature should downgrade itself if it detects if the provider is used on Airflow 2.x, which I think is feasible until the provider is Airflow 3.x only.

Still not further implemented - but if you want to contribute, we can also define it together. Starting point atm are the params UI parts which from spec will then be usable for connection forms... at least that is the plan: #45270

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 15, 2025
@potiuk
Copy link
Member

potiuk commented Jan 25, 2025

some static checks and others are failing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants