Skip to content

DOC: Provide a public place for users to link to our documentation #55632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
phofl opened this issue Oct 22, 2023 · 28 comments
Open
1 task done

DOC: Provide a public place for users to link to our documentation #55632

phofl opened this issue Oct 22, 2023 · 28 comments
Labels
Milestone

Comments

@phofl
Copy link
Member

phofl commented Oct 22, 2023

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

...

Documentation problem

Currently, if users/downstream packages want to link to pandas specific docs (e.g. groupby ops and others) they have to specify the path in pandas.core.... This option becomes unavailable after deprecating pandas.core. #55626 linked to pandas.api.typing as a temporary solution for groupby ops, but we should ideally provide a place outside of typing for docs linkage.

cc @rhshadrach

Suggested fix for documentation

See above

@phofl phofl added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 22, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.2 milestone Dec 14, 2023
@jorisvandenbossche jorisvandenbossche added Blocker Blocking issue or pull request for an upcoming release and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 14, 2023
@jorisvandenbossche
Copy link
Member

I personally agree that for documentation purposes, ideally we have a "path" to link to that is not linked to typing.

To be explicit, I think the classes that we are discussing here (at least the ones that were already moved in #55626) are:

  • DataFrameGroupBy, SeriesGroupBy
  • Resampler
  • Rolling, Expanding, Window, ExponentialMovingWindow

There are a few others exposed in pandas.api.typing like DatetimeIndexResamplerGroupby, but those aren't used in the reference docs, I think.

One option could be to expose them directly in pandas.api instead of in a sub-submodule of it? (so eg pandas.api.DataFrameGroupBy instead of pandas.api.typing.DataFrameGroupBy)
I know at the moment pandas.api only has submodules, and no direct members, so that would be a change. But that would make the reference a bit shorter and remove the "typing" connotation.

Another option could also be to simply add them to the top-level namespace .. Of course, since a user never constructs those objects directly (using the class constructor), it's not very useful to have them their for other purposes than docs/typing, so that's a trade-off on whether that alone is enough for adding to the top-level namespace.

@rhshadrach
Copy link
Member

There was early opposition to having these classes in the top namespace in #48577. I do agree with that opposition, but would be okay with them there. I am also okay with where they currently are - pandas.api.typing is a bit odd for the URL, but I don't see it as problematic.

pandas.api seem like a reasonable location. But it's been mentioned (#48577 (comment)) that if we had clear public modules, then there is no reason for pandas.api. Of course, that doesn't mean it has to be removed.

@jorisvandenbossche
Copy link
Member

But it's been mentioned (#48577 (comment)) that if we had clear public modules, then there is no reason for pandas.api

I would say: "there is no reason for pandas.api submodules" (eg pandas.api.typing could just be pandas.typing). But we could still use the pandas.api module itself to expose some things.
I think the main problem for the objects we are discussing here, if we want to put them in a public pandas.<.something.> submodule, what to name that? It are public APIs in terms of that users work with those objects (get them returned, call methods on them), but since you don't call them directly they don't necessarily need to be top-level. But for a submodule, I can't directly think of a good name that would fit for all of those objects together (eg pandas.groupby would also be a bit strange for Window). And then a general "api" for public user-facing API that doesn't need to be top-level might actually make sense?

(personally, I think I have a slight preference for just putting them top-level, but I certainly understand and agree with the downsides of that)

@rhshadrach
Copy link
Member

I'm +1 on putting them in pandas.api, +0 on top-level, +0 on leaving them as-is.

@phofl phofl removed the Blocker Blocking issue or pull request for an upcoming release label Dec 19, 2023
@phofl
Copy link
Member Author

phofl commented Dec 19, 2023

We reverted the change of path, so this unblocks this issue from the release

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 20, 2023

If the objects are just internal ones that our methods can return, what about pandas.api.internal or something like that, which would indicate that the object is internal to pandas?

@jorisvandenbossche jorisvandenbossche modified the milestones: 2.2, 3.0 Dec 21, 2023
@jorisvandenbossche
Copy link
Member

I wouldn't call them "internal", in the way we typically speak about "our internals" (manager, blocks) or in the way they are internal details. Those objects are still public, the user sees those objects, calls methods on them, etc. The only thing you could consider "internal" is their __init__, as no user needs to call that.

So I would find pandas.api.internals a bit strange for this (the url of docs of public functions would then include "internals"). If at some point during the core->_core rename, we do decide that we want to provide a public location where downstream library can access eg BlockManager (big if ;)), then that's something that I would expect in pandas.api.internals

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 21, 2023

So I would find pandas.api.internals a bit strange for this (the url of docs of public functions would then include "internals").

That's reasonable. How about pandas.api.transient ? I can't think of a word that means "a class that is something that you should not instantiate". In Java, people call those "utility classes", so we could possible use pandas.api.utility to mean that.

@rhshadrach
Copy link
Member

rhshadrach commented Dec 26, 2023

I'll add intermediate as an alternative to transient. But I think I'm -1 on utility - we already use utility in a different way (e.g. https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.util.hash_array.html#pandas.util.hash_array).

Also: indirect, incidental, ancillary, accessory.

Thus far, ancillary is my favorite if we're not going with just putting them in pandas.api.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 26, 2023

Thus far, ancillary is my favorite if we're not going with just putting them in pandas.api.

That works for me. Also could consider auxiliary as well.

@rhshadrach
Copy link
Member

I'd like to start working on this if we can get past the bike shedding 😆 The following seem appropriate to me, in my order of preference:

  1. auxiliary
  2. transient
  3. ancillary

cc @Dr-Irv @jorisvandenbossche @phofl

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 9, 2024

I'd like to start working on this if we can get past the bike shedding 😆 The following seem appropriate to me, in my order of preference:

  1. auxiliary
  2. transient
  3. ancillary

I'm indifferent between auxiliary and ancillary, and probably prefer those 2 over transient.

@phofl
Copy link
Member Author

phofl commented Feb 9, 2024

my opinion hasn't changed, I don't really care as long as I can access them.

auxiliary sounds good to me

@jorisvandenbossche
Copy link
Member

What's the argument again for not putting them just in pandas.api ?

Above @rhshadrach you linked to a different thread (from pandas.api.typing), but as I commented above (#55632 (comment)) for me that's mostly an argument against having submodules in pandas.api (instead of just putting those submodules top-level), and not for putting things directly in pandas.api

Personally, I would say that any of auxiliary or ancillary or one of the other names add unnecessary verbosity to the URLs. Apart from preferring not to use any of them, I don't have a strong opinion on which sounds best (although as a non-native speaker, I find those all "difficult" words)

@rhshadrach
Copy link
Member

@jorisvandenbossche - no opposition here, but I think we should try to have a clear rule as to what goes into pandas vs pandas.api vs pandas.api.foo. Is it:

  • top level is for classes and functions we expect the "typical" user to directly utilize
  • pandas.api is for classes and functions the "typical" user won't directly utilize; such as
    • classes returned by pandas methods/functions that aren't to be instantiated directly
    • classes and functions meant for downstream packages and extending pandas
  • pandas.api.foo is anything that would go into pandas.api, but put in foo for organizational purposes

@rhshadrach
Copy link
Member

@jorisvandenbossche - friendly ping.

@jorisvandenbossche
Copy link
Member

but I think we should try to have a clear rule as to what goes into pandas vs pandas.api vs pandas.api.foo

Your rationale of what to put where sounds perfect to me

@rhshadrach
Copy link
Member

@Dr-Irv - are you good to move forward with @jorisvandenbossche's suggestion in #55632 (comment)? That is, put these in pandas.api directly.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 4, 2024

@Dr-Irv - are you good to move forward with @jorisvandenbossche's suggestion in #55632 (comment)? That is, put these in pandas.api directly.

Not really, and here's why. I think there should be a difference between classes/functions that users might use versus ones that are auxiliary. You previously wrote:

pandas.api is for classes and functions the "typical" user won't directly utilize; such as

  • classes returned by pandas methods/functions that aren't to be instantiated directly
  • classes and functions meant for downstream packages and extending pandas

We currently have things like pd.api.types.CategoricalDtype that a user might use, whereas currently the things that are currently in pandas.api.typing are just needed for documentation purposes. So right now, the sub-packages of pandas.api can be divided into two groups:

  • Things a user might use directly (e.g., pd.api.extensions.ExtensionArray, pd.api.types.CategoricalDtype)
  • Things a user will not use directly (e.g., pd.api.typing.Resampler, pd.api.typing.DataFrameGroupBy)

So the current subdivisions of pandas.api into these categories allows us to clearly delineate what they are for.

  • interchange
  • extensions
  • indexers
  • types
  • typing

So I don't think that moving the things currently in typing up to pd.api keeps our design consistent. That's why I suggested words like "auxiliary" and "ancillary"

@rhshadrach
Copy link
Member

Thanks for the response @Dr-Irv. If there is agreement on where these things should go, I'm happy to take up the work. But I won't be actively pushing people for this agreement to happen. I'll just end by stating that I think having core appear public is something that really should be fixed, and this is holding that up.

@jorisvandenbossche
Copy link
Member

  • Things a user might use directly (e.g., pd.api.extensions.ExtensionArray, pd.api.types.CategoricalDtype)

For this specific example, CategoricalDtype indeed can be used directly by user, and therefore actually lives in the top-level namespace (it seems it is additionally exposed in pandas.api.types as well, but that is maybe a mistake?)

While ExtensionArray is generally not something a user should use? (or would you use it for type checking, like isinstance checks?)

Generally I would say if there is something in pandas.api that is clearly meant for users, it should be moved. So I am not sure this category exists (or should exist)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 4, 2024

While ExtensionArray is generally not something a user should use? (or would you use it for type checking, like isinstance checks?)

You'd use it if you were implementing your own extension array.

Generally I would say if there is something in pandas.api that is clearly meant for users, it should be moved. So I am not sure this category exists (or should exist)

I'm not sure I understand this statement. Currently pandas.api only has sub-packages. We have a number of things in pandas.api.types that are meant for users, such as infer_dtype(), and a bunch of is_xxx() functions.

So I still don't think we should have any individual class or function at the pandas.api level - we should categorize them as we do now.

@jorisvandenbossche
Copy link
Member

While ExtensionArray is generally not something a user should use? (or would you use it for type checking, like isinstance checks?)

You'd use it if you were implementing your own extension array.

Of course, but for this discussion that is not "typical day-to-day / interactive usage of pandas", but rather fits in the "classes and functions meant for downstream packages and extending pandas" as listed int the second bullet point in #55632 (comment)

The distinction between both is always fuzzy (but in the end that's also what we can do with when deciding to put something in the top-level namespace or not)

But so that's what I meant with "nothing in pandas.api is clearly meant for users". Of course it's still for some users, but with the assumption it's only needed for "advanced" cases (typically downstream library development)

So I don't think that moving the things currently in typing up to pd.api keeps our design consistent. That's why I suggested words like "auxiliary" and "ancillary"

Yes, you are right it isn't consistent with how we currently do things. But we are discussing here a potential new option (put things directly in pd.api), so of course that's extending the current design.
The bullet points that @rhshadrach listed in #55632 (comment) is an attempt to describe new rules that we can then try to apply consistently.

Just like in the top-level pd namespace we have some objects directly in that namespace (eg pd.DataFrame) and some objects grouped in a sub-module (eg pd.arrays.DatetimeArray), I think we could follow a similar pattern in pd.api.

The main reason I don't like using a module named "auxiliary" or "ancillary", is that IMO those don't add any value for users (they don't give meaningful categorization like eg pd.api.types does), and just add noise to urls.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 5, 2024

Just like in the top-level pd namespace we have some objects directly in that namespace (eg pd.DataFrame) and some objects grouped in a sub-module (eg pd.arrays.DatetimeArray), I think we could follow a similar pattern in pd.api.

So how do we decide which ones go in pd.api versus go in a sub-package?

The main reason I don't like using a module named "auxiliary" or "ancillary", is that IMO those don't add any value for users (they don't give meaningful categorization like eg pd.api.types does), and just add noise to urls.

That's fair, but that's an argument to have everything that is currently in pd.api.* move up to pd.api .

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 6, 2024

@jorisvandenbossche and I discussed this on a call today and had a couple other ideas for these classes:

  • pandas.aux
  • pandas.api.aux

And for reference, the reason they are in typing for now is because we think the only reason a user would use what is in there is if they wanted to create a method or function that had typing declarations to reference those object types. Here is the list currently in pandas.api.typing:

  • DataFrameGroupBy
  • DatetimeIndexResamplerGroupby
  • Expanding
  • ExpandingGroupby
  • ExponentialMovingWindow
  • ExponentialMovingWindowGroupby
  • JsonReader
  • NaTType
  • NAType
  • PeriodIndexResamplerGroupby
  • Resampler
  • Rolling
  • RollingGroupby
  • SeriesGroupBy
  • StataReader
  • TimedeltaIndexResamplerGroupby
  • TimeGrouper
  • Window

It's unclear if we chose a new location (e.g., pandas.aux) for something like DataFrameGroupBy, whether we move all the things in pandas.typing into that location, or keep the pandas.api.typing package and split it.

@jorisvandenbossche
Copy link
Member

Some general thoughts on a typing submodule. I do think it would be good that we have such a module (and personally I think it should just be pandas.typing instead of pandas.api.typing).
But for such a module, I might expect that it are objects that are only used for type annotations. And while pandas users would probably mostly access eg the "Resampler" name to use it in a type annotation, this object itself is of course not a typing-only construct. If I compare it to numpy.typing, that module only exposes actual type vars, such as np.typing.DTypeLike. I don't think we do that currently(?), but at some point it probably makes sense to publicly expose some common type vars that now live in pandas._typing, such as our version of DType/DtypeObj or ArrayLike. At that point, if we put those in a pandas.(api.)typing, that might also be a strange mix with the objects we are currently discussing.
(to be honest, I am far from familiar enough with typing to really know if we actually want to expose those type vars at some point or not)


Looking at the current list of what is in pandas.api.typing, some other questions (those can probably be handled in separate issues):

  • Is it needed we expose all three of DatetimeIndexResamplerGroupby, TimedeltaIndexResamplerGroupby and PeriodIndexResamplerGroupby? Or why not just one ResamplerGroupby? For the non-grouped Resampler, we also only have just one Resampler exposed, and not those subclasses.
  • StataReader is included in the docs as pd.io.stata.StataReader, and so typing could also use that, if we are OK with that? (but of course the public/private status of pd.io is also very unclear ..).
  • TimeGrouper was previously exposed top-level, but was deprecated in favor of the more generic pd.Grouper. Is it therefore needed to actually expose this?

It's unclear if we chose a new location for something like DataFrameGroupBy, whether we move all the things in pandas.api.typing into that location, or keep the pandas.api.typing package and split it.

For the original motivation of this issue ("public location to link to for the docs"), we don't need all of the items on the list. For example I think NaTType and NAType are not used for docs, but only for typing.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 6, 2024

For the original motivation of this issue ("public location to link to for the docs"), we don't need all of the items on the list. For example I think NaTType and NAType are not used for docs, but only for typing.

I agree. So if we take this list

  • DataFrameGroupBy
  • DatetimeIndexResamplerGroupby
  • Expanding
  • ExpandingGroupby
  • ExponentialMovingWindow
  • ExponentialMovingWindowGroupby
  • PeriodIndexResamplerGroupby
  • Resampler
  • Rolling
  • RollingGroupby
  • SeriesGroupBy
  • TimedeltaIndexResamplerGroupby
  • Window

and move it to pd.aux or pd.api.aux, I think that will work.

Note - the various *ResamplerGroupby are needed. If you look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.resample.html you can see that any of them can be returned.

@jorisvandenbossche
Copy link
Member

Note - the various *ResamplerGroupby are needed. If you look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.resample.html you can see that any of them can be returned.

Yes, that's how we currently document it. But the same is true for DataFrame.resample as well (it can return one of three subclasses), but there we are fine with typing + documenting it as returning Resampler. So we could also change the docs you linked to just say that this method returns a ResamplerGroupby (note, currently such base class doesn't exist)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants