diff --git a/docs/advanced.rst b/docs/advanced.rst deleted file mode 100644 index 3d51b70..0000000 --- a/docs/advanced.rst +++ /dev/null @@ -1,193 +0,0 @@ -.. _advanced_usage: - -Advanced usage of EDVART -=========================================== - -This section describes several concepts behind edvart -and how you can modify your report before exporting it. - -Report class ------------- - -The most important class of the package :py:class:`~edvart.report.Report`. -The report consists of sections, which can be added via methods of the `Report` class. -The report is empty by default. -The class :py:class:`~edvart.report.DefaultReport` is a subclass of `Report`, -which contains a default set of sections. - -With created instance of `Report` you can: - -1. Show the report directly in your jupyter notebook using :py:meth:`~edvart.report.Report.show` method. -2. Export a new notebook using :py:meth:`~edvart.report.Report.export_notebook` method and edit it by yourself. -3. Export the output to a HTML report. You can also use a `.tpl` template to style the report. - -Exporting to HTML ------------------ -Apart from directly exporting a `Report`, you may also wish to export a generated notebook to HTML. -To export a notebook, you may use a tool called `jupyter nbconvert` (https://nbconvert.readthedocs.io/en/latest/). -For example, to export a notebook called `notebook.ipynb` using the `lab` template, you may use the following command: - -.. code-block:: bash - - poetry run jupyter nbconvert --to html notebook.ipynb --template lab - - - -TimeseriesReport class ----------------------- - -This class is a special version of the :py:class:`~edvart.report.Report` class which is specifically meant to be used for analysis of time series. - -The main differences are a different set of default sections including :py:class:`~edvart.report_sections.TimeseriesAnalysis`, -which cannot be added to the normal `Report` and the assumption that analyzed data is time-indexed. - -Helper functions :py:func:`~edvart.utils.reindex_to_period` or :py:func:`~edvart.utils.reindex_to_datetime` -can be used to index a DataFrame by a `pd.PeriodIndex` or a `pd.DatetimeIndex` respectively. - -Each column is treated as a separate timeseries. - -.. code-block:: python - - df = pd.DataFrame( - data=[ - ['2018Q1', 120000, 11000], - ['2018Q2', 150000, 13000], - ['2018Q3', 100000, 12000], - ['2018Q4', 110000, 11000], - ['2019Q1', 120000, 13000], - ['2019Q2', 110000, 12000], - ['2019Q3', 120000, 14000], - ['2019Q4', 90000, 12000], - ['2020Q1', 130000, 12000], - ], - columns=['Quarter', 'Revenue', 'Profit'], - ) - - # Reindex using helper function to have 'Quarter' as index - df = edvart.utils.reindex_to_datetime(df, datetime_column='Quarter') - report_ts = edvart.TimeseriesReport(df) - report_ts.show() - - -Modifying sections ------------------- - -The report consists of sections. - -In current version of edvart you can find following sections: - -* TableOfContents - - - Provides table of contents with links to all other sections. - - :py:meth:`~edvart.report.ReportBase.add_table_of_contents` - -* DatasetOverview - - - Provides essential information about whole dataset - - :py:meth:`~edvart.report.ReportBase.add_overview` - -* UnivariateAnalysis - - - Provides analysis of individual columns - - :py:meth:`~edvart.report.ReportBase.add_univariate_analysis` - -* BivariateAnalysis - - - Provides analysis of pairs of columns - - :py:meth:`~edvart.report.ReportBase.add_bivariate_analysis` - -* MultivariateAnalysis - - - Provides analysis of all columns together. Currently features PCA, parallel coordinates and parallel categories subsections. - - :py:meth:`~edvart.report.ReportBase.add_multivariate_analysis` - -* GroupAnalysis - - - Provides analysis of each column when grouped a column or a set of columns. Includes basic information similar to dataset overview and univariate analysis, but on a per-group basis. - - :py:meth:`~edvart.report.ReportBase.add_group_analysis` - -* TimeseriesAnalysis - - - Provides analysis specific for time series. - - :py:meth:`~edvart.report.TimeseriesReport.add_timeseries_analysis` - - -The edvart API allows you to choose which sections you want in the final report -or modifying sections settings. - -Selection of sections -~~~~~~~~~~~~~~~~~~~~~ -You can add sections using methods `add_*` of the `Report` class. - -.. code-block:: python - - # Shows only univariate and bivariate analysis - import edvart - df = edvart.example_datasets.dataset_titanic() - report = ( - edvart.Report(df) - .add_univariate_analysis() - .add_bivariate_analysis() - ) - - -Sections configuration -~~~~~~~~~~~~~~~~~~~~~~ - -Each section can be also configured. -For example you can define which columns should be used or omitted. - -Or you can set section verbosity (described later). - -.. code-block:: python - - # Configures sections to omit or use specific columns - import edvart - - df = edvart.example_datasets.dataset_titanic() - report = edvart.Report(df) - - report.add_overview(omit_columns=["PassengerId"]).add_univariate_analysis( - use_columns=["Name", "Sex", "Age"] - ) - - - -.. _verbosity: - -Verbosity ---------- - -EDVART provides a concept of a verbosity that is used during *export* into jupyter notebook. -The verbosity helps us to generate a code with a specific level of detail. - -edvart supports three levels of verbosity: - -- LOW - - High level functions for whole sections are generated. User can modify the markdown description. -- MEDIUM - - edvart functions are generated. User can modify parameters of these functions. -- HIGH - - Raw code is generated. User can do very advanced modification such as changing visualisations style. - -The verbosity can be set to whole report or to each section separately. - -Examples: - -.. code-block:: python - - # Set default verbosity for all sections to Verbosity.MEDIUM - import edvart - from edvart import Verbosity - - df = edvart.example_datasets.dataset_titanic() - edvart.DefaultReport(df, verbosity=Verbosity.MEDIUM).export_notebook("test-export.ipynb") - - -.. code-block:: python - - # Set default verbosity to Verbosity.MEDIUM but use verbosity Verbosity.HIGH for univariate analysis - import edvart - - df = edvart.example_datasets.dataset_titanic() - edvart.DefaultReport(df, verbosity=Verbosity.MEDIUM, verbosity_univariate_analysis=Verbosity.HIGH).export_notebook("test-export.ipynb") diff --git a/docs/api_reference.rst b/docs/api_reference.rst index 624bc18..48de4f9 100644 --- a/docs/api_reference.rst +++ b/docs/api_reference.rst @@ -1,4 +1,4 @@ -API reference +API Reference ============= .. toctree:: diff --git a/docs/getting_started.rst b/docs/getting_started.rst deleted file mode 100644 index 352fa81..0000000 --- a/docs/getting_started.rst +++ /dev/null @@ -1,37 +0,0 @@ -Getting started -=============== - -1. Start with default exploratory analysis in jupyter notebook. - -.. code-block:: python - - import edvart - df = edvart.example_datasets.dataset_titanic() - edvart.DefaultReport(df).show() - -2. Generate report notebook - -.. code-block:: python - - import edvart - df = edvart.example_datasets.dataset_titanic() - report = edvart.DefaultReport(df) - report.export_notebook("titanic_report.ipynb") - -You can modify the generated notebook if you want to modify some settings. -For more advanced usage of edvart, please read the documentation section -:ref:`Advanced usage `. - -3. Generate HTML report - -.. code-block:: python - - import edvart - df = edvart.example_datasets.dataset_titanic() - report = edvart.DefaultReport(df) - report.export_html( - html_filepath="titanic_report.html", - dataset_name="Titanic", - dataset_description="Dataset that contains data for 891 of the real Titanic passengers.", - ) - diff --git a/docs/index.rst b/docs/index.rst index b235fd4..b3f37c0 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,15 +1,27 @@ EDVART ================================ -Exploratory Data Analysis (EDA) is a very initial task a data scientist -or data analyst does when he reaches new data. -EDA refers to the critical process of performing -initial investigations on data to discover patterns, to spot -anomalies, to test hypothesis and to check assumptions with the help -of summary statistics and graphical representations. +Edvart is an open-source Python library designed to simplify and streamline +your exploratory data analysis (EDA) process. +Edvart supports different levels of customization: +from a default report generated in one line of code to a fully-customized +report down to the level of code generating the visualizations. + +Key Features +------------ +* **One-line Reports**: Generate a comprehensive set of pandas DataFrame visualizations using a single Python statement. + Edvart supports: + * Data overview, + * Univariate analysis, + * Bivariate analysis, + * Multivariate analysis, + * Grouped analysis, + * Time series analysis. + +* **Customizable Reports**: Produce, iterate, and style detailed reports in Jupyter notebooks and HTML formats. +* **Flexible API**: From high-level simplicity in a single line of code to detailed control, choose the API level that fits your needs. +* **Interactive Visualizations**: Many of the visualizations are interactive and can be used to explore the data in detail. -EDVART serves for speeding up EDA and for -creating Data analysis reports. Table of Contents ----------------- @@ -18,15 +30,16 @@ Table of Contents :maxdepth: 2 installation.rst - getting_started.rst - advanced.rst + usage.rst + sections.rst api_reference.rst .. include:: installation.rst -.. include:: getting_started.rst +.. include:: usage.rst +.. include:: sections.rst Links ------------ +----- * `GitHub repository `_ * :ref:`modindex` diff --git a/docs/installation.rst b/docs/installation.rst index 7b50e55..421fbd8 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -1,14 +1,16 @@ Installation ============ -edvart is distributed via PyPI. -Example installation with pip: +``edvart`` is distributed as a Python package via `PyPI `_. +It can be installed using ``pip``: .. code-block:: console $ pip install edvart -or you can add edvart into your environment file defined by `pyproject.toml`: +We recommend using `Poetry `_ for dependency management. +To add ``edvart`` into a Poetry environment, add the following snippet +to the ``pyproject.toml`` environment definition file: .. parsed-literal:: @@ -17,13 +19,22 @@ or you can add edvart into your environment file defined by `pyproject.toml`: edvart = "|VERSION|" +.. _extras: + Extras ------ -edvart also has an optional dependency "umap", which adds a plot called UMAP -(Universal Manifold Approximation) to Multivariate Analysis. To install edvart with the optional -extra, replace the above snippet of the `pyproject.toml` environment file with the following -snippet: +Edvart has an optional dependency ``umap``, which adds a plot called `UMAP `_ +to :ref:`Multivariate Analysis `. + +To install Edvart with the optional ``umap`` dependency via pip, run the following command: + +.. code-block:: console + + $ pip install "edvart[umap]" + +To install Edvart with the optional extra using Poetry, replace the snippet +of the ``pyproject.toml`` environment file above with the following snippet: .. parsed-literal:: @@ -31,40 +42,43 @@ snippet: python = ">=3.8, <3.12" edvart = { version = "|VERSION|", extras = ["umap"] } -To install edvart with the optional "umap" dependency via pip, run the following command: - -.. code-block:: console - - $ pip install "edvart[umap]" - +Rendering Plotly Interactive Plots +---------------------------------- -Plotly -====== +Edvart uses `Plotly `_ to render interactive plots. JupyterLab ----------- +~~~~~~~~~~ To display interactive plots which use Plotly in JupyterLab, you need to install some JupyterLab extensions. -To install the required extensions, you can follow the full guide at -https://plot.ly/python/getting-started/ or simply run the following commands -(inside the JupyterLab container if running in a container): +The extension ``jupyter-dash`` needs to be installed in order for Plotly plots +to be rendered correctly in JupyterLab. +It can be simply installed as a Python package, e.g. via ``pip``: .. code-block:: console - jupyter labextension install @jupyter-widgets/jupyterlab-manager@1.1 --no-build - jupyter labextension install jupyterlab-plotly@1.5.2 --no-build - jupyter labextension install plotlywidget@1.5.2 --no-build - jupyter lab build + pip install jupyter-dash + +to install `plotly-dash` to a Poetry environment, add the following line +under ``tool.poetry.dependencies`` in the ``pyproject.toml`` environment definition file: -Visual Studio Code ------------------- -To display interactive plots which use Plotly in Visual Studio Code notebooks, -you need to install the following extensions: -* `Jupyter `_ is required to - run Jupyter notebooks in Visual Studio Code. -* `Jupyter Notebook Renderers `_ is required - to render Plotly plots in Visual Studio Code notebooks. +.. code-block:: toml + jupyter-dash = "^0.4.2" + + +See https://plot.ly/python/getting-started/ for more information. + +Visual Studio Code +~~~~~~~~~~~~~~~~~~ +The following extensions need to be installed to display Plotly +interactive plots in Visual Studio Code notebooks: + +* `Jupyter `_ + is required to + run Jupyter notebooks in Visual Studio Code. +* `Jupyter Notebook Renderers `_ + is required to render Plotly plots in Visual Studio Code notebooks. diff --git a/docs/sections.rst b/docs/sections.rst new file mode 100644 index 0000000..fb7efe9 --- /dev/null +++ b/docs/sections.rst @@ -0,0 +1,41 @@ +Report Sections +--------------- + +Dataset Overview +~~~~~~~~~~~~~~~~ + - Provides essential information about whole dataset, such as inferred + data types, number of rows and columns, number of missing values, duplicates, etc. + - See :py:meth:`edvart.report.ReportBase.add_overview` + +Univariate Analysis +~~~~~~~~~~~~~~~~~~~ + - Provides analysis of individual columns. The analysis differs based on the data type of the column. + - See :py:meth:`edvart.report.ReportBase.add_univariate_analysis` + +Bivariate Analysis +~~~~~~~~~~~~~~~~~~ + - Provides analysis of pairs of columns, such as correlations, scatter plots, contingency tables, etc. + - See :py:meth:`edvart.report.ReportBase.add_bivariate_analysis` + + +.. _multivariate_analysis: + +Multivariate Analysis +~~~~~~~~~~~~~~~~~~~~~ + - Provides analysis of all columns together. + - Currently features PCA, parallel coordinates and parallel categories subsections. + Additionally, an UMAP section is included if the :ref:`extra` dependency ``umap`` is installed. + - See :py:meth:`edvart.report.ReportBase.add_multivariate_analysis` + +Group Analysis +~~~~~~~~~~~~~~ + - Provides analysis of each column when grouped by a column or a set of columns. + Includes basic information similar to dataset overview and univariate analysis, + but on a per-group basis. + - See :py:meth:`edvart.report.ReportBase.add_group_analysis` + +Timeseries Analysis +~~~~~~~~~~~~~~~~~~~ + - Provides analysis specific for time series. + - Used with :py:class:`edvart.report.TimeseriesReport` + - See :py:meth:`edvart.report.TimeseriesReport.add_timeseries_analysis` diff --git a/docs/usage.rst b/docs/usage.rst new file mode 100644 index 0000000..49da871 --- /dev/null +++ b/docs/usage.rst @@ -0,0 +1,255 @@ +Usage +===== + +Quick Start +----------- + +Show a Default Report in a Jupyter Notebook +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + import edvart + + + df = edvart.example_datasets.dataset_titanic() + edvart.DefaultReport(df).show() + +Export the Report Code to a Jupyter Notebook +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + import edvart + + + df = edvart.example_datasets.dataset_titanic() + report = edvart.DefaultReport(df) + report.export_notebook( + "titanic_report.ipynb", + dataset_name="Titanic", + dataset_description="Dataset of 891 of the real Titanic passengers.", + ) + +The exported notebook contains the code which generates the report. +It can be modified to fine-tune the report. +The code can be exported with different levels of detail (see :ref:`verbosity`). + +Export a Report to HTML +~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + import edvart + + + df = edvart.example_datasets.dataset_titanic() + report = edvart.DefaultReport(df) + report.export_html( + html_filepath="titanic_report.html", + dataset_name="Titanic", + dataset_description="Dataset of 891 of the real Titanic passengers.", + ) + + +A :py:class:`~edvart.report.Report` can be directly exported +to HTML via the :py:meth:`~edvart.report.ReportBase.export_html` method. + +Jupyter notebooks can be exported to other formats including HTML, using a tool +called `jupyter nbconvert` (https://nbconvert.readthedocs.io/en/latest/). +This can be useful to create a HTML report from a notebook which was exported +using the :py:meth:`~edvart.report.ReportBase.export_notebook` method. + +Customizing the Report +---------------------- + +This section describes several concepts behind edvart and how a report +can be customized. + +Report Class +~~~~~~~~~~~~ + +The :py:class:`~edvart.report.Report` class is central to the edvart API. +A *Report* consists of sections, which can be added via methods of the :py:class:`~edvart.report.Report` class. +The class :py:class:`~edvart.report.DefaultReport` is a subclass of :py:class:`~edvart.report.Report`, +which includes a default set of sections. + +With an instance of :py:class:`~edvart.report.Report` you can: + +1. Show the report directly in a Jupyter notebook using the :py:meth:`~edvart.report.Report.show` method. +2. Export the code which generates the report to a new Jupyter notebook using + :py:meth:`~edvart.report.ReportBase.export_notebook` method. + The code can be exported with different levels of :ref:`verbosity `. + The notebook containing the exported code can be modified to fine-tune the report. +3. Export the output to a HTML file. You can specify an + `nbconvert template + `_ + to style the report. + + +Selection of Sections +~~~~~~~~~~~~~~~~~~~~~ +You can add sections using methods ``add_*`` (e.g. :py:meth:`edvart.report.ReportBase.add_overview`) of the :py:class:`~edvart.report.Report` class. + +.. code-block:: python + + # Include univariate and bivariate analysis + import edvart + + + df = edvart.example_datasets.dataset_titanic() + report = ( + edvart.Report(df) + .add_univariate_analysis() + .add_bivariate_analysis() + ) + +.. _sections-config: + +Configuration of Sections +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each section can be also configured. +For example you can define which columns should be used or omitted. + +.. code-block:: python + + import edvart + + + df = edvart.example_datasets.dataset_titanic() + report = ( + edvart.Report(df) + .add_overview(omit_columns=["PassengerId"]) + .add_univariate_analysis(use_columns=["Name", "Sex", "Age"]) + ) + + +Subsections +*********** + +Some sections are made of subsections. For those, you can can configure which subsections are be included. + +.. code-block:: python + + import edvart + from edvart.report_sections.dataset_overview import Overview + + + df = edvart.example_datasets.dataset_titanic() + report = edvart.Report(df) + + report.add_overview( + subsections=[ + Overview.OverviewSubsection.QuickInfo, + Overview.OverviewSubsection.DataPreview, + ] + ) + + +.. _verbosity: + +Verbosity +~~~~~~~~~ + +A :py:class:`~edvart.report.Report` can be exported to a Jupyter notebook containing +the code which generates the report. The code can be exported with different levels of detail, +referred to as *verbosity*. + +It can be set on the level of the whole report or on the level of each +section or subsection separately (see :ref:`sections-config`). + +Specific verbosity overrides general verbosity, i.e. the verbosity set on a +subsection overrides the verbosity set on a section, which overrides +the verbosity set on the report. + +EDVART supports three levels of verbosity: + +LOW + High level functions for whole sections are exported, i.e. the output + of each section is generated by a single function call. + Suitable for small modifications such as changing parameters of the functions, + adding commentary to the report, adding visualizations which are not in EDVART, etc. + +MEDIUM + For report sections which consist of subsections, each subsection is + exported to a separate function call. + Same as LOW for report sections which do not consist of subsections. + +HIGH + The definitions of (almost) all functions are exported. + The functions can be modified or used as a starting point for custom analysis. + + +Examples +******** + +.. code-block:: python + + # Set default verbosity for all sections to Verbosity.MEDIUM + import edvart + from edvart import Verbosity + + + df = edvart.example_datasets.dataset_titanic() + edvart.DefaultReport(df, verbosity=Verbosity.MEDIUM).export_notebook("test-export.ipynb") + + +.. code-block:: python + + import edvart + from edvart import Verbosity + + + # Set report verbosity to Verbosity.MEDIUM but use verbosity Verbosity.HIGH for univariate analysis + df = edvart.example_datasets.dataset_titanic() + edvart.DefaultReport( + df, + verbosity=Verbosity.MEDIUM, + verbosity_univariate_analysis=Verbosity.HIGH, + ).export_notebook("exported-report.ipynb") + + +Reports for Time Series Datasets +-------------------------------- + +The class :py:class:`~edvart.report.TimeseriesReport` is a version +of the :py:class:`~edvart.report.Report` class which is specific for creating +reports on time series datasets. +There is also a :py:class:`~edvart.report.DefaultTimeseriesReport`, which contains +a default set of sections, similar to :py:class:`~edvart.report.DefaultReport`. + + +The main differences compared to the report for tabular data are: + +* a different set of default sections for :py:class:`~edvart.report.DefaultTimeseriesReport` +* :py:class:`~edvart.report_sections.TimeseriesAnalysis` section, which contains visualizations + for analyzing time series data +* the assumption that the input data is time-indexed and sorted by time. + +Helper functions :py:func:`edvart.utils.reindex_to_period` or :py:func:`edvart.utils.reindex_to_datetime` +can be used to index a DataFrame by a ``pd.PeriodIndex`` or a ``pd.DatetimeIndex`` respectively. + +Each column in the input data is treated as a separate time series. + +.. code-block:: python + + df = pd.DataFrame( + data=[ + ["2018Q1", 120000, 11000], + ["2018Q2", 150000, 13000], + ["2018Q3", 100000, 12000], + ["2018Q4", 110000, 11000], + ["2019Q1", 120000, 13000], + ["2019Q2", 110000, 12000], + ["2019Q3", 120000, 14000], + ["2019Q4", 160000, 12000], + ["2020Q1", 130000, 12000], + ], + columns=["Quarter", "Revenue", "Profit"], + ) + + # Reindex using helper function to have 'Quarter' as index + df = edvart.utils.reindex_to_datetime(df, datetime_column="Quarter") + report_ts = edvart.DefaultTimeseriesReport(df) + report_ts.show()