Skip to content

ohmycoffe/pandas-validity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

6c8868c · Nov 22, 2024

History

9 Commits
Nov 22, 2024
Oct 18, 2023
Oct 18, 2023
Sep 13, 2023
Sep 13, 2023
Nov 22, 2024
Sep 10, 2023
Nov 22, 2024
Oct 18, 2023
Oct 18, 2023
Nov 22, 2024
Nov 22, 2024

Repository files navigation

pandas-validity

PyPI - Version PyPI - Python Version Test and lint codecov Code style: black Checked with mypy Imports: isort Poetry PyPI - License

What is it?

pandas-validity is a Python library for the validation of pandas DataFrames. It provides a DataFrameValidator class that serves as a context manager. Within this context, you can perform multiple validations and checks. Any encountered errors are collected and raised at the end of the process. The DataFrameValidator raises a ValidationErrorsGroup exception to summarize the errors.

Installation

You can easily install the latest released version using binary installers from the Python Package Index (PyPI):

pip install pandas-validity

Usage

import pandas as pd
import datetime
from pandas_validity import DataFrameValidator

# Create a sample DataFrame
df = pd.DataFrame(
        {
            "A": [1, 2, 3],
            "B": ["a", None, "c"],
            "C": [2.3, 4.5, 9.2],
            "D": [
                datetime.datetime(2023, 1, 1, 1),
                datetime.datetime(2023, 1, 1, 2),
                datetime.datetime(2023, 1, 1, 3),
            ],
        }
    )

# Define your expectations and data type mappings
expected_columns = ['A', 'B', 'C', 'E']
data_types_mapping = {
            "A": 'float',
            "D": 'datetime'
        }

# Use DataFrameValidator for validation
with DataFrameValidator(df) as validator:
    validator.is_empty()
    validator.has_required_columns(expected_columns)
    validator.has_no_redundant_columns(expected_columns)
    validator.has_valid_data_types(data_types_mapping)
    validator.has_no_missing_data()

Output:

Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) The dataframe has missing columns: ['E']
Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) The dataframe has redundant columns: ['D']
Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) Column 'A' has an invalid data type: 'int64'
Error occurred: (<class 'pandas_validity.exceptions.ValidationError'>) Found 1 missing value: [{'index': 1, 'column': 'B', 'value': None}]
  + Exception Group Traceback (most recent call last):
...
  | pandas_validity.exceptions.ValidationErrorsGroup: Validation errors found: 4. (4 sub-exceptions)
  +-+---------------- 1 ----------------
    | pandas_validity.exceptions.ValidationError: The dataframe has missing columns: ['E']
    +---------------- 2 ----------------
    | pandas_validity.exceptions.ValidationError: The dataframe has redundant columns: ['D']
    +---------------- 3 ----------------
    | pandas_validity.exceptions.ValidationError: Column 'A' has an invalid data type: 'int64'
    +---------------- 4 ----------------
    | pandas_validity.exceptions.ValidationError: Found 1 missing value: [{'index': 1, 'column': 'B', 'value': None}]
    +------------------------------------

The library supports the following data types for validation:

  • predefined: "str", "int", "float","datetime", "bool"
  • or any Callable that accepts a data type/dtype object and returns a boolean value to indicate the validation status - example: pd.api.types.is_string_dtype

Development

Prerequisites: poetry for environment management

The source code is currently hosted on GitHub at ohmycoffe/pandas-validity. To get the development version:

git clone [email protected]:ohmycoffe/pandas-validity.git

To install the project and development dependencies:

make install 

To run tests:

make test 

To view all possible commands, use:

make help

License

This project is licensed under the terms of the MIT license.