Overview

Whistler is an open source data quality and profiling tool. Along with profiling it also supports running custom constraints on the data. With underlining support of Apache Spark execution engine Whistler can be extended to data in magnitudes of GB's, TB's. .

🐣 Getting Started

1. Install Whistler

pip install dq-whistler

2. Create a Spark dataframe for you data

# Sample Data
Age,Description
1,"abc"
2,"abc1"
3,
4,"abc4"
10,"xyz"
12,"null"
17,"abc"
20,"abc3"
23,

# You can read data from all the supported sources as per Apache Spark module
df = spark.read.option("header", "true").csv("<your path>")

3. Create a config in the form of python dict or read it from any json file

config = [
   {
      "name": "Age",
      "datatype": "number",
      "constraints":[
         {
            "name": "gt_eq",
            "values": 5
         },
         {
            "name": "is_in",
            "values": [1, 23]
         }
         
      ]
   },
   {
      "name": "Description",
      "datatype": "string",
      "constraints":[
         {
            "name": "regex",
            "values": "([A-Za-z]+)"
         },
         {
            "name": "contains",
            "values": "abc"
         }
         
      ]
   }
]

4. Build an instance of Data Quality Analyzer and execute the checks

from dq_whistler import DataQualityAnalyzer

output = DataQualityAnalyzer(df, config).analyze()

print(output)

[
    {
        "col_name": "Age",
        "total_count": 9,
        "null_count": 0,
        "unique_count": 9,
        "topn_values": {
            "1": 1,
            "2": 1,
            "3": 1,
            "4": 1,
            "10": 1,
            "12": 1,
            "17": 1,
            "20": 1,
            "23": 1
        },
        "min": 1,
        "max": 23,
        "mean": 10.222222222222221,
        "stddev": 8.303279138054101,
        "quality_score": 0,
        "constraints": [
            {
                "name": "gt_eq",
                "values": 5,
                "constraint_status": "failed",
                "invalid_count": 4,
                "invalid_values": [
                    "1",
                    "2",
                    "3",
                    "4"
                ]
            },
            {
                "name": "is_in",
                "values": [
                    1,
                    23
                ],
                "constraint_status": "failed",
                "invalid_count": 7,
                "invalid_values": [
                    "2",
                    "3",
                    "4",
                    "10",
                    "12",
                    "17",
                    "20"
                ]
            }
        ]
    },
    {
        "col_name": "Description",
        "total_count": 9,
        "null_count": 2,
        "unique_count": 7,
        "topn_values": {
            "abc": 2,
            "abc1": 1,
            "xyz": 1,
            "abc4": 1,
            "abc3": 1
        },
        "quality_score": 0,
        "constraints": [
            {
                "name": "regex",
                "values": "([A-Za-z]+)",
                "constraint_status": "success",
                "invalid_count": 0,
                "invalid_values": []
            },
            {
                "name": "contains",
                "values": "abc",
                "constraint_status": "failed",
                "invalid_count": 2,
                "invalid_values": [
                    "xyz",
                    "null"
                ]
            }
        ]
    }
]

📦 Roadmap

The list below contains the functionality that contributors are planning to develop for this module

Visualization
- Visualization of profiling output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Overview

🐣 Getting Started

1. Install Whistler

2. Create a Spark dataframe for you data

3. Create a config in the form of python dict or read it from any json file

4. Build an instance of Data Quality Analyzer and execute the checks

📦 Roadmap

🎓 Important Resources

👋 Contributing

✨ Contributors

Files

README.md

Latest commit

History

README.md

File metadata and controls

Overview

🐣 Getting Started

1. Install Whistler

2. Create a Spark dataframe for you data

3. Create a config in the form of python dict or read it from any json file

4. Build an instance of Data Quality Analyzer and execute the checks

📦 Roadmap

🎓 Important Resources

👋 Contributing

✨ Contributors