We provide intel optimized solution for
- Tabular - Auto Feature Engineering Pipeline, 50+ essential primitives for feature engineering.
- LLM Text - 10+ essential primitives for text clean, fixing, deduplication, 4 quality control module, 2 built-in high quality data pipelines.
DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre graphviz
pip install pyrecdp --pre
Only 3 lines of codes to generate new features for your tabular data. Usually 5x new features can be found with up to 1.2x accuracy boost
from pyrecdp.autofe import AutoFE
pipeline = AutoFE(dataset=train_data, label=target_label, time_series = 'Day')
transformed_train_df = pipeline.fit_transform()
Low Code to build your own pipeline
from pyrecdp.LLM import ResumableTextPipeline
pipeline = ResumableTextPipeline("usecase/finetune_pipeline.yaml")
ret = pipeline.execute()
or
from pyrecdp.primitives.operations import *
from pyrecdp.LLM import ResumableTextPipeline
pipeline = ResumableTextPipeline()
ops = [
JsonlReader("data/"),
URLFilter(),
LengthFilter(),
ProfanityFilter(),
TextFix(),
LanguageIdentify(),
PIIRemoval(),
PerfileParquetWriter("ResumableTextPipeline_output")
]
pipeline.add_operations(ops)
pipeline.execute()
- Apache 2.0
- Spark 3.4.*
- python 3.*
- Ray 2.7.*