DataFlow
Overview
OpenDCAI/DataFlow is a tool developed for data preparation and training. It's intended to generate, refine, evaluate and filter high-quality data for AI from noisy sources such as PDFs, plain text, and low-quality QA.
This tool aims to improve the performance of large language models (LLMs) through targeted training in specific domains like healthcare, finance, legal, and academic research.
The system incorporates operator-based design to transform the entire data cleaning workflow into a reproducible, reusable, and shareable pipeline. This serves as the core infrastructure for the Data-Centric AI community.
Additionally, OpenDCAI/DataFlow has an intelligent agent capability that can dynamically assemble new pipelines by either recombining existing operators or creating new ones based on demand.
This tool assists in generating high-quality LLM training datasets from raw data using visual, low-code pipelines with flexible orchestration across domains and use cases.
The tool also includes text, math, and code data generation, as well as tools like AgenticRAG and Text2SQL for data creation. Other features include large-scale PDF to QA conversion and structured data extraction.
Releases
Top alternatives
-
AZ .Text transforms raw data into strategic gold.


How would you rate DataFlow?
Help other people by letting them know if this AI was useful.