Skip to content

yival.data_generators.openai_prompt_data_generator

This module provides an implementation for data generation using OpenAI's model.

The primary goal of this module is to programmatically generate data examples based on a given prompt and configuration. It employs OpenAI's models to produce these examples, and offers utility functions for transforming and processing the generated data.

OpenAIPromptDataGenerator Objects

class OpenAIPromptDataGenerator(BaseDataGenerator)

Data generator using OpenAI's model based on provided prompts and configurations.

This class is responsible for the generation of data examples using OpenAI's models. The generated data can be used for various purposes, including testing, simulations, and more. The nature and number of generated examples are determined by the provided configuration.

prepare_messages

def prepare_messages(all_data_content,
                     number_of_examples) -> List[Dict[str, Any]]

Prepare the messages for GPT API based on configurations.

process_outputs

def process_outputs(output_content: str,
                    all_data: List[InputData],
                    chunk: List[InputData],
                    fixed_input: Dict[str, Any] | None = {})

Process the output from GPT API and update data lists.

process_output

def process_output(output_content: str,
                   all_data: List[InputData],
                   chunk: List[InputData],
                   fixed_input: Dict[str, Any] | None = {})

Process the output from GPT API and update data lists.

yival.data_generators.base_data_generator

This module provides a foundational architecture for programmatically generating data.

Data generators are responsible for creating data programmatically based on certain configurations. The primary utility of these generators is in scenarios where synthetic or mock data is required, such as testing, simulations, and more. This module offers a base class that outlines the primary structure and functionalities of a data generator. It also provides methods to register new generators, retrieve existing ones, and fetch their configurations.

BaseDataGenerator Objects

class BaseDataGenerator(ABC)

Abstract base class for all data generators.

This class provides a blueprint for data generators and offers methods to register new generators, retrieve registered generators, and fetch their configurations.

Attributes:

  • _registry Dict[str, Dict[str, Any]] - A registry to keep track of data generators.
  • default_config Optional[BaseDataGeneratorConfig] - Default configuration for the generator.

get_data_generator

@classmethod
def get_data_generator(cls, name: str) -> Optional[Type['BaseDataGenerator']]

Retrieve data generator class from registry by its name.

get_default_config

@classmethod
def get_default_config(cls, name: str) -> Optional[BaseDataGeneratorConfig]

Retrieve the default configuration of a data generator by its name.

get_config_class

@classmethod
def get_config_class(cls,
                     name: str) -> Optional[Type[BaseDataGeneratorConfig]]

Retrieve the configuration class of a generator_info by its name.

register_data_generator

@classmethod
def register_data_generator(
        cls,
        name: str,
        data_generator_cls: Type['BaseDataGenerator'],
        config_cls: Optional[Type[BaseDataGeneratorConfig]] = None)

Register data generator class with the registry.

generate_examples

@abstractmethod
def generate_examples() -> Iterator[List[InputData]]

Generate data examples and return an iterator of lists containing InputData.

This method is designed to produce data programmatically. The number and nature of data examples are determined by the generator's configuration.

Returns:

  • Iterator[List[InputData]] - An iterator yielding lists of InputData objects.

generate_example_id

def generate_example_id(content: str) -> str

Generate a unique identifier for a given content string.

Arguments:

  • content str - The content for which an ID should be generated.

Returns:

  • str - A unique MD5 hash derived from the content.

yival.data_generators.document_data_generator

This module provides an implementation for generating question data from documents. Supported types of document sources include: - plain text - unstructured files: Text, PDF, PowerPoint, HTML, Images, Excel spreadsheets, Word documents, Markdown, etc. - documents from Google Drive (provide file id). Currently support only one document a time.

DocumentDataGenerator Objects

class DocumentDataGenerator(BaseDataGenerator)

prepare_messages

def prepare_messages() -> List[Dict[str, Any]]

Prepare the messages for GPT API based on configurations.

process_output

def process_output(output_content: str, all_data: List[InputData],
                   chunk: List[InputData])

Process the output from GPT API and update data lists.