Skip to content

Creating a Custom Data Generator with BaseDataGenerator

This guide will walk you through creating a custom data generator using the provided BaseDataGenerator.

Introduction

The ability to programmatically generate data is crucial in scenarios where synthetic or mock data is required, such as in testing, simulations, and more. The provided foundational architecture for data generators allows for flexibility and extensibility, enabling you to create custom data generators tailored to specific needs.

In this guide, we will demonstrate how to create a custom data generator by extending the BaseDataGenerator. Our custom generator will output a list of predefined strings. By following this guide, you'll gain an understanding of the structure and process, enabling you to develop even more complex generators as needed.

Step 1: Subclassing the BaseDataGenerator

First, create a ListStringDataGenerator that simply outputs a list of strings as specified in its configuration.

from typing import Iterator, List

from list_string_data_generator_config import ListStringGeneratorConfig
from yival.data_generators.base_data_generator import BaseDataGenerator
from yival.schemas.common_structures import InputData



class ListStringDataGenerator(BaseDataGenerator):
    def __init__(self, config: 'ListStringGeneratorConfig'):
        super().__init__(config)

    def generate_examples(self) -> Iterator[List[InputData]]:
        for string_data in self.config.strings_to_generate:
            yield [InputData(example_id=self.generate_example_id(string_data), content=string_data)]

Step 2: Providing a Configuration Class

To specify the list of strings our generator should output, define a custom configuration class:

from dataclasses import dataclass, field
from typing import List

from yival.schemas.data_generator_configs import BaseDataGeneratorConfig


@dataclass
class ListStringGeneratorConfig(BaseDataGeneratorConfig):
    """
    Configuration for the ListStringDataGenerator.
    """
    strings_to_generate: List[str] = field(default_factory=list)

Config

In your configuration (YAML), you can now specify the use of this data generator:

custom_data_generators:
  list_string_data_generator:
    class: /path/to/list_string_data_generator.ListStringDataGenerator
    config_cls: /path/to/list_string_data_generator_config.ListStringGeneratorConfig
dataset:
  data_generators:
    list_string_data_generator:
      strings_to_generate:
        - abc
        - def
  source_type: machine_generated