Skip to content

yival.data.csv_reader

Read data from CSV

get_valid_path

def get_valid_path(user_specified_path)

Get valid csv input path.

CSVReader Objects

class CSVReader(BaseReader)

CSVReader is a class derived from BaseReader to read datasets from CSV files.

Attributes:

  • config CSVReaderConfig - Configuration object specifying reader parameters.
  • default_config CSVReaderConfig - Default configuration for the reader.

Methods:

init(self, config: CSVReaderConfig): Initializes the CSVReader with a given configuration. read(self, path: str) -> Iterator[List[InputData]]: Reads the CSV file and yields chunks of InputData.

Notes:

The read method checks for headers in the CSV file and raises an error if missing. It also checks for missing data in rows, skipping those with missing values but logs them. If a specified column contains expected results, it extracts those results from the row. Rows are read in chunks, and each chunk is yielded once its size reaches chunk_size. The class supports registering with the BaseReader using the register_reader method.

Usage: reader = CSVReader(config) for chunk in reader.read(path_to_csv_file): process(chunk)

yival.data.base_reader

This module provides an abstract foundation for data readers.

Data readers are responsible for reading data from various sources, and this module offers a base class to define and register new readers, retrieve existing ones, and fetch their configurations. The design encourages efficient parallel processing by reading data in chunks.

BaseReader Objects

class BaseReader(ABC)

Abstract base class for all data readers.

This class provides a blueprint for data readers and offers methods to register new readers, retrieve registered readers, and fetch their configurations.

Attributes:

  • _registry Dict[str, Dict[str, Any]] - A registry to keep track of data readers.
  • default_config Optional[BaseReaderConfig] - Default configuration for the reader.

get_reader

@classmethod
def get_reader(cls, name: str) -> Optional[Type['BaseReader']]

Retrieve reader class from registry by its name.

get_default_config

@classmethod
def get_default_config(cls, name: str) -> Optional[BaseReaderConfig]

Retrieve the default configuration of a reader by its name.

get_config_class

@classmethod
def get_config_class(cls, name: str) -> Optional[Type[BaseReaderConfig]]

Retrieve the configuration class of a reader by its name.

register_reader

@classmethod
def register_reader(cls,
                    name: str,
                    reader_cls: Type['BaseReader'],
                    config_cls: Optional[Type[BaseReaderConfig]] = None)

Register reader's subclass along with its default configuration and config class.

read

@abstractmethod
def read(path: str) -> Iterator[List[InputData]]

Read data from the given file path and return an iterator of lists containing InputData.

This method is designed to read data in chunks for efficient parallel processing. The chunk size is determined by the reader's configuration.

Arguments:

  • path str - The path to the file containing data to be read.

Returns:

  • Iterator[List[InputData]] - An iterator yielding lists of InputData objects.

generate_example_id

def generate_example_id(row_data: Dict[str, Any], path: str) -> str

Default function to generate an example_id for a given row of data.

yival.data.huggingface_dataset_reader