Writing a Custom Data Reader in Python with BaseReader¶
This guide provides steps on how to create custom data readers by subclassing the
provided BaseReader
class. The example demonstrates how to create a TXTReader
to read .txt
files.
Table of Contents¶
- Introduction
- BaseReader Overview
- Creating a Custom Reader (TXTReader)
- Conclusion
1. Introduction¶
Data readers are responsible for reading data from various sources. By subclassing
the BaseReader
, you can create custom readers tailored to your specific data
format needs.
2. BaseReader Overview¶
The BaseReader
class offers a blueprint for designing data readers. It has
methods to:
- Register new readers.
- Retrieve registered readers and their configurations.
- Read data in chunks.
The class provides an abstract method read
that you must override in your
custom reader. The method is designed to read data in chunks for efficient
parallel processing.
3. Creating a Custom Reader (TXTReader)¶
3.1. Design the TXTReaderConfig Class¶
Before creating the reader, design a configuration class specific to the TXTReader
.
This class will inherit from the base BaseReaderConfig
:
from dataclasses import asdict, dataclass
from yival.data.base_reader import BaseReaderConfig
@dataclass
class TXTReaderConfig(BaseReaderConfig):
"""
Configuration specific to the TXT reader.
"""
delimiter: str = "\n" # Default delimiter for txt files.
def asdict(self):
return asdict(self)
3.2. Implement the TXTReader Class¶
Now, create the TXTReader
class, subclassing the BaseReader
:
from typing import Iterator, List
from txt_reader_config import TXTReaderConfig
from yival.data.base_reader import BaseReader
from yival.schemas.common_structures import InputData
class TXTReader(BaseReader):
"""
TXTReader is a class derived from BaseReader to read datasets from TXT
files.
Attributes:
config (TXTReaderConfig): Configuration object specifying reader parameters.
Methods:
__init__(self, config: TXTReaderConfig): Initializes the TXTReader with
a given configuration.
read(self, path: str) -> Iterator[List[InputData]]: Reads the TXT file
and yields chunks of InputData.
"""
config: TXTReaderConfig
default_config = TXTReaderConfig()
def __init__(self, config: TXTReaderConfig):
super().__init__(config)
self.config = config
def read(self, path: str) -> Iterator[List[InputData]]:
chunk = []
chunk_size = self.config.chunk_size
with open(path, mode="r", encoding="utf-8") as file:
for line in file:
line_content = line.strip().split(self.config.delimiter)
# Each line in the TXT file is treated as a separate data point.
example_id = self.generate_example_id({"content": line_content}, path)
input_data_instance = InputData(
example_id=example_id,
content=line_content
)
chunk.append(input_data_instance)
if len(chunk) >= chunk_size:
yield chunk
chunk = []
if chunk:
yield chunk
3.3. Config¶
After defining the config and reader sublass, we can define the yml config:
custom_reader:
txt_reader:
class: /path/to/text_reader.TXTReader
config_cls: /path/to/txt_reader_config.TXTReaderConfig
dataset:
source_type: dataset
reader: txt_reader
file_path: "/Users/taofeng/YiVal/data/headline_generation.txt"
reader_config:
delimiter: "\n"
4. Conclusion¶
Creating custom data readers with the provided framework is straightforward. You
can design readers tailored to various data formats by simply subclassing the
BaseReader
and overriding its read
method. With this capability, you can
efficiently read data in chunks, making it suitable for parallel processing and
large datasets.