Data

This is the class used to represent any body of data. With several memory-efficient methods built in, this class is designed for efficient manipulation and loading of data.

class src.abstractions.data.Data(data_name: str, data_type: Literal['pretrain', 'sft', 'preference'] = 'sft', data_path: str | None = None, data_content: List[Dict] | None = None, **kwargs)

The Data class stores a body of data’s path, format, relevant fields, etc., allowing for memory-efficient manipulation of large bodies of data.

__init__(data_name: str, data_type: Literal['pretrain', 'sft', 'preference'] = 'sft', data_path: str | None = None, data_content: List[Dict] | None = None, **kwargs)

Initialize.

Parameters:
  • data_name (str) – Necessary, name of the data

  • data_type (Literal["pretrain", "sft", "preference"] = "sft") – Optional, type of usage of data, i.e. which stage of training it will be used in.

  • data_path (Optional[str] = None) – Optional. Search path of data. When data_path is omitted, make sure it exists in ‘./libs/llama_factory/data/’ or other data search paths (see abstractions_config.json) and is recognized by Llama-Factory.

  • data_content (List[Dict] = None) – Optional. Content of data. When data_content is provided, the given content will be written to a data_path to create a new dataset, unless data_path is not provided in which case the dataset will be saved to ‘./output/datasets/{data_name}.json’.

Raises:

FileNotFoundError – If file is not found in default search path and path is not specified.

Examples:
Data('c4_demo', data_type = 'sft', data_path = './libs/llama_factory/data/c4_demo.json')
Data('c4_demo', data_type = 'sft')
all_passages() Iterable[Dict[Hashable, Any]]

Returns an iterator of all passages (json dicts) in this dataset.

copy() Data

Returns a shallow copy of the current Data instance.

manage_llama_factory_registration(operation: Literal['add', 'remove', 'query'], forced_update: bool = True) bool

Add, remove, or query the registration status of the current dataset in Llama-Factory. No changes are made when adding an already existing dataset, or when removing a non-existent dataset.

Parameters:
  • operation (Literal["add", "remove", "query"]) – The operation to perform. It can be “add”, “remove”, or “query”.

  • forced_update (bool = True) – Whether to forcefully update the data

Returns:

A boolean meaning the registration status before this operation.

Return type:

bool.

save_permanent_and_register(saved_name: str | None = None, forced_rewrite: bool = False)

Data will be saved to data_save_path from abstractions_config.json. Without save_permanent, it will still be present in ./output/ and can still be directly used next time without specifying the full path. Do not include path and suffix in the saved_name argument.

set_key_fields(prompt_field_name: str | None = None, query_field_name: str | None = None, response_field_name: str | None = None, system_field_name: str | None = None, suppress_registration_update: bool = False, **kwargs) None

Specify which of the dict fields to use for training. In-place.

Pass empty string to an argument in order to erase that argument.

Will automatically update registration, if already registered.

Parameters:
  • prompt_field_name (Optional[str] = None) – The name of the prompt field

  • query_field_name (Optional[str] = None) – The name of the query field

  • response_field_name (Optional[str] = None) – The name of the response field

  • system_field_name (Optional[str] = None) – The name of the system field

  • suppress_registration_update (bool = False) – Whether to suppress the update of the registration

Example:
data.set_key_fields(prompt_field_name='content') # for pretraining dataset stored in content field
data.set_key_fields(prompt_field_name='instruction', query_field_name='input', response_field_name='output') # for QA dataset with system prompt
transform(transformation: Callable[[Dict], Dict] | Callable[[List[Dict]], List[Dict]], result_data_name: str, forced_rewrite: bool = False, max_batch_size: int = 1, keep_key_fields: bool = True) Data

Apply transformation to every element of the current dataset (in the format of a json list of json dicts where the values are of mutable or immutable types), and returns a Data instance containing the resulting dataset.

Out-of-place. Does not modify self.

This function (like all others in abstractions) is memory-efficient for huge json files. The data file will be a json file with the type of List[Dict[Hashable, Any]].

Parameters:
  • transformation (Union[Callable[[Dict], Dict], Callable[[List[Dict]], List[Dict]]) – Transformation to be performed upon the dataset.

  • result_data_name (str) – The name of the resulting data. Do not include path in result_data_name.

  • forced_rewrite (bool = False) – Whether to forcefully rewrite the existing data

  • max_batch_size (int = 1) – If max_batch_size is specified and is >1, the transformation function must take inputs of type List[Dict] and return a List[Dict].

  • keep_key_fields (bool = True) – If keep_key_fields is True, the registered key_fields names will be copied to the new Data instance. Only do so if the transformation doesn’t rename the key fields.

Returns:

The data after transformation.

Return type:

Data.

class src.abstractions.data.DataFileCollection(collection_name: str, data_type: Literal['pretrain', 'sft', 'preference'] = 'pretrain', collection_path: str | None = None, file_selection_func: Callable[[str], bool] | None = None, **kwargs)
__init__(collection_name: str, data_type: Literal['pretrain', 'sft', 'preference'] = 'pretrain', collection_path: str | None = None, file_selection_func: Callable[[str], bool] | None = None, **kwargs)

Initialize.

Parameters:
  • prompt_field_name (Optional[str] = None) – The name of the prompt field

  • query_field_name (Optional[str] = None) – The name of the query field

  • response_field_name (Optional[str] = None) – The name of the response field

  • system_field_name (Optional[str] = None) – The name of the system field

  • suppress_registration_update (bool = False) – Whether to suppress the update of the registration

If collection_path is omitted, we will search for collection_name in directories specified in abstractions_config.json. When file_selection_func is supplied, files will be captured real-time, instead of only when initializing. Only json files will be captured. You may want to exclude undated.json using file_selection_func. That file is huge.

Example:
DataFileCollection(collection_name='histtext_1826_to_2018',
                data_type='pretrain',
                collection_path = './dataset/dataset_text_sequence/',
                file_selection_func = (lambda path: 1826 <= int(path.split('/')[-1][1:6]) <= 2018))
all_files() Iterable[str]

Returns an iterator of all json files in this collection. If file_selection_func had been specified, files will be captured real-time, instead of only when initializing.

all_passages() Iterable[Dict[Hashable, Any]]

Returns an iterator of all passages (json dicts) in this collection. If file_selection_func had been specified, files will be captured real-time, instead of only when initializing.

always_force_rewrite: bool = True

The Data File Collection class stores multi-file data, by name, path and type, etc. Before being used for training, the Data File Collection needs to be converted to Data. Operations similar to those of Data are available in this class nevertheless.

convert_to_Data(result_data_name: str, forced_rewrite: bool = False, filter_fields=None)

Convert self to an Data instance by merging all data files into one json, and return that Data instance with name result_data_name. Out-of-place. Does not modify self, and self is still usable after this operation. All data files should be json files with the type of List[Dict[Hashable, Any]].

Parameters:
  • result_data_name (str) – The name of the resulting data

  • forced_rewrite (bool = False) – Whether to forcefully rewrite the existing data

  • filter_fields (Optional = None) – Fields to filter the data (default is None)

copy() DataFileCollection

Returns a shallow copy of the current DataFileCollection instance.

save_permanent(saved_name: str | None = None, forced_rewrite: bool = False)

DataFileCollection will be saved to data_save_path from abstractions_config.json. Without save_permanent, it will still be present in ./output/ and can still be directly used next time without specifying the full path. Normally, you should not include full path and/or suffix in saved_name. If you do, it will be seen as a path. In this case, the collection may not be autodiscovered by abstractions for future use.

transform(transformation: Callable[[Dict], Dict | None] | Callable[[List[Dict]], List[Dict]], result_collection_name: str, forced_rewrite: bool = False, max_batch_size: int = 1, suppress_tqdm: bool = False)

Apply transformation to every element of the current dataset (in the format of a json list of json dicts where the values are of mutable or immutable types), and returns a DataFileCollection instance containing the resulting dataset. Out-of-place. Does not modify self.

This function (like all others in abstractions) is memory-efficient for huge json files. All data files should be json files with the type of List[Dict[Hashable, Any]].

Parameters:
  • transformation (Union[Callable[[Dict], Optional[Dict]], Callable[[List[Dict]], List[Dict]]]) – Transformation applied to every element of the current dataset

  • result_collection_name (str) – The name of the resulting collection. Do not include path in result_collection_name.

  • forced_rewrite (bool = False) – Whether to forcefully rewrite the existing data

  • max_batch_size (int = 1) – The maximum batch size. If max_batch_size is specified and is >1, the transformation function must take inputs of type List[Dict] and return a List[Dict].

  • suppress_tqdm (bool = False) – Whether to suppress the tqdm progress bar