Data

This is the class used to represent any body of data. With several memory-efficient methods built in, this class is designed for efficient manipulation and loading of data.

class src.abstractions.data.Data(data_name: str, data_type: Literal['pretrain', 'sft', 'preference'] = 'sft', data_path: str | None = None, data_content: List[Dict] | None = None, **kwargs)

The Data class stores a body of data’s path, format, relevant fields, etc., allowing for memory-efficient manipulation of large bodies of data.

__init__(data_name: str, data_type: Literal['pretrain', 'sft', 'preference'] = 'sft', data_path: str | None = None, data_content: List[Dict] | None = None, **kwargs)

Initialize.

Parameters:
  • data_name (str) – Necessary, name of the data

  • data_type (Literal["pretrain", "sft", "preference"] = "sft") – Optional, type of usage of data, i.e. which stage of training it will be used in.

  • data_path (Optional[str] = None) – Optional. Search path of data. When data_path is omitted, make sure it exists in ProgressGym/libs/llama_factory/data/ or other data search paths (see abstractions_config.json) and is recognized by Llama-Factory.

  • data_content (List[Dict] = None) – Optional. Content of data. When data_content is provided, the given content will be written to a data_path to create a new dataset, unless data_path is not provided in which case the dataset will be saved to ‘./output/datasets/{data_name}.json’.

Raises:

FileNotFoundError – If file is not found in default search path and path is not specified.

Examples:
Data('c4_demo', data_type = 'sft', data_path = './libs/llama_factory/data/c4_demo.json')
Data('c4_demo', data_type = 'sft')
all_passages() Iterable[Dict[Hashable, Any]]

Returns an iterator of all passages (json dicts) in this dataset.

append_content(field_key: str, content: str | Iterable[str], out_of_place: bool = False, map_key_fields: bool = False) Data

Append content to a specified field in the dataset.

Parameters:
  • field_key (str) – The key of the field to append content to.

  • content (Union[str, Iterable[str]]) – The content to append. Can be a single string or an iterable of strings, where each string corresponds to the content to append for a different sample in the dataset.

  • out_of_place (bool = False) – Whether to perform the operation out-of-place. If out_of_place is True, the original data will not be modified, and a new Data instance with an annotated name will be returned. Otherwise, the original data will be modified in-place, and the same Data instance will be returned.

  • map_key_fields (bool = False) – Whether to map the key fields to the default key fields before appending content.

Returns:

The data after the operation.

Return type:

Data.

copy(data_name: str | None = None) Data

Returns a copy of the current Data instance. Shallow copy if data_name is not provided or identical to the current data_name; deep copy otherwise.

filter_incomplete_samples(out_of_place: bool = False) Data

Remove the samples that has at least one of the key fields missing.

Parameters:

out_of_place (bool = False) – Whether to perform the operation out-of-place. If out_of_place is True, the original data will not be modified, and a new Data instance with an annotated name will be returned. Otherwise, the original data will be modified in-place, and the same Data instance will be returned.

Returns:

The data after the operation.

Return type:

Data.

manage_llama_factory_registration(operation: Literal['add', 'remove', 'query'], forced_update: bool = True) bool

Add, remove, or query the registration status of the current dataset in Llama-Factory. No changes are made when adding an already existing dataset, or when removing a non-existent dataset.

Parameters:
  • operation (Literal["add", "remove", "query"]) – The operation to perform. It can be “add”, “remove”, or “query”.

  • forced_update (bool = True) – Whether to forcefully update the data

Returns:

A boolean meaning the registration status before this operation.

Return type:

bool.

move_current_to_history(out_of_place: bool = False) Data

Move the current dialogue turn in the prompt/question field and the response/predict field to the history field.

Parameters:

out_of_place (bool = False) – Whether to perform the operation out-of-place. If out_of_place is True, the original data will not be modified, and a new Data instance with an annotated name will be returned. Otherwise, the original data will be modified in-place, and the same Data instance will be returned.

Returns:

The data after the operation.

Return type:

Data.

save_permanent_and_register(saved_name: str | None = None, forced_rewrite: bool = False)

Data will be saved to data_save_path from abstractions_config.json. Without save_permanent, it will still be present in ./output/ and can still be directly used next time without specifying the full path. Do not include path and suffix in the saved_name argument.

set_key_fields(prompt_field_name: str | None = None, query_field_name: str | None = None, response_field_name: str | None = None, system_field_name: str | None = None, history_field_name: str | None = None, suppress_registration_update: bool = False, **kwargs) None

Specify which of the dict fields to use for training. In-place.

Pass empty string to an argument in order to erase that argument.

Will automatically update registration, if already registered.

Parameters:
  • prompt_field_name (Optional[str] = None) – The name of the prompt field

  • query_field_name (Optional[str] = None) – The name of the query field

  • response_field_name (Optional[str] = None) – The name of the response field

  • system_field_name (Optional[str] = None) – The name of the system field

  • history_field_name (Optional[str] = None) – The name of the history field

  • suppress_registration_update (bool = False) – Whether to suppress the update of the registration

Example:
data.set_key_fields(prompt_field_name='content') # for pretraining dataset stored in content field
data.set_key_fields(prompt_field_name='instruction', query_field_name='input', response_field_name='output') # for QA dataset with system prompt
switch_role_to_assistant(assistant_system_prompt: str | Iterable[str] | None = None, out_of_place: bool = False) Data

Switch the prompt/question field and the response/predict field, thereby shifting the dialogue turn from the user to the assistant.

Parameters:
  • assistant_system_prompt (Union[str, Iterable[str]] = None) – The system prompt of the assistant role. Can be a single string or an iterable of strings, where each string corresponds to the prompt for a different sample in the dataset. If None, a default prompt will be used.

  • out_of_place (bool = False) – Whether to perform the operation out-of-place. If out_of_place is True, the original data will not be modified, and a new Data instance with an annotated name will be returned. Otherwise, the original data will be modified in-place, and the same Data instance will be returned.

Returns:

The data after the operation.

Return type:

Data.

switch_role_to_user(user_system_prompt: str | Iterable[str] | None = None, dialogue_starter: str | Iterable[str] | None = None, out_of_place: bool = False) Data

Switch the prompt/question field and the response/predict field, thereby shifting the dialogue turn from the assistant to the user.

Parameters:
  • user_system_prompt (Union[str, Iterable[str]] = None) – The system prompt of the user role. Can be a single string or an iterable of strings, where each string corresponds to the prompt for a different sample in the dataset. If None, a default prompt will be used.

  • dialogue_starter (str = None) – Placeholder message for the “zeroth” dialogue turn by the assistant that prompts the user to start the conversation.

  • out_of_place (bool = False) – Whether to perform the operation out-of-place. If out_of_place is True, the original data will not be modified, and a new Data instance with an annotated name will be returned. Otherwise, the original data will be modified in-place, and the same Data instance will be returned.

Returns:

The data after the operation.

Return type:

Data.

to_openai_format() Iterable[List[Dict[str, str]]]

Convert the data to OpenAI format, where each dialogue is a list of dictionaries with string keys and string values. Each dictionary represents a dialogue turn.

transform(transformation: Callable[[Dict], Dict] | Callable[[List[Dict]], List[Dict]], result_data_name: str, forced_rewrite: bool = False, max_batch_size: int = 1, keep_key_fields: bool = True, map_key_fields: bool = False) Data

Apply transformation to every element of the current dataset (in the format of a json list of json dicts where the values are of mutable or immutable types), and returns a Data instance containing the resulting dataset.

Out-of-place. Does not modify self.

This function (like all others in abstractions) is memory-efficient for huge json files. The data file will be a json file with the type of List[Dict[Hashable, Any]].

Parameters:
  • transformation (Union[Callable[[Dict], Dict], Callable[[List[Dict]], List[Dict]]) – Transformation to be performed upon the dataset.

  • result_data_name (str) – The name of the resulting data. Do not include path in result_data_name.

  • forced_rewrite (bool = False) – Whether to forcefully rewrite the existing file, if there is one.

  • max_batch_size (int = 1) – If max_batch_size is specified and is >1, the transformation function must take inputs of type List[Dict] and return a List[Dict].

  • keep_key_fields (bool = True) – If keep_key_fields is True, the registered key_fields names will be copied to the new Data instance. Only do so if the transformation doesn’t rename the key fields.

Returns:

The data after transformation.

Return type:

Data.

class src.abstractions.data.DataFileCollection(collection_name: str, data_type: Literal['pretrain', 'sft', 'preference'] = 'pretrain', collection_path: str | None = None, file_selection_func: Callable[[str], bool] | None = None, **kwargs)
__init__(collection_name: str, data_type: Literal['pretrain', 'sft', 'preference'] = 'pretrain', collection_path: str | None = None, file_selection_func: Callable[[str], bool] | None = None, **kwargs)

Initialize.

Parameters:
  • prompt_field_name (Optional[str] = None) – The name of the prompt field

  • query_field_name (Optional[str] = None) – The name of the query field

  • response_field_name (Optional[str] = None) – The name of the response field

  • system_field_name (Optional[str] = None) – The name of the system field

  • suppress_registration_update (bool = False) – Whether to suppress the update of the registration

If collection_path is omitted, we will search for collection_name in directories specified in abstractions_config.json. When file_selection_func is supplied, files will be captured real-time, instead of only when initializing. Only json files will be captured. You may want to exclude undated.json using file_selection_func. That file is huge.

Example:
DataFileCollection(collection_name='histtext_1826_to_2018',
                data_type='pretrain',
                collection_path = f'{root}/dataset/dataset_text_sequence/',
                file_selection_func = (lambda path: 1826 <= int(path.split('/')[-1][1:6]) <= 2018))
all_files() Iterable[str]

Returns an iterator of all json files in this collection. If file_selection_func had been specified, files will be captured real-time, instead of only when initializing.

all_passages() Iterable[Dict[Hashable, Any]]

Returns an iterator of all passages (json dicts) in this collection. If file_selection_func had been specified, files will be captured real-time, instead of only when initializing.

always_force_rewrite: bool = True

The Data File Collection class stores multi-file data, by name, path and type, etc. Before being used for training, the Data File Collection needs to be converted to Data. Operations similar to those of Data are available in this class nevertheless.

convert_to_Data(result_data_name: str, forced_rewrite: bool = False, filter_fields=None)

Convert self to an Data instance by merging all data files into one json, and return that Data instance with name result_data_name. Out-of-place. Does not modify self, and self is still usable after this operation. All data files should be json files with the type of List[Dict[Hashable, Any]].

Parameters:
  • result_data_name (str) – The name of the resulting data

  • forced_rewrite (bool = False) – Whether to forcefully rewrite the existing data

  • filter_fields (Optional = None) – Fields to filter the data (default is None)

copy() DataFileCollection

Returns a shallow copy of the current DataFileCollection instance.

save_permanent(saved_name: str | None = None, forced_rewrite: bool = False)

DataFileCollection will be saved to data_save_path from abstractions_config.json. Without save_permanent, it will still be present in ./output/ and can still be directly used next time without specifying the full path. Normally, you should not include full path and/or suffix in saved_name. If you do, it will be seen as a path. In this case, the collection may not be autodiscovered by abstractions for future use.

transform(transformation: Callable[[Dict], Dict | None] | Callable[[List[Dict]], List[Dict]], result_collection_name: str, forced_rewrite: bool = False, max_batch_size: int = 1, suppress_tqdm: bool = False)

Apply transformation to every element of the current dataset (in the format of a json list of json dicts where the values are of mutable or immutable types), and returns a DataFileCollection instance containing the resulting dataset. Out-of-place. Does not modify self.

This function (like all others in abstractions) is memory-efficient for huge json files. All data files should be json files with the type of List[Dict[Hashable, Any]].

Parameters:
  • transformation (Union[Callable[[Dict], Optional[Dict]], Callable[[List[Dict]], List[Dict]]]) – Transformation applied to every element of the current dataset

  • result_collection_name (str) – The name of the resulting collection. Do not include path in result_collection_name.

  • forced_rewrite (bool = False) – Whether to forcefully rewrite the existing data

  • max_batch_size (int = 1) – The maximum batch size. If max_batch_size is specified and is >1, the transformation function must take inputs of type List[Dict] and return a List[Dict].

  • suppress_tqdm (bool = False) – Whether to suppress the tqdm progress bar