Working with Corpora

See the documentation for the fileorganize package particularly for the function dir_to_df() which is extremely helpful in working with a large collection of audio and label files.

fileorganize.dir_to_df()

see the fileorganize documentation

phonlab.srt_to_df(srtfile, verbose=True)[source]

Read subtitles in an .srt file and return as a dataframe.

The dataframe is checked for overlapping subtitle texts, and a warning is issued if any overlaps are found.

Parameters:
  • srtfile (pathlike) – Input .srt file path as a Path object or string.

  • verbose (bool (default True)) – If True, print informational messages.

Returns:

df – The output dataframe with time columns t1 and t2 that indicate start and end times of subtitle content, which is in the text column.

Return type:

dataframe

phonlab.split_speaker_df(df, textcol='text', ts=['t1', 't2'], sep=None, ffill=True, include=[], exclude=[], as_dict=True, verbose=True)[source]

Split speaker identifier from the text contained in a dataframe column, and add speaker as new column.

To help guard against the misparsing of speaker identifiers, an error is raised if any speaker identifiers are found in the dataframe that are not explicitly listed in the include and exclude parameters.

Parameters:
  • df (dataframe) – Input dataframe of speaker utterances.

  • textcol (str) – Name of the column in df that contains utterance content. Speaker identifiers are split off from the values in this column, e.g. ‘Speaker1: Some utterance’ yields ‘Speaker1’ and ‘Some utterance’ as the new speaker and textcol columns.

  • ffill (bool (default True)) – If True, df rows which have no speaker value (i.e. do not contain sep and cannot be split) inherit speaker from the immediately preceding row.

  • ts (list of str (default ['t1', 't2'])) – The names of the start and end time columns in the df dataframe. The first name defines the start time of the interval, and the second name defines the end time.

  • include (list of str (default [])) – List of speaker identifiers and associated rows to include in the return value. Hint: If you want to construct a list of possible speaker identifiers by integer you can use a list comprehension. For example, the list comprehension includelist = [f’Speaker-{n}’ for n in range(3)] creates a list of three speakers: [‘Speaker-0’, ‘Speaker-1’, ‘Speaker-2’].

  • exclude (list of str (default [])) – List of speaker identifiers and associated rows to exclude from the return value.

  • sep (str) – String on which to split textcol into speaker and utterance.

  • as_dict (bool (True)) – If True, return value is a dict with speaker identifiers as keys. The values are dataframes of utterance rows for that speaker. If False, return original dataframe with new speaker column.

  • verbose (bool (default True)) – If True, print informational messages.

Returns:

df – If as_dict is True, a dict of dataframes is returned in which the keys are speaker identifiers and the values are the dataframes of utterances by the speaker. If as_dict is False, then a single dataframe is returned with the speaker identifiers in a new column named speaker added to the input dataframe, and with the speaker identifiers removed from textcol.

Return type:

dataframe or dict of dataframes