mealprep package

Submodules

mealprep.mealprep module

mealprep.mealprep.find_bad_apples(df)

This function uses a univariate approach to outlier detection. For each column with outliers (values that are 2 or more standard deviations from the mean), this function will create a reference list of row indices with outliers, and the total number of outliers in that column. Note: This function works best for small datasets with unimodal variable distributions. Note: If your dataframe has duplicate column names, only the last of the duplicated columns will be checked.

Parameters:df (pandas.core.frame.DataFrame) – A dataframe containing numeric data
Returns:bad_apples – A dataframe showing 3 columns: Variable (column name), Indices (list of row indices with outliers), and Total Outliers (number of outliers in the column)
Return type:pandas.DataFrame

Examples

>>> df = pd.DataFrame({'A' : ['test', 1, 1, 1, 1])
>>> find_bad_apples(df)
AssertionError: Every column in your dataframe must be numeric.
>>> df = pd.DataFrame({'A' : [1, 1, 1, 1, 1],
...                    'B' : [10000, 1, 1, 1, 1]})
>>> find_bad_apples(df)
AssertionError: Sorry, you don't have enough data.
The dataframe needs to have at least 30 rows.
>>> df = pd.DataFrame({'A' : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
...                             1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
...                    'B' : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,-100,
...                             1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,100],
...                    'C' : [1,1,1,1,1,19,1,1,1,1,1,1,1,1,19,1,1,1,1,
...                             1,1,1,1,1,1,1,19,1,1,1,1,1,1,1,1]})
>>> find_bad_apples(df)
Variable      Indices     Total Outliers
    B         [17, 34]          2
    C      [5, 14, 26]          3
>>> df = pd.DataFrame({'A' : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
...                             1,1,1,1,1,1,1,1,1,1,1,1,1],
...                    'B' : [1.000001, 1.000001, 1.000001, 1.000001,
...                           1.000001, 1.000001, 1.000001, 1.000001,
...                           1.000001, 1.000001, 1.000001, 1.000001,
...                           1.000001, 1.000001, 1.000001,
...                           1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]})
>>> find_bad_apples(df))
Variable                Indices     Total Outliers
No outliers detected        []              0
mealprep.mealprep.find_fruits_veg(df, type_of_out='categ')

This function will drop row with NAs and find the index of columns with all numeric value or categorical value based on the specification.

Parameters:
  • df (pandas.core.frame.DataFrame) – Data frame that need to be proceed
  • type_of_out (string) – Type of columns that we want to know index of
  • list_of_index (list) – list of index value
Returns:

  • list_of_categ (list) – list of index of categorical value
  • list_of_num (list) – list of index of numerical value

Example

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': ['a', 'b']})
>>> find_fruits_veg(df, type_of_out = 'categ')
[1]
mealprep.mealprep.find_missing_ingredients(data)

For each column with missing values, this function will create a reference list of row indices, sum the number and calculate proportion of missing values

Parameters:data (pandas.core.frame.DataFrame) – A dataframe that need to be processed
Returns:Data frame summarizing the indexes, count and proportion of missing values in each column
Return type:pandas.core.frame.DataFrame

Example

>>> df = data.frame("letters" = c("a","b","c"),
                    "numbers" = c(1,2,3))
>>> find_missing_ingredients(df)
'There are no missing values'
mealprep.mealprep.make_recipe(X, y, recipe, splits_to_return='train_test', random_seed=None, train_valid_prop=0.8)

The make_recipe() function is used to quickly apply common data preprocessing techniques

Parameters:
  • X (pandas.DataFrame) – A dataframe containing training, validation, and testing features.
  • y (pandas.DataFrame) – A dataframe containing training, validation, and testing response.
  • recipe (str) – A string specifying which recipe to apply to the data. The only recipe currently available is “ohe_and_standard_scaler”. More recipes are under development.
  • splits_to_return (str, optional) – “train_test” to return train and test splits, “train_test_valid” to return train, test, and validation data, “train” to return all data without splits. By default “train_test”.
  • random_seed (int, optional) – The random seed to set for splitting data to create reproducible results. By default None.
  • train_valid_prop (float, optional) – The proportion to split the data by. Should range between 0 to 1. By default = 0.8
Returns:

A tuple of dataframes: (X_train, X_valid, X_test, y_train, y_valid, y_test)

Return type:

Tuple of pandas.DataFrame

Example

>>> from vega_datasets import data
>>> from mealprep.mealprep import make recipe
>>> df = pd.read_json(data.cars.url).drop(columns=["Year"])
>>> X = df.drop(columns=["Name"])
>>> y = df[["Name"]]
>>> X_tr X_va, X_te, y_tr, y_va, y_te = mealprep.make_recipe(
...        X=X, y=y, recipe="ohe_and_standard_scaler",
...        splits_to_return="train_test")

Module contents