mealprep package¶
Submodules¶
mealprep.mealprep module¶
-
mealprep.mealprep.
find_bad_apples
(df)¶ This function uses a univariate approach to outlier detection. For each column with outliers (values that are 2 or more standard deviations from the mean), this function will create a reference list of row indices with outliers, and the total number of outliers in that column. Note: This function works best for small datasets with unimodal variable distributions. Note: If your dataframe has duplicate column names, only the last of the duplicated columns will be checked.
Parameters: df (pandas.core.frame.DataFrame) – A dataframe containing numeric data Returns: bad_apples – A dataframe showing 3 columns: Variable (column name), Indices (list of row indices with outliers), and Total Outliers (number of outliers in the column) Return type: pandas.DataFrame Examples
>>> df = pd.DataFrame({'A' : ['test', 1, 1, 1, 1]) >>> find_bad_apples(df) AssertionError: Every column in your dataframe must be numeric. >>> df = pd.DataFrame({'A' : [1, 1, 1, 1, 1], ... 'B' : [10000, 1, 1, 1, 1]}) >>> find_bad_apples(df) AssertionError: Sorry, you don't have enough data. The dataframe needs to have at least 30 rows. >>> df = pd.DataFrame({'A' : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, ... 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1], ... 'B' : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,-100, ... 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,100], ... 'C' : [1,1,1,1,1,19,1,1,1,1,1,1,1,1,19,1,1,1,1, ... 1,1,1,1,1,1,1,19,1,1,1,1,1,1,1,1]}) >>> find_bad_apples(df) Variable Indices Total Outliers B [17, 34] 2 C [5, 14, 26] 3 >>> df = pd.DataFrame({'A' : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, ... 1,1,1,1,1,1,1,1,1,1,1,1,1], ... 'B' : [1.000001, 1.000001, 1.000001, 1.000001, ... 1.000001, 1.000001, 1.000001, 1.000001, ... 1.000001, 1.000001, 1.000001, 1.000001, ... 1.000001, 1.000001, 1.000001, ... 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}) >>> find_bad_apples(df)) Variable Indices Total Outliers No outliers detected [] 0
-
mealprep.mealprep.
find_fruits_veg
(df, type_of_out='categ')¶ This function will drop row with NAs and find the index of columns with all numeric value or categorical value based on the specification.
Parameters: - df (pandas.core.frame.DataFrame) – Data frame that need to be proceed
- type_of_out (string) – Type of columns that we want to know index of
- list_of_index (list) – list of index value
Returns: - list_of_categ (list) – list of index of categorical value
- list_of_num (list) – list of index of numerical value
Example
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': ['a', 'b']}) >>> find_fruits_veg(df, type_of_out = 'categ') [1]
-
mealprep.mealprep.
find_missing_ingredients
(data)¶ For each column with missing values, this function will create a reference list of row indices, sum the number and calculate proportion of missing values
Parameters: data (pandas.core.frame.DataFrame) – A dataframe that need to be processed Returns: Data frame summarizing the indexes, count and proportion of missing values in each column Return type: pandas.core.frame.DataFrame Example
>>> df = data.frame("letters" = c("a","b","c"), "numbers" = c(1,2,3)) >>> find_missing_ingredients(df) 'There are no missing values'
-
mealprep.mealprep.
make_recipe
(X, y, recipe, splits_to_return='train_test', random_seed=None, train_valid_prop=0.8)¶ The make_recipe() function is used to quickly apply common data preprocessing techniques
Parameters: - X (pandas.DataFrame) – A dataframe containing training, validation, and testing features.
- y (pandas.DataFrame) – A dataframe containing training, validation, and testing response.
- recipe (str) – A string specifying which recipe to apply to the data. The only recipe currently available is “ohe_and_standard_scaler”. More recipes are under development.
- splits_to_return (str, optional) – “train_test” to return train and test splits, “train_test_valid” to return train, test, and validation data, “train” to return all data without splits. By default “train_test”.
- random_seed (int, optional) – The random seed to set for splitting data to create reproducible results. By default None.
- train_valid_prop (float, optional) – The proportion to split the data by. Should range between 0 to 1. By default = 0.8
Returns: A tuple of dataframes: (X_train, X_valid, X_test, y_train, y_valid, y_test)
Return type: Tuple of pandas.DataFrame
Example
>>> from vega_datasets import data >>> from mealprep.mealprep import make recipe >>> df = pd.read_json(data.cars.url).drop(columns=["Year"]) >>> X = df.drop(columns=["Name"]) >>> y = df[["Name"]] >>> X_tr X_va, X_te, y_tr, y_va, y_te = mealprep.make_recipe( ... X=X, y=y, recipe="ohe_and_standard_scaler", ... splits_to_return="train_test")