Data Cleaning Module
Overview
The cleaning module prepares raw Yelp business data for analysis by standardizing columns, handling missing values, and creating derived variables used throughout the project.
Raw Yelp data contains nested fields, inconsistent column names, and categorical values that are not immediately ready for analysis. This module makes sure the dataset is clean, consistent, and reproducible.
Functions
Main Function: clean_data()
from yelp_final_project.cleaning import clean_datadf_clean = clean_data()
This function:
-Loads the raw Yelp dataset -Applies all cleaning steps internally -Returns a cleaned pandas DataFrame ready for analysis
Supporting/Helper Functions:
- load_data(): loads the raw CSV file
- drop_columns(): removes unnecessary columns
- convert_price(): converts price strings to numeric levels
- clean_delivery(): standardizes delivery/pickup availability
- sort_cities(): sorts and cleans city names
Parameters:
df (pandas.DataFrame): Raw Yelp business data
Returns:
pandas.DataFrame with engineered features
Output
The cleaned dataset includes standardized columns such as:
- name
- city
- rating
- review_count
- price_level
- service_type
This dataset is used by both the analysis module and the Streamlit app.