This project automates the extraction, transformation, and loading (ETL) of electricity data from raw .csv and .json files into a MySQL database. It's based on a foundational DataCamp project and expanded as a full end-to-end data engineering pipeline.
.
βββ datasets/
β βββ electricity_sales.csv
β βββ electricity_capability_nested.json
βββ main.py
βββ nootbook.ipynb
βββ .env
βββ requirements.txt
βββ README.md
-
electricity_sales.csv
- Monthly electricity sales data across different states and sectors.
- Columns:
period,stateid,stateDescription,sectorid,sectorName,price,price-units
-
electricity_capability_nested.json
- Nested JSON representing generation capacity per energy source per state.
- Flattened to:
period,stateId,stateDescription,energySourceid,energySourceDescription,energySourceCapability,energySourceCapabilityUnits
-
Extract:
- Reads CSV and JSON files from the
datasets/directory. - JSON is flattened for relational DB compatibility.
- Reads CSV and JSON files from the
-
Transform:
- For
electricity_sales:- Filters only for
residentialandtransportationsectors. - Parses
periodintoyearandmonth. - Drops unused columns and missing values.
- Filters only for
- For
electricity_capability:- Normalizes nested JSON and converts
capabilitytofloat.
- Normalizes nested JSON and converts
- For
-
Load:
- Writes both cleaned DataFrames into a MySQL database using SQLAlchemy.
- Clone the repository
git clone https://github.com/elliedel/electricty_etl.git
cd electricity-etl-project- Create and configure your
.envfile
USER=your_mysql_username
PASSWORD=your_mysql_password
- Install dependencies
pip install -r requirements.txt- Run the ETL pipeline
python etl_script.pyMake sure MySQL is running and the database electricity_db exists.
create schema if not exists electricity_db;- Python 3.10+
- Pandas
- SQLAlchemy
- MySQL
- dotenv
After running the ETL, you will have two tables in your MySQL database:
electricity_saleselectricity_capability
You can use these cleaned tables for further analytics or dashboarding.
- Based on a DataCamp project: βPowering Data for the Department of Energy - Building an ETL Pipeline"