This repository contains instructions and requirements for lab activities in the Data Processing and Analytics course.
Pandas is a popular library in Python for data processing and analysis. Depending on the Python installation on your computer, you may run on of the following options (using the command line inteface):
Install pandas via pip:
pip install pandas
Use pip3 if you are using python3:
pip3 install pandas
On Windows, you may need to use the Python launcher (py):
py -m pip install pandas
Matplotlib and seaborn are the two most popular modules for data visualization in Python. Install these packages using pip:
pip install matplotlib
pip install seaborn
Adapt the pip command to the installation of Python in your computer as mentioned earlier.
[Scikit-learn] is a module that provides tools for predictive modeling. Install scikit-learn using pip:
pip install scikit-learn
We will create our data mining models using the Data Mining Project Template. The template comprises six sections:
This section briefly explains the project from a business perspective, casting business objectives into a data mining problem definition. In your course project, you will complement this section with a slideshow presentation.
The purpose of this section is to improve the organization and efficiency of your Python code.
Explore the data by performing visualizations, check the ranges and distributions of numeric values using histograms, and examine correlations among the attribute variables. In supervised learning, examine correlations between the target variables and attributes.
Perform data cleaning and transformation tasks as necessary.
Train different models and calibrate the parameters of the most promising ones to optimal values.
Measure the performance of your final model on the test set to estimate the generalization error.
TDhe following notebook contains a template for building a data mining project in Python:
Click here to download this repository and access the template in your local computer.