Python Libraries for different stages in Predictive Modeling Process
The basic steps used for model-building are the same across all modeling methods. Generating high-quality predictive models is a time consuming activity because of the tuning process in finding optimum model parameters and often required to reuse the models in the future. It is important to follow standard methodologies and industry best practices.
The below are some common stages involved in the Predictive Modeling Process,
1. Data Extraction
2. Data Cleaning
3. Data Visualization
4. Feature Engineering
5. Mode l Building
6. Model Evaluation
7. Model Deployment
Data science landscape is changing rapidly, and tools used for extracting value from data science have also grown in numbers. Machine learning is one of the significant elements used to maximize value from data. With Python as the data science tool, exploring the basics of machine learning becomes easy and effective. It has become the most preferred machine learning tool in the way it allows aspirants to ‘do math’ easily. Name any math function, and you have a Python package meeting the requirement.
In this article, we will see the most commonly used python libraries for the different stages in modelling.
Data Extraction:
Extracting or Data collection is a very crucial skill for Data Scientist. Here are useful Python libraries that helps for extraction and collection of data.
Beautiful Soap: BeautifulSoap is popular library for web crawling and data scraping. It is an HTML and XML parser which creates parse trees for parsed pages to extract data from webpages. The process of extracting data from web pages is called web scraping.
Scrapy: Scrapy is another powerful library for web scarping, it can retrieve structured data from web. Developers use it for gathering data from API’s.
Data Cleaning:
Pandas: Pandas stands for Python Data Analysis Library, it is a open-source package helps to work with ”Labeled” and “Relational” data. It’s based on two main data structures:
a. Series — One dimensional like list
b. Data Frame — Two dimensional like table
Pandas can take data in .csv, .tsv file can convert them into python objects called Data frame with rows and columns. Here is list of things that can be achieved with help of Pandas library
- Indexing, manipulating, renaming, sorting, merging data frame.
- Update, add, delete columns from a data frame.
- Handling missing values.
Numpy: General purpose of NumPy library is array processing. It supports large multi-dimensional arrays and matrices. Here is list of things that can be achieved with help of NumPy library.
- Array operations like add, multiply, slice, flatten, reshape, index arrays
- Work with datetime or Linear Algebra.
SciPy: SciPy works great for all scientific programming project. It contains libraries for efficient mathematical routines like linear algebra, interpolation, integration, calculus, ordinary differential equation, Statistics and optimization. SciPy is built upon NumPy and its array.
PyOD: PyOD is a comprehensive and scalable toolkit for detecting outliers in the data. An outlier is a data point that is distant from other similar data points in the given data set.
Data Visualization
Mathplotlib: Mathplotlib is a standard data science library for generating two dimensional diagrams and graphs. It provides object-oriented API for embedding plots into applications. It also facilitates labels, grids, legends and many more formatting entities. Listed below are the different types of graphs that Mathplotlib library supports.
- Line plot
- Scatter plot
- Area plot
- Bar charts
- Histograms
- Pie charts
- Stem plot
- Contour plot
- Quiver plot
- Spectrogram
Seaborn: Seaborn is an extension of Mathplotlib with advanced features with less complex and fewer syntax. Let us understand in which scenario we can use Seaborn library.
- Determine correlation
- Univariate and Bivariate distributions of variables
- Plot linear regression models for dependent variables
- Joint plots, time series, violin diagrams.
Bokeh: Bokeh can be used to generate interactive plots, dashboards and data applications that target modern web browsers for presentation. It is fully dependent on Mathplotlib.
Model Building, Model Evaluation and Feature Engineering:
Scikit-learn: Scikit-learn is the master for building models. It is built using NumPy, SciPy and Mathplotlib. Data Scientists use it for handling standard machine learning tasks such as Clustering, Regression, clustering, model selection, dimensionality reduction.
Tensorflow: Tensorflow is a popular deep learning library that helps to build and train different models. It is developed by Google. It helps in handling tasks like object identification, speech recognition. And also helps in working with Artificial Neural network that need to handle multiple datasets.
- Voice recognition
- Sentiment Analysis
- Text based Apps
- Face/Image recognition
- Video detection
Statsmodel: It provides easy computations for Descriptive statistics and estimation, inference for statistical model.
Model Deployment:
Flask: After putting lot of efforts to build a model with good accuracy. Next step is to deploy the model. Flask is a web framework used for deploying data science models. It has two components Werkzeug — utility library and Jinja — template engine.
Django: It is a high level web framework which is a powerful and flexible toolkit designing web based API for Machine Learning model deployment.
Happy Learning!!
Thanks, Pavithra Jayashankar