Understanding Boosted Trees Models

In the previous post, we learned about tree based learning methods - basics of tree based models and the use of bagging to reduce variance. We also looked at one of the most famous learning algorithms based on the idea of bagging- random forests. In this post, we will look into the details of yet another type of tree-based learning algorithms: boosted trees.

Reading Time: 30 minutes       Read more…

A Practical Guide to Tree Based Learning Algorithms

Tree based learning algorithms are quite common in data science competitions. These algorithms empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well. Common examples of tree based models are: decision trees, random forest, and boosted trees.

Reading Time: 30 minutes       Read more…

Understanding Support Vector Machine via Examples

In the previous post on Support Vector Machines (SVM), we looked at the mathematical details of the algorithm. In this post, I will be discussing the practical implementations of SVM for classification as well as regression. I will be using the iris dataset as an example for the classification problem, and a randomly generated data as an example for the regression problem.

Reading Time: 18 minutes       Read more…

Switching to Hugo from Nikola

I have been using Nikola to build this Blog. Its a great static site build system that is based on Python. However, It has some crazy amount of dependencies (to have reasonable looking site). It uses restructured text (rst) as the primary language for content creation. Personally, I use markdown for almost every thing else - taking notes, making diary, code documentation etc. Furthermore, given Nikola tries to support almost everything in a static site builder, lately its is becoming more and more bloated.

Reading Time: 10 minutes       Read more…

Descriptive Statistics

One of the first tasks involved in any data science project is to get to understand the data. This can be extremely beneficial for several reasons: Catch mistakes in data See patterns in data Find violations of statistical assumptions Generate hypotheses etc. We can think of this task as an exercise in summarization of the data. To summarize the main characteristics of the data, often two methods are used: numerical and graphical.

Reading Time: 15 minutes       Read more…

My Arch Linux Setup with Plasma 5

Arch Linux is a general purpose GNU/Linux distribution that provides most up-to-date software by following the rolling-release model. Arch Linux allows you to use updated cutting-edge software and packages as soon as the developers released them. KDE Plasma 5 is the current generation of the desktop environment created by KDE primarily for Linux systems. In this post, we will do a complete installation of Arch Linux with Plasma 5 as the desktop environment.

Reading Time: 21 minutes       Read more…