Reddit Survey: Introduction to Pandas

The data set used here is part of a project from UD651 course on udacity by Facebook.

The data from the project corresponds to a survey from reddit.com. You can load the data through the following command. We will first look at the different attributes of this data using the summary() and describe() pandas methods.

In [45]:
import pandas as pd
import numpy as np

#Read csv file
reddit = pd.read_csv("https://s3.amazonaws.com/udacity-hosted-downloads/ud651/reddit.csv").astype(object)
#summarize data
reddit.describe(include='all', percentiles=[]).T.replace(np.nan,' ', regex=True)

Out[45]:
count unique top freq
id 32754.0 32754.0 32756 1.0
gender 32553.0 2.0 0 26418.0
age.range 32666.0 7.0 18-24 15802.0
marital.status 32749.0 6.0 Single 10428.0
employment.status 32603.0 6.0 Employed full time 14814.0
military.service 32749.0 2.0 No 30526.0
children 32535.0 2.0 No 27488.0
education 32610.0 7.0 Bachelor's degree 11046.0
country 32577.0 439.0 United States 20967.0
state 20846.0 52.0 California 3401.0
income.range 31139.0 8.0 Under $20,000 7892.0
fav.reddit 28393.0 1833.0 askreddit 2123.0
dog.cat 32749.0 3.0 I like dogs. 17151.0
cheese 32749.0 11.0 Other 6563.0

The describe() method helped us get an overview of all the data available to us. We also ensured that all the data read was a categorical data.

Let us look at the age.range variable in more detail. We can look at the different levels of this variables using the cat.categories property of a Pandas Series.

In [46]:
reddit["age.range"].astype('category').cat.categories

Out[46]:
Index(['18-24', '25-34', '35-44', '45-54', '55-64', '65 or Above', 'Under 18'], dtype='object')

This shows there are 7 possible values of this variable and some where no data is available (NA).

A more pictorial view of this can be seen using a histogram plot of this.

In [57]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
%matplotlib inline

newOrder = ["Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above"] ax = sns.countplot(x="age.range", data=reddit, order=newOrder)

Similarly, we can also plot a distribution of income range.

In [51]:
ax = sns.countplot(x="income.range", data=reddit)
locs, labels = plt.xticks()
ax = plt.setp(labels, rotation=90)

One problem with the above plots is that the different levels are not ordered. This can be fixed using ordered Factors, instead of regular factor type variables. Additionally, We need to use a more reasonable x-label for plotting income.range.

In [52]:
newLevels = ["100K", ">150K", "20K","30K", "40K", "50K", "70K", "<20K"]
reddit["income.range"] = reddit["income.range"].astype('category')
reddit["income.range"] = reddit["income.range"].cat.rename_categories(newLevels)

In [55]:
newOrder = ["<20K", "20K","30K", "40K", "50K", "70K", "100K", ">150K"]
ax = sns.countplot(x="income.range", data=reddit, order=newOrder)

comments powered by Disqus