I have been playing a lot lately with Julia. In the past I have used python and R for my data analysis tasks. I always tend to have a cheatsheet in some form for the basic dataframe operations in pandas and R. Its hard to make that context switch in the syntax from one language to another.
With Julia, my brain was just not ready to accept another set of syntax. So I thought a quick reference comparing the basic dataframe manipulation syntax for all 3 languages would be nice. Most of my data tasks start with some ETL tasks on dataframes. So this is more of a reference for my use in future but can be useful to others too who face the same situation.
I took this awesome tutorial by Greg Reda on “Working with DataFrames” and tried to port the example to R and Julia.
All the code from this post can be found at this github repo dataframes-compare.
We start with the MovieLens 100k dataset.
This data set consists of:
- 100,000 ratings (1-5) from 943 users on 1682 movies.
- Each user has rated at least 20 movies.
- Simple demographic info for the users (age, gender, occupation, zip)
Data import
Lets start with loading the data into our dataframe. The methods to read the data from a flat file are pretty simple and straight forward in all 3, the only tricky part is when adding column names in Julia. But there are ways to get around.
Julia
using DataFrames
u_col_names=[symbol("user_id"), symbol("age"), symbol("sex"), symbol("occupation"), symbol("zip_code")]
Another way to do the same without adding each entry as a symbol. Thanks to Jubobs from this stackoverflow post for the suggestion.
col_names=["user_id", "age", "sex", "occupation", "zip_code"]
u_col_names=map(symbol, col_names)
users = DataFrames.readtable("data/ml-100k/u.user", separator='|', header=false, names=u_col_names)
Edit: Probably the easiest way to do this is (as pointed out by lot of people on HN and in the comments section here)
u_col_names=[:user_id, :age, :sex, :occupation, :zip_code]
users = DataFrames.readtable("data/ml-100k/u.user", separator='|', header=false, names=u_col_names)
There is no way right now to load selective columns from a file. The next IO version will hopefully have that mechanism. Stackoverflow question selecting-columns-while-importing-data-with-dataframes-readtable
Python
import pandas as pd
u_col_names = ['user_id', 'age', 'sex', 'occuptation', 'zip_code']
users = pd.read_csv('data/ml-100k/u.user', sep = '|', names=u_col_names)
Let’s only load the first five columns of the file with usecols
m_col_names = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=m_col_names, usecols=range(5))
R
u_col_names <- c('user_id', 'age', 'sex', 'occupation', 'zip_code')
users <- read.csv('data/ml-100k/u.user', sep='|', col.names=u_col_names, header=FALSE)
Loading the first 5 columns only
m_col_names = c('movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url')
movies = read.table('data/ml-100k/u.item', sep='|', colClasses=c("integer", "character", "factor", "factor", "character", rep("NULL", 19)), quote="")
Stackoverflow question only-read-limited-number-of-columns-in-r also quotes in strings cause importing errors so you need quote=”” Cannot specify col.names in read.table as we are skipping 19 columns. R will complain with “more columns than column names”
colnames(movies) <- m_col_names
If you notice closely R and python accept string literals in single quotes ‘ but Julia treats is as a character literal. I personally feel thats how it should be if I think about how C/C++ treats string literals.
Sanity check
Now that we have the data loaded in our dataframes, its usually good practice to see the classes/types of each column. We will also try to get the summary statistics for the columns of our dataframe.
Julia
eltypes(users)
5-element Array{Type{T<:Top},1}:
Int64
Int64
UTF8String
UTF8String
UTF8String
describe(users)
user_id
Min 1.0
1st Qu. 236.5
Median 472.0
Mean 472.0
3rd Qu. 707.5
Max 943.0
NAs 0
NA% 0.0%
age
Min 7.0
1st Qu. 25.0
Median 31.0
Mean 34.05196182396607
3rd Qu. 43.0
Max 73.0
NAs 0
NA% 0.0%
sex
Length 943
Type UTF8String
NAs 0
NA% 0.0%
Unique 2
occupation
Length 943
Type UTF8String
NAs 0
NA% 0.0%
Unique 21
zip_code
Length 943
Type UTF8String
NAs 0
NA% 0.0%
Unique 795
Python
type(users)
pandas.core.frame.DataFrame
users.dtypes
user_id int64
age int64
sex object
occuptation object
zip_code object
dtype: object
users.describe
user_id age
count 943.000000 943.000000
mean 472.000000 34.051962
std 272.364951 12.192740
min 1.000000 7.000000
25% 236.500000 25.000000
50% 472.000000 31.000000
75% 707.500000 43.000000
max 943.000000 73.000000
R
class(users)
[1] "data.frame"
lapply(users, class)
$user_id
[1] "integer"
$age
[1] "integer"
$sex
[1] "factor"
$occupation
[1] "factor"
$zip_code
[1] "factor"
str(users)
'data.frame': 943 obs. of 5 variables:
'data.frame': 943 obs. of 5 variables:
$ user_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : int 24 53 23 24 33 42 57 36 29 53 ...
$ sex : Factor w/ 2 levels "F","M": 2 1 2 2 1 2 2 2 2 2 ...
$ occupation: Factor w/ 21 levels "administrator",..: 20 14 21 20 14 7 1 1 19 10 ...
$ zip_code : Factor w/ 795 levels "00000","01002",..: 623 690 271 332 134 759 652 49 2 643 ...
$ user_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : int 24 53 23 24 33 42 57 36 29 53 ...
$ sex : Factor w/ 2 levels "F","M": 2 1 2 2 1 2 2 2 2 2 ...
$ occupation: Factor w/ 21 levels "administrator",..: 20 14 21 20 14 7 1 1 19 10 ...
$ zip_code : Factor w/ 795 levels "00000","01002",..: 623 690 271 332 134 759 652 49 2 643 ...
summary(users)
user_id age sex occupation zip_code
Min. : 2.0 Min. : 7.00 F:273 student :196 55414 : 9
1st Qu.:237.2 1st Qu.:25.00 M:669 other :105 55105 : 6
Median :472.5 Median :31.00 educator : 95 10003 : 5
Mean :472.5 Mean :34.06 administrator: 79 20009 : 5
3rd Qu.:707.8 3rd Qu.:43.00 engineer : 67 55337 : 5
Max. :943.0 Max. :73.00 programmer : 66 27514 : 4
(Other) :334 (Other):908
user_id age sex occupation zip_code
Min. : 2.0 Min. : 7.00 F:273 student :196 55414 : 9
1st Qu.:237.2 1st Qu.:25.00 M:669 other :105 55105 : 6
Median :472.5 Median :31.00 educator : 95 10003 : 5
Mean :472.5 Mean :34.06 administrator: 79 20009 : 5
3rd Qu.:707.8 3rd Qu.:43.00 engineer : 67 55337 : 5
Max. :943.0 Max. :73.00 programmer : 66 27514 : 4
(Other) :334 (Other):908
We can see that the ratio is around 1:2.5 between female and male users and most of them (if you exclude “other”) are students. Minimum age is 7 and it goes till 73 as the max.
Subsetting
Lets try to query our data now and filter it with some conditions.
Head/Tail
We will warmup with some top/bottom values to get a feel of the data we have at hand.
Julia
head(users)
6x5 DataFrame
| Row | user_id | age | sex | occupation | zip_code |
|-----|---------|-----|-----|--------------|----------|
| 1 | 1 | 24 | "M" | "technician" | "85711" |
| 2 | 2 | 53 | "F" | "other" | "94043" |
| 3 | 3 | 23 | "M" | "writer" | "32067" |
| 4 | 4 | 24 | "M" | "technician" | "43537" |
| 5 | 5 | 33 | "F" | "other" | "15213" |
| 6 | 6 | 42 | "M" | "executive" | "98101" |
tail(users)
6x5 DataFrame
| Row | user_id | age | sex | occupation | zip_code |
|-----|---------|-----|-----|-----------------|----------|
| 1 | 938 | 38 | "F" | "technician" | "55038" |
| 2 | 939 | 26 | "F" | "student" | "33319" |
| 3 | 940 | 32 | "M" | "administrator" | "02215" |
| 4 | 941 | 20 | "M" | "student" | "97229" |
| 5 | 942 | 48 | "F" | "librarian" | "78209" |
| 6 | 943 | 22 | "M" | "student" | "77841" |
If you want to select a custom number of rows in head or tail you can pass it as a param
head(users, 3)
3x5 DataFrame
| Row | user_id | age | sex | occupation | zip_code |
|-----|---------|-----|-----|--------------|----------|
| 1 | 1 | 24 | "M" | "technician" | "85711" |
| 2 | 2 | 53 | "F" | "other" | "94043" |
| 3 | 3 | 23 | "M" | "writer" | "32067" |
I wont paste the output of these for python and R as they are similar to what Julia shows..
Python
users.head()
users.tail()
users.head(3)
R
head(movies)
tail(movies)
head(movies, n=3)
Row subset
Lets try to get all the rows from 50th row to the 55th row
Julia
users[50:55,:]
6x5 DataFrame
| Row | user_id | age | sex | occupation | zip_code |
|-----|---------|-----|-----|--------------|----------|
| 1 | 50 | 21 | "M" | "writer" | "52245" |
| 2 | 51 | 28 | "M" | "educator" | "16509" |
| 3 | 52 | 18 | "F" | "student" | "55105" |
| 4 | 53 | 26 | "M" | "programmer" | "55414" |
| 5 | 54 | 22 | "M" | "executive" | "66315" |
| 6 | 55 | 37 | "M" | "programmer" | "01331" |
Python
users[49:55]
R
users[50:55,]
Notice that Julia and R have 1-based indexing whereas python has 0-based indexing. Also python slicing a:b is a(included) and b(excluded) so you have to do 49:55 to get rows 50-55 in python.
Column subset
You can select a single column by column name.
Julia
head(users[:occupation])
6-element DataArray{UTF8String,1}:
"technician"
"other"
"writer"
"technician"
"other"
"executive"
Python
users['occupation'].head()
R
head(users$occupation)
Multiple column selection works by passing a vector of column names
Julia
head(users[:,[:occupation, :sex, :age]])
6x3 DataFrame
| Row | occupation | sex | age |
|-----|--------------|-----|-----|
| 1 | "technician" | "M" | 24 |
| 2 | "other" | "F" | 53 |
| 3 | "writer" | "M" | 23 |
| 4 | "technician" | "M" | 24 |
| 5 | "other" | "F" | 33 |
| 6 | "executive" | "M" | 42 |
Python
users[['occupation', 'sex', 'age']].head()
R
head(users[,c('occupation', 'sex', 'age')])
Query / Conditional subset
Subsetting a dataframe based on querying a column for a condition can be achieved like this -
Julia
users[users[:occupation] .== "writer", :]
45x5 DataFrame
| Row | user_id | age | sex | occupation | zip_code |
|-----|---------|-----|-----|------------|----------|
| 1 | 3 | 23 | "M" | "writer" | "32067" |
| 2 | 21 | 26 | "M" | "writer" | "30068" |
| 3 | 22 | 25 | "M" | "writer" | "40206" |
| 4 | 28 | 32 | "M" | "writer" | "55369" |
| 5 | 50 | 21 | "M" | "writer" | "52245" |
⋮
| 43 | 853 | 49 | "M" | "writer" | "40515" |
| 44 | 896 | 28 | "M" | "writer" | "91505" |
| 45 | 911 | 37 | "F" | "writer" | "53210" |
Python
users[users.occupation == 'writer']
There is another way in python, by using the query construct of the dataframe.
users.query('occupation=="writer"')
R
users[users$occupation == 'writer',]
Notice the subtle changes in all these examples. For example in the last query subsetting, for python you dont need to specify “select all columns” by adding a ‘ ,” ‘ as you do in Julia or a ‘, ‘ as you do in R.
I will try to write a followup post on Joins on Dataframes in these 3.
Discuss this post on hacker news.