Getting Started with DataFrames.jl: A Beginner's Guide
Mastering Tabular Data in Julia: A Simple Guide for Beginners.
When doing any sort of development one will often find themselves in need of working with data in a tabular format. This is especially true for those of us in data science, or data analysis, fields. In the Julia programming language one of the more popular libraries for this type of data wrangling is DataFrames.jl. In this blog post we'll explore the beginnings of working with this package.
Introduction
The great thing about a package like Dataframes.jl
is that it bridges the gap between traditional
programming and SQL (Structured Query Language). Databases are great tools for easily gaining insights
into your data by joining, filtering, aggregating, sorting, etc... Dataframes.jl
brings those goodies right
into your hands by simply adding the package into your julia session. So, lets get started!
Getting Started
Adding the package is a few simple steps.
julia> using Pkg
julia> Pkg.add("DataFrames")
julia> using DataFrames
The constructor for a DataFrame
provides flexibility to create from arrays, tuples, constants, or files. The
documentation covers all these, but for this post we'll just explore one of the more common ways.
julia> df = DataFrame(a = 1:4, b = rand(4), c = "My first DataFrame")
4×3 DataFrame
Row │ a b c
│ Int64 Float64 String
─────┼─────────────────────────────────────
1 │ 1 0.141874 My first DataFrame
2 │ 2 0.432084 My first DataFrame
3 │ 3 0.47098 My first DataFrame
4 │ 4 0.414639 My first DataFrame
You'll notice in the code above we use a mix of datatypes including range, array, and scalar. The underlying vectors must be of the same size and the scalar gets broadcasted, or repeated, for each row. Also, pay attention that the types of each column are inferred based on the arrays passed into the constructor.
Now, to access a column of a DataFrame
there are also a few different possibilities. Here are a few examples of accessing the
first column "a".
julia> df.a
4-element Vector{Int64}:
1
2
3
4
julia> df."a"
4-element Vector{Int64}:
1
2
3
4
julia> df[!, "a"]
4-element Vector{Int64}:
1
2
3
4
julia> df[!, :a]
4-element Vector{Int64}:
1
2
3
4
julia> df[:, :a]
4-element Vector{Int64}:
1
2
3
4
In these examples columns can be access directly with literals such as df.a
, or more
dynamically using brackets (since variables could be substituted.) You may also find
yourself wondering the difference between !
and :
, which is an important distinction!
The !
returns the underlying vector and :
returns a copy. This can be showcased in an
example where we will attempt to change the description of the second value in column c
to "I love Julia!"
julia> df[:, :c][2] = "I love Julia!"
3
julia> df
4×3 DataFrame
Row │ a b c
│ Int64 Float64 String
─────┼─────────────────────────────────────
1 │ 1 0.394165 My first DataFrame
2 │ 2 0.809883 My first DataFrame
3 │ 3 0.124035 My first DataFrame
4 │ 4 0.886781 My first DataFrame
julia> df[!, :c][2] = "I love Julia!"
3
julia> df
4×3 DataFrame
Row │ a b c
│ Int64 Float64 String
─────┼─────────────────────────────────────
1 │ 1 0.394165 My first DataFrame
2 │ 3 0.809883 I love Julia!
3 │ 3 0.124035 My first DataFrame
4 │ 4 0.886781 My first DataFrame
Notice how the change will only persist to df
when we access the column with !
.
There is often a tradeoff between returning copies versus the actual underlying vectors. Returning a copy is generally considered safer since if the copy is later mutated the underlying DataFrame remains unchanged. However, with very large DataFrames copying every column access will result in an increase in memory. It is best to weigh those considerations and figure out what approach will work best for a given program.
Data Wrangling
Import / Export
Another great feature of the Julia programming language is that many different packages will interact well
when used together. For instance, DataFrames.jl
and CSV.jl
can be used to very easily import and export
data.
First, we can save the DataFrame
from above to CSV.
julia> using CSV
julia> path = joinpath(homedir(), "my_df.csv")
julia> CSV.write(path, df)
And, reading in the DataFrame
from file is just as easy!
julia> CSV.read(path, DataFrame)
4×3 DataFrame
Row │ a b c
│ Int64 Float64 String31
─────┼─────────────────────────────────────
1 │ 1 0.601361 My first DataFrame
2 │ 2 0.178065 My first DataFrame
3 │ 3 0.729591 My first DataFrame
4 │ 4 0.280314 My first DataFrame
There are many keyword arguments to explore when handling csv files and the documentation is best for covering all of these CSV.jl.
DataFrames.jl
also supports writing and reading to multiple files types such as Arrow, JSON, Parquet, and others.
Joins
A join is a way to merge data from two DataFrames into a single DataFrame. There are several types and they generally mimic the same types that a database would support.
innerjoin
leftjoin
rightjoin
outerjoin
semijoin
antijoin
crossjoin
Definitions of each can be found in either the documentation, or docstrings, but lets take a look at a few
examples. Say we have the following DataFrame
sets containing information from a school.
julia> student_df = DataFrame(student_id = 1:10, student_name = ["Joe", "Sally", "Jim", "Sandy", "Beth", "Alex", "Tom", "Liz", "Bill", "Carl"], teacher_id = repeat([1,2],5))
10×3 DataFrame
Row │ student_id student_name teacher_id
│ Int64 String Int64
─────┼──────────────────────────────────────
1 │ 1 Joe 1
2 │ 2 Sally 2
3 │ 3 Jim 1
4 │ 4 Sandy 2
5 │ 5 Beth 1
6 │ 6 Alex 2
7 │ 7 Tom 1
8 │ 8 Liz 2
9 │ 9 Bill 1
10 │ 10 Carl 2
julia> teacher_df = DataFrame(teacher_id = 1:2, teacher_name = ["Mr. Jackson", "Ms. Smith"])
2×2 DataFrame
Row │ teacher_id teacher_name
│ Int64 String
─────┼──────────────────────────
1 │ 1 Mr. Jackson
2 │ 2 Ms. Smith
julia> grade_df = DataFrame(exam_id = 1, student_id = vcat(1:3, 5:10), grade = [0.95, 0.93, 0.81, 0.85, 0.73, 0.88, 0.77, 0.75, 0.93])
9×3 DataFrame
Row │ exam_id student_id grade
│ Int64 Int64 Float64
─────┼──────────────────────────────
1 │ 1 1 0.95
2 │ 1 2 0.93
3 │ 1 3 0.81
4 │ 1 5 0.85
5 │ 1 6 0.73
6 │ 1 7 0.88
7 │ 1 8 0.77
8 │ 1 9 0.75
9 │ 1 10 0.93
If we look at the grade_df
we can see there are 9 results, but in the student_df
we have 10 students.
So, someone must have missed the exam! Let's find out who that way we can alert the teacher to schedule
a makeup.
Let's do a leftjoin
, which means every row will persist from the first DataFrame
regardless if there is
a match to the second DataFrame
. The leftjoin
function also takes an on
keyword argument
to signify what column needs to be used to find matches.
julia> student_grade_df = leftjoin(student_df, grade_df, on=:student_id)
10×5 DataFrame
Row │ student_id student_name teacher_id exam_id grade
│ Int64 String Int64 Int64? Float64?
─────┼───────────────────────────────────────────────────────────
1 │ 1 Joe 1 1 0.95
2 │ 2 Sally 2 1 0.93
3 │ 3 Jim 1 1 0.81
4 │ 5 Beth 1 1 0.85
5 │ 6 Alex 2 1 0.73
6 │ 7 Tom 1 1 0.88
7 │ 8 Liz 2 1 0.77
8 │ 9 Bill 1 1 0.75
9 │ 10 Carl 2 1 0.93
10 │ 4 Sandy 2 missing missing
We notice Sandy has a missing
value for both the exam_id
and grade
fields. missing
is a
special datatype in Julia that is similar to a null
value in databases. This would signify to
us that there was no match in the grade_df
meaning Sandy missed the exam. We can add one more
join to get the respective teacher's name.
julia> result_df = innerjoin(student_grade_df, teacher_df, on=:teacher_id)
10×6 DataFrame
Row │ student_id student_name teacher_id exam_id grade teacher_name
│ Int64 String Int64 Int64? Float64? String
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 1 Joe 1 1 0.95 Mr. Jackson
2 │ 2 Sally 2 1 0.93 Ms. Smith
3 │ 3 Jim 1 1 0.81 Mr. Jackson
4 │ 5 Beth 1 1 0.85 Mr. Jackson
5 │ 6 Alex 2 1 0.73 Ms. Smith
6 │ 7 Tom 1 1 0.88 Mr. Jackson
7 │ 8 Liz 2 1 0.77 Ms. Smith
8 │ 9 Bill 1 1 0.75 Mr. Jackson
9 │ 10 Carl 2 1 0.93 Ms. Smith
10 │ 4 Sandy 2 missing missing Ms. Smith
We used an innerjoin
this time since we know that every student would have a teacher assigned.
Now, we can let Ms. Smith know that she needs to reach out to Sandy to re-schedule her exam.
Sorting
Another helpful function for analysis is sort
. Let's sort our result_df
by the grade
column.
julia> sort(result_df, [:grade])
10×6 DataFrame
Row │ student_id student_name teacher_id exam_id grade teacher_name
│ Int64 String Int64 Int64? Float64? String
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 6 Alex 2 1 0.73 Ms. Smith
2 │ 9 Bill 1 1 0.75 Mr. Jackson
3 │ 8 Liz 2 1 0.77 Ms. Smith
4 │ 3 Jim 1 1 0.81 Mr. Jackson
5 │ 5 Beth 1 1 0.85 Mr. Jackson
6 │ 7 Tom 1 1 0.88 Mr. Jackson
7 │ 2 Sally 2 1 0.93 Ms. Smith
8 │ 10 Carl 2 1 0.93 Ms. Smith
9 │ 1 Joe 1 1 0.95 Mr. Jackson
10 │ 4 Sandy 2 missing missing Ms. Smith
The function takes the DataFrame
and an array of columns to sort on. Our result of sort
is putting the lowest
grade first, but if we wanted it descending we can pass a rev
keyword argument.
julia> sort(result_df, [:grade], rev=true)
10×6 DataFrame
Row │ student_id student_name teacher_id exam_id grade teacher_name
│ Int64 String Int64 Int64? Float64? String
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 4 Sandy 2 missing missing Ms. Smith
2 │ 1 Joe 1 1 0.95 Mr. Jackson
3 │ 2 Sally 2 1 0.93 Ms. Smith
4 │ 10 Carl 2 1 0.93 Ms. Smith
5 │ 7 Tom 1 1 0.88 Mr. Jackson
6 │ 5 Beth 1 1 0.85 Mr. Jackson
7 │ 3 Jim 1 1 0.81 Mr. Jackson
8 │ 8 Liz 2 1 0.77 Ms. Smith
9 │ 9 Bill 1 1 0.75 Mr. Jackson
10 │ 6 Alex 2 1 0.73 Ms. Smith
In both these cases a copy of the DataFrame
is returned and the result_df
is left unchanged. But, if we wanted to
sort in-place we can also use the sort!
function that will update the passed DataFrame
.
julia> sort!(result_df, [:grade], rev=true)
10×6 DataFrame
Row │ student_id student_name teacher_id exam_id grade teacher_name
│ Int64 String Int64 Int64? Float64? String
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 4 Sandy 2 missing missing Ms. Smith
2 │ 1 Joe 1 1 0.95 Mr. Jackson
3 │ 2 Sally 2 1 0.93 Ms. Smith
4 │ 10 Carl 2 1 0.93 Ms. Smith
5 │ 7 Tom 1 1 0.88 Mr. Jackson
6 │ 5 Beth 1 1 0.85 Mr. Jackson
7 │ 3 Jim 1 1 0.81 Mr. Jackson
8 │ 8 Liz 2 1 0.77 Ms. Smith
9 │ 9 Bill 1 1 0.75 Mr. Jackson
10 │ 6 Alex 2 1 0.73 Ms. Smith
Split-apply-combine
Now that we have some basics down, it's time to dive into aggregating results. In DataFrames.jl
this is
referred to as a split-apply-combine strategy. It is a bit of a mouthful, but let's walk through what
exactly this is referring to.
Split is simply breaking the DataFrame
into groups using the groupby
function. In our example lets split
our DataFrame
by the teacher_name
column.
julia> grouped_df = groupby(result_df, :teacher_name)
GroupedDataFrame with 2 groups based on key: teacher_name
First Group (5 rows): teacher_name = "Mr. Jackson"
Row │ student_id student_name teacher_id exam_id grade teacher_name
│ Int64 String Int64 Int64? Float64? String
─────┼───────────────────────────────────────────────────────────────────────
1 │ 1 Joe 1 1 0.95 Mr. Jackson
2 │ 3 Jim 1 1 0.81 Mr. Jackson
3 │ 5 Beth 1 1 0.85 Mr. Jackson
4 │ 7 Tom 1 1 0.88 Mr. Jackson
5 │ 9 Bill 1 1 0.75 Mr. Jackson
⋮
Last Group (5 rows): teacher_name = "Ms. Smith"
Row │ student_id student_name teacher_id exam_id grade teacher_name
│ Int64 String Int64 Int64? Float64? String
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 2 Sally 2 1 0.93 Ms. Smith
2 │ 6 Alex 2 1 0.73 Ms. Smith
3 │ 8 Liz 2 1 0.77 Ms. Smith
4 │ 10 Carl 2 1 0.93 Ms. Smith
5 │ 4 Sandy 2 missing missing Ms. Smith
The result of calling groupby
is of type GroupedDataFrame
, which is basically a wrapper
around one, or many, groups of a DataFrame
. In our example we have two teachers and so the result
GroupedDataFrame
has two groups.
Now, lets try to get an average exam grade for our two teachers. This will introduce the combine
function that takes a GroupedDataFrame
and any number of aggregation functions. Let's also add
the Statistics.jl
package, so we can take advantage of the mean
function.
julia> using Statistics
julia> combine(grouped_df, :grade => mean)
2×2 DataFrame
Row │ teacher_name grade_mean
│ String Float64?
─────┼───────────────────────────
1 │ Mr. Jackson 0.848
2 │ Ms. Smith missing
The result is a DataFrame
where the first column(s) will match our GroupedDataFrame
key(s) and the
subsequent column(s) will match the function(s) we pass for aggregation. However, Ms. Smith has a grade_mean
of missing
!?
In our earlier discussion we found that Sandy missed the exam, so her grade was set to missing
. A missing
value
behaves differently than normal numbers, which is problematic in our aggregation function. Take a look at a very simple
example.
julia> 1 + missing
missing
We notice that adding a value of 1 to missing
equals missing
. This is a necessary evil and you may be wondering
why don't we just treat it as 0? Let's see what happens to our results if we replace missing
with 0.
julia> combine(grouped_df, :grade => (x -> mean(coalesce.(x, 0))))
2×2 DataFrame
Row │ teacher_name grade_function
│ String Float64
─────┼──────────────────────────────
1 │ Mr. Jackson 0.848
2 │ Ms. Smith 0.672
In the above example, instead of just passing mean
as the function we create an anonymous function. This allows us to get a little
more clever with adding a coalesce
to replace the missing
values with 0. We see from the results that Ms. Smith has a much lower
scoring average than Mr. Jackson. But, if we think about it the results are getting incorrectly skewed. We know Sandy didn't actually
score a 0, but rather didn't take the test at all. Treating her result as a 0 is skewing the average much lower than it should be.
In some cases replacing with a 0 would make sense, but not in this scenario. Here are a few better options:
We could just drop the rows that contain missing
values prior to aggregation. DataFrames.jl
provides a dropmissing
function
specifically for this.
julia> result_no_missing_df = dropmissing(result_df)
9×6 DataFrame
Row │ student_id student_name teacher_id exam_id grade teacher_name
│ Int64 String Int64 Int64 Float64 String
─────┼──────────────────────────────────────────────────────────────────────
1 │ 1 Joe 1 1 0.95 Mr. Jackson
2 │ 2 Sally 2 1 0.93 Ms. Smith
3 │ 3 Jim 1 1 0.81 Mr. Jackson
4 │ 5 Beth 1 1 0.85 Mr. Jackson
5 │ 6 Alex 2 1 0.73 Ms. Smith
6 │ 7 Tom 1 1 0.88 Mr. Jackson
7 │ 8 Liz 2 1 0.77 Ms. Smith
8 │ 9 Bill 1 1 0.75 Mr. Jackson
9 │ 10 Carl 2 1 0.93 Ms. Smith
julia> grouped_no_missing_df = groupby(result_no_missing_df, :teacher_name)
GroupedDataFrame with 2 groups based on key: teacher_name
First Group (5 rows): teacher_name = "Mr. Jackson"
Row │ student_id student_name teacher_id exam_id grade teacher_name
│ Int64 String Int64 Int64 Float64 String
─────┼──────────────────────────────────────────────────────────────────────
1 │ 1 Joe 1 1 0.95 Mr. Jackson
2 │ 3 Jim 1 1 0.81 Mr. Jackson
3 │ 5 Beth 1 1 0.85 Mr. Jackson
4 │ 7 Tom 1 1 0.88 Mr. Jackson
5 │ 9 Bill 1 1 0.75 Mr. Jackson
⋮
Last Group (4 rows): teacher_name = "Ms. Smith"
Row │ student_id student_name teacher_id exam_id grade teacher_name
│ Int64 String Int64 Int64 Float64 String
─────┼──────────────────────────────────────────────────────────────────────
1 │ 2 Sally 2 1 0.93 Ms. Smith
2 │ 6 Alex 2 1 0.73 Ms. Smith
3 │ 8 Liz 2 1 0.77 Ms. Smith
4 │ 10 Carl 2 1 0.93 Ms. Smith
julia> combine(grouped_no_missing_df, :grade => mean)
2×2 DataFrame
Row │ teacher_name grade_mean
│ String Float64
─────┼──────────────────────────
1 │ Mr. Jackson 0.848
2 │ Ms. Smith 0.84
We now see that the two teachers average test scores are very similar. This approach would work well
if we never again needed the rows containing missing
values.
But, if we wanted to keep those rows around and rather just exclude them from certain calculations.
We can make use of another function, skipmissing
, which will simply skip over the missing
values.
julia> combine(grouped_df, :grade => (x -> mean(skipmissing(x))))
2×2 DataFrame
Row │ teacher_name grade_function
│ String Float64
─────┼──────────────────────────────
1 │ Mr. Jackson 0.848
2 │ Ms. Smith 0.84
One last thing to note on missing
values is that it is easy to identify if one, or more, of your DataFrame
columns contains missing
values. We talked earlier that DataFrames.jl
infers the types for each
column and displays them in the output. You'll notice in result_df
that the column teacher_id
is of datatype
Int64
and exam_id
is of Int64?
. Here the ?
denotes that missing
values were found, so be careful!
Conclusion
We've touched on some of the topics that makes DataFrames.jl
such a great general purpose package. It is a helpful
tool to quickly interact for data exploration, or to be used in production code to manipulate tabular data. I hope
you've enjoyed today's reading and be sure to check out the rest of our blog posts on blog.glcs.io!