When doing any sort of development one will often find themselves in need of working with data in a tabular format. This is especially true for those of us in data science, or data analysis, fields. In the Julia programming language one of the more popular libraries for this type of data wrangling is DataFrames.jl. In this blog post we'll explore the beginnings of working with this package.

Introduction

The great thing about a package like Dataframes.jl is that it bridges the gap between traditional programming and SQL (Structured Query Language). Databases are great tools for easily gaining insights into your data by joining, filtering, aggregating, sorting, etc... Dataframes.jl brings those goodies right into your hands by simply adding the package into your julia session. So, lets get started!

Getting Started

Adding the package is a few simple steps.

julia> using Pkg
julia> Pkg.add("DataFrames")
julia> using DataFrames

The constructor for a DataFrame provides flexibility to create from arrays, tuples, constants, or files. The documentation covers all these, but for this post we'll just explore one of the more common ways.

julia> df = DataFrame(a = 1:4, b = rand(4), c = "My first DataFrame")
4×3 DataFrame
 Row │ a      b         c                  
     │ Int64  Float64   String             
─────┼─────────────────────────────────────
   1 │     1  0.141874  My first DataFrame
   2 │     2  0.432084  My first DataFrame
   3 │     3  0.47098   My first DataFrame
   4 │     4  0.414639  My first DataFrame

You'll notice in the code above we use a mix of datatypes including range, array, and scalar. The underlying vectors must be of the same size and the scalar gets broadcasted, or repeated, for each row. Also, pay attention that the types of each column are inferred based on the arrays passed into the constructor.

Now, to access a column of a DataFrame there are also a few different possibilities. Here are a few examples of accessing the first column "a".

julia> df.a
4-element Vector{Int64}:
 1
 2
 3
 4

julia> df."a"
4-element Vector{Int64}:
 1
 2
 3
 4

julia> df[!, "a"]
4-element Vector{Int64}:
 1
 2
 3
 4

julia> df[!, :a]
4-element Vector{Int64}:
 1
 2
 3
 4

julia> df[:, :a]
4-element Vector{Int64}:
 1
 2
 3
 4

In these examples columns can be access directly with literals such as df.a, or more dynamically using brackets (since variables could be substituted.) You may also find yourself wondering the difference between ! and :, which is an important distinction!

The ! returns the underlying vector and : returns a copy. This can be showcased in an example where we will attempt to change the description of the second value in column c to "I love Julia!"

julia> df[:, :c][2] = "I love Julia!"
3

julia> df
4×3 DataFrame
 Row │ a      b         c                  
     │ Int64  Float64   String             
─────┼─────────────────────────────────────
   1 │     1  0.394165  My first DataFrame
   2 │     2  0.809883  My first DataFrame
   3 │     3  0.124035  My first DataFrame
   4 │     4  0.886781  My first DataFrame

julia> df[!, :c][2] = "I love Julia!"
3

julia> df
4×3 DataFrame
 Row │ a      b         c                  
     │ Int64  Float64   String             
─────┼─────────────────────────────────────
   1 │     1  0.394165  My first DataFrame
   2 │     3  0.809883  I love Julia!
   3 │     3  0.124035  My first DataFrame
   4 │     4  0.886781  My first DataFrame

Notice how the change will only persist to df when we access the column with !.

There is often a tradeoff between returning copies versus the actual underlying vectors. Returning a copy is generally considered safer since if the copy is later mutated the underlying DataFrame remains unchanged. However, with very large DataFrames copying every column access will result in an increase in memory. It is best to weigh those considerations and figure out what approach will work best for a given program.

Data Wrangling

Import / Export

Another great feature of the Julia programming language is that many different packages will interact well when used together. For instance, DataFrames.jl and CSV.jl can be used to very easily import and export data.

First, we can save the DataFrame from above to CSV.

julia> using CSV

julia> path = joinpath(homedir(), "my_df.csv")

julia> CSV.write(path, df)

And, reading in the DataFrame from file is just as easy!

julia> CSV.read(path, DataFrame)
4×3 DataFrame
 Row │ a      b         c                  
     │ Int64  Float64   String31           
─────┼─────────────────────────────────────
   1 │     1  0.601361  My first DataFrame
   2 │     2  0.178065  My first DataFrame
   3 │     3  0.729591  My first DataFrame
   4 │     4  0.280314  My first DataFrame

There are many keyword arguments to explore when handling csv files and the documentation is best for covering all of these CSV.jl.

DataFrames.jl also supports writing and reading to multiple files types such as Arrow, JSON, Parquet, and others.

Joins

A join is a way to merge data from two DataFrames into a single DataFrame. There are several types and they generally mimic the same types that a database would support.

innerjoin
leftjoin
rightjoin
outerjoin
semijoin
antijoin
crossjoin

Definitions of each can be found in either the documentation, or docstrings, but lets take a look at a few examples. Say we have the following DataFrame sets containing information from a school.

julia> student_df = DataFrame(student_id = 1:10, student_name = ["Joe", "Sally", "Jim", "Sandy", "Beth", "Alex", "Tom", "Liz", "Bill", "Carl"], teacher_id = repeat([1,2],5))
10×3 DataFrame
 Row │ student_id  student_name  teacher_id 
     │ Int64       String        Int64      
─────┼──────────────────────────────────────
   1 │          1  Joe                    1
   2 │          2  Sally                  2
   3 │          3  Jim                    1
   4 │          4  Sandy                  2
   5 │          5  Beth                   1
   6 │          6  Alex                   2
   7 │          7  Tom                    1
   8 │          8  Liz                    2
   9 │          9  Bill                   1
  10 │         10  Carl                   2

julia> teacher_df = DataFrame(teacher_id = 1:2, teacher_name = ["Mr. Jackson", "Ms. Smith"])
2×2 DataFrame
 Row │ teacher_id  teacher_name 
     │ Int64       String       
─────┼──────────────────────────
   1 │          1  Mr. Jackson
   2 │          2  Ms. Smith

julia> grade_df = DataFrame(exam_id = 1, student_id = vcat(1:3, 5:10), grade = [0.95, 0.93, 0.81, 0.85, 0.73, 0.88, 0.77, 0.75, 0.93])
9×3 DataFrame
 Row │ exam_id  student_id  grade   
     │ Int64    Int64       Float64 
─────┼──────────────────────────────
   1 │       1           1     0.95
   2 │       1           2     0.93
   3 │       1           3     0.81
   4 │       1           5     0.85
   5 │       1           6     0.73
   6 │       1           7     0.88
   7 │       1           8     0.77
   8 │       1           9     0.75
   9 │       1          10     0.93

If we look at the grade_df we can see there are 9 results, but in the student_df we have 10 students. So, someone must have missed the exam! Let's find out who that way we can alert the teacher to schedule a makeup.

Let's do a leftjoin, which means every row will persist from the first DataFrame regardless if there is a match to the second DataFrame. The leftjoin function also takes an on keyword argument to signify what column needs to be used to find matches.

julia> student_grade_df = leftjoin(student_df, grade_df, on=:student_id)
10×5 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade      
     │ Int64       String        Int64       Int64?   Float64?   
─────┼───────────────────────────────────────────────────────────
   1 │          1  Joe                    1        1        0.95
   2 │          2  Sally                  2        1        0.93
   3 │          3  Jim                    1        1        0.81
   4 │          5  Beth                   1        1        0.85
   5 │          6  Alex                   2        1        0.73
   6 │          7  Tom                    1        1        0.88
   7 │          8  Liz                    2        1        0.77
   8 │          9  Bill                   1        1        0.75
   9 │         10  Carl                   2        1        0.93
  10 │          4  Sandy                  2  missing  missing

We notice Sandy has a missing value for both the exam_id and grade fields. missing is a special datatype in Julia that is similar to a null value in databases. This would signify to us that there was no match in the grade_df meaning Sandy missed the exam. We can add one more join to get the respective teacher's name.

julia> result_df = innerjoin(student_grade_df, teacher_df, on=:teacher_id)
10×6 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade       teacher_name 
     │ Int64       String        Int64       Int64?   Float64?    String       
─────┼─────────────────────────────────────────────────────────────────────────
   1 │          1  Joe                    1        1        0.95  Mr. Jackson
   2 │          2  Sally                  2        1        0.93  Ms. Smith
   3 │          3  Jim                    1        1        0.81  Mr. Jackson
   4 │          5  Beth                   1        1        0.85  Mr. Jackson
   5 │          6  Alex                   2        1        0.73  Ms. Smith
   6 │          7  Tom                    1        1        0.88  Mr. Jackson
   7 │          8  Liz                    2        1        0.77  Ms. Smith
   8 │          9  Bill                   1        1        0.75  Mr. Jackson
   9 │         10  Carl                   2        1        0.93  Ms. Smith
  10 │          4  Sandy                  2  missing  missing     Ms. Smith

We used an innerjoin this time since we know that every student would have a teacher assigned. Now, we can let Ms. Smith know that she needs to reach out to Sandy to re-schedule her exam.

Sorting

Another helpful function for analysis is sort. Let's sort our result_df by the grade column.

julia> sort(result_df, [:grade])
10×6 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade       teacher_name 
     │ Int64       String        Int64       Int64?   Float64?    String       
─────┼─────────────────────────────────────────────────────────────────────────
   1 │          6  Alex                   2        1        0.73  Ms. Smith
   2 │          9  Bill                   1        1        0.75  Mr. Jackson
   3 │          8  Liz                    2        1        0.77  Ms. Smith
   4 │          3  Jim                    1        1        0.81  Mr. Jackson
   5 │          5  Beth                   1        1        0.85  Mr. Jackson
   6 │          7  Tom                    1        1        0.88  Mr. Jackson
   7 │          2  Sally                  2        1        0.93  Ms. Smith
   8 │         10  Carl                   2        1        0.93  Ms. Smith
   9 │          1  Joe                    1        1        0.95  Mr. Jackson
  10 │          4  Sandy                  2  missing  missing     Ms. Smith

The function takes the DataFrame and an array of columns to sort on. Our result of sort is putting the lowest grade first, but if we wanted it descending we can pass a rev keyword argument.

julia> sort(result_df, [:grade], rev=true)
10×6 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade       teacher_name 
     │ Int64       String        Int64       Int64?   Float64?    String       
─────┼─────────────────────────────────────────────────────────────────────────
   1 │          4  Sandy                  2  missing  missing     Ms. Smith
   2 │          1  Joe                    1        1        0.95  Mr. Jackson
   3 │          2  Sally                  2        1        0.93  Ms. Smith
   4 │         10  Carl                   2        1        0.93  Ms. Smith
   5 │          7  Tom                    1        1        0.88  Mr. Jackson
   6 │          5  Beth                   1        1        0.85  Mr. Jackson
   7 │          3  Jim                    1        1        0.81  Mr. Jackson
   8 │          8  Liz                    2        1        0.77  Ms. Smith
   9 │          9  Bill                   1        1        0.75  Mr. Jackson
  10 │          6  Alex                   2        1        0.73  Ms. Smith

In both these cases a copy of the DataFrame is returned and the result_df is left unchanged. But, if we wanted to sort in-place we can also use the sort! function that will update the passed DataFrame.

julia> sort!(result_df, [:grade], rev=true)
10×6 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade       teacher_name 
     │ Int64       String        Int64       Int64?   Float64?    String       
─────┼─────────────────────────────────────────────────────────────────────────
   1 │          4  Sandy                  2  missing  missing     Ms. Smith
   2 │          1  Joe                    1        1        0.95  Mr. Jackson
   3 │          2  Sally                  2        1        0.93  Ms. Smith
   4 │         10  Carl                   2        1        0.93  Ms. Smith
   5 │          7  Tom                    1        1        0.88  Mr. Jackson
   6 │          5  Beth                   1        1        0.85  Mr. Jackson
   7 │          3  Jim                    1        1        0.81  Mr. Jackson
   8 │          8  Liz                    2        1        0.77  Ms. Smith
   9 │          9  Bill                   1        1        0.75  Mr. Jackson
  10 │          6  Alex                   2        1        0.73  Ms. Smith

Split-apply-combine

Now that we have some basics down, it's time to dive into aggregating results. In DataFrames.jl this is referred to as a split-apply-combine strategy. It is a bit of a mouthful, but let's walk through what exactly this is referring to.

Split is simply breaking the DataFrame into groups using the groupby function. In our example lets split our DataFrame by the teacher_name column.

julia> grouped_df = groupby(result_df, :teacher_name)
GroupedDataFrame with 2 groups based on key: teacher_name
First Group (5 rows): teacher_name = "Mr. Jackson"
 Row │ student_id  student_name  teacher_id  exam_id  grade     teacher_name 
     │ Int64       String        Int64       Int64?   Float64?  String       
─────┼───────────────────────────────────────────────────────────────────────
   1 │          1  Joe                    1        1      0.95  Mr. Jackson
   2 │          3  Jim                    1        1      0.81  Mr. Jackson
   3 │          5  Beth                   1        1      0.85  Mr. Jackson
   4 │          7  Tom                    1        1      0.88  Mr. Jackson
   5 │          9  Bill                   1        1      0.75  Mr. Jackson
⋮
Last Group (5 rows): teacher_name = "Ms. Smith"
 Row │ student_id  student_name  teacher_id  exam_id  grade       teacher_name 
     │ Int64       String        Int64       Int64?   Float64?    String       
─────┼─────────────────────────────────────────────────────────────────────────
   1 │          2  Sally                  2        1        0.93  Ms. Smith
   2 │          6  Alex                   2        1        0.73  Ms. Smith
   3 │          8  Liz                    2        1        0.77  Ms. Smith
   4 │         10  Carl                   2        1        0.93  Ms. Smith
   5 │          4  Sandy                  2  missing  missing     Ms. Smith

The result of calling groupby is of type GroupedDataFrame, which is basically a wrapper around one, or many, groups of a DataFrame. In our example we have two teachers and so the result GroupedDataFrame has two groups.

Now, lets try to get an average exam grade for our two teachers. This will introduce the combine function that takes a GroupedDataFrame and any number of aggregation functions. Let's also add the Statistics.jl package, so we can take advantage of the mean function.

julia> using Statistics

julia> combine(grouped_df, :grade => mean)
2×2 DataFrame
 Row │ teacher_name  grade_mean  
     │ String        Float64?    
─────┼───────────────────────────
   1 │ Mr. Jackson         0.848
   2 │ Ms. Smith     missing

The result is a DataFrame where the first column(s) will match our GroupedDataFrame key(s) and the subsequent column(s) will match the function(s) we pass for aggregation. However, Ms. Smith has a grade_mean of missing!?

In our earlier discussion we found that Sandy missed the exam, so her grade was set to missing. A missing value behaves differently than normal numbers, which is problematic in our aggregation function. Take a look at a very simple example.

julia> 1 + missing
missing

We notice that adding a value of 1 to missing equals missing. This is a necessary evil and you may be wondering why don't we just treat it as 0? Let's see what happens to our results if we replace missing with 0.

julia> combine(grouped_df, :grade => (x -> mean(coalesce.(x, 0))))
2×2 DataFrame
 Row │ teacher_name  grade_function 
     │ String        Float64        
─────┼──────────────────────────────
   1 │ Mr. Jackson            0.848
   2 │ Ms. Smith              0.672

In the above example, instead of just passing mean as the function we create an anonymous function. This allows us to get a little more clever with adding a coalesce to replace the missing values with 0. We see from the results that Ms. Smith has a much lower scoring average than Mr. Jackson. But, if we think about it the results are getting incorrectly skewed. We know Sandy didn't actually score a 0, but rather didn't take the test at all. Treating her result as a 0 is skewing the average much lower than it should be.

In some cases replacing with a 0 would make sense, but not in this scenario. Here are a few better options:

We could just drop the rows that contain missing values prior to aggregation. DataFrames.jl provides a dropmissing function specifically for this.

julia> result_no_missing_df = dropmissing(result_df)
9×6 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade    teacher_name 
     │ Int64       String        Int64       Int64    Float64  String       
─────┼──────────────────────────────────────────────────────────────────────
   1 │          1  Joe                    1        1     0.95  Mr. Jackson
   2 │          2  Sally                  2        1     0.93  Ms. Smith
   3 │          3  Jim                    1        1     0.81  Mr. Jackson
   4 │          5  Beth                   1        1     0.85  Mr. Jackson
   5 │          6  Alex                   2        1     0.73  Ms. Smith
   6 │          7  Tom                    1        1     0.88  Mr. Jackson
   7 │          8  Liz                    2        1     0.77  Ms. Smith
   8 │          9  Bill                   1        1     0.75  Mr. Jackson
   9 │         10  Carl                   2        1     0.93  Ms. Smith

julia> grouped_no_missing_df = groupby(result_no_missing_df, :teacher_name)
GroupedDataFrame with 2 groups based on key: teacher_name
First Group (5 rows): teacher_name = "Mr. Jackson"
 Row │ student_id  student_name  teacher_id  exam_id  grade    teacher_name 
     │ Int64       String        Int64       Int64    Float64  String       
─────┼──────────────────────────────────────────────────────────────────────
   1 │          1  Joe                    1        1     0.95  Mr. Jackson
   2 │          3  Jim                    1        1     0.81  Mr. Jackson
   3 │          5  Beth                   1        1     0.85  Mr. Jackson
   4 │          7  Tom                    1        1     0.88  Mr. Jackson
   5 │          9  Bill                   1        1     0.75  Mr. Jackson
⋮
Last Group (4 rows): teacher_name = "Ms. Smith"
 Row │ student_id  student_name  teacher_id  exam_id  grade    teacher_name 
     │ Int64       String        Int64       Int64    Float64  String       
─────┼──────────────────────────────────────────────────────────────────────
   1 │          2  Sally                  2        1     0.93  Ms. Smith
   2 │          6  Alex                   2        1     0.73  Ms. Smith
   3 │          8  Liz                    2        1     0.77  Ms. Smith
   4 │         10  Carl                   2        1     0.93  Ms. Smith

julia> combine(grouped_no_missing_df, :grade => mean)
2×2 DataFrame
 Row │ teacher_name  grade_mean 
     │ String        Float64    
─────┼──────────────────────────
   1 │ Mr. Jackson        0.848
   2 │ Ms. Smith          0.84

We now see that the two teachers average test scores are very similar. This approach would work well if we never again needed the rows containing missing values.

But, if we wanted to keep those rows around and rather just exclude them from certain calculations. We can make use of another function, skipmissing, which will simply skip over the missing values.

julia> combine(grouped_df, :grade => (x -> mean(skipmissing(x))))
2×2 DataFrame
 Row │ teacher_name  grade_function 
     │ String        Float64        
─────┼──────────────────────────────
   1 │ Mr. Jackson            0.848
   2 │ Ms. Smith              0.84

One last thing to note on missing values is that it is easy to identify if one, or more, of your DataFrame columns contains missing values. We talked earlier that DataFrames.jl infers the types for each column and displays them in the output. You'll notice in result_df that the column teacher_id is of datatype Int64 and exam_id is of Int64?. Here the ? denotes that missing values were found, so be careful!

Conclusion

We've touched on some of the topics that makes DataFrames.jl such a great general purpose package. It is a helpful tool to quickly interact for data exploration, or to be used in production code to manipulate tabular data. I hope you've enjoyed today's reading and be sure to check out the rest of our blog posts on blog.glcs.io!

Getting Started with DataFrames.jl: A Beginner's Guide

Introduction

Getting Started

Data Wrangling

Import / Export

Joins

Sorting

Split-apply-combine

Conclusion

Comments

Diving into Julia

Mastering Efficient Array Operations with StaticArrays.jl in Julia

More from this blog

How to Integrate Julia Code Within a Python Program

Simplifying Julia Package Integration with Extensions

Discover the Key Features and Updates in Julia 1.12

Integrating Julia and MATLAB: Julia and MATLAB Can Coexist

MATLAB vs Julia: Best Programming Language for Renewable Energy Simulations

Command Palette

Introduction

Getting Started

Data Wrangling

Import / Export

Joins

Sorting

Split-apply-combine

Conclusion

Comments

Diving into Julia

Mastering Efficient Array Operations with StaticArrays.jl in Julia

More from this blog