Getting Started with DataFrames.jl: A Beginner's Guide

Getting Started with DataFrames.jl: A Beginner's Guide

Mastering Tabular Data in Julia: A Simple Guide for Beginners.

·

14 min read

When doing any sort of development one will often find themselves in need of working with data in a tabular format. This is especially true for those of us in data science, or data analysis, fields. In the Julia programming language one of the more popular libraries for this type of data wrangling is DataFrames.jl. In this blog post we'll explore the beginnings of working with this package.

Introduction

The great thing about a package like Dataframes.jl is that it bridges the gap between traditional programming and SQL (Structured Query Language). Databases are great tools for easily gaining insights into your data by joining, filtering, aggregating, sorting, etc... Dataframes.jl brings those goodies right into your hands by simply adding the package into your julia session. So, lets get started!

Getting Started

Adding the package is a few simple steps.

julia> using Pkg
julia> Pkg.add("DataFrames")
julia> using DataFrames

The constructor for a DataFrame provides flexibility to create from arrays, tuples, constants, or files. The documentation covers all these, but for this post we'll just explore one of the more common ways.

julia> df = DataFrame(a = 1:4, b = rand(4), c = "My first DataFrame")
4×3 DataFrame
 Row │ a      b         c                  
     │ Int64  Float64   String             
─────┼─────────────────────────────────────
   11  0.141874  My first DataFrame
   22  0.432084  My first DataFrame
   33  0.47098   My first DataFrame
   44  0.414639  My first DataFrame

You'll notice in the code above we use a mix of datatypes including range, array, and scalar. The underlying vectors must be of the same size and the scalar gets broadcasted, or repeated, for each row. Also, pay attention that the types of each column are inferred based on the arrays passed into the constructor.

Now, to access a column of a DataFrame there are also a few different possibilities. Here are a few examples of accessing the first column "a".

julia> df.a
4-element Vector{Int64}:
 1
 2
 3
 4

julia> df."a"
4-element Vector{Int64}:
 1
 2
 3
 4

julia> df[!, "a"]
4-element Vector{Int64}:
 1
 2
 3
 4

julia> df[!, :a]
4-element Vector{Int64}:
 1
 2
 3
 4

julia> df[:, :a]
4-element Vector{Int64}:
 1
 2
 3
 4

In these examples columns can be access directly with literals such as df.a, or more dynamically using brackets (since variables could be substituted.) You may also find yourself wondering the difference between ! and :, which is an important distinction!

The ! returns the underlying vector and : returns a copy. This can be showcased in an example where we will attempt to change the description of the second value in column c to "I love Julia!"

julia> df[:, :c][2] = "I love Julia!"
3

julia> df
4×3 DataFrame
 Row │ a      b         c                  
     │ Int64  Float64   String             
─────┼─────────────────────────────────────
   11  0.394165  My first DataFrame
   22  0.809883  My first DataFrame
   33  0.124035  My first DataFrame
   44  0.886781  My first DataFrame

julia> df[!, :c][2] = "I love Julia!"
3

julia> df
4×3 DataFrame
 Row │ a      b         c                  
     │ Int64  Float64   String             
─────┼─────────────────────────────────────
   11  0.394165  My first DataFrame
   23  0.809883  I love Julia!
   33  0.124035  My first DataFrame
   44  0.886781  My first DataFrame

Notice how the change will only persist to df when we access the column with !.

There is often a tradeoff between returning copies versus the actual underlying vectors. Returning a copy is generally considered safer since if the copy is later mutated the underlying DataFrame remains unchanged. However, with very large DataFrames copying every column access will result in an increase in memory. It is best to weigh those considerations and figure out what approach will work best for a given program.

Data Wrangling

Import / Export

Another great feature of the Julia programming language is that many different packages will interact well when used together. For instance, DataFrames.jl and CSV.jl can be used to very easily import and export data.

First, we can save the DataFrame from above to CSV.

julia> using CSV

julia> path = joinpath(homedir(), "my_df.csv")

julia> CSV.write(path, df)

And, reading in the DataFrame from file is just as easy!

julia> CSV.read(path, DataFrame)
4×3 DataFrame
 Row │ a      b         c                  
     │ Int64  Float64   String31           
─────┼─────────────────────────────────────
   11  0.601361  My first DataFrame
   22  0.178065  My first DataFrame
   33  0.729591  My first DataFrame
   44  0.280314  My first DataFrame

There are many keyword arguments to explore when handling csv files and the documentation is best for covering all of these CSV.jl.

DataFrames.jl also supports writing and reading to multiple files types such as Arrow, JSON, Parquet, and others.

Joins

A join is a way to merge data from two DataFrames into a single DataFrame. There are several types and they generally mimic the same types that a database would support.

  • innerjoin
  • leftjoin
  • rightjoin
  • outerjoin
  • semijoin
  • antijoin
  • crossjoin

Definitions of each can be found in either the documentation, or docstrings, but lets take a look at a few examples. Say we have the following DataFrame sets containing information from a school.

julia> student_df = DataFrame(student_id = 1:10, student_name = ["Joe", "Sally", "Jim", "Sandy", "Beth", "Alex", "Tom", "Liz", "Bill", "Carl"], teacher_id = repeat([1,2],5))
10×3 DataFrame
 Row │ student_id  student_name  teacher_id 
     │ Int64       String        Int64      
─────┼──────────────────────────────────────
   11  Joe                    1
   22  Sally                  2
   33  Jim                    1
   44  Sandy                  2
   55  Beth                   1
   66  Alex                   2
   77  Tom                    1
   88  Liz                    2
   99  Bill                   1
  1010  Carl                   2

julia> teacher_df = DataFrame(teacher_id = 1:2, teacher_name = ["Mr. Jackson", "Ms. Smith"])
2×2 DataFrame
 Row │ teacher_id  teacher_name 
     │ Int64       String       
─────┼──────────────────────────
   11  Mr. Jackson
   22  Ms. Smith

julia> grade_df = DataFrame(exam_id = 1, student_id = vcat(1:3, 5:10), grade = [0.95, 0.93, 0.81, 0.85, 0.73, 0.88, 0.77, 0.75, 0.93])
9×3 DataFrame
 Row │ exam_id  student_id  grade   
     │ Int64    Int64       Float64 
─────┼──────────────────────────────
   11           1     0.95
   21           2     0.93
   31           3     0.81
   41           5     0.85
   51           6     0.73
   61           7     0.88
   71           8     0.77
   81           9     0.75
   91          10     0.93

If we look at the grade_df we can see there are 9 results, but in the student_df we have 10 students. So, someone must have missed the exam! Let's find out who that way we can alert the teacher to schedule a makeup.

Let's do a leftjoin, which means every row will persist from the first DataFrame regardless if there is a match to the second DataFrame. The leftjoin function also takes an on keyword argument to signify what column needs to be used to find matches.

julia> student_grade_df = leftjoin(student_df, grade_df, on=:student_id)
10×5 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade      
     │ Int64       String        Int64       Int64?   Float64?   
─────┼───────────────────────────────────────────────────────────
   11  Joe                    1        1        0.95
   22  Sally                  2        1        0.93
   33  Jim                    1        1        0.81
   45  Beth                   1        1        0.85
   56  Alex                   2        1        0.73
   67  Tom                    1        1        0.88
   78  Liz                    2        1        0.77
   89  Bill                   1        1        0.75
   910  Carl                   2        1        0.93
  104  Sandy                  2  missing  missing

We notice Sandy has a missing value for both the exam_id and grade fields. missing is a special datatype in Julia that is similar to a null value in databases. This would signify to us that there was no match in the grade_df meaning Sandy missed the exam. We can add one more join to get the respective teacher's name.

julia> result_df = innerjoin(student_grade_df, teacher_df, on=:teacher_id)
10×6 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade       teacher_name 
     │ Int64       String        Int64       Int64?   Float64?    String       
─────┼─────────────────────────────────────────────────────────────────────────
   11  Joe                    1        1        0.95  Mr. Jackson
   22  Sally                  2        1        0.93  Ms. Smith
   33  Jim                    1        1        0.81  Mr. Jackson
   45  Beth                   1        1        0.85  Mr. Jackson
   56  Alex                   2        1        0.73  Ms. Smith
   67  Tom                    1        1        0.88  Mr. Jackson
   78  Liz                    2        1        0.77  Ms. Smith
   89  Bill                   1        1        0.75  Mr. Jackson
   910  Carl                   2        1        0.93  Ms. Smith
  104  Sandy                  2  missing  missing     Ms. Smith

We used an innerjoin this time since we know that every student would have a teacher assigned. Now, we can let Ms. Smith know that she needs to reach out to Sandy to re-schedule her exam.

Sorting

Another helpful function for analysis is sort. Let's sort our result_df by the grade column.

julia> sort(result_df, [:grade])
10×6 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade       teacher_name 
     │ Int64       String        Int64       Int64?   Float64?    String       
─────┼─────────────────────────────────────────────────────────────────────────
   16  Alex                   2        1        0.73  Ms. Smith
   29  Bill                   1        1        0.75  Mr. Jackson
   38  Liz                    2        1        0.77  Ms. Smith
   43  Jim                    1        1        0.81  Mr. Jackson
   55  Beth                   1        1        0.85  Mr. Jackson
   67  Tom                    1        1        0.88  Mr. Jackson
   72  Sally                  2        1        0.93  Ms. Smith
   810  Carl                   2        1        0.93  Ms. Smith
   91  Joe                    1        1        0.95  Mr. Jackson
  104  Sandy                  2  missing  missing     Ms. Smith

The function takes the DataFrame and an array of columns to sort on. Our result of sort is putting the lowest grade first, but if we wanted it descending we can pass a rev keyword argument.

julia> sort(result_df, [:grade], rev=true)
10×6 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade       teacher_name 
     │ Int64       String        Int64       Int64?   Float64?    String       
─────┼─────────────────────────────────────────────────────────────────────────
   14  Sandy                  2  missing  missing     Ms. Smith
   21  Joe                    1        1        0.95  Mr. Jackson
   32  Sally                  2        1        0.93  Ms. Smith
   410  Carl                   2        1        0.93  Ms. Smith
   57  Tom                    1        1        0.88  Mr. Jackson
   65  Beth                   1        1        0.85  Mr. Jackson
   73  Jim                    1        1        0.81  Mr. Jackson
   88  Liz                    2        1        0.77  Ms. Smith
   99  Bill                   1        1        0.75  Mr. Jackson
  106  Alex                   2        1        0.73  Ms. Smith

In both these cases a copy of the DataFrame is returned and the result_df is left unchanged. But, if we wanted to sort in-place we can also use the sort! function that will update the passed DataFrame.

julia> sort!(result_df, [:grade], rev=true)
10×6 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade       teacher_name 
     │ Int64       String        Int64       Int64?   Float64?    String       
─────┼─────────────────────────────────────────────────────────────────────────
   14  Sandy                  2  missing  missing     Ms. Smith
   21  Joe                    1        1        0.95  Mr. Jackson
   32  Sally                  2        1        0.93  Ms. Smith
   410  Carl                   2        1        0.93  Ms. Smith
   57  Tom                    1        1        0.88  Mr. Jackson
   65  Beth                   1        1        0.85  Mr. Jackson
   73  Jim                    1        1        0.81  Mr. Jackson
   88  Liz                    2        1        0.77  Ms. Smith
   99  Bill                   1        1        0.75  Mr. Jackson
  106  Alex                   2        1        0.73  Ms. Smith

Split-apply-combine

Now that we have some basics down, it's time to dive into aggregating results. In DataFrames.jl this is referred to as a split-apply-combine strategy. It is a bit of a mouthful, but let's walk through what exactly this is referring to.

Split is simply breaking the DataFrame into groups using the groupby function. In our example lets split our DataFrame by the teacher_name column.

julia> grouped_df = groupby(result_df, :teacher_name)
GroupedDataFrame with 2 groups based on key: teacher_name
First Group (5 rows): teacher_name = "Mr. Jackson"
 Row │ student_id  student_name  teacher_id  exam_id  grade     teacher_name 
     │ Int64       String        Int64       Int64?   Float64?  String       
─────┼───────────────────────────────────────────────────────────────────────
   11  Joe                    1        1      0.95  Mr. Jackson
   23  Jim                    1        1      0.81  Mr. Jackson
   35  Beth                   1        1      0.85  Mr. Jackson
   47  Tom                    1        1      0.88  Mr. Jackson
   59  Bill                   1        1      0.75  Mr. Jackson
⋮
Last Group (5 rows): teacher_name = "Ms. Smith"
 Row │ student_id  student_name  teacher_id  exam_id  grade       teacher_name 
     │ Int64       String        Int64       Int64?   Float64?    String       
─────┼─────────────────────────────────────────────────────────────────────────
   12  Sally                  2        1        0.93  Ms. Smith
   26  Alex                   2        1        0.73  Ms. Smith
   38  Liz                    2        1        0.77  Ms. Smith
   410  Carl                   2        1        0.93  Ms. Smith
   54  Sandy                  2  missing  missing     Ms. Smith

The result of calling groupby is of type GroupedDataFrame, which is basically a wrapper around one, or many, groups of a DataFrame. In our example we have two teachers and so the result GroupedDataFrame has two groups.

Now, lets try to get an average exam grade for our two teachers. This will introduce the combine function that takes a GroupedDataFrame and any number of aggregation functions. Let's also add the Statistics.jl package, so we can take advantage of the mean function.

julia> using Statistics

julia> combine(grouped_df, :grade => mean)
2×2 DataFrame
 Row │ teacher_name  grade_mean  
     │ String        Float64?    
─────┼───────────────────────────
   1 │ Mr. Jackson         0.848
   2 │ Ms. Smith     missing

The result is a DataFrame where the first column(s) will match our GroupedDataFrame key(s) and the subsequent column(s) will match the function(s) we pass for aggregation. However, Ms. Smith has a grade_mean of missing!?

In our earlier discussion we found that Sandy missed the exam, so her grade was set to missing. A missing value behaves differently than normal numbers, which is problematic in our aggregation function. Take a look at a very simple example.

julia> 1 + missing
missing

We notice that adding a value of 1 to missing equals missing. This is a necessary evil and you may be wondering why don't we just treat it as 0? Let's see what happens to our results if we replace missing with 0.

julia> combine(grouped_df, :grade => (x -> mean(coalesce.(x, 0))))
2×2 DataFrame
 Row │ teacher_name  grade_function 
     │ String        Float64        
─────┼──────────────────────────────
   1 │ Mr. Jackson            0.848
   2 │ Ms. Smith              0.672

In the above example, instead of just passing mean as the function we create an anonymous function. This allows us to get a little more clever with adding a coalesce to replace the missing values with 0. We see from the results that Ms. Smith has a much lower scoring average than Mr. Jackson. But, if we think about it the results are getting incorrectly skewed. We know Sandy didn't actually score a 0, but rather didn't take the test at all. Treating her result as a 0 is skewing the average much lower than it should be.

In some cases replacing with a 0 would make sense, but not in this scenario. Here are a few better options:

We could just drop the rows that contain missing values prior to aggregation. DataFrames.jl provides a dropmissing function specifically for this.

julia> result_no_missing_df = dropmissing(result_df)
9×6 DataFrame
 Row │ student_id  student_name  teacher_id  exam_id  grade    teacher_name 
     │ Int64       String        Int64       Int64    Float64  String       
─────┼──────────────────────────────────────────────────────────────────────
   11  Joe                    1        1     0.95  Mr. Jackson
   22  Sally                  2        1     0.93  Ms. Smith
   33  Jim                    1        1     0.81  Mr. Jackson
   45  Beth                   1        1     0.85  Mr. Jackson
   56  Alex                   2        1     0.73  Ms. Smith
   67  Tom                    1        1     0.88  Mr. Jackson
   78  Liz                    2        1     0.77  Ms. Smith
   89  Bill                   1        1     0.75  Mr. Jackson
   910  Carl                   2        1     0.93  Ms. Smith

julia> grouped_no_missing_df = groupby(result_no_missing_df, :teacher_name)
GroupedDataFrame with 2 groups based on key: teacher_name
First Group (5 rows): teacher_name = "Mr. Jackson"
 Row │ student_id  student_name  teacher_id  exam_id  grade    teacher_name 
     │ Int64       String        Int64       Int64    Float64  String       
─────┼──────────────────────────────────────────────────────────────────────
   11  Joe                    1        1     0.95  Mr. Jackson
   23  Jim                    1        1     0.81  Mr. Jackson
   35  Beth                   1        1     0.85  Mr. Jackson
   47  Tom                    1        1     0.88  Mr. Jackson
   59  Bill                   1        1     0.75  Mr. Jackson
⋮
Last Group (4 rows): teacher_name = "Ms. Smith"
 Row │ student_id  student_name  teacher_id  exam_id  grade    teacher_name 
     │ Int64       String        Int64       Int64    Float64  String       
─────┼──────────────────────────────────────────────────────────────────────
   12  Sally                  2        1     0.93  Ms. Smith
   26  Alex                   2        1     0.73  Ms. Smith
   38  Liz                    2        1     0.77  Ms. Smith
   410  Carl                   2        1     0.93  Ms. Smith

julia> combine(grouped_no_missing_df, :grade => mean)
2×2 DataFrame
 Row │ teacher_name  grade_mean 
     │ String        Float64    
─────┼──────────────────────────
   1 │ Mr. Jackson        0.848
   2 │ Ms. Smith          0.84

We now see that the two teachers average test scores are very similar. This approach would work well if we never again needed the rows containing missing values.

But, if we wanted to keep those rows around and rather just exclude them from certain calculations. We can make use of another function, skipmissing, which will simply skip over the missing values.

julia> combine(grouped_df, :grade => (x -> mean(skipmissing(x))))
2×2 DataFrame
 Row │ teacher_name  grade_function 
     │ String        Float64        
─────┼──────────────────────────────
   1 │ Mr. Jackson            0.848
   2 │ Ms. Smith              0.84

One last thing to note on missing values is that it is easy to identify if one, or more, of your DataFrame columns contains missing values. We talked earlier that DataFrames.jl infers the types for each column and displays them in the output. You'll notice in result_df that the column teacher_id is of datatype Int64 and exam_id is of Int64?. Here the ? denotes that missing values were found, so be careful!

Conclusion

We've touched on some of the topics that makes DataFrames.jl such a great general purpose package. It is a helpful tool to quickly interact for data exploration, or to be used in production code to manipulate tabular data. I hope you've enjoyed today's reading and be sure to check out the rest of our blog posts on blog.glcs.io!