10.3. File Processing#

A crucial part of many programs is interacting with the file system. This is often called I/O (Input/Output). Your program might need to read input data from a file, or write its results to a file for later use.

Julia provides a straightforward and powerful set of tools for file I/O. Let’s cover the essentials.

10.3.1. Reading Text Files#

Let’s assume we have a simple text file named test_file.txt in the same directory as this notebook. Reading its contents is a fundamental task.

10.3.1.1. The open/close Pattern#

The basic process for interacting with a file involves three steps:

  1. open the file to get a stream object, which is like a connection to the file’s content.

  2. Read from (or write to) the stream.

  3. close the file to release it back to the operating system. This is a critical step! Forgetting to close a file can lead to data corruption or resource leaks.

Here is the low-level way to read a file line by line:

# 1. Open the file for reading.
f = open("test_file.txt")

# 2. Loop until we reach the end-of-file (eof).
while !eof(f)
    # Read the next line from the stream.
    str = readline(f)
    println("Read line: ", str)
end

# 3. CRITICAL: Always close the file when you are done.
close(f)
Read line: This is a test file
Read line: ===================
Read line: 
Read line: This is line #4
Read line: Here are some comma-separated numbers:
Read line: 
Read line: 1,2,3,4,5
Read line: 5,-4,3e3,2.0,1

10.3.1.2. Easier and Safer Methods#

Because it’s easy to forget close(f), Julia provides more convenient and safer ways to work with files that handle closing automatically. You should always prefer these methods.

The eachline function is the most common way to iterate through a file’s lines. It takes care of opening and closing the file for you.

# This loop does the same thing as the code above, but is more concise and safer.
for line in eachline("test_file.txt")
    println("Read line: ", line)
end
Read line: This is a test file
Read line: ===================
Read line: 
Read line: This is line #4
Read line: Here are some comma-separated numbers:
Read line: 
Read line: 1,2,3,4,5
Read line: 5,-4,3e3,2.0,1

If you need to read the entire file into a single string, use read.

# This "slurps" the whole file content, including newline characters, into one variable.
# Best for smaller files to avoid using too much memory.
full_content = read("test_file.txt", String)
"This is a test file\n===================\n\nThis is line #4\nHere are some comma-separated numbers:\n\n1,2,3,4,5\n5,-4,3e3,2.0,1"

Alternatively, you can read the entire file into an array of strings, where each element is one line, using readlines.

# This also loads the entire file into memory.
lines = readlines("test_file.txt")
8-element Vector{String}:
 "This is a test file"
 "==================="
 ""
 "This is line #4"
 "Here are some comma-separated numbers:"
 ""
 "1,2,3,4,5"
 "5,-4,3e3,2.0,1"

Once you have the lines in an array, you can process them using standard array operations.

println("Line #2 says: ", lines[2])
println()
println("Here are all the lines with 1 to 18 characters:\n")
for line in lines
    if 1  length(line)  18
        println(line)
    end
end
Line #2 says: ===================

Here are all the lines with 1 to 18 characters:

This is line #4
1,2,3,4,5
5,-4,3e3,2.0,1

10.3.2. Writing to Text Files#

Writing to files follows a similar open/close pattern. When opening a file for writing, you must specify a mode. The most common are:

  • "w": write mode. Creates a new file or overwrites an existing file.

  • "a": append mode. Adds new content to the end of an existing file.

# Open a file in write mode ("w"). If it exists, it will be erased!
f = open("created_data.txt", "w")

# `write` sends a string to the file stream.
write(f, "Here are some random strings:\n")
for i in 1:5
    str = String(rand('a':'z', 20)) # Create a random string
    write(f, str * "\n")
end

# `print` and `println` also work with file streams.
println(f, "\nHere are some Fibonacci numbers:")
x, y = 1, 1
print(f, "$x $y")
for i in 1:10
    x, y = y, x + y
    print(f, " $y")
end
println(f) # Add a final newline

# CRITICAL: Always close a file you've written to save the changes.
close(f)
# Let's read the file we just created to verify its contents.
for line in eachline("created_data.txt")
    println(line)
end
Here are some random strings:
xnluhsxogmyjyhvcekmq
zjggxsoxqixzsuczrnkg
byehkvyyrkbntouthzjf
vgsxpukrbapxioymgywm
yhjxvfdozcvolpwliwkk

Here are some Fibonacci numbers:
1 1 2 3 5 8 13 21 34 55 89 144

10.3.3. Working with Delimited Files (like CSVs)#

A very common data format is a delimited file, where data is arranged in a grid and columns are separated by a special character (a delimiter), such as a comma ,, a tab \t, or a space ' '.

The standard DelimitedFiles package provides two handy functions for this:

  • writedlm(filename, A, delim): Writes an array A to a file, separating elements with delim.

  • readdlm(filename, delim): Reads data with delimiter delim from a file into an array.

using DelimitedFiles

# 1. Create some sample matrix data.
A = rand(-100:100, 8, 3)

# 2. Write the matrix to a file, using a comma as the delimiter.
writedlm("created_data.csv", A, ',')

# 3. Let's look at the raw file content.
println("--- File Content ---")
for line in eachline("created_data.csv")
    println(line)
end
println("--------------------\n")

# 4. Read the data back from the file into a new array.
B = readdlm("created_data.csv", ',')

# 5. Check if the original and read-back data are identical.
isequal(A, B)
--- File Content ---
-73,15,90
-30,100,-89
-72,74,-25
75,-66,-75
34,-25,85
-27,33,54
-8,-41,-65
14,1,79
--------------------
true

10.3.4. Example: Coded Triangle Numbers#

Let’s solve Problem 42 from Project Euler, which combines file processing and string manipulation.

The nth term of the sequence of triangle numbers is given by, \(t_n = n(n+1)/2\); so the first ten triangle numbers are:

1, 3, 6, 10, 15, 21, 28, 36, 45, 55, ...

By converting each letter in a word to a number corresponding to its alphabetical position and adding these values we form a word value. For example, the word value for SKY is \(19 + 11 + 25 = 55 = t_{10}\). If the word value is a triangle number then we shall call the word a triangle word.

Using p042_words.txt (right click and ‘Save Link/Target As…’), a 16K text file containing nearly two-thousand common English words, how many are triangle words?

First, you’ll need to download the p042_words.txt file and place it in the same directory as this notebook.

# This function calculates the value of a word.
function word_value(word)
    # `collect(word)` creates a vector of Chars.
    # `.- 'A'` uses broadcasting to subtract the integer value of 'A' from each.
    # `.+ 1` converts from a 0-based index (A=0) to a 1-based index (A=1).
    # `sum` adds them all up.
    return sum(collect(word) .- 'A' .+ 1)
end

word_value("SKY")
55
# 1. Generate a list of triangle numbers to check against.
trinums = Set([n*(n+1)÷2 for n in 1:50]) # A Set provides fast lookups.

# 2. Read the comma-delimited file of words into an array.
words = readdlm("p042_words.txt", ',', String)

# 3. Use a comprehension to check each word, then count the `true` results.
is_triangle_word = [word_value(word)  trinums for word in words]
num_triangle_words = count(is_triangle_word)

println("There are $num_triangle_words triangle words in the list.")
There are 162 triangle words in the list.

10.3.5. Advanced Parsing with Regular Expressions (Regex)#

Sometimes data in files isn’t neatly delimited. For complex or irregular text patterns, you need a more powerful tool: Regular Expressions, or Regex.

Think of a regex as a mini-language for creating sophisticated search patterns. The syntax is used across nearly all programming languages, so it’s a valuable skill to learn. A great interactive tutorial can be found at RegexOne.

In Julia, you create a regex using the r"..." string macro. The match function then searches for the first occurrence of that pattern in a string.

# This function parses a file where each line has three comma-separated integers.
function interpret_3digit_file(filename)
    # Store the results in a vector of vectors
    array = Vector{Int64}[]
    
    for line in eachline(filename)
        # Define the regex pattern.
        # `(-?\d+)` is a capture group:
        #   - `-?` matches an optional minus sign.
        #   - `\d+` matches one or more digits.
        pattern = r"(-?\d+),(-?\d+),(-?\d+)"
        
        # Find the first match in the current line.
        m = match(pattern, line)
        
        # If a match was found, parse the captured strings into integers.
        if m !== nothing
            push!(array, parse.(Int, m.captures))
        end
    end
    return array
end

# We can run this on the CSV file we created earlier.
interpret_3digit_file("created_data.csv")
8-element Vector{Vector{Int64}}:
 [-73, 15, 90]
 [-30, 100, -89]
 [-72, 74, -25]
 [75, -66, -75]
 [34, -25, 85]
 [-27, 33, 54]
 [-8, -41, -65]
 [14, 1, 79]

The eachmatch function is similar to match, but it returns an iterator that finds all non-overlapping matches in a string, not just the first one.

# This function parses a file, extracting all words from each line.
function interpret_words_file(filename)
    sentences = Vector{String}[]
    
    for line in eachline(filename)
        # Define the regex pattern.
        # `(\w+)` is a capture group that matches one or more "word" characters
        # (letters, numbers, and underscore).
        pattern = r"(\w+)"
        
        # Find all matches in the line.
        matches_iterator = eachmatch(pattern, line)
        
        # Use a comprehension to extract the matched string from each match object.
        sentence = [m.match for m in matches_iterator]
        
        # Add the array of words to our list of sentences.
        if !isempty(sentence)
            push!(sentences, sentence)
        end
    end
    return sentences
end

interpret_words_file("test_file.txt")
5-element Vector{Vector{String}}:
 ["This", "is", "a", "test", "file"]
 ["This", "is", "line", "4"]
 ["Here", "are", "some", "comma", "separated", "numbers"]
 ["1", "2", "3", "4", "5"]
 ["5", "4", "3e3", "2", "0", "1"]