File Processing

10.3. File Processing#

Files are commonly used by computer codes, for example to save computed data or to read tables and external data files. Julia has extensive support for working with files, but here we will focus on only the basic functionality.

10.3.1. Reading files#

First we consider the most basic way to read a text file. We create an example text file named test_file.txt (e.g. using the Jupyter notebook or an editor), containing some lines of text.

The code below shows how to read each line of this file into a string, which can then be further processed in Julia (here it simply displays each line as a Julia string).

The function f = open(filename) returns a so-called stream f for accessing the data in the file filename. It will break with an error if the operation cannot be completed, for example if the file does not exist.
The function eof(f) (end-of-file) returns true if the stream f has reached the end of the file.
The function readline(f) returns a string containing the next file in the stream f.
The function close(f) closes the stream f.

f = open("test_file.txt")
while !eof(f)
    str = readline(f)
    display(str)
end
close(f)

"This is a test file"

"==================="

""

"This is line #4"

"Here are some comma-separated numbers:"

""

"1,2,3,4,5"

"5,-4,3e3,2.0,1"

The function eachline lets you do this in a easier way, and it also supports a filename instead of a stream:

for line in eachline("test_file.txt")
    display(line)
end

"This is a test file"

"==================="

""

"This is line #4"

"Here are some comma-separated numbers:"

""

"1,2,3,4,5"

"5,-4,3e3,2.0,1"

If you also read the entire file into a Julia string with the read function:

str = read("test_file.txt", String)

"This is a test file\n===================\n\nThis is line #4\nHere are some comma-separated numbers:\n\n1,2,3,4,5\n5,-4,3e3,2.0,1"

Alternatively, you can read the entire file into an array, with each line an element:

lines = readlines("test_file.txt")

8-element Vector{String}:
 "This is a test file"
 "==================="
 ""
 "This is line #4"
 "Here are some comma-separated numbers:"
 ""
 "1,2,3,4,5"
 "5,-4,3e3,2.0,1"

You can then access these strings using the usual array syntax, or loop over all of them:

println("Line #2 says: ", lines[2])
println()
println("Here are all the lines which have between 1 and 18 characters:\n")
for line in lines
    if 1 ≤ length(line) ≤ 18
        println(line)
    end
end

Line #2 says: ===================

Here are all the lines which have between 1 and 18 characters:

This is line #4
1,2,3,4,5
5,-4,3e3,2.0,1

10.3.2. Writing files#

The syntax for writing files is similar. The basic usage is demonstrated below:

f = open("created_data.txt", "w")
for i = 1:5
     # Create random strings of letters
    str = String(rand('a':'z', 50))
    write(f, str * "\n")  # Write string to stream f
end
println(f) # println can be used with streams too

# Print Fibonacci numbers to file
x = y = 1
print(f, "$x $y")
for i = 1:50
    z = x + y
    x = y
    y = z
    print(f, " $z")
end
println(f)
close(f)

# Read file and print each line
for line in eachline("created_data.txt")
    println(line)
end

xzoytpdpkzpnebtfaueervjukbqiyezlcyzkihzmmttsgfvqsp
hulojjuagvkwviizodevqrjylaukijmvupazrndtampanionvp
lofssumggaxrffpuczxjztdabewywbysomlegwhsrhzknbgnog
fwjpnzunuqswaflkpludowtmtgtnqvujrodlyfeftdlcaqifcx
jptwhttqguafjkczbslgbkoeydxxjuxdcitwerieiyyhwsqmcp

1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269 2178309 3524578 5702887 9227465 14930352 24157817 39088169 63245986 102334155 165580141 267914296 433494437 701408733 1134903170 1836311903 2971215073 4807526976 7778742049 12586269025 20365011074 32951280099

10.3.3. Delimited files#

The DelimitedFiles package contains two convenient functions for reading and writing arrays of data:

writedlm(filename, A, delim) writes the array A to file filename, using the character or string delim between each element in a row.
readdlm(filename, delim, T) reads an array from a file in a similar way, with the (optional) element type T

The code below demonstrates these functions.

using DelimitedFiles

# Write file
A = rand(-100:100, 8,3)    # Sample data
writedlm("created_data.txt", A, ',')

# Print file
for line in eachline("created_data.txt")
    println(line)
end

# Read into array
B = readdlm("created_data.txt", ',')

isequal(A,B) # Check identical

-44,-58,46
72,55,78
4,-90,-12
92,-58,-97
70,-32,-61
56,52,74
15,-82,-16
-89,50,-21

true

10.3.4. Example: Coded triangle numbers#

Project Euler, problem 42:

The n^th term of the sequence of triangle numbers is given by, \(t_n = n(n+1)/2\); so the first ten triangle numbers are:
1, 3, 6, 10, 15, 21, 28, 36, 45, 55, ...
By converting each letter in a word to a number corresponding to its alphabetical position and adding these values we form a word value. For example, the word value for SKY is \(19 + 11 + 25 = 55 = t_{10}\). If the word value is a triangle number then we shall call the word a triangle word.

Using p042_words.txt (right click and ‘Save Link/Target As…’), a 16K text file containing nearly two-thousand common English words, how many are triangle words?

function word_value(word)
    return sum(collect(word) .- 'A' .+ 1)
end
word_value("SKY")

trinums = [n*(n+1)÷2 for n = 1:50]
words = readdlm("p042_words.txt", ',', String)
nbrtriwords = count([word_value(word) ∈ trinums for word in words])
println("There are $nbrtriwords triangle words in the list")

There are 162 triangle words in the list

10.3.5. Regex#

Regular expressions (regex) are used to extract information from strings (often from files) in a systematic way. Defined by a specific search pattern, a regex finds any and all text matching that search pattern. The syntax for these regexes is a bit more archaic than most Julia, but this same syntax is used across (almost) all programming languages. Therefore, the basics are certainly worth learning, for instance through this tutorial.

In Julia, a regex is defined as a string prefixed with the character r, such as r"word \d+". One searches a string using the match(Regex, String) function. This function returns a special RegexMatch object containing the matching strings, but for most purposes we only need to access these matches as an array of strings, done with m.captures.

For instance, if a file contains many lines, each with 3 integers separated by a comma, you could parse it into an array using the following function.

function interpret_3digit_file(filename)
    array = zeros(Int64, 3, countlines(filename))
    for (iteration,line) in enumerate(eachline(filename))
        # Capture all 3 digits in this line
        pattern = Regex("(-?\\d+),(-?\\d+),(-?\\d+)") # easily written as r"(-?\d+),(-?\d+),(-?\d+)"
        m = match(pattern, line)
        # Convert all 3 captured strings into integers
        array[:,iteration] = parse.(Int, m.captures)
    end
    return array
end

interpret_3digit_file("created_data.txt")

3×8 Matrix{Int64}:
 -44  72    4   92   70  56   15  -89
 -58  55  -90  -58  -32  52  -82   50
  46  78  -12  -97  -61  74  -16  -21

The function eachmatch(Regex, String) behaves identically to match, but returns all possible matches instead of just one.

If a file contains many lines, each made up by sentence fragments of English words, you could parse it into an array of arrays using the following function.

function interpret_words_file(filename)
    sentences = Array[]
    for line in eachline(filename)
        # Capture all words in this line
        pattern = Regex("(\\w+)") # easily written as r"(\w+)"
        ms = eachmatch(pattern, line)
        # Collect all words into an array
        sentence = [m.match for m in ms]
        # Push current array into total array
        if !isempty(sentence)
            push!(sentences, sentence)
        end
    end
    return sentences
end

interpret_words_file("test_file.txt")

5-element Vector{Array}:
 SubString{String}["This", "is", "a", "test", "file"]
 SubString{String}["This", "is", "line", "4"]
 SubString{String}["Here", "are", "some", "comma", "separated", "numbers"]
 SubString{String}["1", "2", "3", "4", "5"]
 SubString{String}["5", "4", "3e3", "2", "0", "1"]