Common issues with CSV parsing and solutions to them

Although CSV is a straightforward way of formatting the data that later can be used for import/export actions, you may sometimes run into parsing issues. This article is a handy first aid kit when it comes to Ruby and CSV.

Fixing performance issues

Problem: you are parsing a large CSV file, and it takes a lot of time to process the whole file, or your server is running out of memory when the process is in the middle.

Solution: when processing CSV files with Ruby, always use the CSV.foreach method instead of CSV.read or CSV.parse. The reason is simple: while read and parse methods load the whole file into the memory and then process it, the foreach method process the file line by line without loading all contents at once. By iterating the file line by line, you won't use much memory, and you would get access to the rows immediately.

Imagine that you are looking for one row in a file that consists of millions of rows. With CSV.foreach, you won't load all rows at once and stop processing as soon as you find the row you are looking for.

Fixing formatting issues

Problem: you are parsing a CSV file where some of the values are integers, dates, or formats that require special parsing. By default, you would receive all values as strings, and you would have to perform additional parsing on your own.

Solution: to speed up the parsing process, you can use preprocessors. A preprocessor is a simple anonymous function that formats the data when parsing a CSV file.

By default, we have the following preprocessors available in the standard library:

  • :integer - parse values that are integers
  • :float - parse values that are floats
  • :numeric - parse values that are numbers (integers or floats)
  • :date - parse values that are dates
  • :date_time - parse values that are dates with the time
  • :all - use all available preprocessors

As I mentioned before, preprocessor is just a simple anonymous function so you can call it directly using the call method:

CSV::Converters[:float].call("2.1")
# => 2.1

When parsing CSV you can define which preprocessors you would like to use by passing converters option:

csv = "first_name,age,birth_date\nJohn,30,2000-01-15"
CSV.parse(csv, headers: true, converters: [:integer, :date_time]).first
# => #<CSV::Row "first_name":"John" "age":30 "birth_date":#<DateTime: 2000-01-15T00:00:00+00:00 ((2451559j,0s,0n),+0s,2299161j)>>

As you can see in the above example, both age and birth_date attributes were appropriately formatted.

Custom parsing

If you want to parse something more than just a simple number or date, you can write your own converter in seconds. If you would like to parse every value that is URL, you can write the following code:

url_converter = ->(val) {
  val.to_s.match(/http|https/) ? URI.parse(val) : val
}

CSV.parse("first_name,url\nJohn,https://google.com", converters: [url_converter])
# => [["first_name", "url"], ["John", #<URI::HTTPS https://google.com>]]

Fixing encoding issues

Problem: you would like to parse a CSV file that contains special characters, but you get the CSV::MalformedCSVError error at every attempt to parse the file’s contents

Solution: specify the file’s encoding when parsing the file. You can do this by passing the encoding option with the name of the encoding:

CSV.read("./users.csv", encoding: "ISO-8859-1", headers: true)

Fixing duplications issues

Problem: you have to process the file where one header is duplicated, but you would like to keep all values because they are different.

Solution: pass the headers option and iterate over each row without transforming it into the hash because the standard library would remove the duplications automatically:

CSV.parse("first_name,role,role\nJohn,admin,user", headers: true).first.to_h
# => {"first_name"=>"John", "role"=>"admin"}

To keep all values from the duplicated column you can use the following approach:

data = []

CSV.parse("first_name,role,role\nJohn,admin,user", headers: true).each do |row|
  attributes = {}
  row.each_pair do |column, value|
    attributes[column] ||= []
    attributes[column] << value
  end

  data << attributes
end

data.map { |d| d.transform_values { |val| val.size == 1 ? val.first : val } }
# => [{"first_name"=>"John", "role"=>["admin", "user"]}]

It is helpful to know how the CSV library would behave if it would come to parsing duplicated values.