Switching from Python to Rust for extracting data from Git repositories

Background #

Python is great for collecting and manipulating data. It is usually very easy and fast to get something running. However, I recently had some issues mostly related to encodings when working with mails in mbox format and with Git repositories. Specifically, I had problems with parsing emails due to broken character encodings. Another issue I encountered was that I was not able to catch some exceptions. Maybe I missed something, but I decided to try to see if I have this problem if I use Rust instead.

Switching to Rust #

The way I started with Rust was to parse a few commits that were problematic in Python. Testing a small portion of problematic commits was the best choice before committing to rewrite my code in Rust. It turned out that the encoding issues dissapeared and I was able to output the needed data. In the following paragraphs, I will show how I used Rust for extracting data from Git repositories and from the mbox format.

Extracting data from repositories using git2-rs crate #

For working with Git repositories, I used the git2-rs crate, which provides libgit2 bindings for Rust.

Extracting data from emails stored as mbox using mail-parser crate #

There are several crates for working with emails in mbox format, but none of them are as mature as in other ecosystems.

mbox-reader - has not been updated for the past two years
mailbox - last update was in summer 2017
lettre - excellent email library, though it does not support email parsing
mail-parser - fairly new library that is actively developed and it explicitly mentions that it can decode messages in in 41 different character sets including obsolete formats such as UTF-7.

The mail-parser crate sounded promising so I decided to use that. However, I quickly realized that there was no way to read and parse an mbox file. The parser needed an email as a string. Luckily, the maintainers were kind enough to add support for reading mbox files, and I was able to quickly get it up and running.

Here is a code snippet on how one could use the mail-parser crate to parse some mbox files.


struct EmailData {
    address: Option<String>,
    from: Option<String>,
    message_id: Option<String>,
    date: Option<String>,
}

fn parse_mbox() {
  let p = "path_to_mbox";
  let mbox_file = std::fs::File::open(p);
    match mbox_file {
        Ok(val) => {
          for raw_message in MBoxParser::new(val) {
              let parsed_email = Message::parse(&raw_message);
              match parsed_email {
                  Some(email) => {
                      let from = match email.get_from() {
                          HeaderValue::Address(x) => Some(x.name.as_deref().unwrap_or("")),
                          _ => None,
                      };
                      let from_email = match email.get_from() {
                          HeaderValue::Address(x) => Some(x.address.as_deref().unwrap_or("")),
                          _ => None,
                      };


                      let message_id = match e.get_message_id() {
                          Some(m) => Some(m.to_string()),
                          None => None,
                      };


                      let string_date = match email.get_date() {
                          Some(d) => Some(d.to_string()),
                          None => None,
                      };
                  }
            }
          }
        },
        None => panic!("Error while parsing the mbox file");
    }
}

Lessons learned #

While it's much faster to get something up and running in Python, I ended up spending more time trying to debug and figuring out where it fails than I spent re-writing the scripts in Rust. Here are a few of my thoughts on the whole process

Picking the right tool for the job is important...
So is figuring out when to stop and try something else
Write your code in a way that ensures you handle most error cases. It is better to be more "defensive" in your coding than spending time to debug and patching things up
I prefer strong typing over dynamic typing, particularly when I expect things to go wrong