Switching from Python to Rust for extracting data from Git repositories
Background #
Python is great for collecting and manipulating data. It is usually very easy and fast to get something running. However, I recently had some issues mostly related to encodings when working with mails in mbox format and with Git repositories. Specifically, I had problems with parsing emails due to broken character encodings. Another issue I encountered was that I was not able to catch some exceptions. Maybe I missed something, but I decided to try to see if I have this problem if I use Rust instead.
Switching to Rust #
The way I started with Rust was to parse a few commits that were problematic in Python. Testing a small portion of problematic commits was the best choice before committing to rewrite my code in Rust. It turned out that the encoding issues dissapeared and I was able to output the needed data. In the following paragraphs, I will show how I used Rust for extracting data from Git repositories and from the mbox format.
Extracting data from repositories using git2-rs crate #
For working with Git repositories, I used the git2-rs crate, which provides libgit2 bindings for Rust.
Extracting data from emails stored as mbox using mail-parser crate #
There are several crates for working with emails in mbox format, but none of them are as mature as in other ecosystems.
- mbox-reader - has not been updated for the past two years
- mailbox - last update was in summer 2017
- lettre - excellent email library, though it does not support email parsing
- mail-parser - fairly new library that is actively developed and it explicitly mentions that it can decode messages in
in 41 different character sets including obsolete formats such as UTF-7.
The mail-parser crate sounded promising so I decided to use that. However, I quickly realized that there was no way to read and parse an mbox file. The parser needed an email as a string. Luckily, the maintainers were kind enough to add support for reading mbox files, and I was able to quickly get it up and running.
Here is a code snippet on how one could use the mail-parser crate to parse some mbox files.
struct EmailData {
address: Option<String>,
from: Option<String>,
message_id: Option<String>,
date: Option<String>,
}
fn parse_mbox() {
let p = "path_to_mbox";
let mbox_file = std::fs::File::open(p);
match mbox_file {
Ok(val) => {
for raw_message in MBoxParser::new(val) {
let parsed_email = Message::parse(&raw_message);
match parsed_email {
Some(email) => {
let from = match email.get_from() {
HeaderValue::Address(x) => Some(x.name.as_deref().unwrap_or("")),
_ => None,
};
let from_email = match email.get_from() {
HeaderValue::Address(x) => Some(x.address.as_deref().unwrap_or("")),
_ => None,
};
let message_id = match e.get_message_id() {
Some(m) => Some(m.to_string()),
None => None,
};
let string_date = match email.get_date() {
Some(d) => Some(d.to_string()),
None => None,
};
}
}
}
},
None => panic!("Error while parsing the mbox file");
}
}
Lessons learned #
While it's much faster to get something up and running in Python, I ended up spending more time trying to debug and figuring out where it fails than I spent re-writing the scripts in Rust. Here are a few of my thoughts on the whole process
- Picking the right tool for the job is important...
- So is figuring out when to stop and try something else
- Write your code in a way that ensures you handle most error cases. It is better to be more "defensive" in your coding than spending time to debug and patching things up
- I prefer strong typing over dynamic typing, particularly when I expect things to go wrong