I try to keep the documents in some order. In case of the invoices, I prefer to store each of them in a folder named with corresponding month. That’s not the same as the month when I’ve downloaded or received the invoice. I needed to extract a date from each invoice, so I’ve decided to create an invoice date parser script. This task turned out to be more difficult than I initially expected.
Motivation
Invoices often include multiple separate dates, with names such as ‘invoice date’, ‘order date’, ‘payment deadline’ etc. I needed a script that would automatically parse a correct date from a given invoice, so another script can move the invoice to a directory with all invoices from the same month. I couldn’t find such script, so I’ve decided to create one by myself. You can find in in this GitHub repository.
Issues with invoice date parsing
There are a couple of problems with parsing data from PDF invoices:
- the invoices are formatted in various different ways. The actual date can be next to its name/title, or above/below the name/title. Sometimes a single line in PDF invoice contains multiple date-name pairs.
- the dates can be in multiple different formats
- the invoice can contain multiple different dates with different names/titles. Not all of them might be important.
- a different date name/title can be relevant in different case. I.e. for purchase invoice, the important dates are ‘invoice date’, ‘issue date’ etc., but for sales invoice the earliest date from a given subset is important.
- in my case the additional difficulty comes with the fact that I’m not from an English-speaking country and the invoices I deal with are in Polish or English.
The default configuration contains regex patterns for English and Polish invoices. I’ve tried to configure the script according to information from Polish accountant.
The initial requirements
The invoice date parser is written in bash shell and uses several tools available for most of Unix/Linux operating systems. I developed the script with MacOS and Linux in mind. The tools used in the shell script are:
- bash
- pdfgrep
- grep
- sed
- awk
- GNU date – on MacOS can be installed with
coreutils