I first installed Linux on an old PC way back in 1993. Today, I used sed for the first time. I should have been using this years ago.
A side project that I volunteer on has me downloading some reports in PDF format that contain information that we use as a checklist of sorts (I have no control over the format of the output). The problem is that in many instances the report format is extremely wasteful such that 3500 lines of data require 2000 report pages.
Not very efficient.
Worse, this 2000 page report needs to be split up into sub-reports in order to be handed off to the people on the ground who will do the actual work. The sub reports are distributed electronically so they must only contain the pages needed by the field worker. Shades of my old Navy days (can you say “compartmentalization”?).
Some Googling led me to Ghostscript as the tool needed for splitting my big PDF into small PDFs. So I had my small sub-reports that I needed for distribution.
Except… that people did not like having to print a large number of pages that had mostly whitespace on them. I knew that if I could extract the text in the PDFs I might be able to reformat the text into a more usable format.
So, I began doing some research. One thing’s for sure: the Internet knows all.
In fairly short order I found the pdftotext utility that already resides on my linux box. Almost magically, I had extracted the text from my mini-mountain of PDF files. Neat!
Except… the format left a lot to be desired.
I opened one of the text files in good ole vim (actually, wimpy old me usually uses gVim). I worked through the file using six (or so) macros to transform the text into something I could use. Looking better. I quickly transformed my six (or so) macros into one large macro that was executed withing vim for format my text. Everything was looking good.
The next step was to process mini-mountain of PDFs using vim to transform the raw text into my preferred format in something like a batch (or at least from the command line in a script of some sort).
I was Googling for how to pass a macro name into vim as a command line parameter when I remembered that sed exists for tasks exactly like this. The only problem: I’ve never used it before.
More Googling. More good news. sed uses commands that are very very similar to the macros that I was using in vim. In short order, I had a script file for use by sed that transformed that raw text file into the format that I wanted for my distribution-ready files.
A few more minutes and I had a tiny shell script that starts with a mini-PDF, generates a text file, processes the text file with sed and voila, we have a well formatted text file suitable for printing, framing, or using as intended (I’m sure that I can probably save a step or two, but that’s something for tomorrow).
So, I’m twenty years late in making good use of tools that are available in every Linux installation. Tools that I’ve had at my disposal since day one. But I’m sure that this won’t be the last time I use these tools.
Now, on to awk…