So I was doing my regular job, writing new queries to test out my team’s new databases on Vertica. And suddenly my senior manager sends out an email to all in the team asking an estimated disk space requirement for each of our work. I had to find out how much space the raw data files for a complete history that were used to load my databases occupy on disk.

I knew I could find this out by checking our archive space. We use linux. So suppose I stored my data files after zipping into ~myname/archive/ directory. I have 3 databases, each of which have two broad categories of files, stored in two subdirectories under a directory named after the database. Something like below:

~myname/archive > ls   
mydb1  
mydb2  
mydb3  
~myname/archive > cd mydb1  
~myname/archive/mydb1 > ls  
broadcategory1  
broadcategory2  
~myname/archive/mydb2 > ls  
broadcategory1  
broadcategory2  
~myname/archive/mydb3 > ls  
broadcategory1  
broadcategory2  

Each of the broadcategory directory would have subcategory directories which would depend on the database. So in my case I had something similar to this:

~myname/archive/mydb1/broadcategory1 > ls  
subcategory1  
subcategory2  
subcategory3  
~myname/archive/mydb1/broadcategory2 > ls  
subcategory1  
subcategory2  
subcategory3  
  
~myname/archive/mydb2/broadcategory1 > ls  
subcategory4  
subcategory5  
  
~myname/archive/mydb2/broadcategory2 > ls  
subcategory4  
subcategory5  
  
~myname/archive/mydb3/broadcategory1 > ls  
subcategory1  
  
~myname/archive/mydb3/broadcategory2 > ls  
subcategory1  

Now in each of those directories I had zip files with “yyyymmdd*history” in it’s name. That was how I could pick the right archive files. But we also had a policy of storing only 2 GB of data in a zip file. If a zip file was to be larger than 2GB then we would split it by generating a new zip file adding “_sequenceNumber” in its name. And not all database were refreshed with full history data on the same yyymmdd. Thus I would need to go in and check to figure out that. But once I find the right files all I had to do was run an unzip -v on the file and get the uncompressed files and sum it up and populate the results in a nice excel sheet and send it across.

If I were to run unzip manually on each of those files traversing through the directory hierarchy, I would waste my entire day on it. Because there were quite a lot of zip files. Even then I did that manually for the first few files. When I got tired, I got reminded of the cool ‘xargs’. I decided to write a one liner which would help me get the information i need so that I can easily copy paste into excel. So unzip -v zipfile.zip would list out a few columns and the information would include: size of each uncompressed file, size of compressed version of the file, compression ratio of each file, etc. And in the end of listing all the files, it would sum up all the sizes and neatly print it out below as 4 different fields. I needed the first (total size of uncompressed files in the archive), second(total size of compressed files in the archive) and fourth field (total number of files). And I also wanted the name of each zip file, this was just to show the breakdown (if ever I was asked about how I calculated the size)

Now I know that I need to list out filenames using the file name pattern in each directory. I also know that I need to run unzip -v on each of those zip file names. And I needed the names of the files and also the last line of the unzip -v filename command. And after some trial and error using man and google, I came up with the series of commands below:

ls \*20130112\*history\*.zip | xargs -d '\\n' -I % sh -c 'unzip -v % | tail -n1 | awk -v fname="%" '\\''BEGIN{OFS=",";}{print fname,$1,$2,$4}'\\'''  

In case you didn’t understand what each of those pipe separated commands did read further. ’ls *20130112*history*.zip’ would list out all the files that match the pattern as a list. I pass this list on to xargs. Now xargs is a command that is used to accept arguments from stdin and run a variety of other commands on each of those arguments. To know more go to xargs - Wikipedia

To elaborate on xargs and the options that I used. ‘-d’ is used to denote the delimiter of the arguments. In my case the list of file names were separated by new lines as provided by ’ls’. So I specified “-d ‘\n’”. -I option of xargs is a way of marking a placeholder. -I is followed by ‘%’, this is telling xargs to replace all occurences of ‘%’ that followed with a filename.

‘sh -c’ is used to let me run commands within single quotes that followed in other words it is passing commands to ‘sh’ within single quotes. You can google about it.

In the quotes that follows I right the stuff that I actually need to collect.

unzip -v %   

The above unzip command would display what I mentioned earlier, the % would be replaced with the name of the zip file, one at a time. I pass on the output of this command to tail.

tail -n1   

Tail command by default prints out the last 10 lines of whatever stream is input to it. Here I’m passing in the output of unzip -v and I also use the option -n1 to retrieve only the last 1 line. The output of this is passed on to an awk command.

Now I can’t teach you awk. Not that I don’t like it. But because someone has already taken all the pain in the world to put together the best tutorial I’ve ever seen on the web about awk here: Awk Tutorial. So I’ll just explain the options that I have used with awk.

awk -v  

The option defines a variable that can be used inside the awk’s script. So I’ve declared a variable fname with the value of % (which is nothing but the name of the file). Now you might see an awkward cluttered set of single quotes and a backslash. To know more about quoting on bash please use google. Sorry for being lazy. The idea was to get the quote that is used to begin an awk script inside the quotes in which we pass all these commands to ‘sh -c’.

Then I define the Output Field Separator (OFS) as a simple comma. Because I intend to copy comma separated values and paste it into MS Excel and get it into different columns automatically for me. And finally I print out the fname variable, the first field, second field and fourth field. These fields are explained above.