Yo’ dawg, I heard you like archives, so I put some tar.gz in your tar so you can extract archives from your archives!

This isn’t incredibly mindblowing, but I was tickled when I discovered today that I could untar a file in memory, and then extract the gzip archives within… all still in memory! Here is an example.

The Variables

The variables in the script below include:

input_tar

A tar file path that includes a single folder of .tar.gz files from arXiv, (e.g., “arxiv/1306.tar”) has one subfolder “1306” and this corresponds to a bunch of papers from June (06) 2013 (13). We write a scrappy script that expects this to be the first argument (if you are writing a real script, please use argparse)!

tar

The “input_tar” once it’s read into memory with tarfile.

member

Each of the “*.tar.gz” (tarinfo) objects corresponds to a compressed paper (e.g., “1306/1306.5867.tar.gz”). The reason we check for the extension and that it’s a file is because the top level folder is also a tarinfo object, and we would get an error if we treated it like a file. I suspect when a user uploads their paper, it gets assigned this unique id that includes the month, year, and then the paper number (5867).

subtar

A member (.tar.gz) that is read into a second tar object, but this one with mode r|gz because it’s gzipped. In this subtar we expect to find the files of interest that we want to parse.

submember

The members of the subtar, or the .tar.gz file. One or more of these will be tex files, and if we read into a file object we get the LaTeX!

The Algorithm

Before I show you a dump of code, let’s walk through the psuedocode so you know what’s going on.

  1. We start with a tar file, input_tar, and make sure it exists.
  2. We extract the input tar into memory, this is the variable tar.
  3. We iterate through the contents (members) of the input tar, and look for files that end in .tar.gz.
  4. When we find a .tar.gz, we read it into another tar object by passing it as a file object.
  5. We then iterate through the members of this second tar until we find the content we are looking for.


The Code

First, here are some helper functions.

# in helpers.py

def check_exists(input_file):
    '''if input_file doesn't exist, tell user and exit. We do this
       at the onset of the script for a clean and quick exit if the input
       is invalid
    '''
    if not os.path.exists(input_file):
        print('Cannot find %s!' % input_tar)
        sys.exit(1)

def extract_member(tar, member):
    '''a wrapped to extractfile, simple extract the content with "read"
       and return it
    '''
    with tar.extractfile(member) as m:
        content = m.read()
    return content

And here is the main function, named very appropriately if you ask me!


def yodawg_extract(input_tar, tar_mode='r', subtar_mode='r|gz'):
    '''extract tar.gz in memory from an initial tar file
    '''
    extracted = []
    tar = tarfile.open(input_tar, tar_mode)
    for member in tar:

       # Are we dealing with a file?
       if member.isfile():

           # Is it a gzip archive?
           if member.name.endswith('.tar.gz'):
               subtar = tarfile.open(mode=subar_mode, fileobj=tar.extractfile(member))
           
               # Now we can find papers (.tex LaTeX files) inside
               for submember in subtar:
                   if submember.name.endswith('.tex'):    

                       # We extract the submember from it's parent subtar
                       tex = extract_member(subtar, submember)
                       extracted.append(tex)

               subtar.close()

    # Don't forget to close your file handles!
    tar.close()

    return extracted

It would then be used like this:

#!/usr/bin/env python

import sys
import tarfile
from helpers import yodawg_extract

input_tar = sys.argv[1]

# If input tar is not found, do not proceed
check_exists(input_tar)
contents = yodawg_extract(input_tar)

Wait! Can we do this recursively?

Actually, now that I just wrote this function, if you think about it, we could very easily have the function recursively call itself (with the correct mode) each time it finds another archive! We can have archives inside archives inside archives… forever! I don’t know if this would work (and I’ll likely try it after the fact, I need to eat dinner soon!) I’m really excited about this, and so is someone else:


Let’s start with the parent function that will call a helper function to do the recursion. The main difference is that the first opening with tarfile is for an actual file, and the remaining are opening file objects.


import re

def yodawg_extract(input_tar, extension='.tex'):
    '''extract tar or targz files (in memory) from an original input tar (file)
    '''

    extracted = []
    tar = tarfile.open(input_tar, tar_mode)
    for member in tar:

       # Are we dealing with a file?
       if member.isfile():

           # Is it a gzip archive?
           if re.search('(tar$|gz$)', member.name):

               extracted += yodawg_helper(tar, member, extension)

    # Don't forget to close your file handle!
    tar.close()

    return extracted

Now here is the yodawg helper! We basically just read in the file object passed from the member, and then either extract a file to return (given the right extension, .tex) OR call ourselves again to keep digging.


def yodawg_helper(tar, member, extension, mode='r'):
    '''keep extracting until we don't find any more to extract.
    '''
    extracted = []

    # Do we have gzip and not tar?
    if member.name.endswith('tar.gz'):
        mode = 'r|gz'

    subtar = tarfile.open(mode=mode, fileobj=tar.extractfile(member))
           
    for submember in subtar:

        # Case 1: Another tar or tar.gz!
        if re.search('(tar$|gz$)', submember.name):
            extracted = extracted + yodawg_helper(subtar, submember, extension)
 
        elif submember.name.endswith(extension):
            tex = extract_member(subtar, submember)
            extracted.append(tex)

    return extracted

And of course and at the end, you would want to do something with the contents that you read.



Final Thoughts

I want to note that I didn’t test the above, I wrote it while I was making dinner (and if you test and have improvements please comment and contribute!)

Why did I want to write this? Maybe I just really like data structures, or working with file objects, or just playing with Python, but I think this is some really cool beans. Keep in mind some potential limitations:

Memory

Keep memory in mind as you parse through archives. For example, I was doing something similar with containers, and for larger ones my computer would slow and then poop out.

Modes

The file modes are important. Notice we use ‘r’ for read, and ‘gz’ when it’s a gzipped file. If you look in the Resources link below, you’ll notice different format strings to specify different kinds of compression. Your tar is probably not like my tar, so give a glance here first.

Generators

If the list of extracted gets too big, you could modify the code to yield one of the results. I didn’t do this because I really need to start making dinner now!

If you do something cool, even if it’s just as simple as the above, please share! It’s fun to learn things and tricks.

Resources