Mastering document metadata
During the information gathering stage of a penetration test, the key is to gather as much information about the target as possible before you begin to actively engage with the person or organisation in question.
Almost all commercial document creation and publishing tools embed additional information into files when they are saved. These details include both technical and non-technical information about the tool, the computer it was used on and the document itself.
The type, quantity and detail of this metadata varies from tool to tool, version to version. This extra information is known as the metadata - the data about the data.
Typical fields held within document metadata include:
- Usernames
- Email Addresses
- Paths
- Software names and versions
Tip
Don't forget image files when examining document metadata. A modern camera can embed dates, times and GPS co-ordinates in the images it produces.
The intended purpose of this metadata is to supposedly add extra value to the end users experience when using software. This metadata is used for a variety of things such as document previews, thumbnails and search.
As a penetration tester, we can harness this metadata as a non-invasive source of target data. Targets often publish all sorts of documents as part of normal day-to-day business. Whether that be via their web presence or on request via email. All we need to do is collect them and analyse them.
There are many tools that can help you do this from small script based penetration testing tools while others are intended for forensics.
Let's have a look at two of them:
Metagoofil
Metagoofil is a tool from Edge Security and is included in Backtrack by default. You can visit it's home page here.
Metagoofil is a python script that automates the process of acquiring documents hosted on publicly accessible web servers and extracting the embedded metadata.
./metagoofil.py -d [target domain] -t [file type(s)] -n [xx] -l [xx] -o [destination directory] -f [results filename]
How does it work?
At it's simplest this tool carries out the following:
- Searches Google for the target domain and desired filetype
- Downloads the files to your machine
- Runs the relevent extraction libraries over the files based on the file type
- Stores all of the findings in a handy, human readable html document.
Tip
Don't have access to tools - you can do this manually on Google using the inurl: or site: advanced search flag combined with the filetype: flag - then use tools such as strings to view the raw file contents and its metadata.
Things to remember
- Even though metagoofil uses Google to find the documents, they are being requested and downloaded from the target web server. As a result the requests will show in the targets web server log. This must be considered if you are doing a low profile red team event or need to avoid direct interaction with the target. This can be considered "active reconnaisance".
- The -n and -l flags are really handy if time or bandwidth are a concern - the more documents you request the more time and space will be required. Annual reports and other large, image heavy documents are common and can take a long time to gather.
- You can use full paths for the destination directory and results file name - this can be useful when working on multiple targets and organising your findings.
- The destination directory doesn't need to exist before you start - the tool will create it.
- Already collected documents or retrieved them from another source? Use the "-h yes" option to do a local analysis. Using this option, the destination directory flag (-d) is used to specify where the local files are stored.
- Don't forget to take into account document creation date when prioritising the importance of your findings with this tool. Usernames found in a document created last week are more likely to be valid than those in a file created a year ago.
- Leave out the -t option to analyse all files found or specify the file types you are interested in using a comma separated list.
Exiftool
Exiftool is a specialist perl tool for the reading, writing and manipulation of image metadata. It can work with a multitude of file types and is cross platform. You can find out more about this superbly powerful tool here.
Unlike metagoofil, exiftool will not find and download the files for you. We will assume you have done this stage yourself and stored the files in a handy directory for processing.
Tip
Be very careful when using exiftool to analyse files - particularly if you are unfamiliar with the syntax. Exiftool can modify the metadata so be sure you keep a seperate copy of the files you are analysing and run an md5sum tool over the file you are analysing before and after running exiftool. This will make sure you haven't changed the files as part of your work. There is nothing worse than realising your findings are the result of your bad process not bad target security.
How does it work?
This tool is extremely powerful so we will just use a simple set of options for now. Feel free to experiment with a more complex set in your own time.
exiftool [image name]
This will run exiftool across the supplied image file and output it to the screen.
exiftool -htmldump -w tmp/%f_%e.html t/images
Generate HTML pages from a hex dump of EXIF information in all images from the "t/images" directory. The output HTML files are written to the "tmp" directory (which is created if it didn't exist), with names of the form 'FILENAME_EXT.html'.
Things to remember
- Only use options once you understand what they do.
- As you gain experience you will begin to spot which bits of metadata are more important than others. For example names are generally handy for most jobs however some red team jobs will rely on more subtle details such as the physical location of where the image was taken.
- Exiftool is installed by default in backtrack and has a comprehensive manpage (including many examples).
Wrap up
Information gathering is definitely not the sexiest phase of penetration testing but done well can feed valuable intelligence into every stage of your work. Document metadata is a great source of this information and can be simply and easily extracted and analysed.



