griffin brown open source software presentation
TRANSCRIPT
Sarah AlgerVanessa AttiaGriffin Brown
Tim MappKathleen O’Connell
Open Source Software Presentation
Archival Enterprise II • INF 389S • April 10, 2014
Overview
I. What? Why? Who? When? Where? How?II. InstallationIII. Overview of featuresIV. Walkthrough and class exerciseV. Evaluation
What?● A bundle of software combining existing digital forensic tools with interfaces and
workflows that specifically meets the needs of collecting institutions. ● A Linux Ubuntu based operating system that can be installed either as a stand alone
on a dedicated computer or it can be installed in a virtual machine that operates within another operating system, such as MAC or PC.
● BitCurator creates a workflow to follow that is practical and logical for collecting
institutions.○ It uses easy to navigate graphical user interfaces (GUI) for its main functions.○ All functions can also be operated from the command line for advanced users.
So What?The software addresses five main areas of digital forensics that are important to collecting institutions:
● Forensic disk imaging○ Creates a disk image without altering the original disk, gathers metadata about the disk,
documents how the disk image was created – includes a software write blocker.● File System analysis and reporting
○ Looks at the structure of information on the disk including the disk format, amount of space used, file directory, individual file formats (mime types), and exports report as DFXML.
● Identification of private and individual identifying information○ Looks for private information such as social security numbers, email address, phone
numbers – depends on the information being in the expected format i.e. 608-231-2196.● Export of technical and other metadata
○ Creates easy to read reports from DFXML to PDF, including graphs on file types.● Identify and remove duplicate files
○ Quickly identify where duplicate files exist on a disk with the option to delete them.
Why?To develop a system for collecting professionals that incorporates the functionality of many digital forensics tools. Existing digital forensic tools do not address two fundamental needs of collecting institutions: ● Work within a collection management workflow
○ Preserving information and documenting the chain of custody and provenance of the information.
● Making the resulting data available to the public.○ Reports are in a format that is easier to understand and redaction of personal
information make it easier for institutions make collections available for research.
Who?Bit Curator is being developed jointly by the School of Information and Library Science at the University of North Carolina, Chapel Hill (UNC SILS) and the Maryland Institute for Technology in the Humanities (MITH). This is project is being funded by the Andrew W. Mellon Foundation. Currently there are seven people on the project development team:
Co-Principal InvestigatorsChristopher Lee (UNC SILS)
Matthew Kirschenbaum (MITH)The 5 other members are Postdoc, PHD and Masters students.
Who Else?● Beyond the development team there are two groups of external partners
who contribute to the development process: ● Professional Expert Panel (PEP) of individuals who are at various levels
of implementing digital forensics tools and methods in their collecting institution contexts, currently 10 people.
● Development Advisory Group (DAG) of individuals who have experience with development of software, currently 10 people.
When:Created in two phases:
● Phase one - November 2011 – September 2013Creation of the BitCurator environment that collects existing open source digital forensic tools and some custom build by BitCurator.
● Phase Two –now ongoing○ Continue to build the software, but more importantly grow the user base by
conducting training programs.○ Follow @BitCurator on Twitter for details
● Updates are released frequently - Current Version 0.8.0○ six updates since January 2014
WhereCultural heritage institutions throughout the country are using it, but it is still new and not ubiquitous in the profession. There is a concerted effort by the organization to grow a user base and promote its use by offering free workshops and webinars.
● Webinars:○ http://www.bitcurator.net/blog/○ The next one is April 14, 2014 - Metadata Output
● On-Site Visit and Training○ Grant money to conduct training, by request, until September 14, 2014
● Online community○ Google Group: https://groups.google.com/forum/?fromgroups#!forum/bitcurator-users
● Software can be downloaded from the Bitcurator wiki page:○ http://wiki.bitcurator.net/index.php?title=Main_Page
● BitCurator is open source and the software components are available on GitHub:○ https://github.com/kamwoods/bitcurator
ContextRelation to Other Archival Software: BitCurator was created to be the first step in an archival workflow, covering appraisal, ingest (creation of SIPs), and metadata extraction for digital materials. The goal of the software is to gain intellectual control over newly acquired digital records and extract enough information about those records (administrative, descriptive, structural, and preservation metadata) so that they can be turned over to other archival software, like Archivematica, ICA AtoM or Dspace, for collection management, access and/or preservation.
Future Plans: Continue to update software, add new tools and create GUIs for existing tools. Grow user base through creator-guided training, with the hopes of reaching a critical mass of users so that the community becomes its own self-sustaining support system. Ensure interoperability or possibility merge BitCurator with other archival software platforms (collection management, access, preservation) to create a complete digital records workflow that can be performed by a single archivist with a medium level of computer literacy.
Use Scenarios: Your archive has received a faculty member’s collection that includes floppy disks and CD-ROMs amidst the paper materials. A photographer has sent you several hundred master images on an external hard drive.
Installation● Software available through wiki.bitcurator.net● May be installed as either a virtual machine through a .vdi (virtual disk
image) or a stand alone operating system via .iso (disk image)● There is a medium level proficiency required due to hardware compatibility,
troubleshooting, and working in Ubuntu ● Virtual machine requires additional software to run the virtual environment,
potential need for troubleshooting increases difficulty of downloading the software successfully
● ISO installation requires a dedicated computer or hard drive● BitCurator’s YouTube Channel is full of How-To Videos http://www.
youtube.com/channel/UC17vmx0W-AiYFpdaWshd7yg
Terms to Know● bitstream - a binary string; the most basic information that makes up a
file.● hash value/checksum - unique number used to verify individual files.● MD5 - a particular hash value generator.● file structure - the “file tree”; the nested structure of folders and files.● Nautilus - the GUI file manager for Ubuntu, like Windows “Explorer” or
“Finder” on a Mac. BitCurator has added some back-end scripts to Nautilus to facilitate on-the-fly file analysis.
● feature directory - a folder to direct the information to after the various functions have been run.
● SIP (Submission Information Package) - set of files, including the file to be archived and its associated metadata, to be submitted to a digital repository
Archival Workflow Part 1:
Forensic Disk Imaging
Mounting the source disk & navigating disk content
What does it do?● Write-blocks disk images and devices attached to BitCurator (including USB drives and hard drives).
Devices and images are mounted read-only, so the files and file structures they contain can be accessed without inadvertently making changes to the files or their metadata. This tool performs essentially the same function as a hardware write-blocker.
Why do we care?● Blocks changes to attached devices to ensure the digital files are not affected or changed before the
archival disk images can be captured.● Allows resulting disk image to be examined without permitting change to the files, the directory
structure or any metadata.● Maintains integrity and authenticity of archival materials during and after ingest.● Real-world archival scenario: You’ve just received a collection containing old floppy disks. You want to image
the disks (see next slide), but you want to make sure you don’t inadvertently change any access metadata when you use BitCurator to examine it.
*This is a Linux/Ubuntu tool that BitCurator employs, not specifically a BitCurator feature.
BitCurator Tool: Safe Mount
Creating a forensic disk image
What does it do?● Captures bit-identical images from magnetic (floppy disks), optical (CDs), solid-state (SSDs and flash
drives), and hybrid media . Disk images can be captured in various formats, including raw (just the bitstream), E01 (Expert Witness Format, the standard format), and AFF (Advanced Forensic Format).1
Why do we care?● “Forensic images ensure that no inadvertent changes are made during pre-ingest chain-of-custody.”1 ● Maintains the integrity of the collection by capturing the entire bitstream.● Captures the entire file structure and all content (including data about deleted files) as it was on the
original media in a compressed or uncompressed format.● Real-world archival scenario: You receive a UT faculty member’s 3GB flash drive. You want to get all of its
content off the original device and into your digital repository while maintaining the integrity of the files, directories, and their metadata. Once the flash drive is safely mounted (read-only), your task is to create a disk image of the drive media, capturing its components completely and authentically.
1. http://wiki.bitcurator.net/index.php?title=BitCurator_and_Archival_Workflows
BitCurator Tool: Guymager
Archival Workflow Part 2:
Forensic Processing
Data analysis & metadata extraction
What does it do?● A function of the BitCurator Reporting Tool, fiwalk produces a DFXML (Digital Forensics XML) report on the
contents of a disk image. Fiwalk analyzes the disk image and produces an XML file detailing file system hierarchy within the disk image, including files and folders, deleted materials, and information in unallocated space. It also extracts metadata about each file, including date of last access and modification, file type, the user who created the file, file size and the physical location of the file on the disk (byte run)
● This XML file will be used by the BitCurator Annotated Reporting tool to generate a human readable report.Why do we care?
● Creates a detailed map of the disk image, allowing the archivist, and eventually, researchers, to gain an understanding of the physical and intellectual organization of the files.
● Extracts creation, access and modification metadata from files without changing that metadata.● Helps protect the privacy of the donor by ensuring files thought deleted are not accessed unexpectedly.● Provides information about file types in newly acquired records so repositories can make decisions about
file type support and access.● Real-world archival scenario: You have received an external hard drive and several flash drives from a well-
known author. After imaging the drives, you run fiwalk and discover several drafts of a work the author had thought deleted, which the author permits your archive to keep and add to the existing collection.
BitCurator Tool: fiwalk
Identifying sensitive information
What does it do?● Scans files using a variety of scanners (pdf, image files, email, zip files, etc.) to search for potentially
sensitive data, including geolocation metadata, email addresses, phone numbers, credit card numbers, and Social Security numbers. Also scans deleted files.
● Can be used to search for specific character strings (i.e.. specific names or terms) across multiple files and file types.
Why do we care?● Allows archivists to protect the materials' donor by discovering potentially sensitive information before
making a disk public. Particularly important for finding potentially sensitive data in files thought to be deleted.● Permits archivists to explore and appraise materials through searches for particular terms or names (e.
g. correspondence from a specific email address, mention of a known individual, use of a particular term d’art), resulting in much deeper, faster and more detailed analysis than could be done by hand.
● Real-world archival scenario: Your repository is processing the collection of a state senator. You run bulk extractor to ensure no credit card or Social Security numbers remain in the collection’s digital files. To help with the creation of the Scope and Content note and other access points, you use bulk extractor to find how much material uses the term “university education.”
BitCurator Tool: Bulk Extractor
Generating reportsBitCurator Tool: BitCurator Reporting Tool What does it do?
● The Annotated Features tab matches the "features" found by bulk_extractor with their corresponding file on the disk image. This step is necessary because bulk_extractor locates features by scanning the bit stream, not the file system. The annotated features report acts as a bridge between the output from bulk_extractor and the DFXML report from fiwalk to create a report that not only locates a feature, but also identifies the specific file in which it can be found.1
● BitCurator Forensic Reports bring together the various outputs from bulk_extractor, fiwalk and the annotation tool to generate both machine and human readable reports that can be read directly or crosswalked to other archival tools.2
● In addition to being able to run fiwalk, the annotation reports, and the BitCurator forensic reports individually, the BitCurator Reporting Tool allows the archivist to execute the entire process from the "Run All" tab.3
Why do we care?● Creates human- and computer-readable reports that allow repository staff and software to understand and make
use of information found and extracted by other BitCurator tools.● Allows repository to create and maintain information about collection: metadata, directory structures, file type and
contents, access restrictions. Also allows new reports to be generated as new information is required.● Real-world archival scenario: After considerable analysis of a collection hard drive using the other forensics tools, you
use the Reporting Tool to generate a report that provides information about the unrecognized files types present, the potentially sensitive information found in the deleted files, and a list of files with credit card and phone numbers that need to be redacted.
Archival Workflow Part 3:
Data Triage
Data Triage
What do they do? ● ClamAV / ClamTK: scans disk images and files for viruses and malware● sshash / sdhash: creates hash values/checksums to help determine if files are similar or exact duplicates● F Slint: identifies and deletes duplicate files● Windows registry analysis (Python scripts): extracts identity of Windows user who created or modified files.
Why do we care?● Ensures no viruses are introduced to an archive’s digital repository.● Hash values help differentiate between similar files and allowed a file’s continued integrity to be tested.● Duplicate identification and deletion can help repositories save storage space or flag less interesting files.● Windows registry analysis provides more information about file creation in Windows environments.● Real-world archival scenario: A newly arrived hard drive has several dozen files named “doc1.” After scanning
the hard drive image for viruses, you use the sdhash tool to discover which files, if any, are unique and F Slint to delete the exact duplicates. Windows registry analysis helps you discover which of the hard drive’s two users created which files.
.
BitCurator Tools: Data Triage
Archival Workflow Part 4:
Metadata Export
Metadata export
What does it do? ● Exports technical metadata from DFXML files (created by fiwalk and other forensics tools) to common
preservation and archival metadata formats, which can be incorporated into other stable standards, like METS and EAD.
● BitCurator generates PREMIS (preservation) metadata for each data forensics tool that is used on a disk image, providing an accurate record of provenance for each stage of processing.
Why do we care?● Allows data collected/created by the other forensics tools to be exported and used by other archival
software and to be incorporated into collection documentation (inventories and finding aids).● Creates documentation of all forensics tools used by the archivist, maintaining a record of provenance for the
collection and ensuring the collection’s integrity. ● Real-world archival scenario: You have just finished processing a hybrid collection, using BitCurator to process the
digital materials. You use the metadata export feature to add details about file times and the file structure to your EAD finding aid.
BitCurator Tools: Export Tools
Walkthrough
NOTE:
If you leave BitCurator inactive for a period of time, it will lock. Remember the password is:
bcadmin
And if something does not work...
Scenario for our walk-throughYou are the UT University Archivist. Dr. Lee, a retiring faculty member and donor, just cleared out her office and hands you a box full of the CD-ROMs she found in the process. None are externally labeled. She’s not sure if they’re all related to her research or if personal materials have found their way into some of them. You need to:
- View specific contents of the disks without compromising their archival integrity.- Capture and preserve the disk contents. - Determine if there is personal material that needs to be redacted. - Generate and extract appropriate metadata about file content/structure and the preservation process itself.
Here’s how we’ll use BitCurator to do that...
Creating a Disk Image
1.) Open file manager.
2.) Right-click on the white part of the browser and select “Create New Folder” to create a folder for your disk image.
3.) Name the folder for your disk image.
Access the CD-ROM driveThis will allow you to use CD-ROMs in BitCurator.
This is a VirtualBox function.
Safe Mount a Drive 1. Double-click the glowing green drive at top.
2. Check the appropriate device, in this case the CD.
3. Click OK to safe mount the device.
This will make the contents of the CD explorable in Nautilus.
3.) Name the folder for your disk image.
4.) From the Desktop, open the “Imaging Tools” folder.
5.) Open Guymager.
4.) From the Desktop, open the “Imaging Tools” folder.
5.) Open Guymager.
6.) In Guymager, locate the disk you want to image from the list of devices.
7.) Right-click the desired disk and select “Acquire image”.
8.) Specify the file format you would like the image to be captured in, the location in which you would like to save the image (destination directory), and the filename you would like to give to the image.
9.) Additionally, select the checksum verifications you want to create for the image.
10.) Click Start.
Optional: additional metadata in Expert Witness Format.
*
****
11) State will indicate if it’s “running” (i.e. in progress) and when it is finished.
12) Once Guymager is done imaging the media, navigate to the destination directory you specified and you’ll find your image.
Using Bulk ExtractorRunning a report on the materials to find
potentially sensitive material
1.) Create a folder to hold the data generated from the bulk extractor. Let’s call this folder “test_data”.
2.) Next, navigate to the Bulk Extractor Viewer, which is located in the “Forensics Tools” folder on the Desktop.
3.) In the Bulk Extractor Viewer, go to Tools → “Run bulk_extractor” or Ctrl+R to run the bulk extractor.
4.) Specify the disk image file on which to run the bulk_extractor.
5.) Specify the directory location in which to save the bulk_extractor data (“Output Feature Directory”).
6.) In the Scanners section, select the types of objects (or “features” as Bitcurator calls them) that you want bulk extractor to search for.
Bitcurator has a chart explaining what these scanners look for in more detail, but a few that are important for us include:
- The “accts” scanner looks for formatted information objects like credit card numbers, social security numbers, and phone numbers.
- You can also search for email addresses and URLs found in the data, or EXIF data from images, or PDFs.
6.) Click “Submit Run” to run the report.
7.) Once the bulk_extractor scan is complete, you can view the output of the report in the Reports section of the Viewer window.
We will explain what these reports mean later in the presentation.
Using fiwalkExtracting metadata about the files
1.) Open the “Forensics Tools” folder on the desktop and select the “BitCurator Reporting Tool”.
2.) Select the “Fiwalk XML” tab.
3.) Navigate to the path location of the disk image you want analyzed.
4.) Select the disk image file you want to analyze.
5.) Navigate to the path location in which to save the fiwalk DFXML report.
5.) Create a name for the XML document that fiwalk will create.
6.) Click “Run” to run fiwalk and generate a DFXML file describing the directories and files contained within a disk image.
7.) Once fiwalk has finished analyzing the disk image, the command line output will display either an error message or success message.
Annotated FeaturesLinking Metadata to Locations on Disk
1.) In the BitCurator Reporting Tool, select “Annotated Features” tab.
2.) Type in or navigate to the location of the disk image you want to run the Annotated Features report on.
3.) Type in or navigate to the location in which you saved the bulk extractor results.
4.) Create a new folder for the Annotated Features output reports.
5.) In the Annotated Features tab of the BitCurator Reporting Tool, specify the folder you created for the Annotated Features output reports.
6.) Click “Run”.
7.) Once finished, the command line output will display either an error message or success message.
Generating human-readable reportsThe “Reports” feature of the Bitcurator Reporting Tool combines the output
from bulk_extractor, fiwalk, and Annotated Features to generate human-readable reports in PDF format.
1.) Open Bitcurator Reporting Tool.2.) Enter the path location of the DFXML file generated by fiwalk.3.) Enter the location path of the directory holding the Annotated Features files.4.) Enter an output location in which to save the PDF reports.5.) Click Run and wait for a success message in the command line output.
Reading Human-readable Reports● bc_format_bargraph.pdf
○ Displays files by type by abbreviation and quantity in bar graph form● format_table.pdf
○ Displays the bar graph information and full file names● FiwalkDeletedFiles.pdf
○ Displays files that would be found in the “Trash”● Fiwalk.xlsx
○ Displays fiwalk report in xlsx for metadata encoding● FiwalkReport.pdf
○ Displays fiwalk xlsx report summary chart● BeReport.pdf
○ displays the allocated and unallocated files● Fetures Reports
○ exif.xlsx■ displays information for “EXchangeable Image Files”
○ hex.xlsx■ displays hexidecimal identifier for the file
○ domain.xlsx■ displays web domains in allocated and unallocated files
○ email.xlsx■ displays e-mail address found in allocated and unallocated files
This scenario taught us how to:
❏ Safely mount and capture a disk image to view its metadata.
❏ Identify potentially sensitive materials (“features”) in the bit stream (Bulk Extractor).
❏ Extract metadata about the files on the disk (fiwalk).❏ Link the location of features found in the bit stream to their location in the file structure (Annotated Features).❏ Create human-readable reports to describe the file system.
Your turn!Divide into three to four groups and take one of the provided CDs.
Your task is to:- Safely mount the disk and create a disk image.- Run Bulk Extractor - Run fiwalk, Annotated Features, and generate Reports on the file structure- Find the location of a file or files with personal information like a phone number
We’ll reconvene as a class in 15 minutes to share your experience with BitCurator’s challenges and affordances.
How’d you do?
Short cut...Simply go to ‘Run All’ under the ‘Reports’‘Image file’ is the first file created.‘Bulk Extractor’ is what was generated from the second step.Create an ‘Output’ Folder and name the document. Click Run to generate all reports.
EvaluationPros● Open source (i.e. FREE)● Still in active development, so constantly improving● Wide range of support options (wiki, forums, digital forensics community)● Useful for a variety of media such as CDs, DVDs, floppies, Flash memory,
and hard drives● Requires no specialized forensics equipment● Generates reports that provide all possible information about a document● Good entry point for users with low to medium expertise with computers,
while providing more functionality and customizability for those with more expertise
● Low learning curve for essential functions● Easier than putting together all of the individual tools oneself
EvaluationCons● Employs an aggregation of independently developed tools, so it can be
difficult to know when particular features need updating.● GUIs not yet created for the full suite of tools.● Archives should be willing and able to have staff capable of installing,
configuring, and maintaining the software. Probably need in-house programming/IT staff to benefit from the the full range of potential functions, which currently require competency with command line interface and scripting languages like Python.
● Is not an infrastructure/solution for end-user (patron) access.● Not yet designed to deal with digital files in/from the cloud
Implications for Archives● Allows for faster and more efficient ingest of digital media in archives
● Reports generated by BitCurator may be utilized in a variety of environments including other archive management software
● Allows for an archivist to avoid fragmenting hybrid collections
NOTE:
If you leave BitCurator inactive for a period of time, it will lock. Remember the password is:
bcadmin
And if something does not work...
ResourcesBitCurator wiki
http://wiki.bitcurator.net/
Quickstart Guide
http://wiki.bitcurator.net/downloads/BitCurator-Quickstart-v0.8.0.pdf
How do these tools address archival concerns? http://wiki.bitcurator.net/index.php?title=How_do_these_tools_address_archival_concerns%3F
Archival Workflows for BitCurator
http://wiki.bitcurator.net/index.php?title=BitCurator_and_Archival_Workflows
Screencast Tutorials
http://wiki.bitcurator.net/index.php?title=Screencast_Tutorials
BitCurator white paper
http://sils.unc.edu/news/2013/bitcurator-white-paper
Questions?