YOUTRACE: A SMARTPHONE SYSTEM FOR TRACKING
VIDEO MODIFICATIONS
By
ELIJAH JS HOULE
A thesis submitted in partial fulfillment ofthe requirements for the degree of
MASTER OF SCIENCE IN COMPUTER SCIENCE
WASHINGTON STATE UNIVERSITYSchool of Engineering and Computer Science, Vancouver
MAY 2015
To the Faculty of Washington State University:
The members of the Committee appointed to examine the thesis of
ELIJAH JS HOULE find it satisfactory and recommend that
it be accepted.
Scott Wallace, Ph.D., Chair
Xinghui Zhao, Ph.D.
Sarah Mocas, Ph.D.
ii
ACKNOWLEDGMENTS
I would like to thank both former and present faculty for their support and guidance through-
out my pursuance of the program, including Dr. Thanh Dang, Dr. Scott Wallace, Dr.
Xinghui Zhao, Dr. Sarah Mocas, and Dr. David Chiu.
I would also like to thank friends and family that have encouraged me, especially my
girlfriend Mahal and our feline companion Tessa, for helping me to manage my time and to
balance work with fun.
Lastly, I would like to thank you, the reader, for even looking at this thesis. I truly hope
that you gain something from it in return and that your curiosity never ends.
iii
YOUTRACE: A SMARTPHONE SYSTEM FOR TRACKING
VIDEO MODIFICATIONS
Abstract
by Elijah JS Houle, M.S.Washington State University
MAY 2015
Chair: Scott Wallace
As smartphone cameras and processors get better and faster, content creators increasingly
use them for both recording and editing of videos. On hosting sites, the lack of information
about how an upload relates to the original recording complicates the question of whether to
trust the content, especially for citizen journalism programs like CNN’s iReport. This thesis
introduces YouTrace, a system consisting of a trusted Android client (IntegriDroid) that
tracks modifications made to videos recorded on the smartphone, and a hosting server that
maintains lineage trees for near-duplicate videos from both trusted and untrusted sources.
YouTrace analyzes videos in a non-blind fashion, with the core algorithm, called video-
diff, comparing a parent and child video, and then reporting the transformations used to
produce the child through a structure called a delta-report. The comparison algorithm and
the report structure are capable of detecting and recording the type and degree of temporal
modifications to clips, such as scaling (stretching/shrinking duration) and trimming, as well
as spatial modifications to frames, such as scaling, cropping, bordering, color adjustment, and
content tampering. We implement the IntegriDroid client prototype on a Galaxy Nexus by
building on TaintDroid’s file tracking and porting an emulated Trusted Platform Module onto
iv
Android 4.3. Evaluation of detection accuracy, speed, and power consumption demonstrates
the feasibility for services to utilize a future system built on YouTrace to determine content
integrity.
v
TABLE OF CONTENTS
ACKNOWLEDGMENTS iii
ABSTRACT iv
TABLE OF CONTENTS viii
LIST OF TABLES ix
LIST OF FIGURES xi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Organizing a growing source of traffic . . . . . . . . . . . . . . . . . . 1
1.1.2 Building integrity to trust content . . . . . . . . . . . . . . . . . . . . 2
1.2 Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
vi
2 Related Works 8
2.1 Video Integrity Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Classification and evaluation . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Trusted Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Tracing Lineage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Data provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 System Architecture 15
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Certificate Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Trusted configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Delta-report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Server Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Client Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 IntegriDroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 video-diff — Comparing Original and Derived Videos . . . . . . . . . . . . . 24
4 Evaluation 29
4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Spatial scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 Spatial cropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.3 Block modification (general content tampering) . . . . . . . . . . . . 31
4.1.4 Color adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
vii
4.1.5 Border change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.6 Temporal scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.7 Temporal cropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 IntegriDroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Conclusion 41
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Appendices 43
A Classification and Evaluation of Video Authentication Schemes 43
Bibliography 51
viii
List of Tables
2.1 Taxonomy for video watermarking techniques. . . . . . . . . . . . . . . . . . 8
2.2 Taxonomy of authentication schemes for online video sharing. . . . . . . . . 9
2.3 Ratings for selected video authentication schemes. . . . . . . . . . . . . . . . 11
4.1 T-tests for F-ratio samples of different color components among “lighter”,
“darker”, and “normal” (no filter, border change only) color curve presets. . 34
A.1 Classification of video authentication schemes. . . . . . . . . . . . . . . . . . 44
A.2 Evaluation of video authentication schemes. . . . . . . . . . . . . . . . . . . 47
ix
List of Figures
1.1 Global IP traffic by application category. The Cisco Visual Networking Index
(VNI) forecasts Internet video to become the majority of traffic by 2018. “The
percentages within parentheses next to the legend denote the relative traffic
shares in 2013 and 2018, respectively.” [1]. . . . . . . . . . . . . . . . . . . . 3
3.1 Design of the overall architecture with both trusted and untrusted clients. . . 15
3.2 Delta-report structure for recording modifications. . . . . . . . . . . . . . . . 19
3.3 Media player with TraceBack button added, which links to the lineage tree
for the given video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Examples of TraceBack lineage trees generated and stored on the server, trac-
ing lineage for the bottom videos going upward through the parents (denoted
by IDs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Video upload to server from an untrusted (non-IntegriDroid) client. The
server matches the upload to a verified video if possible and describes the
modifications made. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 Design of the IntegriDroid client. . . . . . . . . . . . . . . . . . . . . . . . . 24
3.7 Video alignment of a temporally scaled sequence. . . . . . . . . . . . . . . . 25
4.1 video-diff accuracy in detecting spatial scaling by 2/3 and 3/2. . . . . . . . . 30
x
4.2 video-diff accuracy in detecting degree of spatial cropping with 50 and 100
pixels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Logo inserted into videos to evaluate block modification detection. Copy-
righted by Larry Ewing, Simon Budig, Anja Gerwinski. [2] . . . . . . . . . . 32
4.4 video-diff accuracy in detecting temporal scaling by 1/2, 2/3, 1, 10/7, and 2. 35
4.5 Running time versus number of threads for video-diff ’s average luminance loop. 37
4.6 Running time versus number of threads for video-diff ’s distance matrix com-
putation on longer videos (large matrices). . . . . . . . . . . . . . . . . . . . 38
xi
Dedication
This thesis is dedicated to Mahal, for her constant motivation, inspiration, and patience
while I carried out my research.
xii
Chapter 1
Introduction
1.1 Motivation
As video becomes an increasingly important form of media, with major decisions being made
based on video content, the following two issues become clear:
• Many videos on hosting sites consist of near-duplicates, complicating the ability to
quickly distinguish among different versions of a video.
• When presented with a video, users and services cannot attest to how it has been
edited since it was recorded, challenging the decision of whether to trust its content.
These issues motivate the work, as presented in the following subsections.
1.1.1 Organizing a growing source of traffic
With the present popularity of online sharing platforms such as YouTube, video has become
a predominant source of information and entertainment on the web. Even excluding those
shared on peer-to-peer networks, video made up 66% of global IP traffic in 2013, a percentage
1
that continues to grow at a rapid rate [3] (Figure 1.1). YouTube itself receives 100 hours of
video every minute [4]. However, 27% of the results to popular queries are near-duplicates [5],
videos that share most of their content with the most popular one, implying that some of
these uploads are redundant and would benefit from an organization scheme.
Major hosting sites already integrate near-duplicate detection into their systems. For
example, YouTube’s Content ID matches videos and live streams with a database of copy-
righted files to automate copyright infringement claims [6]. Recommendation systems likely
also use near-duplicate detection to filter out close matches, e.g., to avoid showing that a
video is a “related video” to itself. However, these sites would benefit from technology that
builds lineage trees, showing how a given video differs from its near-duplicates and, for un-
original content, what transformations the creator used on the original to produce it. Users
would have an interface that allows them to find original videos for popular content and to
distinguish them from derived videos. For example, a user may stumble upon a clip and want
to find the original recording to see the context or reference it. Proposed lineage detection
systems from the literature, along with their weaknesses, are discussed in Chapter 2.
1.1.2 Building integrity to trust content
In addition, user-generated content has become powerfully influential as smartphones have
become ubiquitous. Users can easily record and upload from anywhere, facilitating citi-
zen journalism with videos shared over social media. CNN takes advantage of this with
their iReport initiative, by crowdsourcing stories that may be difficult for reporters to cover
comprehensively, especially unexpected events [7]. However, only some stories are approved
for CNN, which requires manual verification. Most videos are not verified, which weakens
the usefulness of the service, and those that are verified are done so by humans, who may
be tricked by clever, malicious modifications. This and other citizen journalism programs
2
© 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page 16 of 24
Trend 6: IP Video Will Accelerate IP Traffic Growth Through 2018
The sum of all forms of IP video, which includes Internet video, IP VoD, video files exchanged through file sharing, video-streamed gaming, and videoconferencing, will continue to be in the range of 80 to 90 percent of total IP traffic. Globally, IP video traffic will account for 79 percent of traffic by 2018 (Figure 14).
Figure 14. Global IP Traffic by Application Category
The implications of video growth would be difficult to overstate. With video growth, Internet traffic is evolving from a relatively steady stream of traffic (characteristic of P2P1) to a more dynamic traffic pattern.
Impact of Video on Traffic Symmetry With the exception of short-form video and video calling, most forms of Internet video do not have a large upstream component.
As a result, traffic is not becoming more symmetric as many expected when user-generated content first became popular. The emergence of subscribers as content producers is an extremely important social, economic, and cultural phenomenon, but subscribers still consume far more video than they produce. Upstream traffic has been flat as a percentage for several years, according to data from the participants in the Cisco VNI Usage program.
1 Peer-to-peer (P2P), by definition, is highly symmetric traffic, with between 40 and 60 percent of P2P traffic consisting of upstream traffic. For every high-definition movie downloaded, approximately the same amount of traffic is uploaded to a peer. Now, with increased video traffic, most video streams that cross the network have a highly asymmetric profile, consisting mostly of downstream traffic, except in areas where P2P TV is prevalent (in China, for example).
Figure 1.1: Global IP traffic by application category. The Cisco Visual Networking Index(VNI) forecasts Internet video to become the majority of traffic by 2018. “The percentageswithin parentheses next to the legend denote the relative traffic shares in 2013 and 2018,respectively.” [1].
would benefit from an automatic video integrity mechanism to strengthen the utility and
trustworthiness of the service.
On any hosting service, users tend to make benign changes to recordings of events before
uploading them. For example, users may want to share an event they witnessed, after
stripping irrelevant content, piecing together clips, or anonymizing faces and objects by
covering them through blurring, blocking, or pixelation. Traditional integrity mechanisms
do not support benign processing because it changes the video file at a low level. Even
mechanisms made to be robust to some forms of processing such as compression (discussed
in detail in Chapter 2) tend not to support intentional, high-level changes where the meaning
of the content is still preserved but the video structure changes, and they also cannot report
the specific modifications made. People would benefit from a system that lists modifications
that have been made to a video since it was recorded, in order to inform their decision on
how much to trust the content.
3
1.2 Use Case
The increasing ubiquity, camera quality, and video editing power of smartphones lead to a
common pattern in content creation:
• On smartphone:
– User U1 records and saves original video V .
– User U1 uses some app to edit video V and export as a new video V ′.
– User U1 uploads resulting video V ′ to hosting server.
• From server, some other user U2 downloads video V ′.
• User U2 makes some changes to the video V ′ and exports it as a new video V ′′.
• User U2 uploads resulting video V ′′ to hosting server.
We have two problems here. First, how can we track the transformations mapping
V → V ′ on the smartphone, so that we can decide whether to trust the content of V ′ as
genuine? (This assumes that the original recording V is trusted, because “analog” attacks, as
in non-digital special effects or event fabrication, are outside the scope of this work.) Second,
how can we both trace V ′′ back to V ′ and track the transformations mapping V ′ → V ′′?
This thesis proposes possible solutions to both questions, by building a tracking framework
for smartphones on Android software and trusted hardware, and then by extending the
framework to the server by combining the core modification tracking component with near-
duplicate detection.
1.3 Problem Statement
This thesis addresses the following problem:
4
Although users and services would benefit from access to the modification history of a
video, the necessary components for tracking modifications currently suffer from restrictive
assumptions and lack of application to video.
1.4 Thesis Statement
Existing systems present restrictive assumptions, requiring all clients to be trusted or only
considering an unrealistic set of subtle transformations. As a result, they cannot be applied
directly to real domains where content creators apply a series of transformations both spa-
tially and temporally to produce a video, on a variety of different devices. Therefore, the
following statement guides this thesis:
Combining trusted sensing with computer vision approaches, and applying them to real-
istic video modifications, could address the shortcomings of each while enabling a complete
video tracking framework.
1.5 Challenges
Our goal is to design and develop a video tracking framework to verify our thesis. To achieve
this, we need to address the following challenges.
• Android native tracking — To lower the barrier to deployment and let the smart-
phone user create/edit videos with any app, traditional data tracking will not work.
Android taint tracking typically uses the Dalvik Virtual Machine, which can only track
Java methods. We need to track how information propagates through native code as
well. Section 3.4.1 discusses our proposed solution through vidtracker, an Android file
system monitor that approximates video dependencies.
5
• Modification report structure — As video represents both spatial and temporal
information, a report of modifications needs to capture both types of changes. For our
purpose, it needs to adequately and concisely describe the degree of modification in a
human-readable manner, without explicitly holding each version of the data. This way,
other parties can use the report to decide whether they trust that the video’s content
has been preserved. We propose the delta-report, discussed in Section 3.2.2.
• Accuracy and efficiency — The main barrier to deployment exists in whether a
system that tracks modifications can do so accurately, while being implemented with
low latency on the target device. In Chapter 4, we evaluate our core video-diff code
with these parameters in mind.
1.6 Contributions
This thesis proposes YouTrace, a full video tracking framework that takes advantage of both
trusted sensing and computer vision techniques. YouTrace allows for the possibility of users
to trace the history of a video found online back to the device that recorded it. This has
applications in journalism, social media and trend analysis, forensics and law enforcement,
among others. Specifically, the YouTrace system consists of:
• A fully functioning tracking framework including a server and smartphone client.
• Data structure (delta-report) and algorithm (video-diff) for tracking modifications.
• Evaluation of the framework with real video data from YouTube.
We evaluate our system quantitatively by considering accuracy, speed, and power con-
sumption, and qualitatively by analyzing the attack architecture.
6
1.7 Thesis Outline
The remainder of the thesis is organized as follows. Chapter 2 summarizes related efforts
in tracking and organizing data to provide integrity and lineage. Chapter 3 presents an
overview of YouTrace’s architecture. Chapter 4 evaluates the system. Chapter 5 summarizes
and concludes this work. Appendix A contains a survey of video authentication schemes.
7
Chapter 2
Related Works
2.1 Video Integrity Mechanisms
Authentication schemes for video have been considerably explored in the literature, generally
as an extension of image integrity. Unlike traditional file integrity mechanisms (such as
checksums), video authentication needs to compensate for transcoding transformations that
preserve the high-level meaning but alter the low-level bits, such as scaling and compression.
Depending on the scheme, it may treat frame dropping as benign, due to packet loss, or
malicious, due to temporal tampering.
2.1.1 Taxonomy
Generation Content-Based, Content-Independent
Embedding Imperceptible (Invisible), Perceptible (Visible)
Robustness Semi-Fragile, Fragile, Robust
Detection Blind (Oblivious), Semi-Blind, Non-Blind
Table 2.1: Taxonomy for video watermarking techniques.
Two main paradigms exist for authentication techniques: watermarking and fingerprint-
8
ing (or perceptual hashing). Both types of techniques similarly extract some set of features
from the video. They differ in that watermarking embeds information based on the features
back into the video content by modifying it imperceptibly, while fingerprinting keeps this
information as a signature alongside the video (often storing it into a database like a hash
to preserve the content) [8].
Watermarking can be classified by four dimensions, shown in Table 2.1. Watermarking
schemes mainly differ in whether generating the watermark depends on the video content,
whether they are embedded visibly, how well they withstand content transformations, and
whether detecting the watermark requires the original content.
Table 2.2 shows a general taxonomy that we propose for video authentication schemes.
In general, content integrity involves the extraction, encoding, and transmission of features
that describe the content.
Transmission Paradigm Watermarking (Embedded), Fingerprinting (Streamed)
Feature Extraction Pixel Color Values, Pixel Luminance Values,
Transform Coefficients, Key Frames
Feature Compression/Encoding Quantization, Error Correction Coding, Cryptographic Hashing
(MD5, SHA), Cloud Drops, Variable Length Code
Table 2.2: Taxonomy of authentication schemes for online video sharing.
2.1.2 Watermarking
Fragile watermarks verify the integrity of video content by breaking under modification.
However, because any modification can break them, they do not satisfy robustness for online
video sharing. On the other hand, robust watermarks are made to survive attacks, and
so they generally do not offer tampering detection except in extreme cases of tampering.
In contrast, semi-fragile watermarks break under malicious transformations, while still be-
ing robust to acceptable modifications. Likewise, content-independent watermarks are too
9
insecure for this application by being easily extracted and forged [9].
Watermarking techniques also vary on whether the original video data is necessary to
detect the watermark. A watermark that one can detect without the original data is con-
sidered blind or oblivious. If the values of parameters used to generate the watermark are
necessary, then the detection is semi-blind. Otherwise, if the source data is necessary, then
the watermark detection is non-blind.
One way to track modifications to video content would be to embed a watermark into the
original recording, and then check whether it has changed upon upload. Some watermarking
schemes are capable of localizing tampering, based on how the watermark is broken in the
tampered version. However, many of these schemes only authenticate I-frames or keyframes
(usually DCT/DFT coefficients or luminance values), and so they can only detect spatial
tampering [10; 11; 12]. Those that can detect temporal tampering use motion vectors or
shots as features, which limits robustness [13].
Because we assume that, in our system, a certificate is generated for each video, and
we are mostly concerned with the series of transformations that produce a video, we do
not use watermarking. By treating the video features themselves as the watermark and not
embedding one, we do not have to worry about losing track of features when a watermark
is broken.
2.1.3 Fingerprinting
Cryptographic hashing algorithms such as SHA generate a unique string given data input.
Small changes in the data result in drastically different hashes. In contrast, video hashing
algorithms allow small changes to the content to result in similar feature vectors. Because
of the inconsistent nomenclature, they are sometimes called “robust hashes”, “soft hashes”,
or “passive fingerprints” by different researchers. A video’s digital signature refers to a
10
perceptual hash encrypted with one’s private key [14].
We make use of fingerprinting in our system by extracting features from both original and
derived videos, and comparing the features to discover the series of transformations between
them (discussed more in detail in Chapter 3).
Scheme Score
Watermarking based on DCT DC coefficients of 4x4 subblocks of I-framemacroblocks [10]
12
Watermarking based on compressive sensing, red and blue values of I-frames [11]
9 (robustness notmeasured)
Watermarking based on applying error correction coding (ECC) to encodeangular radial transformation (ART) coefficients [12]
12 (but ECC canbe exploited)
Watermarking based on cloud model, generated by DCT energy distribu-tion of I-frames in shots [13]
9
Watermarking based on combining robust and fragile watermarks [15] 8
Fingerprinting based on radial projection of key frame pixels, denoted Ra-dial hASHing (RASH) — variance of pixel luminance values along evenlyspaced lines articulated around the center of each keyframe between 0 and180 degrees [16]
10
Fingerprinting based on MD5 hash of features obtained from DCT coeffi-cients and secret key for each block of each frame [17]
12
Fingerprinting based on cryptographic secret sharing — aggregation ofkeyframes from each shot used as master secret [8]
12
Hybrid scheme based on “content-fragile watermarking”, edge character-istics for each frame[18]
8
Hybrid scheme based on combining fragile watermarking [19] with digitalsignature [20] — hash of transform coefficients used as watermark [21]
6
Table 2.3: Ratings for selected video authentication schemes.
2.1.4 Classification and evaluation
Appendix A contains classification and evaluation of selected watermarking and fingerprint-
ing mechanisms from the literature. In Table 2.3, we summarize the results by assigning a
rating to each based on how well it claims to perform on each of four criteria (Tampering De-
11
tection, Tampering Localization, Geometric/Transcoding Robustness, and Loss Tolerance).
For each criterion, the scheme can have a rating of 1, 2, or 3 (for Low, Medium, or High),
giving a total score out of 12.
2.2 Trusted Sensing
The previous authentication schemes assume that the sender (uploader) is already trusted.
They benefit the sender in ensuring that the receiver obtains the video as it is sent, modified
or not by the sender. In order to prevent the sender from manipulating the video before
sending it, systems must leverage trusted hardware.
Gilbert et al. [22] demonstrated a system called “YouProve” for preserving the meaning
of photo and audio content recorded and uploaded from Android smartphones. Using the
TaintDroid framework, YouProve tracks information as it flows through applications and
records modifications. A Trusted Platform Module is used to attest to a hosting service that
the phone is running a trusted configuration with a report of the changes made to content.
However, this system has not been extended to video. In addition, although the system is
built on top of Android (the most popular mobile operating system) to lower the barrier
to deployment, the hosting service relies on the clients running a trusted configuration.
Our system leverages information from both trusted and untrusted clients to build a more
complete picture, further lowering the barrier to deployment.
2.3 Tracing Lineage
Video lineage is related to the problem of near-duplicate detection [5], which acts as an
extension of the problem for images [23]. However, tracing lineage also requires identifying
parent-child relationships among an item’s near-duplicates, building a directed graph to trace
12
the item’s history. Presently, the problem of lineage, also called phylogeny or archaeology,
has three main solutions (applied first to images but with possible extension to video):
Given a set of near-duplicates:
1. Process item pairs using a specialized detector for each transformation (e.g., scaling,
cropping), each outputting the direction of derivation. A consensus of results implies
a parent-child relationship [24].
2. For each pair of items, detect a dependency by computing the mutual information
between both their content and their noise [25].
3. Calculate a dissimilarity/distance matrix and use it to build the tree (phylogeny) by
graph-theoretic algorithms [26; 27; 28; 29].
Solution 1 relies on the agreement of several detectors, which may not work for subtle
manipulations. Solution 2 and Solution 3, the latter of which has been previously extended
to video [30] as well as large-scale analysis [31], can detect subtle transformations but do
not attempt to describe the transformations involved in the parent-child relationship. They
also do not consider a realistic set of transformations, only including resampling, cropping,
and color adjustment, neglecting both other common transformations in the spatial domain
and the temporal domain. Thus, our system specifically detects and records the degree of
modification for a set of common spatial and temporal transformations.
Another issue in simply extending image algorithms to video is the common assumption
that the frames in the child (derived video) still line up perfectly with the frames in the parent
(original). Because videos can be modified temporally by changing frame rate and splicing
or trimming clips to remove frames, video solutions need to align clips before applying the
spatial techniques from the image solution. Our system incorporates an alignment step to
match frames, extended from a recent work [32].
13
Many of these works come from a funded European project called REWIND, short for
“REVerse engineering of audio-VIsual coNtent Data”. The project mainly assumes that
multimedia transformations leave footprints that can be analyzed to trace back the content
modification history without access to the original content [33]. This represents a blind
approach to lineage detection, while we use a non-blind approach, assuming that changes
have been recorded starting with the original content on the device. Despite starting with
different assumptions, both approaches work toward the same goal of modification detection
and can complement each other.
2.3.1 Data provenance
The notion of provenance refers to a history of the entities and processes that produced
a given data object. Recently, semantic web applications have emerged, with the W3C
(World Wide Web Consortium) publishing a family of documents specifying web data prove-
nance [34]. In this model, our system acts as a step toward automating the generation of a
process description in a trustworthy manner.
14
Chapter 3
System Architecture
Trusted SmartphoneClient
(IntegriDroid)
Record
Edit
UploadMedia, Certificate
Media/CertificateHosting Server
Lineage Tree
UntrustedClient
Download Upload
Edit
Original videofrom trusted client
Derived videos
Figure 3.1: Design of the overall architecture with both trusted and untrusted clients.
15
3.1 Overview
Shown in Figure 3.1, the YouTrace architecture consists of a trusted smartphone client (Sec-
tion 3.4) and a hosting server (Section 3.3), with support for untrusted clients. The client
runs a custom system called IntegriDroid, built on top of Android with trusted hardware in-
tegration. The trusted hardware attests to the original integrity of video recordings, and the
system tracks modifications made to these recordings. When the client uploads a video, it
also sends a certificate (Section 3.2) containing a report of modifications along with informa-
tion about the trusted hardware. The server stores the video and the certificate, associating
them together in a database, and then creates a new lineage tree with the video as the root
node. When the server receives a video without a certificate (from an untrusted client), it
looks for a matching video and inserts it at the appropriate place in the match’s lineage tree.
If there is no match, then the server can store the video as an unverified root node. However,
because the video cannot be traced back to its original recording, users may not trust it.
3.2 Certificate Design
A core component of this work is the certificate, a data structure that includes information
about the device that signed it and information about changes made to the associated video
content. The former is accomplished by using a device’s Trusted Platform Module (TPM) to
measure its configuration, which services can then verify as matching a trusted configuration
and a known TPM. We propose the delta-report to record changes made to a video.
Our certificate is adapted from YouProve’s “fidelity certificate” [22] to accommodate a
delta-report rather than their type-specific content analysis results. Thus, the certificate
contains the following fields: digest of the content being uploaded or stored, timestamp of
original recording, delta-report describing how content differs from the original, digest of
16
the report, boot and system partition digests as hashed by a PCR, TPM’s public key, and
a quote from the TPM that signs the report digest with the value of the PCR using the
private key. (The TPM’s public key is backed by a Certificate Authority, and the private
key is theoretically never exposed.)
3.2.1 Trusted configuration
Because it is becoming increasingly easy to falsify videos on smartphones, and modify the
operating system to undermine the trustworthiness of a system that tracks changes, trusted
hardware attests that both the sensor readings from the camera and the content-tracking
system are relatively unchanged. In our system, this takes the form of a Trusted Platform
Module, a cryptographic processor with the following capabilities:
• Remote attestation: attesting to the software running on the platform to a remote
entity.
• Sealed storage: allowing data access only when the system is in a trusted configuration.
Both issues involve measuring the system configuration. The TPM uses Platform Con-
figuration Registers (PCRs) to accomplish this. PCRs are a set of stored hash chains to
which values can be concatenated and hashed using the extend command, but not directly
overwritten. Then, a chain of system operations is easy to verify, via the quote command
that reports PCR values signed with the TPM’s key, but infeasible to forge.
Although no commercial smartphones presently have a TPM, they will likely incorporate
these capabilities in the future. The Trusted Computing Group has been working on inte-
grating TPM features into the currently used Trusted Execution Environment [35], which
has isolated execution (for critical applications) and secure boot functionality. For this work,
we emulate the TPM on an Android device, though future work may leverage real hardware
with equivalent capabilities.
17
3.2.2 Delta-report
Changes made to a video can be either spatial or temporal in nature, and so an exhaustive
report needs to capture both types. Because transformations are usually performed at the
“clip” level (on sequences of frames rather than on individual frames), the report is a list of
entries describing clips. We propose the delta-report, illustrated in Figure 3.2. Each entry
consists of the clip’s endpoints (indices of the first and last frame) in the child video, the
matching clip’s endpoints in the parent video, and a list of transformations. The trans-
formation structure describes the type of transformation as a string, with the position (for
spatial transformations, given as the 2-dimensional coordinates of the top-left corner) and
the degree of transformation (length or factor of transformation along each axis). Although
some applications may want greater distinction for a given transformation, this model is
generic enough to describe all of the transformations with which we concern ourselves, with
some fields being unused by certain transformations.
The following transformations are considered in this work, with the delta-report being
generated by video-diff (Section 3.5):
• Spatial - scaling, cropping, block modification (general content tampering), color ad-
justment, grayscale, bordering
• Temporal - scaling (frame rate change or stretching), cropping (trimming)
3.3 Server Design
The YouTrace server hosts videos with their certificates, maintaining lineage trees for all
videos. To develop this, we start with the MediaDrop project, an open source video hosting
platform, and add lineage functionality. The only change to the frontend consists of a
18
Clip
…
Delta-report(list of clips)
endpoints (e1, e
2)
parent endpoints (e1', e
2')
Transformation ...
Clip (list of transformations)
type (string)position (x, y)degree (x', y')
Figure 3.2: Delta-report structure for recording modifications.
“TraceBack” button on the video player (Figure 3.3), which takes the user to the lineage
tree for the current video (Figure 3.4). Building the lineage tree requires first finding matches
for a video, as described in the following section.
3.3.1 Matching
When a video is uploaded, feature extraction1 occurs as follows, derived from a process
described in a previous work [36]:
1. Obtain keyframes.
(a) Get one keyframe per shot. Shot segmentation is done using color correlation: for
each I-frame, a histogram is computed for each of the three channels (the colors
blue, green, and red). The histograms of consecutive I-frames are then compared
channel by channel, and the minimum correlation is taken. If this value is less
than some lower limit S for shot correlation, then the I-frame is returned as the
next keyframe, because it belongs to a different shot.
1Here we use color auto-correlograms, but implementations can use different features for near-duplicatedetection as desired.
19
Figure 3.3: Media player with TraceBack button added, which links to the lineage tree forthe given video.
(a) (b)
Figure 3.4: Examples of TraceBack lineage trees generated and stored on the server, tracinglineage for the bottom videos going upward through the parents (denoted by IDs).
20
(b) Discard blanks. A keyframe is discarded if its number of detected keypoints is
less than some threshold B.
2. Preprocess keyframes.
(a) Denoise frames using a median filter.
(b) Remove borders. The border color is inferred from the top-left pixel and used to
trim solid strips of that color from top and bottom, left and right.
(c) Normalize aspect ratio. Frames are resized to some standard width and height
(W,H).
(d) Equalize histogram for the value/luminance channel (V in HSV, Y in YCbCr).
3. Extract the feature for each frame.
(a) Convert to the HSV color space, and quantize to 166 dimensions (18 hues × 3
saturations × 3 values + 4 grays).2
(b) Mask a square region of C2 pixels off each corner to focus on the central portion
of the frame for robustness.
(c) Extract the color auto-correlogram from both the central horizontal and vertical
strips to obtain a 332-dimensional feature. The color auto-correlogram is a 166-
dimensional vector v; for each quantized color c ∈ [1, 166], vc is the probability of
a pixel of color c having a neighbor (within some distance D) of color c.
In our implementation, the following values are used: S = 0.7, B = 50, (W,H) =
(320, 180), C = 20, D = 7.
The features for all keyframes are saved to an HDF5 dataset. A searcher daemon then
adds the features to its index in memory and returns similar features from other videos, using
2In our implementation, grays are defined as saturation or value lower than 10%, but if value is higherthan 80%, then saturation must be lower than 5%.
21
Feature ExtractionUpload
Dataset
SearcherFeatures
video-diff
Matching videos(potential parents)
Delta-reportsTraceBack
Lineage Trees
Figure 3.5: Video upload to server from an untrusted (non-IntegriDroid) client. The servermatches the upload to a verified video if possible and describes the modifications made.
FLANN for indexing and searching.3 For uploads from untrusted clients, these matches are
used as potential parents in our model. Each matching parent is then associated in a database
with the upload, and the match is annotated with a delta-report generated by a server-side
video-diff, described in Section 3.5. The lineage tree can be constructed by following relations
in the database. Figure 3.5 illustrates this process.
As long as the original recording has come from a trusted device, untrusted clients can
modify the video to produce different versions and even remix the different versions, because
this chain of trust will be retained. When a trusted client uploads a video, it sends its
own certificate describing changes that have been made to the video since recording. If
the certificate can be verified (by matching the public key of the TPM to a known one and
verifying the signature), then the video is added as a root node, because the client has proven
that the video originated from a recording on its device.
3Each feature in our implementation consists of 332 32-bit floats, taking up about 1.328 KB. Assumingthat a video has one keyframe per second, 209.17 hours of video could be represented by 1 GB of features.
22
3.4 Client Design
Extending the idea of YouProve [22] to video, the trusted client is a smartphone that relies on
software built on top of Android and hardware integration with a Trusted Platform Module.
As stated above (Section 3.2.1), the TPM is emulated for this prototype.
We use TaintDroid on Android 4.3 r1 for data tracking, build our system on top of it,
and run it on a Galaxy Nexus. We describe our system, IntegriDroid, in the next section.
Figure 3.6 shows the high-level design.
3.4.1 IntegriDroid
To track how a video changes from its source, we add IntegriDroid into the Android frame-
work as a new class that logs video recordings. The media recorder prepares by generating
a random 32-bit integer as the taint tag, used by TaintDroid for tracking the content both
on the filesystem and as it propagates through apps. When the recorder stops, it calls In-
tegriDroid’s log method on the video’s filename and the taint tag. IntegriDroid then inserts
these as well as the following into a centralized database on the device: current timestamp,
a digest of the video data, and a signature of all the fields by the TPM sealed with the first
PCR’s value (having been extended with measurements of the platform’s boot and system
partitions). This allows the device to attest that the recording is original and not digitally
falsified (assuming that the camera hardware is tamper-resistant).
vidtracker — File system monitor
Because TaintDroid is built into the Dalvik Virtual Machine, it can only interpose on I/O
by Java apps. Native apps (written in C++ for example) are not run by the VM, and so
TaintDroid fails to track data flowing through such apps. While developing IntegriDroid,
we discovered that video editors are often implemented natively for performance. To work
23
Camera(MediaRecorder)
Originalrecording
IntegriDroid
Database
File System
TPM
Copy file
Parent, Child
video-diff Delta-report
TaintDroid
Taint file
Check and addtaints
vidtracker
Signed log
Figure 3.6: Design of the IntegriDroid client.
around this issue, we developed a service, vidtracker, that watches the filesystem for processes
reading tainted video files. If a process with a given UID reads a tainted video file and a
process with the same UID writes a video file, the written file is checked to see if its video’s
keyframes match the tainted video’s keyframes. (The best match can be used if the process
has read multiple matching files.) If it does, then vidtracker calls the video-diff code on the
tainted file (assumed to be the parent) and the new one (the child) to report on modifications.
The derived video is then tainted and associated with the parent in the database along with
a report. Later, when a certificate is requested for the derived video, it inherits its parent’s
report in addition to its own, which, if the parent is also a derived video, inherits its parent’s
and so on until getting back to the original recording. This way, the final report constructs
the entire chain of modifications, tracing the final video to the original recording.
3.5 video-diff — Comparing Original and Derived Videos
On the client side, when an app generates tainted video output, as detected by TaintDroid
or vidtracker, IntegriDroid compares the new (derived) video to the parent using video-diff.
24
(a) Distance matrix. Thevertical axis corresponds tosuccessive blocks of framesin the parent video, whilethe horizontal axis corre-sponds to those in the childvideo, which is the parentvideo temporally scaled to70%. A point’s value fallsin the range 0-64, beingthe Hamming distance be-tween the 3D-DCT hashes ofthe two blocks of 64 frameseach. Darker points rep-resent lower values, whereblocks match more.
(b) Distance matrix after bi-narizing. Points with valuesgreater than τ = 16 are dis-carded. (16 is 25% of 64, soresulting blocks have hashesthat match at least 75%.)
(c) Distance matrix afterbinarizing and applyingmorphological opening.This erodes spurious valuesand dilates larger matches,though outliers still occurwhere the video has similarsegments repeating. (Out-liers become insignificantduring clustering, but im-plementations could alsouse RANSAC [37] to filterthem out.) Unfortunately,thin components, in whichsurrounding frames do notmatch, erode into gaps. Thekernel size must be chosencarefully for the applicationin order to minimize thiscase. After clustering theresulting points and fittinglines over distance minimain the clusters, the sequencesare aligned.
Figure 3.7: Video alignment of a temporally scaled sequence.
25
On the server side, uploads from untrusted clients are matched to known, trusted contents
and compared likewise, using the matches as parents. Shown in Figure 3.7, the comparison
algorithm starts by aligning corresponding frames between the two videos to tolerate and
detect temporal transformations, leveraging a published technique [32]. Then, corresponding
keyframes are compared to discover spatial transformations. The algorithm goes as follows:
1. Align videos by frames.
(a) Split each video into overlapping blocks of 64 frames (i.e., the first block is frames
0-63, the second block is frames 1-64).
(b) Get the 3D-DCT hash of each block. Given the 64 3D-DCT coefficients for the
block indexed between (1, 1, 1) and (4, 4, 4) inclusively (where (0, 0, 0) is the DC
coefficient), set the corresponding bit in the hash to 1 if the coefficient is greater
than the median value, and 0 otherwise.
(c) Construct a distance matrix, where each (i, j) is the Hamming distance between
the hashes of block i of the parent video and block j of the child.
(d) Binarize the distance matrix using some threshold τ (values less than or equal to
τ are set to 1, otherwise 0).
(e) Erode and dilate the binarized matrix to discard outliers, using square kernels
(structuring elements) of sizes E2 and D2, respectively.
(f) Find distance minima for rows and columns of the binarized matrix.
(g) Cluster the minima using k-means, starting with some predefined number of clus-
ters K and going down to 1 cluster iteratively. For each iteration, fit a line
through each cluster, and take the mean squared error between the line and the
minima to determine individual cluster quality. Take the mean of all clusters’
26
MSEs (MMSE) for total cluster quality. Choose the k-clustering with the lowest
MMSE and its fitted lines as the clips.
(h) Fine-tune the alignment for each clip by computing the average luminance value
for each frame in the clip in each video. Take the differences in this value between
successive frames to derive a signal for each video’s version of the clip. Cross-
correlate the signals from the two videos and shift the clip accordingly.
2. Detect temporal transformations.
(a) For each line (matching clip), use its slope to estimate temporal scaling.
(b) Use the number of blocks in the parent video without matches in the child to
estimate temporal cropping.
3. Detect spatial transformations.4 For each matching clip, select keyframes in the parent
(using the same color correlation as the server side, as described in Section 3.3.1), and
compare to the corresponding frame in the child:
(a) Check for borders in the child frame that do not exist in the parent frame; if
they are present, trim the second until it matches the first. Use this difference to
estimate degree of bordering modification.
(b) Detect keypoints using a standard algorithm like ORB or SURF5; crop both
frames to minimum bounding box around keypoints. Estimate cropping factor by
how much the parent needs to be cropped compared to the child (relative size of
keypoint structure to frame size).
4Although we did not consider block artifacting due to lossy compression, it can be estimated by dividingeach matching frame of the second video into a grid of 8 × 8 blocks (assuming standard encoding) andcomparing histograms of pixel differences within and outside blocks [38].
5Keypoint structures are used because we are concerned with transformations on the matching contentitself. The frames may also contain non-matching content that we want to ignore.
27
(c) Estimate scaling by the relative size of the child’s keypoint structure compared
to the parent’s.
(d) Estimate color adjustment by comparing the standard deviations of both frames’
color channels. This can be done using a simple variant of the F-ratio: dividing the
larger standard deviation by the smaller one, and comparing it to some critical
value. If they differ greatly, check whether the second video’s frame has near-
equivalent mean and deviation for all color channels, which indicates a likely
grayscale operation.
(e) Estimate general content tampering in each block. Because the frame and key-
point alignment both have some error, a pixelwise comparison between frames
would be unsuitable. Divide each frame’s matching region (from cropping to key-
points above) into a grid of square blocks with size N × N and computing the
structural similarity (SSIM) between corresponding blocks. Similar to YouProve’s
block comparison [22], but with a different metric, we consider the center region
12% smaller in each direction from the first frame’s block and compare it to each
equally sized subblock in the second frame’s block.
In our implementation, the following values are used: τ = 16, E = 8, D = 50, K = 5,
N = 128.
28
Chapter 4
Evaluation
4.1 Accuracy
First we evaluate the accuracy of the core video-diff code, in its ability to detect the type
and degree of specific transformations. In general, detection performs well but returns false
positives (or falsely higher degrees) for spatial modifications due to the imprecision in the
alignment step. To evaluate video-diff ’s performance on a given transformation, we download
a set of popular videos from YouTube (with content varying from trailers to vlogs), apply
the transformation using ffmpeg (version 2.5.git) for each video, and run video-diff on the
original and transformed video.
4.1.1 Spatial scaling
The spatial scaling detector compares the sizes of the bounding rectangles around matching
keypoints. In this evaluation, each frame in the parent video is scaled by some factor for
both width and height to produce the child video. Figure 4.1 shows the results for scaling
by 2/3 and 3/2. The estimated values have low error.
29
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0.6 0.8 1 1.2 1.4 1.6
Dete
cted
fact
or
of
spati
al sc
alin
g
Actual factor of spatial scaling
DetectedIdeal
(a)
Actual factor of spatial scalingDetected factor of spatial scaling
Average Std. dev. Keyframes Mean absolute error
0.67 0.66 0.02 8 0.01
1.50 1.52 0.10 40 0.04
(b)
Figure 4.1: video-diff accuracy in detecting spatial scaling by 2/3 and 3/2.
30
4.1.2 Spatial cropping
The spatial cropping detector works by matching keypoints in corresponding frames, and
then cropping the frames to bounding rectangles around the keypoints. To evaluate for
each video, a number of pixels is cropped from the bottom and right sides of each frame to
produce the child video. The detector returns a number for each axis, making the number
of samples twice the number of keyframes. Figure 4.2 shows the results for cropping by
50 and 100 pixels. The detector tends to underestimate, with the range of detected values
increasing for the higher actual value. It also performs better on some videos than others,
as demonstrated by the low differences between the standard deviations and mean absolute
errors. As with other detectors that work on matching regions, the observed errors here are
due to imprecision of frame alignment.
4.1.3 Block modification (general content tampering)
The block modification detector detects content tampering by computing the structural
similarity (SSIM) of blocks in matching keypoint regions between aligned frames. However,
this makes the detector sensitive to misalignment, for both false positives and false negatives.
Therefore, rather than evaluating on subtle manipulations, we take a logo of size 207x240
(Figure 4.3) and add it to the first 30 seconds of each video at position (256, 256). video-
diff only detects the modification in 11 out of 21 videos. For comparison, when running
video-diff on videos with borders added (but no other transformations) and their parents,
false positives for the detector occur in 10 out of 83 videos. This gives a true positive rate
of 52.38% and a true negative rate of 87.95%, meaning that the detector is biased toward
returning negative. This is likely because tampered frames are not recognized as matches to
the pre-tampered parents. In order for this detector to be more effective, frame alignment
accuracy must improve.
31
-40
-20
0
20
40
60
80
100
120
40 50 60 70 80 90 100 110
Dete
cted
num
ber
of
cropp
ed p
ixels
Actual number of cropped pixels
DetectedIdeal
(a)
Actual number of cropped pixelsDetected number of cropped pixels
Average Std. dev. Keyframes Mean absolute error
50 24.79 33.70 163 28.56
100 47.44 67.59 174 61.73
(b)
Figure 4.2: video-diff accuracy in detecting degree of spatial cropping with 50 and 100 pixels.
Figure 4.3: Logo inserted into videos to evaluate block modification detection. Copyrightedby Larry Ewing, Simon Budig, Anja Gerwinski. [2]
32
4.1.4 Color adjustment
In order to study whether the simple F-ratio can be used to determine color adjustment,
we take sample values from 23 keyframes of videos with the “lighter” curve preset filter, 19
keyframes with “darker”, and 48 with no color adjustment (only borders changed). Each
sample contains a red and blue component, where the standard deviations for that color are
taken from the parent and child frames, and the F-ratio is the greater standard deviation
divided by the lesser.
Lighter red (i.e., the red component of the “lighter” color transformation) and lighter blue
both give an average F-ratio for their respective color channels of 1.08, as does darker red,
while darker blue gives an average F-ratio of 1.12. “Normal” averages (no color adjustment)
give 1.03 for red and 1.04 for blue.
To test whether these last two means significantly differ from the others (and whether
there exists a basis for our claim), we conduct an independent two-sample T-test between
each, the results of which are shown in Table 4.1. The p-values computed between the
normal components and the color-adjusted components are less than 1%, indicating that
their means significantly differ, while the p-values between both color-adjusted components
are high, confirming the null hypothesis of statistical equivalence. These results demonstrate
that applications can use this simple F-test to determine color adjustment, by setting the
critical value somewhere between 1.04 and 1.08 exclusive (based on average F-ratios from
earlier).
4.1.5 Border change
The border change detector is simple enough to have a 100% true positive rate on 37 videos
with borders added. False positives that occur on other videos generally have low degree
and are likely due to misalignments.
33
Lighter Darker
t-statistic p-value t-statistic p-value
Normal red 3.2519 0.0018 3.4486 0.0010
Normal blue 2.7873 0.0069 3.2394 0.0019
Lighter red -0.3287 0.7441
Lighter blue -1.0909 0.2818
Table 4.1: T-tests for F-ratio samples of different color components among “lighter”,“darker”, and “normal” (no filter, border change only) color curve presets.
4.1.6 Temporal scaling
The temporal scaling detector simply measures the slope of the fitted line through each
matching clip of the videos, obtained from clustering their distance matrix. Figure 4.4
shows the results for scaling the video by 1/2, 2/3, 1 (from other transformations), 10/7,
and 2. Detected factors tend to stay close to the actual factors with low standard deviations.
4.1.7 Temporal cropping
The temporal cropping detector measures the lengths of clips in the parent without matching
clips in the child (through clustering and line fitting). For 37 videos that are spatially scaled
with borders added1, on average 15.56% of total parent time is falsely detected as cropped
out in the child. This amount indicates a small lack of accuracy in the video alignment,
likely where either DCT hashing fails or true matches in the distance matrix are eroded.
1Videos are spatially scaled to 720x432 then have black borders added to the top and bottom for a finalsize of 720x576. Videos almost all start at 1280x720, with three exceptions out of 37: 1280x534, 640x358,512x288.
34
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Dete
cted f
act
or
of
tem
pora
l sc
alin
g
Actual factor of temporal scaling
DetectedIdeal
(a)
Actual factor of temporal scalingDetected factor of spatial scaling
Average Std. dev. Clips
0.500 0.523 0.069 9
0.667 0.666 0.001 7
1.000 1.008 0.055 162
1.429 1.426 0.007 11
2.000 2.050 0.055 5
(b)
Figure 4.4: video-diff accuracy in detecting temporal scaling by 1/2, 2/3, 1, 10/7, and 2.
35
4.2 Speed
In this section, we evaluate the speed of the server-side video-diff, on a quad-core Intel Core
i5-2400S. (IntegriDroid client latency is evaluated in the next section.)
3D-DCT hashing operates on blocks of 64 frames each. However, before the DCT is
computed, the next frame is read, its color space is converted to gray and it is resized to
32x32. Therefore, the number of blocks that can be hashed per second depends on the
original frame size. Videos of size 1280x720 hash on average at 228.51 blocks per second,
while videos of size 720x576 hash on average at 333.41 blocks per second. Note that these
blocks are overlapping, and so if the video plays at 30 frames per second, then the time to
hash takes its duration divided by 7.617 in the first case and divided by 11.11 in the second.
Applying a morphological opening (dilation of erosion), finding distance minima, clus-
tering, and cross-correlation (after the costly average luminance computation) all take prac-
tically negligible time, only going up to 4 seconds for all operations on a 9-minute video. On
the other hand, computing histograms for the keyframe selection only works on average at
23.71 frames per second, which acts worse than real time.
Two functions that can be easily made to run in parallel, due to a lack of interdepen-
dencies, include the mean luminance computation loop and distance matrix computation,
explored in the next section.
4.2.1 Concurrency
A main bottleneck in video-diff consists of the loop in which the mean luminance is computed
for a series of frames. For each video sequence, this produces a mono-dimensional signal
used to cross-correlate for fine-tuning the alignment. Although each iteration takes only
0–1 seconds, the number of frames adds up to dominate the running time of the algorithm.
Because the frames can be processed independently, we decided to examine the effect of
36
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
1 2 3 4 5 6 7
Funct
ion t
ime (
norm
aliz
ed
)
Number of threads
Figure 4.5: Running time versus number of threads for video-diff ’s average luminance loop.
splitting the work among threads. Figure 4.5 shows how adding more threads decreases
the running time (normalized with respect to the time of one thread), averaged among 37
different videos. We would expect that doubling the number of threads would cut the running
time in half. The threading does not work as well as anticipated, only cutting the average
running time to 82.5% of normal at 7 threads. This is likely due to the cost of file I/O, as
each frame is read from the video file on disk. A better model in the future may be to have
the video file in RAM, across separate nodes on a distributed filesystem, or on solid-state
drives.
Although it does not present a bottleneck (negligible for short videos), distance matrix
computation can easily run in parallel, with each thread taking a different set of rows to
process. Figure 4.6 shows the results of running with 1, 2, 4, and 7 threads on three different
videos with numbers of blocks 12666, 12686, 16081, against videos of the same length (so the
distance matrix is the square of the number of blocks). This matches the expected results.
37
0
5
10
15
20
25
1 2 3 4 5 6 7
Funct
ion t
ime (
seco
nds)
Number of threads
16081 blocks12686 blocks12666 blocks
Figure 4.6: Running time versus number of threads for video-diff ’s distance matrix compu-tation on longer videos (large matrices).
4.3 IntegriDroid
Now we evaluate IntegriDroid’s performance on a Galaxy Nexus. We cross-compile both
OpenCV and ffmpeg for Android and link our video-diff code, because the prebuilt OpenCV4Android
lacks the ffmpeg backend for reading video. Due to the relatively weak CPU and multithread-
ing incapacity on the Galaxy Nexus, having only 2 cores at 1.2 GHz, our client prototype
does not perform the mean luminance computation for fine-tuning alignment or color corre-
lation for keyframe selection. Instead it simply selects the first and last frame of each clip
as the keyframes. Additionally, it scales each frame down to fit inside a 1024x768 bounding
box, similarly to YouProve [22]. Future implementations on better hardware may lift these
restrictions.
38
4.3.1 Latency
After recording a video, it takes about 5 seconds to log it: for copying the original recording
into secure storage, computing a digest, signing the log by the TPM, and inserting the log
into the database.
After editing a video, video-diff does not start running until the file has been closed,
which can either occur when the exported video stops playing or when the app stops. (See
Section 4.3.3 for a discussion of the vidtracker model.)
Once video-diff starts, a 1280x720 video hashes on average at 8.4 blocks per second, at
3.7% the speed of the server, or 28% playback speed if the video plays at 30 frames per
second. However, the distance matrix operations and clustering still take negligible time.
Future hardware may increase hashing speed as both smartphone processors and file I/O
improve.
4.3.2 Power consumption
We use University of Michigan’s PowerTutor app for reporting the system’s power consump-
tion as a weighted average over 5 minutes. All measurements are performed with WiFi off
and the screen dimmed unless otherwise noted.
When not using the IntegriDroid framework, the Galaxy Nexus uses 5 mW on idle with
the screen off. With the screen on, it uses 330 mW. Starting the TPM emulation processes
and vidtracker only brings this up to 337 mW.
While video-diff is running, consumption is surprisingly low at 600 mW. For comparison,
playing music (via the built-in Music app through the speaker at half the full volume) uses
735 mW. These results show that IntegriDroid has low power overhead and is feasible for
smartphone deployment.
39
4.3.3 Tracking
Although vidtracker works for prototyping, future works will require a more robust solu-
tion for native video dependency tracking. Because our model considers all tainted videos
that an app reads as potential parents to output videos, the number that it reads before
writing and closing a new one affects the tracking complexity and performance, with each
parent tested against the output video as a potential match. Realistically, the child video
may have multiple parents. If the vidtracker model is retained, then implementations may
consider caching video features to boost near-duplicate detection. However, other solutions
for native taint tracking may present higher utility, such as real-time disassembling of ARM
instructions [39].
40
Chapter 5
Conclusion
5.1 Summary
This thesis presents YouTrace, a full video tracking framework. First, we outline related
works in media integrity, lineage, and trusted sensing, showing that these mechanisms suffer
from restrictive or unrealistic assumptions when applied to videos. Second, we propose an
architecture based on a client running a custom Android OS with trusted hardware (Inte-
griDroid) as well as a hosting server, both of which trace video modifications back to the
original recording, as long as the original recording happens on a trusted IntegriDroid device.
Third, we show the accuracy of our system for tracking various types of transformations, as
well as speed and power consumption on a real smartphone. The results demonstrate feasibil-
ity for future deployment. Although the implementation still lacks efficiency on smartphones,
future hardware will likely increase the speed significantly through faster and more parallel
architectures while lowering the power consumption. Still, the results suggest future works
to improve YouTrace’s performance, as discussed in the following section.
41
5.2 Future Work
Several avenues for future research exist, within components of the current architecture or
by transforming the architecture.
The video-diff algorithm still has a long way to go in terms of accuracy and speed, though
one often comes at the expense of the other. The alignment algorithm suffers from impreci-
sion that sometimes causes frames to be mismatched, resulting in false positives, especially
for block modifications and temporal cropping. Implementations may choose to optimize for
a target subset of transformations, based on how users typically edit videos. Accordingly,
the delta-report structure may change to expose different dimensions of transformations.
In order for the proposed system to be practical, future works may target actual hard-
ware on commercial smartphones. Although smartphones do not have TPMs, equivalent
functionality may be possible in current trusted hardware, as the Trusted Computing Group
is working on integrating TPM features into the Trusted Execution Environment [35].
Moreover, file system monitoring using vidtracker may be an inefficient and imprecise
mechanism for taint tracking through native code. Future systems could use more ro-
bust techniques, such as real-time disassembling of ARM instructions as proposed in recent
works [39].
Currently, the architecture uses a simple client/server model. However, the server can
encompass several services, with all of them using a central database. Alternatively, video-
diff computation and certificate storage can be distributed over a peer-to-peer network.
Analogous to bitcoin mining, a set of nodes in the network could “mine” by computing the
delta-report for a parent and child video, possibly rewarded with some resource. This way,
the trust can be distributed rather than being concentrated in a central server. (Note that
the central server model still allows users to verify reports and certificates by recomputing,
but it may take longer.)
42
Appendix A
Classification and Evaluation of Video
Authentication Schemes
Some watermarking and fingerprinting mechanisms are classified in Table A.1, with columns
corresponding to the proposed taxonomy (Table 2.2). Additionally, the “Parameters” column
lists and describes parameters that the scheme accepts to configure application-dependent
settings such as security and robustness.
TransmissionParadigm
Feature Extraction Feature Encoding Parameters
[10] Watermarking
Transform Co-efficients (DCTDC coefficients inI-frame sub-blocks)
Quantization (di-agonal quantizedDCT coefficients)
Threshold T fortampering detec-tion
[11] WatermarkingPixel Color Values(red and blue layersof I-frames)
Quantization(DCT coefficientsof correspondingluminance blocks)
Compressive sens-ing matrices, lu-minance thresholdvalue T
[12] WatermarkingPixel LuminanceValues (ARTcoefficients)
Error CorrectionCoding (low andmiddle frequencyDFT coefficients)
Private key (op-tional)
43
[13] Watermarking
Transform Coef-ficients (averageDCT energy of allI-frames in shot)
Cloud Drops
Expected value Ex,entropy En, hyperentropy He, num-ber of cloud dropsn, secret key
[15] Watermarking Parameterized Quantization
Fragile watermarkfeature MF , robustwatermark featureMR
[16]Digital Signa-ture (PerceptualHashing)
Pixel LuminanceValues (radial pro-jection of key framepixels)
Quantization(RASH vector isquantized first 40DCT coefficients)
Key frame selec-tion algorithm,threshold τ forvisual equivalencedecision
[17]Digital Signa-ture (PerceptualHashing)
Pixel LuminanceValues
CryptographicHashing (MD5)
Maximum allow-able quantizationε, scalar constantfor intensity-transformation α,length/width ofblock P
[8] Digital Signature Key FramesAggregation offrames
Differential energyfactor D, weightfactor W
[18]Hybrid (Content-Fragile Watermark-ing)
Pixel LuminanceValues (edge char-acteristics)
Variable LengthCode
Private key
[21] HybridTransform Coeffi-cients
CryptographicHashing (embed-ded by changinglast LSB bit of xand y in motionvectors)
Thresholds T1 andT2 for selecting mo-tion vectors
Table A.1: Classification of video authentication schemes.
Table A.2 evaluates the same schemes according to the following criteria:
• Tampering Detection: The scheme detects malicious modifications to the video content
44
or source information.
• Tampering Localization: The scheme locates regions of a frame or video sequence that
have been tampered.
• Geometric/Transcoding Robustness : The scheme accepts content-preserving transfor-
mations such as scaling/re-compression.
• Loss Tolerance: The scheme tolerates packet/frame loss or makes up for it with redun-
dancy.
Also critical is speed of authentication. The user uploading the video to the server expects
to do so as quickly as possible. Likewise, the server needs to finish the upload task with
minimal authentication overhead in order to move quickly onto other tasks. The tradeoff
is to minimize overhead while still satisfying security. To prevent delay, especially for the
viewing client, authentication schemes should be capable of working in real time. However,
although they may factor into the usefulness of a system, the literature rarely explores these
characteristics in adequate detail for analysis. This presents a need for future works to
involve space, time, and energy requirements in their discussion.
Detection Localization Robustness Loss Tolerance
[10]High — Config-urable according tothreshold T
4x4 sub-block
High against Gaus-sian noise (89%),Salt and pep-per noise (94%),Gaussian low passfilter (74%), Con-trast enhancement(94%)
High — Only I-frames are signifi-cant
45
[11]
High — Detectionrate tested on fourtypical sequences(Carphone 92.5%,Container 93.3%,Mobile 91.4%,Paris 92.0%)
8x8 block Not measuredHigh — Only I-frames are signifi-cant
[12]High for non-overlapping, largeobjects
Object-level
High — Greaterthan 85% in allvideo processingtests
High — Each frameauthenticated inde-pendently
[13]
Medium (85-91%)— Low for shortshots, high for longones
Sub-region (de-pends on size),also temporaltampering degree(based on distancebetween W andW ′)
High for recompres-sion that preservesGOP structure
Low — Framedropping is con-sidered a temporalattack
[15]Configurable —Based on selectedfeatures
GOP level ConfigurableLow — Requiresmotion vectorsfrom inter-frames
[16]
Configurable —Correlates withthreshold τ byreducing the risk ofhash collision
Frame levelConfigurable —Inversely correlatedwith threshold τ
Medium — onlykey frames are sig-nificant, so toler-ance depends onkey frame selection
[17]Configurable —Correlates withparameter α
Block-level (sizeP × P )
High for recompres-sion, inversely cor-related with param-eter α
Configurable —Can be consideredtemporal tamper-ing or ignored
[8]
High — Inverselycorrelates with pa-rameter D overall,correlates with Won the block level
Specified regionHigh — Correlatedwith parameter D
High — Config-urable, exploitstemporal redun-dancy
46
[18]
Low — Maliciousmodifications thatpreserve edge char-acteristics are ac-cepted
Frame level
Medium — Usesrobust watermarks,but edges are sen-sitive to high com-pression and scal-ing transformations
High — Frames canbe authenticatedindependently
[21]High — Any triv-ial attack will altermotion vectors
GOP level
Low — Recompres-sion alters GOPstructures andmotion vectors
Low — Requiresmotion vectorsfrom inter-frames
Table A.2: Evaluation of video authentication schemes.
47
Bibliography
[1] Cisco, “The zettabyte era: Trends and anal-ysis,” http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/VNI Hyperconnectivity WP.pdf, June 2014, retrieved01 March 2015, archived by WebCite. [Online]. Available:http://www.webcitation.org/6WieRDCib
[2] S. Budig, “The linux-penguin again...” http://www.home.unix-ag.org/simon/penguin/,retrieved 03 April 2015. [Online]. Available: http://www.home.unix-ag.org/simon/penguin/
[3] Cisco, “Cisco visual networking index: Forecast and methodology,2013-2018,” http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generation-network/white paper c11-481360.pdf, June 2014, re-trieved 06 October 2014, archived by WebCite. [Online]. Available:http://www.webcitation.org/6T8ZCShaO
[4] YouTube, “Statistics,” https://www.youtube.com/yt/press/statistics.html, re-trieved 06 October 2014, archived by WebCite. [Online]. Available:http://www.webcitation.org/6T8e2x2HX
[5] X. Wu, A. G. Hauptmann, and C.-W. Ngo, “Practical elimination of near-duplicatesfrom web video search,” in Proceedings of the 15th international conference on Multi-media. ACM, 2007, pp. 218–227.
[6] YouTube, “Content verification program,”https://support.google.com/youtube/answer/6005923, retrieved 01 March 2015,archived by WebCite. [Online]. Available: http://www.webcitation.org/6WijIDfno
[7] CNN, “About ireport,” http://ireport.cnn.com/about.jspa, retrieved 06 October 2014,archived by WebCite. [Online]. Available: http://www.webcitation.org/6T8eJ9VAy
[8] P. K. Atrey, W.-Q. Yan, and M. S. Kankanhalli, “A scalable signature scheme for videoauthentication,” Multimedia Tools and Applications, vol. 34, no. 1, pp. 107–135, July2007.
48
[9] J. Wang, J. Lu, S. Lian, and G. Liu, “On the design of secure multimedia authenti-cation,” Journal of Universal Computer Science, vol. 15, no. 2, pp. 426–443, January2009.
[10] W. Zhang, R. Zhang, X. Liu, C. Wu, and X. Niu, “A video watermarking algorithm ofh.264/avc for content authentication,” Journal of Networks, vol. 7, no. 8, pp. 1150–1154,2012.
[11] C. Xiaoling and Z. Huimin, “A novel video content authentication algorithm combinedsemi-fragile watermarking with compressive sensing,” in 2012 International Conferenceon Intelligent Systems Design and Engineering Application. IEEE, 2012.
[12] D. He, Q. Sun, and Q. Tian, “A secure and robust object-based video authenticationsystem,” EURASIP Journal on Advances in Signal Processing, vol. 2004, pp. 2185–2200,2004.
[13] C.-Y. Liang, A. Li, and X.-M. Niu, “Video authentication and tamper detection basedon cloud model,” in IIHMSP 2007 - Third International Conference on Intelligent In-formation Hiding and Multimedia Signal Processing, November 2007.
[14] C. Zauner, “Implementation and benchmarking of perceptual image hash functions,”Master’s thesis, Upper Austria University of Applied Sciences, Hagenberg, 2010.
[15] P. Yin and H. H. Yu, “A semi-fragile watermarking system for mpeg video authenti-cation,” in 2002 IEEE International Conference on Acoustics, Speech, and Signal Pro-cessing (ICASSP), May 2002, pp. IV–3461–IV–3464.
[16] C. D. Roover, C. D. Vleeschouwer, F. Lefebvre, and B. Macq, “Robust video hashingbased on radial projections of key frames,” IEEE Transactions on Signal Processing,vol. 53, no. 10, pp. 4020–4037, October 2005.
[17] F. Ahmed and M. Y. Siyal, “A robust and secure signature scheme for video authenti-cation,” in 2007 IEEE International Conference on Multimedia and Expo. IEEE, July2007, pp. 2126–2129.
[18] J. Dittmann, A. Steinmetz, and R. Steinmetz, “Content-based digital signature for mo-tion pictures authentication and content-fragile watermarking,” in IEEE InternationalConference on Multimedia Computing and Systems, vol. 2. IEEE, 1999, pp. 209–213.
[19] J. Zhang and A. T. Ho, “Efficient video authentication for h.264,” in IEEE Proceedingsof the first International Conference on Innovative Computing, Information and Control(ICICIC’06), 2006.
[20] N. Ramaswamy and K. R. Rao, “Video authentication for h.264/avc using digital sig-nature standard and secure hash algorithm,” in NOSSDAV’06, May 2006.
49
[21] K. A. Saadi, A. Bouridane, and A. Guessoum, “Combined fragile watermark and dig-ital signature for h.264/avc video authentication,” in 17th European Signal ProcessingConference (EUSIPCO 2009), August 2009, pp. 1799–1803.
[22] P. Gilbert, J. Jung, K. Lee, H. Qin, D. Sharkey, A. Sheth, and L. P. Cox, “Youprove:authenticity and fidelity in mobile sensing,” in Proceedings of the 9th ACM Conferenceon Embedded Networked Sensor Systems. ACM, 2011, pp. 176–189.
[23] Y. Ke, R. Sukthankar, and L. Huston, “Efficient near-duplicate detection and sub-imageretrieval,” in ACM Multimedia, vol. 4, no. 1, 2004, p. 5.
[24] L. Kennedy and S.-F. Chang, “Internet image archaeology: automatically tracing themanipulation history of photographs on the web,” in Proceedings of the 16th ACMinternational conference on Multimedia. ACM, 2008, pp. 349–358.
[25] A. De Rosa, F. Uccheddu, A. Costanzo, A. Piva, and M. Barni, “Exploring imagedependencies: a new challenge in image forensics.” Media Forensics and Security, p.75410, 2010.
[26] Z. Dias, A. Rocha, and S. Goldenstein, “First steps toward image phylogeny,” in Infor-mation Forensics and Security (WIFS), 2010 IEEE International Workshop on. IEEE,2010, pp. 1–6.
[27] ——, “Image phylogeny by minimal spanning trees,” Information Forensics and Secu-rity, IEEE Transactions on, vol. 7, no. 2, pp. 774–788, 2012.
[28] Z. Dias, S. Goldenstein, and A. Rocha, “Toward image phylogeny forests: Automaticallyrecovering semantically similar image relationships,” Forensic science international, vol.231, no. 1, pp. 178–189, 2013.
[29] A. M. Bronstein, M. M. Bronstein, and R. Kimmel, “The video genome,” arXiv preprintarXiv:1003.5320, 2010.
[30] Z. Dias, A. Rocha, and S. Goldenstein, “Video phylogeny: Recovering near-duplicatevideo relationships,” in Information Forensics and Security (WIFS), 2011 IEEE Inter-national Workshop on. IEEE, 2011, pp. 1–6.
[31] Z. Dias, S. Goldenstein, and A. Rocha, “Large-scale image phylogeny: Tracing imageancestral relationships,” MultiMedia, IEEE, vol. 20, no. 3, pp. 58–70, 2013.
[32] S. Lameri, P. Bestagini, A. Melloni, S. Milani, A. Rocha, M. Tagliasacchi, and S. Tubaro,“Who is my parent? reconstructing video sequences from partially matching shots,” in2014 IEEE International Conference on Image Processing (ICIP), 2014.
[33] REWIND, “Reverse engineering of audio-visual content data,”http://www.rewindproject.eu/, retrieved 30 March 2015. [Online]. Available:http://www.rewindproject.eu/
50
[34] W3C, “Prov-overview: An overview of the prov family of documents,”http://www.w3.org/TR/prov-overview/, April 2013, retrieved 30 March 2015.[Online]. Available: http://www.w3.org/TR/prov-overview/
[35] T. C. Group, “Tpm mobile with trusted execu-tion environment for comprehensive mobile device secu-rity,” https://www.trustedcomputinggroup.org/files/static page files/5999C3C1-1A4B-B294-D0BC20183757815E/TPM%20MOBILE%20with%20Trusted%20Execution%20Environment%20for%20Comprehensive%20Mobile%20Device%20Security.pdf, re-trieved 05 January 2015, archived by WebCite. [Online]. Available:http://www.webcitation.org/6VMSMgBmh
[36] L. Xie, A. Natsev, X. He, J. Kender, M. Hill, and J. R. Smith, “Tracking large-scalevideo remix in real-world events,” IEEE Transactions on Multimedia, vol. 15, no. 6, pp.1244–1254, 2013.
[37] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fit-ting with applications to image analysis and automated cartography,” Communicationsof the ACM, vol. 24, no. 6, pp. 381–395, 1981.
[38] Z. Fan and R. L. de Queiroz, “Identification of bitmap compression history: Jpeg de-tection and quantizer estimation,” Image Processing, IEEE Transactions on, vol. 12,no. 2, pp. 230–235, 2003.
[39] V. Pistol, “Practical dynamic information-flow tracking on mobile devices,” Ph.D. dis-sertation, Duke University, 2014.
51