3-8 information hiding on digital docu- ments by ... · of information hiding within images. when...

19
153 TAKIZAWA Osamu et al. 1 Introduction With the expanding use of computer net- works, information security techniques for transmitting information safely over a network are becoming increasingly important. Ciphers form one of these techniques, and are used in processing and decrypting information to hide it from attackers or to detect traces of tamper- ing. Ciphers do not necessarily conceal their roles in carrying hidden information. Thus, it is easy to find cipher communications along the communication route, and an attacker, despite an inability to decode the cipher, can nevertheless find and interfere with important cipher communications. (That the communica- tion is encrypted suggests to the attacker that the content of the communication is impor- tant.) An effective means of addressing such attacks is to hide the information, concealing the fact that secret information is embedded in the communication. Information hiding can be used not only as a means of camouflage but also as a means of embedding copyright infor- mation or distribution destination information in content, including images and music. This paper discusses an information hiding tech- nique that uses a digital document as the cover medium and embeds secret information within the new-line codes inserted in the document. 2 Information hiding for docu- ments [1] 2.1 What is information hiding? Information hiding may be applied as a means of secret communication — as camou- flage, in other words — when transmitting information. It may also be used as a means of embedding proprietary information, such as copyright and distribution destination details, in content such as images and music. When this approach is applied to secret communica- tions it is referred to as “gsteganography”, and when it is applied to intellectual property rights it is referred to as “gdigital watermark- ing”. Information hiding is a process of embed- 3-8 Information Hiding on Digital Docu- ments by Adjustment of New-line Posi- tions TAKIZAWA Osamu, MATSUMOTO Tsutomu, NAKAGAWA Hiroshi, MURASE Ichiro, and MAKINO Kyoko In the usual information hiding applied to digital documents, secret messages are embedded in the layout information (e.g., the space between lines or characters) because character codes have no redundancy. This paper describes a new method for hiding information in plain text without using any layout information. It enables a secret message to be embedded as binary digits that are related to the number of characters in each line of the cover text. Keywords Information hiding, Digital watermarking, Steganography, Document, Natural language processing

Upload: others

Post on 17-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

153TAKIZAWA Osamu et al.

1 Introduction

With the expanding use of computer net-works, information security techniques fortransmitting information safely over a networkare becoming increasingly important. Ciphersform one of these techniques, and are used inprocessing and decrypting information to hideit from attackers or to detect traces of tamper-ing. Ciphers do not necessarily conceal theirroles in carrying hidden information. Thus, itis easy to find cipher communications alongthe communication route, and an attacker,despite an inability to decode the cipher, cannevertheless find and interfere with importantcipher communications. (That the communica-tion is encrypted suggests to the attacker thatthe content of the communication is impor-tant.) An effective means of addressing suchattacks is to hide the information, concealingthe fact that secret information is embedded inthe communication. Information hiding can beused not only as a means of camouflage butalso as a means of embedding copyright infor-

mation or distribution destination informationin content, including images and music. Thispaper discusses an information hiding tech-nique that uses a digital document as the covermedium and embeds secret information withinthe new-line codes inserted in the document.

2 Information hiding for docu-ments[1]

2.1 What is information hiding?Information hiding may be applied as a

means of secret communication — as camou-flage, in other words — when transmittinginformation. It may also be used as a means ofembedding proprietary information, such ascopyright and distribution destination details,in content such as images and music. Whenthis approach is applied to secret communica-tions it is referred to as “gsteganography”, andwhen it is applied to intellectual propertyrights it is referred to as “gdigital watermark-ing”.

Information hiding is a process of embed-

3-8 Information Hiding on Digital Docu-ments by Adjustment of New-line Posi-tions

TAKIZAWA Osamu, MATSUMOTO Tsutomu, NAKAGAWA Hiroshi, MURASE Ichiro, and MAKINO Kyoko

In the usual information hiding applied to digital documents, secret messages are embeddedin the layout information (e.g., the space between lines or characters) because character codeshave no redundancy. This paper describes a new method for hiding information in plain textwithout using any layout information. It enables a secret message to be embedded as binarydigits that are related to the number of characters in each line of the cover text.

KeywordsInformation hiding, Digital watermarking, Steganography, Document, Natural languageprocessing

Page 2: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

154 Journal of the National Institute of Information and Communications Technology Vol.52 Nos.1/2 2005

ding secret messages or copyright information(referred to as the embedded data) into content(referred to as the cover data) to create contentembedded with information (referred to as thestego data). The stego data is transmitted tothe recipient, and the recipient extracts theembedded data from the stego data for use.The main subject of steganography is theembedded data, and the cover data is oftenused for camouflage only. On the other hand,the main subject of digital watermarking is thecover data (the content), and additional infor-mation concerning the cover data is hidden asthe embedded data. Thus, steganographyfocuses on embedding as much data as possi-ble, while digital watermarking focuses onminimizing the difference between the coverdata and the stego data (in other words, mini-mizing the change in content).

2.2 Information hiding for documents;classifications

Information hiding that uses documents ascover data embeds information into the docu-ment adding an artificial component unrecog-nizable as such by third parties; the aim is toallow only the rightful recipient to extract thesecret information from the document.

Classical information-hiding techniquesused throughout history have employed docu-ments as the cover media. Today, steganogra-phy (secret communication) is the first knowncase in which these techniques are primarilyused against threats such as electronic eaves-dropping and filtering. Steganography in doc-uments embeds secret information in data thata third party would regard as comprising ordi-nary communication. Along with steganogra-phy, digital watermarking, which embedscopyright information and “fingerprints” intodigital content, is another important applica-tion of information-hiding in documents. Digi-tal watermarking adds information to the con-tent to allow identification of the people ororganizations that are the rightful holders ofthe content. This process can identify thesource of illegal redistribution and is thusexpected to have a deterrent effect on the dis-

tribution of pirated files.In terms of hiding information in docu-

ments, one must consider the amount ofacceptable modification to the cover text data.If the cover text itself forms the content, suchas a novel, in principle no modification isacceptable. On the other hand, when the mainsubject of the copyright claim is an item ofsoftware, an image, or a video, and the copy-right information is embedded in the docu-ment attached to the content, the stego textdata should simply maintain the meaning ofthe cover text; slight changes in the expressionof this data may be acceptable. An example ofsuch a case is seen when embedding informa-tion using a software package insert as thecover text, such as the manual or the licenseagreement. Further, in steganography, inwhich the embedded information is the focusand the stego text is merely camouflage, if thepurpose of information hiding is to avoid auto-matic filtering, the stego text may not need tocarry meaning as long as the structure is basi-cally textual.

Information hiding is a technique of hidinginformation using redundant cover data. Thus,the technique can be classified into severalcategories according to the type of documentredundancy employed. To make this classifi-cation easier to understand, it is best to divideinformation-hiding methods roughly into twogroups: those methods in which the artificiali-ty remains in the hard copy (considered hereand below as including screen display) andthose in which it does not. Whether the artifi-ciality remains in the hard copy depends onthe output system; this is therefore not a strictclassification. Nevertheless, this is a conve-nient division for explanation. Below, thesemethods are outlined assuming generic outputsystems.(1) Information hiding in which artificiality

remains in the hard copyThe methods in which the artificiality

remains in the hard copy are based on thepremise that it is visually possible but difficultto recognize the artificiality. Thus, these meth-ods can be used not only for distribution of

Page 3: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

155TAKIZAWA Osamu et al.

electronic data but also for distribution of hardcopies. On the other hand, the artificialitymust be implemented carefully so that it is notdiscovered. This category is further classifiedinto the following two types according to theprinciples used to avoid recognition.(i) Methods that use visually concealed arti-

ficialityThis type of information hiding tries to

embed information unrecognized, throughsubtle artificiality that cannot be detected evenif the cover text and the stego text are com-pared side-by-side with the naked eye. Someimplement this effect by adding artificiality tothe layout of the document. The basic proce-dure adds subtle artificiality to the documentlayout using post-script or other functions, andthen the secret information is extracted byscanning the stego text printed as a hard copy.The content of the text is not important eitherin embedding or extracting the secret informa-tion. This technique makes use of visual dif-ferences between the cover and stego text.Thus, it can also be considered a special formof information hiding within images. Whenapplying this method with hard copies, aweakness is found in that the secret informa-tion deteriorates and is lost as the images arerepeatedly copied, reducing image quality. Itis possible to dispense with hard copies and toreceive and extract the secret informationentirely in electronic data form. However, it isnot then necessary to add artificiality to thelayout in such cases, and thus these methodscan be regarded as of the same type as infor-mation hiding within XML and LaTeX docu-ments, discussed later.

Different methods have been proposed foradding artificiality to layout: scaling of theline spacing or word spacing, scaling of char-acter widths, or rotation of the characters. Forexample, the standard number of bits betweenthe lines is specified in advance, and the spac-ing is increased when bit 1 is embedded anddecreased when bit 0 is embedded. Thus, theaccuracy of extracting the secret informationdepends on the resolution of the scanner. Lessscaling would render the artificiality more dif-

ficult to recognize, but then again, extractionerror will also increase. The difficulty ofrecognition of the selected artificialitydepends on the language. For example, thescaling of word spacing is said to be moreadvantageous in European languages (such asEnglish) and the scaling and rotation of fontsis said to be more useful in languages that donot insert spaces between words that use manyideographic characters, such as Japanese[2].Some methods require collation between thestego text and the original cover text and somedo not. Reference[3]describes a number ofmethods that add artificiality to the layout ofthe document.

In addition to artificiality within layout,some methods hide small characters andmarks in the periphery of the document orwithin ruled lines. These methods also belongto the current category. Handwritten steganog-raphy[4], which hides information in artificial-ity within the coordinates of the writing or inthe tool force, may also belong to this catego-ry, to the extent this is regarded as document-based information hiding.(ii) Methods that use natural-appearing

artificialityDigital documents basically consist of

character sequences and layout information.As the characters constitute part of the mean-ing of the document, indiscriminate digitalartificiality in the characters, however slight,may garble them and perhaps significantlydamage meaning (and thus reduce the qualityof the document). This will also increase thepossibility of detection of the artificiality. Forthis reason, many methods traditionally pro-posed for information hiding in documents useartificiality in the layout of the document, asdescribed above. However, plain text such asthat found in an email does not feature layoutinformation. When hiding information in plaintext, one needs to rely on the artificialityadded to the characters themselves. In thiscase, the strategy is to abandon the effort tocamouflage the artificiality and instead to relyon the apparent authenticity of directlyobserved stego text. With this method, artifi-

Page 4: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

156 Journal of the National Institute of Information and Communications Technology Vol.52 Nos.1/2 2005

ciality would only be detected with a covertext for comparison. Thus, the assumed utilitymodel does not include any cover text. Theartificiality in this category is large, and thesecret information is not easily degenerated orlost even with repeated copying in hard-copyformat.

To avoid deterioration in documents whenadding artificiality, two different approachesare possible: to apply natural language pro-cessing (such as word replacement) or toinsert characters or character codes that do notinfluence the outward meaning of the docu-ment. Reference[5]presents examples of theformer method. The current paper discussesthe latter method, which will be explained inthe next and subsequent sections.

Some methods do not require an originalcover text, generating the stego text fromscratch. These methods are also classifiedwithin this category. Two examples of pro-posed tools of this type are “Texto”, whichconverts uuencode files or PGP messages intoEnglish sentences resembling poetry, and“NICETEXT”, which converts binary datainto English sentences of a specified style[6].(2) Information hiding in which artificiality

does not remain in the hard copyWith methods in which artificiality does

not remain in the hard copy, the artificialitycannot be recognized visually, and thus is noteasily detected. However, the secret informa-tion is eliminated when the document is con-verted from electronic data into the displaymedia (paper or monitor screen). Thus, themethods in this category are applied under theassumption that the document is treated aselectronic data until the secret information isextracted.

Among proposed methods of this type is“SNOW”, which uses English sentences as thecover text and embeds information by insert-ing null characters at the end of each line[6].SNOW first encodes the secret information bycompressing the data with Huffman coding,and then inserts up to seven null characters atthe end of each line, corresponding to threebits of embedded information per line. Anoth-

er example is the FFEncode tool[6], whichdistributes null characters within text dataaccording to Morse code. Another methoduses an English LaTeX document as the covertext and embeds information by controllingthe positions of line feeds in the main body ofthe document source file[7]. The methods thatembed information in structured documentssuch as XML documents are also classifiedinto this category, as these also leave no tracesof artificiality in hard copies[8].

2.3 Information hiding through new-line position control

Here we discuss an information-hidingtechnique in which information is embeddedby controlling the positions of line feeds in adocument[9]. This method is intended for anagglutinative language such as Japanese, inwhich new lines may be started relativelyfreely. This technique assumes the use of afiller text as the cover text, with new-linecodes only at the ends of paragraphs, such asthose prepared by a word processor. Figure 1shows the flow of the embedding and extrac-tion processes for the embedded data in thismethod. Figure 2 shows examples of coverand stego texts. Embedding data by providingline feeds at appropriate intervals produces adocument with many line feeds (the stegotext). Two strategies are used when insertingline feeds: (1) reduction in line-length varia-tion (the sum of the widths of the characters ineach line) in order to preserve the apparentartificiality of the document and (2) avoidance

Fig.1 Process flow in information hidingthrough control of the number ofcharacters in each line

Page 5: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

157TAKIZAWA Osamu et al.

of unnatural line feeds (such as those in themiddle of a word). It is necessary to considerthe tradeoffs between these two strategies andto determine the positions of the line feeds tomake the document appear as natural as possi-ble.

Information hiding by controlling the posi-tions of line feeds does not influence the con-tent of the document. Thus, it can also beapplied when the cover text cannot be easilymodified. This method adds artificiality toplain text at the character level and also to thepositions of line feeds, which form part of the

document layout.

3 Information hiding through new-line position control

3.1 IntroductionIn information hiding in which the number

of characters in each line is controlled, thecorrelation between the position of the linefeeds and the embedded data (in other words,the rule illustrated in Fig.1) is essential. Thisrule may be based on the positions of the linefeeds within words or on the number of char-acters in each line. These approaches aredescribed in detail below.

3.2 Method based on the positions ofline feeds within words

In the method based on the positions ofline feeds within words, information is embed-ded within the entry words of a morphologicaldictionary according to the relationshipbetween the position of the line feed in eachword (morpheme) and the embedded informa-tion bit (either 1 or 0). Figure 3 shows exam-ples. It is specified in advance that the linefeed in “suru” (the Japanese verb meaning “todo”) as “su|ru” corresponds to “1” (“ | ” indi-cates the position of the line feed.). To main-tain a natural appearance in the stego text, this

Fig.2 Examples of cover and stego textsin information hiding through con-trol of the number of characters ineach line

Fig.3 Example bit assignment tables for each morpheme

(The morphemes are based on the attached dictionary of Reference[10].)

Page 6: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

158 Journal of the National Institute of Information and Communications Technology Vol.52 Nos.1/2 2005

method pays particular attention to the even-ness of character density in each line, andmakes the length of each line (the sum of thewidths of the characters in the line) as uniformas possible. For this purpose, we define thewidth of a one-byte character as “1” and thewidth of a two-byte character such as a kanaor kanji as “2”. According to the standard linelength specified at the start of the embeddingprocess, the word at the end of a line is subjectto embedding. As shown in Fig.3, for longwords such as “puroguramingu” (program-ming) or “comyunikeshon” (communication),0 or 1 values are ascribed to two or more new-line positions; any of these positions may beused. In this manner, line feed encoding ispossible without deviating too far from thestandard line length.

Figure 4 shows an example of informationembedding using the assignment table shownin Fig.3. The words with embedded data (mor-phemes) are underlined. (The underlines are notshown in the actual text.) The text shown inFig.4 is equally spaced. It is clear that the varia-tion in line length is almost undetectable. In theexample shown in Fig.4, “01111101011...” isthe embedded data.

The technique described in this section hasthe following characteristics:(1) Distinction between the types of characters

(hiragana/katakana/kanji) enables process-ing with a lighter computational load with-out using morphological analysis.

(2) As the embedding method can be defined

for each word, the rules for the correlationbetween the bit of the embedded informa-tion and the new-line position are moredifficult to detect than in a method basedon the number of words in each line (dis-cussed later). Thus, this method is moreresistant to extraction attacks.

(3) As the new-line position can be defined foreach word, unnatural line feeding can beavoided.On the other hand, there are a number of

problems with this technique, involving thehandling of errors in morphological analysisand the handling of single-character mor-phemes.

3.3 Method based on the number ofcharacters in each line

The method described in this sectiondefines in advance an assignment table linkingthe number of characters in each line and anembedded bit. A new-line code is insertedwhere the number of characters in the line cor-responds to the embedded data bit. The new-line codes are inserted in such a way that thestandard line length remains as uniform aspossible. When extracting the embedded data,the number of characters in each line is count-ed and the embedded data are extracted usingthe same assignment table. In other words, thismethod embeds a single bit per line. Figure 5shows an example of information embeddingusing an assignment table correlating the num-ber of characters in each line to an embedded

Fig.4 Example of information embedding with the proposed method

[The number on the right edge is the embedded data (not shown in actual text).]

Page 7: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

159TAKIZAWA Osamu et al.

bit.To render the line length as uniform as

possible, the example in Fig.5 uses 40 charac-ters in the first line, 33 characters in the sec-ond line, and so on. to embed “0100101...”

This method does not require collationwith the bit assignment table for each mor-pheme, as is the case with the method basedon the new-line positions within words. Thusprocessing is rapid, with little need for errorhandling. On the other hand, the embeddingrules are simple, which leads to a higher riskof extraction attacks.

4 Implementation

4.1 IntroductionThis section discusses the results obtained

through the implementation of informationhiding tools for embedding one-bit data corre-sponding to the number of characters per line,as discussed in Section 3.3 Two tools are usedhere. One uses plain text as the cover text andinserts line feeds according to the bit sequenceto be embedded consisting of zeros and ones(the embedded data containing the encryptedsecret information) to create a document con-taining numerous line feeds (stego text). Theother tool extracts the secret information fromthe stego text. The JAVA language is used indevelopment, in consideration of the most

appropriate development environment, futureextensibility, and the use of encryption algo-rithms. The embedded data consists of secretinformation encrypted with RC4 (40-bit keylength) to prevent decoding attacks. To thwartguesses as to the key assignment table forembedding information, the tool can create atable based on random numbers to preventextraction attacks. The tool uses the randomnumber generator Random(), provided byJAVA.

4.2 Embedding methodThe implemented tool selects from two

types of methods for arranging the embeddeddata and three types of methods for determin-ing the new-line positions. Combined, the tooloffers six embedding methods. The followingdescribes the details of each embeddingmethod.(A) Arrangement of the embedded data

With this tool, which embeds secret infor-mation in a document by mapping informationto the number of characters in each line, it isnecessary to implement a mechanism to iden-tify the line containing the embedded data inthe stego text when extracting the secret infor-mation. The authors have implemented twotypes of embedding methods: A1, which usesflags to indicate the embedded range, and A2,which embeds the data in sequence from the

Fig.5 Assignment table for the embedded bit and example stego text

(The thick numbers on the right edge are the embedded bits. The numbers in parentheses are thenumber of characters in the line.)

Page 8: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

160 Journal of the National Institute of Information and Communications Technology Vol.52 Nos.1/2 2005

top of the cover text. These methods aredescribed below.[Method A1] Secret information is embed-

ded between the start and endflags.

Method A1 embeds the secret informationsomewhere within the cover text only once,placing the start flag, embedded data, and endflag in this order. The line feeds from the startof the cover text to the start flag are dummiescontaining no information. The start positionfor embedding and the positions of line feedsup to the start flag are determined using ran-dom numbers. Thus, the same input producesa different result in each run, to preventextraction attacks. Figure 6 shows a conceptu-al diagram of embedding in the cover textusing this method.

This method specifies the following para-meters for the embedding process: the assign-ment table, standard line length, minimum linelength, cipher (decoding) key, start flag (eight-bit binary), end flag (eight-bit binary), andmaximum starting line. When extracting thesecret information, this method specifies thesame assignment table, minimum line length,cipher (decoding) key, beginning flag, and endflag used for embedding. A minimum linelength is specified to prevent the embeddingof information in lines that are deemed tooshort. This is necessary to exclude lines withlengths that differ significantly from the oth-ers, as at the end of a paragraph or in captions,as embedding targets. The minimum line

length is a parameter required both in embed-ding and extraction. The maximum start linespecifies the maximum number of dummy linefeeds up to the start flag. In embedding, thestart line is placed after a number of lines thatis randomly chosen within this maximumstart-line value. The maximum starting line isa required parameter only in embedding.

With this method, attackers cannot easilydetermine the location of embedded informa-tion. However, the data is embedded onlyonce, so that resistance to attack (the conser-vation of embedded data) when the stego textis partially deleted for editing is low. Whenextracting secret information, information isalso required for start and end flags as well asfor the assignment table, which is the commonkey, and the cipher (decoding) key.

This method is suitable when the embed-ded data is relatively large compared to thecover text and repeated embedding is difficult,or when partial deletion of the stego text forediting is unlikely.[Method A2] Secret information is embed-

ded repeatedlyMethod A2 repeatedly embeds data from

the beginning of the cover text in all linefeeds. Thus, it requires no dummy line feeds,start flag, or end flag. Figure 7 shows a con-ceptual diagram of embedding in a cover textusing this method.

This method specifies the following para-meters when embedding information: the

Fig.6 Embedding technique for MethodA1

Fig.7 Embedding procedure based onMethod A2

Page 9: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

161TAKIZAWA Osamu et al.

assignment table, standard line length, mini-mum line length, and cipher (decoding) key.When extracting the secret information, thismethod specifies the same assignment table,minimum line length, and cipher (decoding)key used for embedding. This method embedsdata redundantly, so that if a means is provid-ed to identify the start of the data for extrac-tion, it is highly probable that the embeddeddata is correctly extracted even if the stegotext is partially deleted for editing. However,this method poses a potentially high risk thatthe assignment table will be discovered fromthe repeated patterns.(B) Method for determining the new-line

positionsThree methods are implemented to deter-

mine the new-line positions, in considerationof the tradeoff between uniformity of linelength and the natural appearance of the new-line positions. These three methods areexplained below. Although the examplesbelow all use Method A1 for the arrangementof the embedded data, these three methods canalso be combined with Method A2.[Method B1] Emphasis on uniformity in

line lengthMethod B1 places line feeds near the stan-

dard line length while minimizing variation inlengths, subject to Japanese hyphenation andother punctuation restrictions. Japanesehyphenation is in accordance with standardMS-Word Japanese hyphenation rules for lineheads and tails. Figure 8 shows an example ofoutput using this method.

In this method, the variation in line lengthis small, so the document appears natural interms of page design. However, many unnat-ural line feeds result, as in the middle of aword; the stego text thus may give readers theimpression that something is awry.[Method B2] Line feeds for particular types

of characters are restrictedMethod B2 applies additional restrictions

to Method B1 and avoids line feeds in particu-lar types of character sequences (numbers andalphabets). Figure 9 shows an example of out-put with this method.

In Fig.9, an alphabetical string such as“representation” is not broken into two lines,so the line with this word is slightly longerthan other lines. Thus, Method B2 allowsgreater variation in line length than MethodB1.[Method B3] Significant emphasis on char-

acter-type boundariesMethod B3 adds further constraints to

Method B2 to avoid line feeds in kanji, hira-gana, and katakana sequences and to restrictline feeds in parentheses. (With five or fewercharacters within a pair of parentheses, linefeed is avoided.) Thus, the line feed is primar-ily inserted between different types of charac-ters (kanji/hiragana/katakana/alphabet). InJapanese, the boundary between differenttypes of characters (such as between hiraganaand kanji, or between katakana and hiragana)is often the boundary between clauses. Thus,this method increases the natural line feedsbetween clauses. Figure 10 shows an exampleof output with this method.

In Fig.10, the document appears to featureclause-based line feeds, which makes the doc-ument easy to read. Nevertheless, the devia-tion in line length is even greater than inMethod B2.

5 Evaluation

5.1 IntroductionInformation hiding methods should be

evaluated in light of (1) the amount of infor-mation that can be embedded, (2) the difficul-ty of detecting information embedding, (3) thedifficulty of extracting the embedded data, and(4) the difficulty of destroying the embeddeddata. With respect to criterion (1), the embed-ding rate can be quantitatively evaluated.However, criteria (2), (3), and (4) involveevaluation of the behavior of attackers, thusrequiring subjective evaluation using actualsubjects. This section considers the criteria(2), (3) and (4) in more detail.

In the subjective evaluation for (2), (3),and (4), it is our opinion that criterion (2), thedifficulty of detecting that information has

Page 10: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

162 Journal of the National Institute of Information and Communications Technology Vol.52 Nos.1/2 2005

been embedded, is equivalent to an assessmentof the naturalness of the stego text. We alsoequate (3), the difficulty of extracting theembedded data, with the issue of security ininformation hiding, and (4), the difficulty ofdestroying embedded data (resistance todestructive attacks) with the strength of infor-mation hiding. As such, the subjective evalua-tion here in fact examines two aspects: one isthe naturalness of the stego text, and the otheris the security and tamper-proofing of theinformation hiding. Thus it would be reason-able to perform subjective evaluation experi-ments and subsequent analysis based on thesetwo aspects. The experiments should vary thecombination of implementation methods dis-

cussed in Section 4, (A) in the arrangement ofembedded data or (B) in the determination ofnew-line positions, and the types of cover textshould also be varied, as shown in Table 2.Table 1 summarizes the applicable classifica-tions. The difference in the arrangement of theembedded data is considered to have an effectonly when extracting or destroying the embed-ded information. Thus, this variable is includ-ed only in the evaluation of the security andtamper-proofing of information hiding. Wealso describe the details of the experimentalprocedure to evaluate the naturalness of thestego text based on different cover text genres(Section 5.3.2).

We nevertheless consider that the subjec-

Fig.8 Example of stego text based on Method B1

Page 11: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

163TAKIZAWA Osamu et al.

tive evaluation experiments require futureelaboration and improvements. Thus, in Sec-tions 5.3 and 5.4 below we present only anoverview of the subjective evaluation experi-ments.

5.2 Cover texts used for evaluationTable 2 shows the cover texts used in the

evaluation. The characteristics of the covertexts will affect the results of subject evalua-tion. Thus, various texts are prepared includ-ing news articles, technical papers, and liter-ary works.

5.3 Subjective evaluation of difficultyin detecting information embed-

ding5.3.1 Evaluation of the naturalness of

the stego text based on methodof determining new-line posi-tions

This test evaluates the effect of the differ-ences among the three methods of determiningnew-line positions, as discussed in Section 4.2(B), on the naturalness of the generated stegotext. The subject group, consisting of 5 to 10people, is selected with no particular condi-tions. Stego texts are generated with the samecover data and different methods for determin-ing new-line positions; the data is then provid-ed to subjects in the form of paper or electron-ic documents. The subjects review each stego

Fig.9 Example of stego text based on Method B2

Page 12: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

164 Journal of the National Institute of Information and Communications Technology Vol.52 Nos.1/2 2005

text and rate it using the five-point scaleshown in Fig.11.5.3.2 Evaluation of the naturalness of

the stego text based on thecover-text type

This test evaluates the effect of the type ofcover text on the naturalness of the stego text.As in Section 5.3.1, the subject group isselected with no specific conditions to includefive to 10 people. Stego texts are generatedwith a single method and different types ofcover data; the data is then provided to sub-jects in the form of paper or electronic docu-ments. The subjects review each stego textand rate it using the five-point scale shown inFig.11.

The following are the details of the experi-

mental procedure.(1) Preparation

For the cover texts in Table 2, stego textsare generated using the tool discussed in Sec-tion 4, with the same embedded data. In thisexperiment, the method for determining new-line positions is Method B1, which empha-sizes uniformity in line length, and the A2method is used to arrange the embedded data(repeated data embedding). The methods areeach limited to a single type to highlight theeffect of cover-text type on the evaluation ofnaturalness among the defined number of sub-jects. The set relationship between the numberof characters in a line and the bit value of theembedded data is simple: the bit value is “1”when the number of characters is even and “0”

Fig.10 Example stego text based on Method B3

Page 13: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

165TAKIZAWA Osamu et al.

when it is odd.(2) Experiment procedure(i) Distribution of experiment sheet and evalu-

ation sheetThe experiment sheet and evaluation sheet

are distributed to the subjects on paper or aselectronic documents. Figure 12 shows exam-ples of the experiment sheet, and Figure 13shows an example of the evaluation sheet.(ii) Distribution of evaluation manual

The experiment leader distributes the“evaluation manual” shown in Fig.14 to thesubjects and explains its contents. He or shethen instructs the subjects to read the manualbefore beginning the evaluation.(iii) Evaluation by subjects

The subjects evaluate the stego textsaccording to the distributed evaluation manu-al.(iv) Collection of experimental data

The experiment leader collects the experi-ment sheet, the evaluation sheet, and the eval-uation manual from the subjects after evalua-tion.(3) Analysis of the experimental results and

evaluationsExperiments are performed using two or

more documents of the different cover-texttypes indicated in Table 2 (“children’s news”text not used). Thus, it is possible to assessevaluations in terms of genre and in terms ofdifferent documents. The evaluation marks are

Table 1 Classification of the subjective evaluation experiment

Table 2 Cover texts used for evaluation

Fig.11 Standards for evaluation

Page 14: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

166 Journal of the National Institute of Information and Communications Technology Vol.52 Nos.1/2 2005

calculated as follows:( I ) Evaluation distribution and average evalu-

ation mark for each genre(II) Evaluation distribution and average eval-

uation mark for each documentBased on the results of the above calcula-

tions, the results of ( I ) are used to analyze theeffect of cover-text type on the evaluation ofthe naturalness of the stego text, and theresults of (II) are used to analyze the effect ofindividual document choice on the evaluationof the naturalness of the stego text.

5.4 Subjective evaluation of secu-rity and tamper-proofing ofinformation hiding

5.4.1 Evaluation of security and tam-per-proofing of information hid-ing based on methods ofarranging embedded data or ofdetermining new-line positions

This test evaluates the effects on tamper-proofing of the two methods of arranging

embedded data or the effects of the threemethods of determining new-line positions, asdiscussed in Section 4.2 (A) and (B). The sub-ject group consists of five to 10 undergraduateand graduate students in information engineer-ing, presumed to have significant interest incipher techniques. Stego texts are generated insix different ways by combining the two meth-ods of arranging the embedded data and thethree methods of determining new-line posi-tions, and are distributed as electronic docu-ments. The subjects are requested to modifyfreely any texts that they consider to holdembedded information, while maintaining tex-tual meaning.5.4.2 Evaluation of security and tam-

per-proofing of information hid-ing based on the cover-text type

This test evaluates the effect of cover-texttype on the resistance to tampering. Theexperiment is planned as follows. As in Sec-tion 5.4.1, the subject group consists of fiveto 10 undergraduate and graduate students in

Fig.12a Example experiment sheet (Sheet number 1: General news)

Page 15: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

167TAKIZAWA Osamu et al.

Fig.12b Example experiment sheet (Sheet number 6: Children’s literature)

Fig.13 Example evaluation sheet

Page 16: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

168 Journal of the National Institute of Information and Communications Technology Vol.52 Nos.1/2 2005

information engineering, presumed to havesignificant interest in cipher techniques. Stegotexts are generated using cover texts of differ-ent types and are distributed as electronic doc-uments. The subjects are requested to modifyfreely any texts that they consider to holdembedded information, while maintaining tex-tual meaning.

6 Discussion

As stated at the beginning of this paper,information hiding can be applied in twomajor applications: “digital watermarking”,which embeds copyright information or “fin-gerprints” (information for identifying the dis-tribution destination) into electronic contents,and “steganography” (secret communication),intended to counter threats such as electroniceavesdropping and filtering by a third party.Information hiding in documents as discussedin this paper is considered best applied incases in which a third party cannot easilymodify the new-line positions—for example,in direct document exchanges between twopeople (as with email and printed documents).For example, when distributing a confidentialprinted document among concerned parties,“fingerprints” may be embedded based on thenumber of words in each line throughout the

document without modifying the content.Then, as a person intending to leak the docu-ment cannot easily produce a paper copy thatcan hide the source of the leak, this methodprevents easy leaks. When printed documentsare used as the media, the secret information isextracted using an OCR (Optical CharacterReader), as when information is hidden in thedocument layout; many related methods havetraditionally been proposed. However, itshould be recognized that the embedded datais not subtly conveyed, as in the size of linespacing, character spacing, or miniature char-acters, but instead corresponds to each linelength (the sum of the widths of each charac-ter), which is relatively conspicuous. Thismethod is nevertheless superior in that thesecret information is not easily lost even if thedocument is repeatedly copied with low-quali-ty reproduction.

Let us consider the points to keep in mindwhen applying this technique to steganogra-phy or digital watermarking. Steganographyfocuses on communicating secret informationand uses the stego text only for camouflage.Thus, if the purpose of information hiding isto avoid automatic filtering by machines in thecourse of distribution as electronic data, acomposition resembling natural language maybe sufficient as the stego text, even in the

Fig.14 Evaluation manual

Page 17: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

169TAKIZAWA Osamu et al.

absence of any logical meaning in the docu-ment. On the other hand, when applying thetechnique to digital watermarking, the covertext must have meaning. If content with sig-nificant meaning even in subtle expressions,as in novels, is to be used as the cover text, thetext cannot be modified in any way. Even ifthe document emphasizes the basic meaningof the content, as in confidential documentsand manuals, only subtle modification isallowed within a range that does not changethe meaning of the document. In this respect,the developed tool will only modify the docu-ment in the position of the line feeds. Thus,this tool can be used for both steganographyand digital watermarking.

When using the technique for steganogra-phy, it is particularly important to hide the factthat information is embedded in the document.Thus, it is necessary to devise methods thatcan maintain the visual naturalness of thestego text, i.e., the uniformity of line lengthsand the naturalness of the new-line positions.To this end, it is effective to optimize themethod of determining the new-line positions.It is also effective to use layout functionswhen displaying or printing the document,such as justification.

Whether a technique is used for steganog-raphy or for digital watermarking, we mustconsider measures against decoding, extrac-tion, tampering, and spoofing. The techniquediscussed in this paper uses randomizing ofthe assignment table and encoding of secretinformation. Error correction may also be usedas an additional measure. When consideringthe distribution of the stego text as electronicdata, it is also essential to take measuresagainst destructive attacks through partialdeletion of the stego text for editing and modi-fication of new-line positions. The technique

provides two methods for arranging theembedded data—redundant embedding inMethod A2, and randomized selection ofembedding position in Method A1, both ofwhich are effective to an extent.

The technique discussed in this paper maybe applied not only to information hiding butalso to detection of tampered documents. Inother words, the hash value or messageauthentication codes (MAC) can be embeddedinto a text document according to this methodas verification data; this data is then extractedfor comparison with the stego text in verifica-tion. Any tampering can thus be detected[11].

7 Conclusions

This paper discusses an information hidingtechnique that uses a digital document as theembedding medium and the new-line positionsinserted in the document as the secret informa-tion. Even in our present society, in whichmultimedia technology continues to advance,text-based information such as e-mail, is stillthe most important means of informationexchange. Information hiding in documents istherefore likely to remain important, and manyapplications will continue to arise that lendthemselves to related techniques.

Acknowledgements

This study is being conducted in the con-text of regular discussions with members ofProf. Tsutomu Matsumoto’s laboratory atYokohama National University, members ofProf. Hiroshi Nakagawa’s laboratory at theUniversity of Tokyo, and members of the Mit-subishi Research Institute, Inc. We appreciatetheir useful advice.

Page 18: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

170 Journal of the National Institute of Information and Communications Technology Vol.52 Nos.1/2 2005

References01 Hirosho Nakagawa, Osamu Takizawa and Shingo Inoue, “Information Hiding on Digital Documents”,

IPSJ Magazine, Vol.44, No.3, pp.248-253, 2003. (In Japanese)

02 Kineo Matsui, “Primer of Digital Watermarking”, Morikita Publishing, 1998. (In Japanese)

03 R.J.Anderson and F.A.P.Petitcolas, “Information Hiding -An Annotated Bibliography”,

http://www.cl.cam.ac.uk/~fapp2/steganography/bibliography/Annotated_Bibliography.pdf, 1999.

04 Norihisa Segawa, Yuko Murayama and Masatoshi Miyazaki, “The Proposal of a Handwriting Steganog-

raphy with the Characteristic of Handwriting Input Equipment”, Computer Security Symposium 2002,

pp.215-219, 2002. (In Japanese)

05 Hiroshi Nakagawa, Koji Sanpei, Tsutomu Matsumoto, Takeshi Kashiwagi, Shuji Kawaguchi, Kyoto Maki-

no and Ichiro Murase, “Meaning Preserving Information Hiding _Japanese text Case”, IPSJ Journal,

Vol.42, No.9, pp. 2339 - 2350, 2001. (In Japanese)

06 Information-Technology Promotion Agency, “Technical Research Report of Information Hiding”,

http://www.ipa.go.jp/security/fy10/contents/crypto/report/Information-Hiding.htm, 1998. (In Japanese)

07 Tsutomu Matsumoto, Hiroshi Itoyama, “Can Bypassing Lawful Access be Always Detected?”, Technical

Report of IEICE, ISEC96-79, pp. 159-164, 1997. (In Japanese)

08 Shingo Inoue, Ichiro Murase, Osamu Takizawa, Tsutomu Matsumoto and Hiroshi Nakagawa, “A Propos-

al on Steganography Methods using XML”, The 2002 Symposium on Cryptography and Information

Security, IEICE, pp.301-306, 2002. (In Japanese)

09 Osamu Takizawa, Tsutomu Matsumoto, Hiroshi Nakagawa, Ichiro Murase and Kyoko Makino,

“Steganography on Digital Documents by Adjustment of New-line Positions”, IPSJ Journal, Vol.45, No.8,

pp. 1977 - 1979, 2004. (In Japanese)

10 “ChaSen -A morphological analysis system”, version 2.0 for Windows, Computational Linguistics Labo-

ratory, Graduate School of Information Science, Nara Institute of Science and Technology, 1999. (In

Japanese)

11 Tsutomu Matsumoto, Katsunari Yoshioka, Masataka Suzuki, Ken' ichiro Akai, Osamu Takizawa, Kyoko

Makino and Hiroshi Nakagawa, “Text Alteration Detection by New-Line Positions”, The 2004 Symposium

on Cryptography and Information Security, IEICE, pp.983-988, 2004. (In Japanese)

Page 19: 3-8 Information Hiding on Digital Docu- ments by ... · of information hiding within images. When applying this method with hard copies, a weakness is found in that the secret informa-tion

171TAKIZAWA Osamu et al.

TAKIZAWA Osamu, Ph.D.

Senior Researcher, Security Advance-ment Group, Information and NetworkSystems Department

Contents Security, TelecommunicationTechnology for Disaster Relief

NAKAGAWA Hiroshi, Dr. Eng.

Professor, University of Tokyo

Natural Language Processing

MATSUMOTO Tsutomu, Dr. Eng.

Professor, Yokohama National Univer-sity

Information Security

MURASE Ichiro

Mitsubishi Research Institute, Inc.

Information Security

MAKINO Kyoko

Mitsubishi Research Institute, Inc.

Information Security