Jump to content
me.mello

PDF Overview - Peering into the Internals of PDF

Recommended Posts

Posted

Author: Anand

Pare mult de citit dar nu e, o sa treaca repede timpul.

Introduction

pdfinternals_logo.jpg

Portable Document Format (PDF) is a file format for representing documents in a manner independent of the application software, hardware, and operating system used to create them and of the output device on which they are to be displayed or printed.

In this introductory article I will explain the internals of PDF document, its structures and components with examples and screenshots. It will help you understand intrinsics of PDF document and will be more useful if you are into PDF malware analysis.

Components of PDF File

PDF syntax consists of four main components:

1. Objects

2. File Structure

3. Document Structure

4. Content Stream

PDF Objects

A PDF file consists primarily of objects, of which there are eight types:

1. Boolean values, representing true or false

2. Numbers include integer and real

3. Strings

4. Names

5. Arrays, ordered collections of objects

6. Dictionaries, collections of objects indexed by Names

7. Streams, usually containing large amounts of data

8. The null object denoted by keyword null

I will explain more details about each of these objects in detail in the following section.

PDF Objects -> Strings

String objects can be represented in two ways:

Literal Strings

Hexadecimal Strings

Literal Strings consists of any number of characters between opening and closing parenthesis.

Example

(This is a string objects)

If string is too long then it can be represented using backslash as shown below

(This is a very long\

String.)

Hexadecimal Strings consists of hexadecimal character enclose with angel bracket

Example:

<A0C1D2E3F1>

Here each pair of hexadecimal defines one byte of string.

PDF Objects -> Names

A names object is uniquely defined by sequence of characters. Slash character(/) defined a name.

Example

/secsavvy

/SecSavvy

Both are different name.

/Sec#20Savvy mean Sec Savvy 20 is hexadecimal value for white space.

Note: Pdf is case-sensitive.

PDF Objects -> Array

An array object is collection of objects. PDF array object can be heterogeneous. It is defined with square brackets.

Example

[1 (string) /Name 3.14]

PDF Objects -> Dictionary

Dictionary object consists of pairs of objects. The first element is key and the second is value.

The key must be name. A dictionary is written as a sequence of key-value pairs enclosed in double angle brackets (<< ? >>).

Example

<< /Type /Pages

/Kids [ 4 0 R ]

/Count 1

>>

Count is a key and 1 is value.

PDF Objects -> Streams

A stream object, like a string object, is a sequence of bytes. Stream can be of unlimited length, whereas a string is subject to an implementation limit. For this reason, objects with potentially large amounts of data, such as images and page descriptions, are represented as streams.

A stream consists of a dictionary followed by zero or more bytes bracketed between the keywords stream and endstream:

dictionary

stream

... Zero or more bytes ...

endstream

PDF Objects -> Indirect Ones

Objects may be labeled so that they can be referred to by other objects. A labeled object is called an indirect object.

Example

Consider this object

obj and endobj is a keyword.

10 0 obj

(SecSavvy String)

endobj

This object defined a string of object number 10.

This object can be referred in a file by indirect reference as

10 0 R

PDF Objects -> Streams -> Filters

A filter is an optional part of the specification of a stream, indicating how the data in the stream must be decoded before it is used. For example, if a stream has an ASCIIHexDecode filter, an application reading the data in that stream will transform the ASCII hexadecimal-encoded data in the stream into binary data.

For data encoded using LZW and ASCII base-85 encoding (in that order) can be decoded using the following entry in the stream dictionary:

/Filter [ /ASCII85Decode /LZWDecode ]

Example

1 0 obj

<< /Length 534 /Filter [ /ASCII85Decode /LZWDecode ]>>

stream

J..)6T`?p&<!J9%_[umg"B7/Z7KNXbN'S+,*Q/&"OLT'FLIDK#!n`$"<Atdi`\Vn%b%)&'cA*VnK\CJY(sF>c!Jnl@RM]WM;jjH6Gnc75idkL5]+cPZKEBPWdR>FF(kj1_R%W_d&/jS!;iuad7h?[L-F$+]]0A3Ck*$I0KZ?;<)CJtqi65XbVc3\n5ua:Q/=0$W<#N3U;H,MQKqfg1?:lUpR;6oN[C2E4ZNr8Udn.'p+?#X+1>0Kuk$bCDF/(3fL5]Oq)^kJZ!C2H1'TO]Rl?Q:&?<5&iP!$Rq;BXRecDN[iJB`,)o8XJOSJ9sDS]hQ;Rj@!ND)bD_q&C\g:inYC%)&u#:u,M6Bm%IY!Kb1+?:aAa?S`ViJglLb8<W9k6Yl\\0McJQkDeLWdPN?9A?jX*al>iG1p&i;eVoK&juJHs9%;Xomop?5KatWRT?JQ#qYuL,JD?M$0QP)lKn06l1apKDC@\qJ4B!!(5m+j.7F790m(Vj88l8Q:_CZ(Gm1%X\N1&u!FKHMB~>

endstream

endobj

Here is the list of standard filters

ASCIIHexDecode

ASCII85Decode

LZWDecode

FlateDecode

RunLengthDecode

CCITTFaxDecode

JBIG2Decode

DCTDecode

JPXDecode

Crypt

File Structure

PDF file consists of 4 main elements:

PDF header identifying the PDF specification.

A body containing the objects that make up the document contained in the file

A cross-reference table containing information about the indirect objects in the file

A trailer giving the location of the cross-reference table and of certain special objects within the body of the file.

pdfinternals_screen1.jpg

Cross Reference Table

The cross-reference table contains information that permits random access to indirect objects within the file so that the entire file need not be read to locate any particular object. The table contains a one-line entry for each indirect object, specifying the location of that object within the body of the file.

Each cross-reference section begins with a line containing the keyword xref. Following this line are one or more cross-reference subsections, which may appear in any order.

Each cross-reference subsection contains entries for a contiguous range of object numbers. The subsection begins with a line containing two numbers separated by a space: the object number of the first object in this subsection and the number of entries in the subsection. For example, the line

0 8

introduces a subsection containing five objects numbered consecutively from 0 to 8.

xref

0 8

0000000000 65535 f

0000000009 00000 n

0000000074 00000 n

0000000120 00000 n

0000000179 00000 n

0000000364 00000 n

0000000466 00000 n

0000000496 00000 n

0000000009 is 10 digit byte offset in the case of in-use entry , giving the number of bytes from the beginning of the file to the beginning of the object.

0000000000 is the 10-digit object number of the next free object int the case of free entry

Example Screenshots: Simple Hello World Text PDF

Here are the series of screenshots which shows different parts of sample PDF document.

pdfinternals_screen2.jpg

pdfinternals_screen4.jpg

pdfinternals_screen5.jpg

Conclusion

his article explains in brief internals of PDF document, its structures, components with examples and detailed screenshots. Hope this article http://www.aiim.org/documents/standards/PDF-Ref/References/Adobe/PDFReference17.pdf will help you in the malware research work revolviing around PDF documents.

Though it is enough for beginners but advanced users are advised read through reference white paper for more granular details.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.



×
×
  • Create New...