It was on July 28, when PDF 2. A PDF file is a set of bytes that can be grouped in to tokens according to syntax rules defined by PDF specifications. Once or more tokens are combined to form higher-level syntactic entities, principally objects, which are the basic data values from which a PDF document is constructed.
The body of a PDF file consists of a sequence of indirect objects representing the contents of a document. The objects, as described above, represent components of the document such as fonts, pages and sampled images. Beginning with PDF 1. The cross-reference table contains information that permits random access to indirect objects within the file so that the entire file need not be read to locate any particular object.
The table shall contain a one-line entry for each indirect object, specifying the byte offset of that object within the body of the file. The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section.
Any object in a PDF file may be labelled as an indirect object. Indirect objects are given unique object identifier by which other objects can refer to it. Cross-referencing to these are maintained in an index table and marked with the xref keyword which follows the main body and gives the byte offset of each indirect object from the start of file.
PDF layouts are categorized as Llnear and non-linear depending upon the target applications and other factors. PDF pages of the document reside in scattered form across the PDF file and that is why non-linear files are slower as compared to linear files. As mentioned, PDF body is a collection of objects mentioned above.
As we learnt earlier deleted objects have the 'f' letter at the end of the cross-reference entry. This means that if say object 5, existed before and was deleted during the update the new cross section will have the same entry but with 'f' as the last character in the entry for object 5.
When the PDF file gets updated, along with a new cross-reference section a new trailer is added. This contains all the entries from the previous trailer but will have a different value for the Prev entry in the dictionary.
The Prev entry will have the location of the previous cross-reference section. Hopefully we will discuss this in detail later.
The structure of a PDF file is like the different levels of hierarchy found in a typical company. The trailer will have a Root entry that has the location of the catalog. Basically, the Document Catalog is like the centre from where every information about the PDF file can be found. Being a dictionary it consists of various keys. We will for the time being only look at the mandatory keys. Type - will always be Catalog type Name. Pages - An indirect reference to the object that is the root of the page tree will look at this later.
In this case, it is a 'Catalog' dictionary. An application that reads the above Catolog dictionary will know that it needs to read the 'Pages' dictionary indirect object 3 to get information about the pages in this PDF file. It has two type of nodes - page tree nodes and page objects. Each page in a PDF file is represented as a Page object. Each of these objects is called as 'leaf' node in the Page Tree.
Type - will always be Pages for a Page Tree node. Parent - the page tree node which is this node's parent. Not allowed in root node. Kids - an array referring to the children of this node. The children can only be page tree nodes or page objects. Count - the number of page objects that are descendants of this node.
The PDF that I had created earlier has this page tree remember that the Catalog Dictionary was pointing to indirect object 3. This Page tree node has only one kid which is object 4. The Parent key is missing and therefore this is the root node. Page Objects : This is a dictionary that reveals the page itself characteristics. Some of the keys are. Note: Most of the keys are new to me. I have purposefully left out keys that make no sense to me at this moment. Parent - An indirect reference to the parent of this page.
LastModified - Date and time when this page was last modified. Resources - The resources required by this page. This usually refers to the font used on this page and other info.
MediaBox - A rectangle that defines the boundary inside which the page has to be displayed. Contents - A content stream that describes the contents of this page.
Rotate - In multiples of Rotates the page by the number of degrees before displaying. Thumb - A stream object that gives the thumbnail image for this page. Dur - the number of seconds the page will be displayed in presentations before automatically moving on to the next page. Trans - A dictionary advising what transition to use when displaying the page during presentation.
Annots - This is an array of dictionaries containing references to all the annotations for this page. AA - This is the short form for additional-actions. This dictionary defines the actions that need to be taken when the file is open or closed. Metadata - A stream that contains metadata for this page. As you can see Object 1 is the catalog that directs the PDF reading application to the root of the page tree Object 3. Object 3, the root node had only one kid Object 4 and obviously cannot have a parent.
Object 4 is 'displayed' within a rectangle 0 0 and is not rotated Rotate 0 and has Object 3 as its parent. It's 'resources' as well as its contents Object 5 are included.
Here is Object 5 from my file. As we had discussed earlier, the stream in this object starts with a dictionary that shows the length of the stream which is stored in Object 6. We will discuss more about Content streams further down. Page attributes are inherited: Here is an interesting fact. Certain attributes in a page can be inherited from its parent or any of its ancestors in the page tree. The eliminates the need to keep repeating similar attributes for every child, grandchild etc.
If an ancestor defines a value for an attribute, that value can be replaced or changed by the child. Name Dictionary: Rather than referring to the objects by their references, some objects can be referred to by their names. The link between the names and their references is stored in the PDF file's name dictionary. One of the optional keys in the Catalog, Names is used to used to specify the Name Dictionary.
In Object 5, of my PDF file, mentioned earlier and repeated below we can see a stream. The data in the stream makes no sense because the data has been encoded converted from its original form to another.
In the following sections we will look more in detail about the structure of these data and understand how they form instructions for the PDF reading application to display the page. Note that unlike other objects in a PDF file, the instructions in the object stream are read and followed sequentially one after the other.
Before proceeding further we will try to create a simple PDF file from what we have learnt so far. You can copy this file from here and save it in a text editor like notepad. Save it with a filename but with a file extension "pdf". In notepad, you will have to save as "filename.
Note: Not all PDF files are as simple as this. This PDF file that is very basic and just displays a single line of text. I love your feedback and suggestions. Please leave a comment below or contact me at steve printmyfolders. Search this site. Objects: Here are the objects that make use of the characters we looked at above.
I love Java and PDF. Dictionary Objects:. Other components of a PDF file. Indirect Objects:. PDF Document Structure:. Report abuse. This site uses cookies from Google to deliver its services and to analyze traffic.
Information about your use of this site is shared with Google. By using this site, you agree to its use of cookies. Learn more Got it.
In the example above, we can see that we have four subsections note the four lines that only contain two numbers. The first number in those lines corresponds to the object number, while the second line states the number of objects in the current subsection.
Each object is represented by one entry, which is 20 bytes long including the CRLF. The last object in the cross-reference table uses the generation number 0. The second subsection has an object ID 3 and contains one element, the object 3 that starts at an offset bytes from the beginning of the document. The third subsection has four objects, the first of which has an ID 21 and starts at an offset from the beginning of the file.
Other objects have the subsequent numbers 22, 23 and These objects contain a reference to the next free object and the generation number to be used if the object becomes valid again. Note that the object zero points to the next free object in the table, object Since object 23 is also free, it itself points to the next free object in the table, object If we represent the above cross-reference table with every object number, it would look as follows:.
Multiple subsections are usually present in PDF documents that have been incrementally updated, otherwise only one subsection starting with the number zero should be present.
The PDF trailer specifies how the application reading the PDF document should find the cross-reference table and other special objects. Before the end of the file tag, there is a line with a startxref string that specifies the offset from beginning of the file to the cross-reference table.
In our case the cross-reference table starts at offset bytes. Before that is a trailer string that specifies the start of the Trailer section. We can see that the trailer section defines several keys, each of them for a particular action. The trailer section can specify the following keys:.
We must remember that the initial structure can be modified if we update the PDF document at a later time. The update usually appends additional elements to the end of the file. The PDF has been designed with incremental updates in mind, since we can append some objects to the end of the PDF file without rewriting the entire file. Because of this, changes to a PDF document can be saved quickly. The new structure of the PDF document can be seen in the picture below:.
Figure 3: PDF structure. We can see that the PDF document still contains the original header, body, cross-reference table and the trailer. Additionally, there are also other body, cross-reference and trailer sections that were added to the PDF document.
The additional cross-reference sections will contain only the entries for objects that have been changed, replaced or deleted. In PDF versions 1. Upon opening this PDF document it looks as shown below:. Figure 4: PDF document sample. Figure 5: Cross-reference and trailer sections. The cross-reference section has been reduced for clarity.
The cross-reference section contains one subsection that itself contains objects. The trailer section starts at byte offset , includes objects where the root element points to object and the info element points to object The PDF document contains eight basic types of objects described below.
These types are: booleans, numbers, strings, names, arrays, dictionaries, streams and the null object. Objects may be labeled so that they can be referenced by other objects.
A labeled object is also called an indirect object. There are two keywords: true and false that represent the boolean values. There are two types of numbers in a PDF document: integer and real. An integer consists of one or more digits optionally preceded by a plus or minus sign. An example of integer objects may be seen below:. The real value can be represented with one or more digits, with an optional sign and a leading, trailing or embedded decimal point a period.
An example of real numbers can be seen below:. There is a limitation of the length of the name element, which may be only bytes long. When writing a name, a slash must be used to introduce a name; the slash is not part of the name but is a prefix indicating that what follows is a sequence of characters representing the name.
If we want to use whitespace or any other special character as part of the name, it must be encoded with two-digit hexadecimal notation. Figure 6: PDF names source. Strings in a PDF document are represented as a series of bytes surrounded by parenthesis or angle brackets, but can be a maximum of bytes long. Any character may be represented by ASCII representation, and alternatively with octal or hexadecimal representations. Octal representation requires the character to be written in the form ddd, where ddd is an octal number.
An example of representing a string embedded in parentheses can be seen below:. We can also use special well-known characters when representing a string. Those are: n for new line, r for carriage return, t for horizontal tabulator, b for backspace, f for form feed, for left parenthesis, for right parenthesis and for backslash.
Arrays in PDF documents are represented as a sequence of PDF objects, which may be of different types and enclosed in square brackets. This is why an array in a PDF document can hold any object types, like numbers, strings, dictionaries and even other arrays. An array may also have zero elements. An array is presented with a square bracket.
An example of an array is presented below:. The key must be the name object, whereas the value can be any object, including another dictionary. The maximum number of entries in a dictionary is entries.
A stream object is represented by a sequence of bytes and may be unlimited in length, which is why images and other big data blocks are usually represented as streams. A stream object is represented by a dictionary object followed by the keywords stream followed by newline and endstream. The stream dictionary specifies the exact number of bytes of the stream.
After the data there should be a newline and the endstream keyword. Common keywords used in all stream dictionaries are the following note that the Length entry is mandatory :. The stream data in the object stream will contain N pairs of integers, where the first integer represents the object number and the second integer represents the offset in the decoded stream of that object.
The First entry in the dictionary identifies the first object in the object stream.
0コメント