Vanilla.PDF  1.5.1
Cross-platform toolkit for creating and modifying PDF documents
Architecture

Table of Contents

The library is writted in standard C++ (currently 14) and can be compiled using Visual studio (2015 and 2017) and GCC (tested on Ubuntu 16.04) as well.

Build is executed using cross-platform make tool CMake (https://cmake.org/). CMake also integrates packaging system to provide one-click installable packages for each platform.

Currently supported package systems:

It provides only ANSI C API. The reason why I did not expose native C++ interface is rooted within the incompatibility of the C++ ABI between compilers. Functions across the interface use standard C caller clean-up cdecl calling convention.

Error handling

Library uses C++ exceptions internally. Each interface function is wrapped inside try-catch block to prevent any exceptions to escape and potentially crash the application.

This is example, how interface functions usually look like:

Buffer* obj = reinterpret_cast<Buffer*>(handle);
if (obj == nullptr) {
}
if (data == nullptr) {
}
try
{
obj->assign(data, data + size);
} catch (std::exception& e) {
// Store the error message
} catch (...) {
}
}
Note
The wrapping try-catch should have negligible performance impact on most compilers.

Error codes and messages

All exceptions thrown in this way are caught and their message is stored in a thread-local buffer. This buffer is separate for each thread and has a pre-allocated size in case of memory shortage.

Following code snippet declares the structures that carries error information:

thread_local uint32_t m_error;
thread_local size_type m_message_length;
thread_local char m_message[constant::MAX_MESSAGE_SIZE];

Object ownership

All handles are basically opaque pointers to internal structures. Library uses so-called intrusive pointer reference counting mechanism. Usually, the structure and the reference counter are two separate objects. In this case, the reference counter is embedded inside the structure body.

Intrusive vs Shared

Let's compare intrusive pointer with the traditional C++ shared pointers.

Transferring object handle outside library bounds is more clear.

Buffer* buffer = new Buffer();
*result = reinterpret_cast<BufferHandle*>(buffer);
...
Buffer* buffer = reinterpret_cast<Buffer*>(handle);

Intrusive pointers can guarantee, that there are no multiple reference count objects.

Intrusive pointers should have a better performance (in some cases) comparing to traditional C++ shared pointers. Main reason is that accessing the object required two pointer dereferences for shared pointer, while for intrusive only one. The other reason is that whole object is allocated within a single allocation, while shared pointers are often not.

Note
Shared pointer can be allocated using make_shared. In addition, to ensure (not guarantee) that there is only a single reference counter object, the objects may be derived from shared_from_this.

File layer

File layer allows access to file contents at the syntactic level. It has some necessary semantic features that are required for parsing its syntax.

For example IndirectReferenceObjectHandle often has to be resolved to read an object. The StreamObjectHandle has it's Length often stored as an indirect object. In order to validate this object, the Length has to be resolved to successfully parse an object.

IO Streams

Library uses C++ io streams for reading source files and writing output files. There are already interfaces, that represents these streams and will be used throughout the library interface.

Note
These interfaces could be overriden in the future, so that user can provide custom implementation for reading source file. This is often helpful for interacting with other applications, that might need to share file access.

Tokenizer

Tokens are smallest syntactic elements and are separated by a whitespace or a delimiter. Which characters are considered whitespace and which are considered delimiter is discussed in section 7.2 - Lexical Conventions.

Note
PDF supports comments, but they are currently ignored. They might be persisted in the future.

Tokenizer uses look-ahead to determine proper token type, since some of the tokens are ambiguous from the first character. For example hexadecimal string is enclosed with angle brackets "<", ">" and the dictionary "<<", ">>".

Sample parsing loop for hexadecimal string:

int char = m_stream->Get();
if (char == Delimiter::LESS_THAN_SIGN) {
int ahead = m_stream->Peek();
if (ahead == Delimiter::LESS_THAN_SIGN) {
return Token::Type::DICTIONARY_BEGIN;
}
for (;;) {
int hex_char = m_stream->Get();
if (hex_char == Delimiter::GREATER_THAN_SIGN) {
break;
}
if (IsNumeric(hex_char) || IsAlpha(hex_char)) {
continue;
}
// Found unknown character - terminate
}
return Token::Type::HEXADECIMAL_STRING;
}

Parser

Tokens are passed to the parser, who is responsible for constructing objects. Parser uses look-ahead as well, since multiple tokens may form a single object.

indirect_reference_parsing.png
Picture 1: Diagram for parsing indirect object references

Function callbacks

Library provides multiple interfaces, that could be overriden by the calling application.

For instance, when signing a document, it is possible to use classic PKCS#12 (Personal Information Exchange described in RFC 7292). Unfortunately, this would not work with smart cards, where the private key is not directly accessible. User can override SigningKeyHandle and provide signing implementation outside library boundaries.

More extendable interfaces:

Dependencies

Library has also following dependent libraries with required runtime support:

Internal dependent library without runtime support: