docs/Readers.rst - llvm-project/lld - Git at Google

 .. _Readers:

 Developing lld Readers
 ======================

 Note: this document discuss Mach-O port of LLD. For ELF and COFF,
 see :doc:`index`.

 Introduction
 ------------

 The purpose of a "Reader" is to take an object file in a particular format
 and create an `lld::File`:cpp:class: (which is a graph of Atoms)
 representing the object file.  A Reader inherits from
 `lld::Reader`:cpp:class: which lives in
 :file:`include/lld/Core/Reader.h` and
 :file:`lib/Core/Reader.cpp`.

 The Reader infrastructure for an object format ``Foo`` requires the
 following pieces in order to fit into lld:

 :file:`include/lld/ReaderWriter/ReaderFoo.h`

    .. cpp:class:: ReaderOptionsFoo : public ReaderOptions

       This Options class is the only way to configure how the Reader will
       parse any file into an `lld::Reader`:cpp:class: object.  This class
       should be declared in the `lld`:cpp:class: namespace.

    .. cpp:function:: Reader *createReaderFoo(ReaderOptionsFoo &reader)

       This factory function configures and create the Reader. This function
       should be declared in the `lld`:cpp:class: namespace.

 :file:`lib/ReaderWriter/Foo/ReaderFoo.cpp`

    .. cpp:class:: ReaderFoo : public Reader

       This is the concrete Reader class which can be called to parse
       object files. It should be declared in an anonymous namespace or
       if there is shared code with the `lld::WriterFoo`:cpp:class: you
       can make a nested namespace (e.g. `lld::foo`:cpp:class:).

 You may have noticed that :cpp:class:`ReaderFoo` is not declared in the
 ``.h`` file. An important design aspect of lld is that all Readers are
 created *only* through an object-format-specific
 :cpp:func:`createReaderFoo` factory function. The creation of the Reader is
 parametrized through a :cpp:class:`ReaderOptionsFoo` class. This options
 class is the one-and-only way to control how the Reader operates when
 parsing an input file into an Atom graph. For instance, you may want the
 Reader to only accept certain architectures. The options class can be
 instantiated from command line options or be programmatically configured.

 Where to start
 --------------

 The lld project already has a skeleton of source code for Readers for
 ``ELF``, ``PECOFF``, ``MachO``, and lld's native ``YAML`` graph format.
 If your file format is a variant of one of those, you should modify the
 existing Reader to support your variant. This is done by customizing the Options
 class for the Reader and making appropriate changes to the ``.cpp`` file to
 interpret those options and act accordingly.

 If your object file format is not a variant of any existing Reader, you'll need
 to create a new Reader subclass with the organization described above.

 Readers are factories
 ---------------------

 The linker will usually only instantiate your Reader once.  That one Reader will
 have its loadFile() method called many times with different input files.
 To support multithreaded linking, the Reader may be parsing multiple input
 files in parallel. Therefore, there should be no parsing state in you Reader
 object.  Any parsing state should be in ivars of your File subclass or in
 some temporary object.

 The key function to implement in a reader is::

   virtual error_code loadFile(LinkerInput &input,
                               std::vector<std::unique_ptr<File>> &result);

 It takes a memory buffer (which contains the contents of the object file
 being read) and returns an instantiated lld::File object which is
 a collection of Atoms. The result is a vector of File pointers (instead of
 simple a File pointer) because some file formats allow multiple object
 "files" to be encoded in one file system file.


 Memory Ownership
 ----------------

 Atoms are always owned by their File object. During core linking when Atoms
 are coalesced or stripped away, core linking does not delete them.
 Core linking just removes those unused Atoms from its internal list.
 The destructor of a File object is responsible for deleting all Atoms it
 owns, and if ownership of the MemoryBuffer was passed to it, the File
 destructor needs to delete that too.

 Making Atoms
 ------------

 The internal model of lld is purely Atom based.  But most object files do not
 have an explicit concept of Atoms, instead most have "sections". The way
 to think of this is that a section is just a list of Atoms with common
 attributes.

 The first step in parsing section-based object files is to cleave each
 section into a list of Atoms. The technique may vary by section type. For
 code sections (e.g. .text), there are usually symbols at the start of each
 function. Those symbol addresses are the points at which the section is
 cleaved into discrete Atoms.  Some file formats (like ELF) also include the
 length of each symbol in the symbol table. Otherwise, the length of each
 Atom is calculated to run to the start of the next symbol or the end of the
 section.

 Other sections types can be implicitly cleaved. For instance c-string literals
 or unwind info (e.g. .eh_frame) can be cleaved by having the Reader look at
 the content of the section.  It is important to cleave sections into Atoms
 to remove false dependencies. For instance the .eh_frame section often
 has no symbols, but contains "pointers" to the functions for which it
 has unwind info.  If the .eh_frame section was not cleaved (but left as one
 big Atom), there would always be a reference (from the eh_frame Atom) to
 each function.  So the linker would be unable to coalesce or dead stripped
 away the function atoms.

 The lld Atom model also requires that a reference to an undefined symbol be
 modeled as a Reference to an UndefinedAtom. So the Reader also needs to
 create an UndefinedAtom for each undefined symbol in the object file.

 Once all Atoms have been created, the second step is to create References
 (recall that Atoms are "nodes" and References are "edges"). Most References
 are created by looking at the "relocation records" in the object file. If
 a function contains a call to "malloc", there is usually a relocation record
 specifying the address in the section and the symbol table index. Your
 Reader will need to convert the address to an Atom and offset and the symbol
 table index into a target Atom. If "malloc" is not defined in the object file,
 the target Atom of the Reference will be an UndefinedAtom.


 Performance
 -----------
 Once you have the above working to parse an object file into Atoms and
 References, you'll want to look at performance.  Some techniques that can
 help performance are:

 * Use llvm::BumpPtrAllocator or pre-allocate one big vector<Reference> and then
   just have each atom point to its subrange of References in that vector.
   This can be faster that allocating each Reference as separate object.
 * Pre-scan the symbol table and determine how many atoms are in each section
   then allocate space for all the Atom objects at once.
 * Don't copy symbol names or section content to each Atom, instead use
   StringRef and ArrayRef in each Atom to point to its name and content in the
   MemoryBuffer.


 Testing
 -------

 We are still working on infrastructure to test Readers. The issue is that
 you don't want to check in binary files to the test suite. And the tools
 for creating your object file from assembly source may not be available on
 every OS.

 We are investigating a way to use YAML to describe the section, symbols,
 and content of a file. Then have some code which will write out an object
 file from that YAML description.

 Once that is in place, you can write test cases that contain section/symbols
 YAML and is run through the linker to produce Atom/References based YAML which
 is then run through FileCheck to verify the Atoms and References are as
 expected.
	.. _Readers:

	Developing lld Readers
	======================

	Note: this document discuss Mach-O port of LLD. For ELF and COFF,
	see :doc:`index`.

	Introduction
	------------

	The purpose of a "Reader" is to take an object file in a particular format
	and create an `lld::File`:cpp:class: (which is a graph of Atoms)
	representing the object file. A Reader inherits from
	`lld::Reader`:cpp:class: which lives in
	:file:`include/lld/Core/Reader.h` and
	:file:`lib/Core/Reader.cpp`.

	The Reader infrastructure for an object format ``Foo`` requires the
	following pieces in order to fit into lld:

	:file:`include/lld/ReaderWriter/ReaderFoo.h`

	.. cpp:class:: ReaderOptionsFoo : public ReaderOptions

	This Options class is the only way to configure how the Reader will
	parse any file into an `lld::Reader`:cpp:class: object. This class
	should be declared in the `lld`:cpp:class: namespace.

	.. cpp:function:: Reader *createReaderFoo(ReaderOptionsFoo &reader)

	This factory function configures and create the Reader. This function
	should be declared in the `lld`:cpp:class: namespace.

	:file:`lib/ReaderWriter/Foo/ReaderFoo.cpp`

	.. cpp:class:: ReaderFoo : public Reader

	This is the concrete Reader class which can be called to parse
	object files. It should be declared in an anonymous namespace or
	if there is shared code with the `lld::WriterFoo`:cpp:class: you
	can make a nested namespace (e.g. `lld::foo`:cpp:class:).

	You may have noticed that :cpp:class:`ReaderFoo` is not declared in the
	``.h`` file. An important design aspect of lld is that all Readers are
	created only through an object-format-specific
	:cpp:func:`createReaderFoo` factory function. The creation of the Reader is
	parametrized through a :cpp:class:`ReaderOptionsFoo` class. This options
	class is the one-and-only way to control how the Reader operates when
	parsing an input file into an Atom graph. For instance, you may want the
	Reader to only accept certain architectures. The options class can be
	instantiated from command line options or be programmatically configured.

	Where to start
	--------------

	The lld project already has a skeleton of source code for Readers for
	``ELF``, ``PECOFF``, ``MachO``, and lld's native ``YAML`` graph format.
	If your file format is a variant of one of those, you should modify the
	existing Reader to support your variant. This is done by customizing the Options
	class for the Reader and making appropriate changes to the ``.cpp`` file to
	interpret those options and act accordingly.

	If your object file format is not a variant of any existing Reader, you'll need
	to create a new Reader subclass with the organization described above.

	Readers are factories
	---------------------

	The linker will usually only instantiate your Reader once. That one Reader will
	have its loadFile() method called many times with different input files.
	To support multithreaded linking, the Reader may be parsing multiple input
	files in parallel. Therefore, there should be no parsing state in you Reader
	object. Any parsing state should be in ivars of your File subclass or in
	some temporary object.

	The key function to implement in a reader is::

	virtual error_code loadFile(LinkerInput &input,
	std::vector<std::unique_ptr<File>> &result);

	It takes a memory buffer (which contains the contents of the object file
	being read) and returns an instantiated lld::File object which is
	a collection of Atoms. The result is a vector of File pointers (instead of
	simple a File pointer) because some file formats allow multiple object
	"files" to be encoded in one file system file.


	Memory Ownership
	----------------

	Atoms are always owned by their File object. During core linking when Atoms
	are coalesced or stripped away, core linking does not delete them.
	Core linking just removes those unused Atoms from its internal list.
	The destructor of a File object is responsible for deleting all Atoms it
	owns, and if ownership of the MemoryBuffer was passed to it, the File
	destructor needs to delete that too.

	Making Atoms
	------------

	The internal model of lld is purely Atom based. But most object files do not
	have an explicit concept of Atoms, instead most have "sections". The way
	to think of this is that a section is just a list of Atoms with common
	attributes.

	The first step in parsing section-based object files is to cleave each
	section into a list of Atoms. The technique may vary by section type. For
	code sections (e.g. .text), there are usually symbols at the start of each
	function. Those symbol addresses are the points at which the section is
	cleaved into discrete Atoms. Some file formats (like ELF) also include the
	length of each symbol in the symbol table. Otherwise, the length of each
	Atom is calculated to run to the start of the next symbol or the end of the
	section.

	Other sections types can be implicitly cleaved. For instance c-string literals
	or unwind info (e.g. .eh_frame) can be cleaved by having the Reader look at
	the content of the section. It is important to cleave sections into Atoms
	to remove false dependencies. For instance the .eh_frame section often
	has no symbols, but contains "pointers" to the functions for which it
	has unwind info. If the .eh_frame section was not cleaved (but left as one
	big Atom), there would always be a reference (from the eh_frame Atom) to
	each function. So the linker would be unable to coalesce or dead stripped
	away the function atoms.

	The lld Atom model also requires that a reference to an undefined symbol be
	modeled as a Reference to an UndefinedAtom. So the Reader also needs to
	create an UndefinedAtom for each undefined symbol in the object file.

	Once all Atoms have been created, the second step is to create References
	(recall that Atoms are "nodes" and References are "edges"). Most References
	are created by looking at the "relocation records" in the object file. If
	a function contains a call to "malloc", there is usually a relocation record
	specifying the address in the section and the symbol table index. Your
	Reader will need to convert the address to an Atom and offset and the symbol
	table index into a target Atom. If "malloc" is not defined in the object file,
	the target Atom of the Reference will be an UndefinedAtom.


	Performance
	-----------
	Once you have the above working to parse an object file into Atoms and
	References, you'll want to look at performance. Some techniques that can
	help performance are:

	* Use llvm::BumpPtrAllocator or pre-allocate one big vector<Reference> and then
	just have each atom point to its subrange of References in that vector.
	This can be faster that allocating each Reference as separate object.
	* Pre-scan the symbol table and determine how many atoms are in each section
	then allocate space for all the Atom objects at once.
	* Don't copy symbol names or section content to each Atom, instead use
	StringRef and ArrayRef in each Atom to point to its name and content in the
	MemoryBuffer.


	Testing
	-------

	We are still working on infrastructure to test Readers. The issue is that
	you don't want to check in binary files to the test suite. And the tools
	for creating your object file from assembly source may not be available on
	every OS.

	We are investigating a way to use YAML to describe the section, symbols,
	and content of a file. Then have some code which will write out an object
	file from that YAML description.

	Once that is in place, you can write test cases that contain section/symbols
	YAML and is run through the linker to produce Atom/References based YAML which
	is then run through FileCheck to verify the Atoms and References are as
	expected.