| Data Formatters |
| =============== |
| |
| This page is an introduction to the design of the LLDB data formatters |
| subsystem. The intended target audience are people interested in understanding |
| or modifying the formatters themselves rather than writing a specific data |
| formatter. For the latter, refer to :doc:`/use/variable/`. |
| |
| This page also highlights some open areas for improvement to the general |
| subsystem, and more evolutions not anticipated here are certainly possible. |
| |
| Overview |
| -------- |
| |
| The LLDB data formatters subsystem is used to allow the debugger as well as the |
| end-users to customize the way their variables look upon inspection in the user |
| interface (be it the command line tool, or one of the several GUIs that are |
| backed by LLDB). |
| |
| To this aim, they are hooked into the ``ValueObjects`` model, in order to |
| provide entry points through which such customization questions can be |
| answered. For example: What format should this number be printed as? How many |
| child elements does this ``std::vector`` have? |
| |
| The architecture of the subsystem is layered, with the highest level layer |
| being the user visible interaction features (e.g. the ``type ***`` commands, |
| the SB classes, ...). Other layers of interest that will be analyzed in this |
| document include: |
| |
| * Classes implementing individual data formatter types |
| * Classes implementing formatters navigation, discovery and categorization |
| * The ``FormatManager`` layer |
| * The ``DataVisualization`` layer |
| * The SWIG <> LLDB communication layer |
| |
| Data Formatter Types |
| -------------------- |
| |
| As described in the user documentation, there are four types of formatters: |
| |
| * Formats |
| * Summaries |
| * Filters |
| * Synthetic children |
| |
| Formatters have descriptor classes, ``Type*Impl``, which contain at least a |
| "Flags" nested object, which contains both rules to be used by the matching |
| algorithm (e.g. should the formatter for type Foo apply to a Foo*?) or rules to |
| be used by the formatter itself (e.g. is this summary a oneliner?). |
| |
| Individual formatter descriptor classes then also contain data items useful to |
| them for performing their functionality. For instance ``TypeFormatImpl`` |
| (backing formats) contains an ``lldb::Format`` that is the format to then be |
| applied were this formatter to be selected. Upon issuing a ``type format add`` |
| a new ``TypeFormatImpl`` is created that wraps the user-specified format, and |
| matching options: |
| |
| :: |
| |
| entry.reset(new TypeFormatImpl( |
| format, TypeFormatImpl::Flags() |
| .SetCascades(m_command_options.m_cascade) |
| .SetSkipPointers(m_command_options.m_skip_pointers) |
| .SetSkipReferences(m_command_options.m_skip_references))); |
| |
| |
| While formats are fairly simple and only implemented by one class, the other |
| formatter types are backed by a class hierarchy. |
| |
| Summaries, for instance, can exist in one of three "flavors": |
| |
| * Summary strings |
| * Python script |
| * Native C++ |
| |
| The base class for summaries, ``TypeSummaryImpl``, is a pure virtual class that |
| wraps, again, the Flags, and exports among others: |
| |
| :: |
| |
| virtual bool FormatObject (ValueObject *valobj, std::string& dest) = 0; |
| |
| |
| This is the core entry point, which allows subclasses to specify their mode of |
| operation. |
| |
| ``StringSummaryFormat``, which is the class that implements summary strings, |
| does a check as to whether the summary is a one-liner, and if not, then uses |
| its stored summary string to call into ``Debugger::FormatPrompt``, and obtain a |
| string back, which it returns in ``dest`` as the resulting summary. |
| |
| For a Python summary, implemented in ``ScriptSummaryFormat``, |
| ``FormatObject()`` calls into the ``ScriptInterpreter`` which is supposed to |
| hold the knowledge on how to bridge back and forth with the scripting language |
| (Python in the case of LLDB) in order to produce a valid string. Implementors |
| of new ``ScriptInterpreters`` for other languages are expected to provide a |
| ``GetScriptedSummary()`` entry point for this purpose, if they desire to allow |
| users to provide formatters in the new language |
| |
| Lastly, C++ summaries (``CXXFunctionSummaryFormat``), wrap a function pointer |
| and call into it to execute their duty. It should be noted that there are no |
| facilities for users to interact with C++ formatters, and as such they are |
| extremely opaque, effectively being a thin wrapper between plain function |
| pointers and the LLDB formatters subsystem. |
| |
| Also, dynamic loading of C++ formatters in LLDB is currently not implemented, |
| and as such it is safe and reasonable for these formatters to deal with |
| internal ``ValueObjects`` instances instead of public ``SBValue`` objects. |
| |
| An interesting data point is that summaries are expected to be stateless. While |
| at the Python layer they are handed an ``SBValue`` (since nothing else could be |
| visible for scripts), it is not expected that the ``SBValue`` should be cached |
| and reused - any and all caching occurs on the LLDB side, completely |
| transparent to the formatter itself. |
| |
| The design of synthetic children is somewhat more intricate, due to them being |
| stateful objects. The core idea of the design is that synthetic children act |
| like a two-tier model, in which there is a backend dataset (the underlying |
| unformatted ``ValueObject``), and an higher level view (frontend) which vends |
| the computed representation. |
| |
| To implement a new type of synthetic children one would implement a subclass of |
| ``SyntheticChildren``, which akin to the ``TypeFormatImpl``, contains Flags for |
| matching, and data items to be used for formatting. For instance, |
| ``TypeFilterImpl`` (which implements filters), stores the list of expression |
| paths of the children to be displayed. |
| |
| Filters are themselves synthetic children. Since all they do is provide child |
| values for a ``ValueObject``, it does not truly matter whether these come from the |
| real set of children or are crafted through some intricate algorithm. As such, |
| they perfectly fit within the realm of synthetic children and are only shown as |
| separate entities for user friendliness (to a user, picking a subset of |
| elements to be shown with relative ease is a valuable task, and they should not |
| be concerned with writing scripts to do so). |
| |
| Once the descriptor of the synthetic children has been coded, in order to hook |
| it up, one has to implement a subclass of ``SyntheticChildrenFrontEnd``. For a |
| given type of synthetic children, there is a deep coupling with the matching |
| front-end class, given that the front-end usually needs data stored in the |
| descriptor (e.g. a filter needs the list of child elements). |
| |
| The front-end answers the interesting questions that are the true raison d'ĂȘtre |
| of synthetic children: |
| |
| :: |
| |
| virtual size_t CalculateNumChildren () = 0; |
| virtual lldb::ValueObjectSP GetChildAtIndex (size_t idx) = 0; |
| virtual size_t GetIndexOfChildWithName (const ConstString &name) = 0; |
| virtual bool Update () = 0; |
| virtual bool MightHaveChildren () = 0; |
| |
| Synthetic children providers (their front-ends) will be queried by LLDB for a |
| number of children, and then for each of them as necessary, they should be |
| prepared to return a ``ValueObject`` describing the child. They might also be |
| asked to provide a name-to-index mapping (e.g. to allow LLDB to resolve queries |
| like ``myFoo.myChild``). |
| |
| ``Update()`` and ``MightHaveChildren()`` are described in the user |
| documentation, and they mostly serve bookkeeping purposes. |
| |
| LLDB provides three kinds of synthetic children: filters, scripted synthetics, |
| and the native C++ providers Filters are implemented by |
| ``TypeFilterImpl::FrontEnd``. |
| |
| Scripted synthetics are implemented by ``ScriptedSyntheticChildren::FrontEnd``, |
| plus a set of callbacks provided by the ``ScriptInterpteter`` infrastructure to |
| allow LLDB to pass the front-end queries down to the scripting languages. |
| |
| As for C++ native synthetics, there is a ``CXXSyntheticChildren``, but no |
| corresponding ``FrontEnd`` class. The reason for this design is that |
| ``CXXSyntheticChildren`` store a callback to a creator function, which is |
| responsible for providing a ``FrontEnd``. Each individual formatter (e.g. |
| ``LibstdcppMapIteratorSyntheticFrontEnd``) is a standalone frontend, and once |
| created retains to relation to its underlying ``SyntheticChildren`` object. |
| |
| On a ``ValueObject`` level, upon being asked to generate synthetic children for |
| a ``ValueObject``, LLDB spawns a ValueObjectSynthetic object which is a |
| subclass of ``ValueObject``. Building upon the ``ValueObject`` infrastructure, |
| it stores a backend, and a shared pointer to the ``SyntheticChildren``. Upon |
| being asked queries about children, it will use the ``SyntheticChildren`` to |
| generate a front-end for itself and will let the front-end answer questions. |
| The reason for not storing the ``FrontEnd`` itself is that there is no |
| guarantee that across updates, the same ``FrontEnd`` will be used over and over |
| (e.g. a ``SyntheticChildren`` object could serve an entire class hierarchy and |
| vend different frontends for different subclasses). |
| |
| Formatters Matching |
| ------------------- |
| |
| The problem of formatters matching is going from "I have a ``ValueObject``" to |
| "these are the formatters to be used for it." |
| |
| There is a rather intricate set of user rules that are involved, and a rather |
| intricate implementation of this model. All of these relate to the type of the |
| ``ValueObject``. It is assumed that types are a strong enough contract that it |
| is possible to format an object entirely depending on its type. If this turns |
| out to not be correct, then the existing model will have to be changed fairly |
| deeply. |
| |
| The basic building block is that formatters can match by exact type name or by |
| regular expressions, i.e. one can describe matching by saying things like "this |
| formatters matches type ``__NSDictionaryI``", or "this formatter matches all |
| type names like ``^std::__1::vector<.+>(( )?&)?$``." |
| |
| This match happens in class ``FormattersContainer``. For exact matches, this |
| goes straight to the ``FormatMap`` (the actual storage area for formatters), |
| whereas for regular expression matches the regular expression is matched |
| against the provided candidate type name. If one were to introduce a new type |
| of matching (say, match against number of ``$`` signs present in the typename, |
| ``FormattersContainer`` is the place where such a change would have to be |
| introduced). |
| |
| It should be noted that this code involves template specialization, and as such |
| is somewhat trickier than other formatters code to update. |
| |
| On top of the string matching mechanism (exact or regex), there are a set of |
| more advanced rules implemented by the ``FormattersContainer``, with the aid of the |
| ``FormattersMatchCandidate``. Namely, it is assumed that any formatter class will |
| have flags to say whether it allows cascading (i.e. seeing through typedefs), |
| allowing pointers-to-object and reference-to-object to be formatted. Upon |
| verifying that a formatter would be a textual match, the Flags are checked, and |
| if they do not allow the formatter to be used (e.g. pointers are not allowed, |
| and one is looking at a Foo*), then the formatter is rejected and the search |
| continues. If the flags also match, then the formatter is returned upstream and |
| the search is over. |
| |
| One relevant fact to notice is that this entire mechanism is not dependent on |
| the kind of formatter to be returned, which makes it easier to devise new types |
| of formatters as the lowest layers of the system. The demands on individual |
| formatters are that they define a few typedefs, and export a Flags object, and |
| then they can be freely matched against types as needed. |
| |
| This mechanism is replicated across a number of categories. A category is a |
| named bucket where formatters are grouped on some basis. The most common reason |
| for a category to exist is a library (e.g. ``libcxx`` formatters vs. ``libstdcpp`` |
| formatters). Categories can be enabled or disabled, and they have a priority |
| number, called position. The priority sets a strong order among enabled |
| categories. A category named "default" is always the highest priority one and |
| it's the category where all formatters that do not ask for a category of their |
| own end up (e.g. ``type summary add ....`` without a ``w somecategory`` flag |
| passed) The algorithm inquires each category, in the order of their priorities, |
| for a formatter for a type, and upon receiving a positive answer from a |
| category, ends the search. Of course, no search occurs in disabled categories. |
| |
| At the individual category level, there is the first dependence on the type of |
| formatter to be returned. Since both filters and synthetic children proper are |
| implemented through the same backing store, the matching code needs to ensure |
| that, were both a synthetic children provider and a filter to match a type, |
| only the most recently added one is actually used. The details of the algorithm |
| used are to be found in ``TypeCategoryImpl::Get()``. |
| |
| It is quite obvious, even to a casual reader, that there are a number of |
| complexities involved in this algorithm. For starters, the entire search |
| process has to be repeated for every variable. Moreover, for each category, one |
| has to repeat the entire process of crawling the types (go to pointee, ...). |
| This is exactly the algorithm initially implemented by LLDB. Over the course of |
| the life of the formatters subsystem, two main evolutions have been made to the |
| matching mechanism: |
| |
| * A caching mechanism |
| * A pregeneration of all possible type matches |
| |
| The cache is a layer that sits between the ``FormatManager`` and the |
| ``TypeCategoryMap``. Upon being asked to figure out a formatter, the ``FormatManager`` |
| will first query the cache layer, and only if that fails, will the categories |
| be queried using the full search algorithm. The result of that full search will |
| then be stored in the cache. Even a negative answer (no formatter) gets stored. |
| The negative answer is actually the most beneficial to cache as obtaining it |
| requires traversing all possible formatters in all categories just to get a |
| no-op back. |
| |
| Of course, once an answer is cached, getting it will be much quicker than going |
| to a full category search, as the cached answers are of the form "type foo" --> |
| "formatter bar". But given how formatters can be edited or removed by the user, |
| either at the command line or via the API, there needs to be a way to |
| invalidate the cache. |
| |
| This happens through the ``FormatManager::Changed()`` method. In general, anything |
| that changes the formatters causes ``FormatManager::Changed()`` to be called |
| through the ``IFormatChangeListener`` interface. This call increases the |
| ``FormatManager``'s revision and clears the cache. The revision number is a |
| monotonically increasing integer counter that essentially corresponds to the |
| number of changes made to the formatters throughout the current LLDB session. |
| This counter is used by ``ValueObjects`` to know when their formatters are out of |
| date. Since a search is a potentially expensive operation, before caching was |
| introduced, individual ``ValueObjects`` remembered which revision of the |
| ``FormatManager`` they used to search for their formatter, and stored it, so that |
| they would not repeat the search unless a change in the formatters had |
| occurred. While caching has made this less critical of an optimization, it is |
| still sensible and thus is kept. |
| |
| Lastly, as a side note, it is worth highlighting that any change in the |
| formatters invalidates the entire cache. It would likely not be impossible to |
| be smarter and figure out a subset of cache entries to be deleted, letting |
| others persist, instead of having to rebuild the entire cache from scratch. |
| However, given that formatters are not that frequently changed during a debug |
| session, and the algorithmic complexity to "get it right" seems larger than the |
| potential benefit to be had from doing it, the full cache invalidation is the |
| chosen policy. The algorithm to selectively invalidate entries is probably one |
| of the major areas for improvements in formatters performance. |
| |
| The second major optimization, introduced fairly recently, is the pregeneration |
| of type matches. The original algorithm was based upon the notion of a |
| ``FormatNavigator`` as a smart object, aware of all the intricacies of the |
| matching rules. For each category, the ``FormatNavigator`` would generate the |
| possible matches (e.g. dynamic type, pointee type, ...), and check each one, |
| one at a time. If that failed for a category, the next one would again generate |
| the same matches. |
| |
| This worked well, but was of course inefficient. The |
| ``FormattersMatchCandidate`` is the solution to this performance issue. In |
| top-of-tree LLDB, the ``FormatManager`` has the centralized notion of the |
| matching rules, and the former ``FormatNavigators`` are now |
| ``FormattersContainers``, whose only job is to guarantee a centralized storage |
| of formatters, and thread-safe access to such storage. |
| |
| ``FormatManager::GetPossibleMatches()`` fills a vector of possible matches. The |
| way it works is by applying each rule, generating the corresponding typename, |
| and storing the typename, plus the required Flags for that rule to be accepted |
| as a match candidate (e.g. if the match comes by fetching the pointee type, a |
| formatter that matches will have to allow pointees as part of its Flags |
| object). The ``TypeCategoryMap``, when tasked with finding a formatter for a |
| type, generates all possible matches and passes them down to each category. In |
| this model, the type system only does its (expensive) job once, and textual or |
| regex matches are the core of the work. |
| |
| FormatManager and DataVisualization |
| ----------------------------------- |
| |
| There are two main entry points in the data formatters: the ``FormatManager`` and |
| the ``DataVisualization``. |
| |
| The ``FormatManager`` is the internal such entry point. In this context, |
| internal refers to data formatters code itself, compared to other parts of |
| LLDB. For other components of the debugger, the ``DataVisualization`` provides |
| a more stable entry point. On the other hand, the ``FormatManager`` is an |
| aggregator of all moving parts, and as such is less stable in the face of |
| refactoring. |
| |
| People involved in the data formatters code itself, however, will most likely |
| have to confront the ``FormatManager`` for significant architecture changes. |
| |
| The ``FormatManager`` wraps a ``TypeCategoryMap`` (the list of all existing |
| categories, enabled and not), the ``FormatCache``, and several utility objects. |
| Plus, it is the repository of named summaries, since these don't logically |
| belong anywhere else. |
| |
| It is also responsible for creating all builtin formatters upon the launch of |
| LLDB. It does so through a bunch of methods ``Load***Formatters()``, invoked as |
| part of its constructor. The original design of data formatters anticipated |
| that individual libraries would load their formatters as part of their debug |
| information. This work however has largely been left unattended in practice, |
| and as such core system libraries (mostly those for masOS/iOS development as of |
| today) load their formatters in an hardcoded fashion. |
| |
| For performance reasons, the ``FormatManager`` is constructed upon being first |
| required. This happens through the ``DataVisualization`` layer. Upon first |
| being inquired for anything formatters, ``DataVisualization`` calls its own |
| local static function ``GetFormatManager()``, which in turns constructs and |
| returns a local static ``FormatManager``. |
| |
| Unlike most things in LLDB, the lifetime of the ``FormatManager`` is the same |
| as the entire session, rather than a specific ``Debugger`` or ``Target`` |
| instance. This is an area to be improved, but as of now it has not caused |
| enough grief to warrant action. If this work were to be undertaken, one could |
| conceivably devise a per-architecture-triple model, upon the assumption that an |
| OS and CPU combination are a good enough key to decide which formatters apply |
| (e.g. Linux i386 is probably different from masOS x86_64, but two macOS x86_64 |
| targets will probably have the same formatters; of course versioning of the |
| underlying OS is also to be considered, but experience with OSX has shown that |
| formatters can take care of that internally in most cases of interest). |
| |
| The public entry point is the ``DataVisualization`` layer. |
| ``DataVisualization`` is a static class on which questions can be asked in a |
| relatively refactoring-safe manner. |
| |
| The main question asked of it is to obtain formatters for ``ValueObjects`` (or |
| typenames). One can also query ``DataVisualization`` for named summaries or |
| individual categories, but of course those queries delve deeper in the internal |
| object model. |
| |
| As said, the ``FormatManager`` holds a notion of revision number, which changes |
| every time formatters are edited (added, deleted, categories enabled or |
| disabled, ...). Through ``DataVisualization::ForceUpdate()`` one can cause the |
| same effects of a formatters edit to happen without it actually having |
| happened. |
| |
| The main reason for this feature is that formatters can be dynamically created |
| in Python, and one can then enter the ``ScriptInterpreter`` and edit the |
| formatter function or class. If formatters were not updated, one could find |
| them to be out of sync with the new definitions of these objects. To avoid the |
| issue, whenever the user exits the scripting mode, formatters force an update |
| to make sure new potential definitions are reloaded on demand. |
| |
| The SWIG Layer |
| -------------- |
| |
| In order to implement formatters written in Python, LLDB requires that |
| ``ScriptInterpreter`` implementations provide a set of functions that one can call |
| to ask formatting questions of scripts. |
| |
| For instance, in order to obtain a scripting summary, LLDB calls: |
| |
| :: |
| |
| virtual bool |
| GetScriptedSummary(const char *function_name, llldb::ValueObjectSP valobj, |
| lldb::ScriptInterpreterObjectSP &callee_wrapper_sp, |
| std::string &retval) |
| |
| |
| For Python, this function is implemented by first checking if the |
| ``callee_wrapper_sp`` is valid. If so, LLDB knows that it does not need to |
| search a function with the passed name, and can directly call the wrapped |
| Python function object. Either way, the call is routed to a global callback |
| ``g_swig_typescript_callback``. |
| |
| This callback pointer points to ``LLDBSwigPythonCallTypeScript``. The details |
| of the implementation require familiarity with the Python C API, plus a few |
| utility objects defined by LLDB to ease the burden of dealing with the |
| scripting world. However, as a sketch of what happens, the code tries to find a |
| Python function object with the given name (i.e. if you say ``type summary add |
| -F module.function`` LLDB will scan for the ``module`` module, and then for a |
| function named ``function`` inside the module's namespace). If the function |
| object is found, it is wrapped in a ``PyCallable``, which is an LLDB utility class |
| that wraps the callable and allows for easier calling. The callable gets |
| invoked, and the return value, if any, is cast into a string. Originally, if a |
| non-string object was returned, LLDB would refuse to use it. This disallowed |
| such simple construct as: |
| |
| :: |
| |
| def getSummary(value,*args): |
| return 1 |
| |
| Similar considerations apply to other formatter (and non-formatter related) |
| scripting callbacks. |