| # Data flow analysis: an informal introduction | 
 |  | 
 | ## Abstract | 
 |  | 
 | This document introduces data flow analysis in an informal way. The goal is to | 
 | give the reader an intuitive understanding of how it works, and show how it | 
 | applies to a range of refactoring and bug finding problems. | 
 |  | 
 | Data flow analysis is a well-established technique; it is described in many | 
 | papers, books, and videos. If you would like a more formal, or a more thorough | 
 | explanation of the concepts mentioned in this document, please refer to the | 
 | following resources: | 
 |  | 
 | *   [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)). | 
 | *   Videos on the PacketPrep YouTube channel that introduce lattices and the | 
 |     necessary background information: | 
 |     [#20](https://www.youtube.com/watch?v=73j_FXBXGm8), | 
 |     [#21](https://www.youtube.com/watch?v=b5sDjo9tfE8), | 
 |     [#22](https://www.youtube.com/watch?v=saOG7Uooeho), | 
 |     [#23](https://www.youtube.com/watch?v=3EAYX-wZH0g), | 
 |     [#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc), | 
 |     [#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw). | 
 | *   [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE) | 
 | *   [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf). | 
 | *   [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf). | 
 | *   [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/). | 
 | *   [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf) | 
 |     (a paper that successfully applies symbolic execution to real-world | 
 |     software). | 
 |  | 
 | ## Data flow analysis | 
 |  | 
 | ### The purpose of data flow analysis | 
 |  | 
 | Data flow analysis is a static analysis technique that proves facts about a | 
 | program or its fragment. It can make conclusions about all paths through the | 
 | program, while taking control flow into account and scaling to large programs. | 
 | The basic idea is propagating facts about the program through the edges of the | 
 | control flow graph (CFG) until a fixpoint is reached. | 
 |  | 
 | ### Sample problem and an ad-hoc solution | 
 |  | 
 | We would like to explain data flow analysis while discussing an example. Let's | 
 | imagine that we want to track possible values of an integer variable in our | 
 | program. Here is how a human could annotate the code: | 
 |  | 
 | ```c++ | 
 | void Example(int n) { | 
 |   int x = 0; | 
 |   // x is {0} | 
 |   if (n > 0) { | 
 |     x = 5; | 
 |     // x is {5} | 
 |   } else { | 
 |     x = 42; | 
 |     // x is {42} | 
 |   } | 
 |   // x is {5; 42} | 
 |   print(x); | 
 | } | 
 | ``` | 
 |  | 
 | We use sets of integers to represent possible values of `x`. Local variables | 
 | have unambiguous values between statements, so we annotate program points | 
 | between statements with sets of possible values. | 
 |  | 
 | Here is how we arrived at these annotations. Assigning a constant to `x` allows | 
 | us to make a conclusion that `x` can only have one value. When control flow from | 
 | the "then" and "else" branches joins, `x` can have either value. | 
 |  | 
 | Abstract algebra provides a nice formalism that models this kind of structure, | 
 | namely, a lattice. A join-semilattice is a partially ordered set, in which every | 
 | two elements have a least upper bound (called a *join*). | 
 |  | 
 | ``` | 
 | join(a, b) ⩾ a   and   join(a, b) ⩾ b   and   join(x, x) = x | 
 | ``` | 
 |  | 
 | For this problem we will use the lattice of subsets of integers, with set | 
 | inclusion relation as ordering and set union as a join. | 
 |  | 
 | Lattices are often represented visually as Hasse diagrams. Here is a Hasse | 
 | diagram for our lattice that tracks subsets of integers: | 
 |  | 
 |  | 
 |  | 
 | Computing the join in the lattice corresponds to finding the lowest common | 
 | ancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of | 
 | literature on efficiently implementing LCA queries for a DAG, however Efficient | 
 | Implementation of Lattice Operations (1989) | 
 | ([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911), | 
 | [doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that | 
 | particularly well-suited for programmatic implementation. | 
 |  | 
 | ### Too much information and "top" values | 
 |  | 
 | Let's try to find the possible sets of values of `x` in a function that modifies | 
 | `x` in a loop: | 
 |  | 
 | ```c++ | 
 | void ExampleOfInfiniteSets() { | 
 |   int x = 0; // x is {0} | 
 |   while (condition()) { | 
 |     x += 1;  // x is {0; 1; 2; …} | 
 |   } | 
 |   print(x);  // x is {0; 1; 2; …} | 
 | } | 
 | ``` | 
 |  | 
 | We have an issue: `x` can have any value greater than zero; that's an infinite | 
 | set of values, if the program operated on mathematical integers. In C++ `int` is | 
 | limited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is | 
 | still really big. | 
 |  | 
 | To make our analysis practical to compute, we have to limit the amount of | 
 | information that we track. In this case, we can, for example, arbitrarily limit | 
 | the size of sets to 3 elements. If at a certain program point `x` has more than | 
 | 3 possible values, we stop tracking specific values at that program point. | 
 | Instead, we denote possible values of `x` with the symbol `⊤` (pronounced "top" | 
 | according to a convention in abstract algebra). | 
 |  | 
 | ```c++ | 
 | void ExampleOfTopWithALoop() { | 
 |   int x = 0;  // x is {0} | 
 |   while (condition()) { | 
 |     x += 1;   // x is ⊤ | 
 |   } | 
 |   print(x);   // x is ⊤ | 
 | } | 
 | ``` | 
 |  | 
 | The statement "at this program point, `x`'s possible values are `⊤`" is | 
 | understood as "at this program point `x` can have any value because we have too | 
 | much information, or the information is conflicting". | 
 |  | 
 | Note that we can get more than 3 possible values even without a loop: | 
 |  | 
 | ```c++ | 
 | void ExampleOfTopWithoutLoops(int n) { | 
 |   int x = 0;  // x is {0} | 
 |   switch(n) { | 
 |     case 0:  x = 1; break; // x is {1} | 
 |     case 1:  x = 9; break; // x is {9} | 
 |     case 2:  x = 7; break; // x is {7} | 
 |     default: x = 3; break; // x is {3} | 
 |   } | 
 |   // x is ⊤ | 
 | } | 
 | ``` | 
 |  | 
 | ### Uninitialized variables and "bottom" values | 
 |  | 
 | When `x` is declared but not initialized, it has no possible values. We | 
 | represent this fact symbolically as `⊥` (pronounced "bottom"). | 
 |  | 
 | ```c++ | 
 | void ExampleOfBottom() { | 
 |   int x;    // x is ⊥ | 
 |   x = 42;   // x is {42} | 
 |   print(x); | 
 | } | 
 | ``` | 
 |  | 
 | Note that using values read from uninitialized variables is undefined behaviour | 
 | in C++. Generally, compilers and static analysis tools can assume undefined | 
 | behavior does not happen. We must model uninitialized variables only when we are | 
 | implementing a checker that specifically is trying to find uninitialized reads. | 
 | In this example we show how to model uninitialized variables only to demonstrate | 
 | the concept of "bottom", and how it applies to possible value analysis. We | 
 | describe an analysis that finds uninitialized reads in a section below. | 
 |  | 
 | ### A practical lattice that tracks sets of concrete values | 
 |  | 
 | Taking into account all corner cases covered above, we can put together a | 
 | lattice that we can use in practice to track possible values of integer | 
 | variables. This lattice represents sets of integers with 1, 2, or 3 elements, as | 
 | well as top and bottom. Here is a Hasse diagram for it: | 
 |  | 
 |  | 
 |  | 
 | ### Formalization | 
 |  | 
 | Let's consider a slightly more complex example, and think about how we can | 
 | compute the sets of possible values algorithmically. | 
 |  | 
 | ```c++ | 
 | void Example(int n) { | 
 |   int x;          // x is ⊥ | 
 |   if (n > 0) { | 
 |     if (n == 42) { | 
 |        x = 44;    // x is {44} | 
 |     } else { | 
 |        x = 5;     // x is {5} | 
 |     } | 
 |     print(x);     // x is {44; 5} | 
 |   } else { | 
 |     x = n;        // x is ⊤ | 
 |   } | 
 |   print(x);       // x is ⊤ | 
 | } | 
 | ``` | 
 |  | 
 | As humans, we understand the control flow from the program text. We used our | 
 | understanding of control flow to find program points where two flows join. | 
 | Formally, control flow is represented by a CFG (control flow graph): | 
 |  | 
 |  | 
 |  | 
 | We can compute sets of possible values by propagating them through the CFG of | 
 | the function: | 
 |  | 
 | *   When `x` is declared but not initialized, its possible values are `{}`. The | 
 |     empty set plays the role of `⊥` in this lattice. | 
 |  | 
 | *   When `x` is assigned a concrete value, its possible set of values contains | 
 |     just that specific value. | 
 |  | 
 | *   When `x` is assigned some unknown value, it can have any value. We represent | 
 |     this fact as `⊤`. | 
 |  | 
 | *   When two control flow paths join, we compute the set union of incoming | 
 |     values (limiting the number of elements to 3, representig larger sets as | 
 |     `⊤`). | 
 |  | 
 | The sets of possible values are influenced by: | 
 |  | 
 | *   Statements, for example, assignments. | 
 |  | 
 | *   Joins in control flow, for example, ones that appear at the end of "if" | 
 |     statements. | 
 |  | 
 | **Effects of statements** are modeled by what is formally known as a transfer | 
 | function. A transfer function takes two arguments: the statement, and the state | 
 | of `x` at the previous program point. It produces the state of `x` at the next | 
 | program point. For example, the transfer function for assignment ignores the | 
 | state at the previous program point: | 
 |  | 
 | ```c++ | 
 | // GIVEN: x is {42; 44} | 
 | x = 0; | 
 | // CONCLUSION: x is {0} | 
 | ``` | 
 |  | 
 | The transfer function for `+` performs arithmetic on every set member: | 
 |  | 
 | ```c++ | 
 | // GIVEN: x is {42, 44} | 
 | x = x + 100; | 
 | // CONCLUSION: x is {142, 144} | 
 | ``` | 
 |  | 
 | **Effects of control flow** are modeled by joining the knowledge from all | 
 | possible previous program points. | 
 |  | 
 | ```c++ | 
 | if (...) { | 
 |   ... | 
 |   // GIVEN: x is {42} | 
 | } else { | 
 |   ... | 
 |   // GIVEN: x is {44} | 
 | } | 
 | // CONCLUSION: x is {42; 44} | 
 | ``` | 
 |  | 
 | ```c++ | 
 | // GIVEN: x is {42} | 
 | while (...) { | 
 |   ... | 
 |   // GIVEN: x is {44} | 
 | } | 
 | // CONCLUSION: {42; 44} | 
 | ``` | 
 |  | 
 | The predicate that we marked "given" is usually called a precondition, and the | 
 | conclusion is called a postcondition. | 
 |  | 
 | In terms of the CFG, we join the information from all predecessor basic blocks. | 
 |  | 
 |  | 
 |  | 
 | Putting it all together, to model the effects of a basic block we compute: | 
 |  | 
 | ``` | 
 | out = transfer(basic_block, join(in_1, in_2, ..., in_n)) | 
 | ``` | 
 |  | 
 | (Note that there are other ways to write this equation that produce higher | 
 | precision analysis results. The trick is to keep exploring the execution paths | 
 | separately and delay joining until later. Hoowever, we won't discuss those | 
 | variations here.) | 
 |  | 
 | To make a conclusion about all paths through the program, we repeat this | 
 | computation on all basic blocks until we reach a fixpoint. In other words, we | 
 | keep propagating information through the CFG until the computed sets of values | 
 | stop changing. | 
 |  | 
 | If the lattice has a finite height and transfer functions are monotonic the | 
 | algorithm is guaranteed to terminate.  Each iteration of the algorithm can | 
 | change computed values only to larger values from the lattice. In the worst | 
 | case, all computed values become `⊤`, which is not very useful, but at least the | 
 | analysis terminates at that point, because it can't change any of the values. | 
 |  | 
 | Fixpoint iteration can be optimised by only reprocessing basic blocks which had | 
 | one of their inputs changed on the previous iteration. This is typically | 
 | implemented using a worklist queue. With this optimisation the time complexity | 
 | becomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and | 
 | `|L|` is the size of lattice used by the analysis. | 
 |  | 
 | ## Symbolic execution: a very short informal introduction | 
 |  | 
 | ### Symbolic values | 
 |  | 
 | In the previous example where we tried to figure out what values a variable can | 
 | have, the analysis had to be seeded with a concrete value. What if there are no | 
 | assignments of concrete values in the program? We can still deduce some | 
 | interesting information by representing unknown input values symbolically, and | 
 | computing results as symbolic expressions: | 
 |  | 
 | ```c++ | 
 | void PrintAbs(int x) { | 
 |   int result; | 
 |   if (x >= 0) { | 
 |     result = x;   // result is {x} | 
 |   } else { | 
 |     result = -x;  // result is {-x} | 
 |   } | 
 |   print(result);  // result is {x; -x} | 
 | } | 
 | ``` | 
 |  | 
 | We can't say what specific value gets printed, but we know that it is either `x` | 
 | or `-x`. | 
 |  | 
 | Dataflow analysis is an istance of abstract interpretation, and does not dictate | 
 | how exactly the lattice and transfer functions should be designed, beyond the | 
 | necessary conditions for the analysis to converge. Nevertheless, we can use | 
 | symbolic execution ideas to guide our design of the lattice and transfer | 
 | functions: lattice values can be symbolic expressions, and transfer functions | 
 | can construct more complex symbolic expressions from symbolic expressions that | 
 | represent arguments. See [this StackOverflow | 
 | discussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation) | 
 | for a further comparison of abstract interpretation and symbolic execution. | 
 |  | 
 | ### Flow condition | 
 |  | 
 | A human can say about the previous example that the function returns `x` when | 
 | `x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by | 
 | tracking a flow condition. A flow condition is a predicate written in terms of | 
 | the program state that is true at a specific program point regardless of the | 
 | execution path that led to this statement. For example, the flow condition for | 
 | the program point right before evaluating `result = x` is `x >= 0`. | 
 |  | 
 | If we enhance the lattice to be a set of pairs of values and predicates, the | 
 | dataflow analysis computes the following values: | 
 |  | 
 | ```c++ | 
 | void PrintAbs(int x) { | 
 |   int result; | 
 |   if (x >= 0) { | 
 |     // Flow condition: x >= 0. | 
 |     result = x;   // result is {x if x >= 0} | 
 |   } else { | 
 |     // Flow condition: x < 0. | 
 |     result = -x;  // result is {-x if x < 0} | 
 |   } | 
 |   print(result);  // result is {x if x >= 0; -x if x < 0} | 
 | } | 
 | ``` | 
 |  | 
 | Of course, in a program with loops, symbolic expressions for flow conditions can | 
 | grow unbounded. A practical static analysis system must control this growth to | 
 | keep the symbolic representations manageable and ensure that the data flow | 
 | analysis terminates. For example, it can use a constraint solver to prune | 
 | impossible flow conditions, and/or it can abstract them, losing precision, after | 
 | their symbolic representations grow beyond some threshold. This is similar to | 
 | how we had to limit the sizes of computed sets of possible values to 3 elements. | 
 |  | 
 | ### Symbolic pointers | 
 |  | 
 | This approach proves to be particularly useful for modeling pointer values, | 
 | since we don't care about specific addresses but just want to give a unique | 
 | identifier to a memory location. | 
 |  | 
 | ```c++ | 
 | void ExampleOfSymbolicPointers(bool b) { | 
 |   int x = 0;     // x is {0} | 
 |   int* ptr = &x; // x is {0}      ptr is {&x} | 
 |   if (b) { | 
 |     *ptr = 42;   // x is {42}     ptr is {&x} | 
 |   } | 
 |   print(x);      // x is {0; 42}  ptr is {&x} | 
 | } | 
 | ``` | 
 |  | 
 | ## Example: finding output parameters | 
 |  | 
 | Let's explore how data flow analysis can help with a problem that is hard to | 
 | solve with other tools in Clang. | 
 |  | 
 | ### Problem description | 
 |  | 
 | Output parameters are function parameters of pointer or reference type whose | 
 | pointee is completely overwritten by the function, and not read before it is | 
 | overwritten. They are common in pre-C++11 code due to the absence of move | 
 | semantics. In modern C++ output parameters are non-idiomatic, and return values | 
 | are used instead. | 
 |  | 
 | Imagine that we would like to refactor output parameters to return values to | 
 | modernize old code. The first step is to identify refactoring candidates through | 
 | static analysis. | 
 |  | 
 | For example, in the following code snippet the pointer `c` is an output | 
 | parameter: | 
 |  | 
 | ```c++ | 
 | struct Customer { | 
 |   int account_id; | 
 |   std::string name; | 
 | } | 
 |  | 
 | void GetCustomer(Customer *c) { | 
 |   c->account_id = ...; | 
 |   if (...) { | 
 |     c->name = ...; | 
 |   } else { | 
 |     c->name = ...; | 
 |   } | 
 | } | 
 | ``` | 
 |  | 
 | We would like to refactor this code into: | 
 |  | 
 | ```c++ | 
 | Customer GetCustomer() { | 
 |   Customer c; | 
 |   c.account_id = ...; | 
 |   if (...) { | 
 |     c.name = ...; | 
 |   } else { | 
 |     c.name = ...; | 
 |   } | 
 |   return c; | 
 | } | 
 | ``` | 
 |  | 
 | However, in the function below the parameter `c` is not an output parameter | 
 | because its field `name` is not overwritten on every path through the function. | 
 |  | 
 | ```c++ | 
 | void GetCustomer(Customer *c) { | 
 |   c->account_id = ...; | 
 |   if (...) { | 
 |     c->name = ...; | 
 |   } | 
 | } | 
 | ``` | 
 |  | 
 | The code also cannot read the value of the parameter before overwriting it: | 
 |  | 
 | ```c++ | 
 | void GetCustomer(Customer *c) { | 
 |   use(c->account_id); | 
 |   c->name = ...; | 
 |   c->account_id = ...; | 
 | } | 
 | ``` | 
 |  | 
 | Functions that escape the pointer also block the refactoring: | 
 |  | 
 | ```c++ | 
 | Customer* kGlobalCustomer; | 
 |  | 
 | void GetCustomer(Customer *c) { | 
 |   c->name = ...; | 
 |   c->account_id = ...; | 
 |   kGlobalCustomer = c; | 
 | } | 
 | ``` | 
 |  | 
 | To identify a candidate function for refactoring, we need to do the following: | 
 |  | 
 | *   Find a function with a non-const pointer or reference parameter. | 
 |  | 
 | *   Find the definition of that function. | 
 |  | 
 | *   Prove that the function completely overwrites the pointee on all paths | 
 |     before returning. | 
 |  | 
 | *   Prove that the function reads the pointee only after overwriting it. | 
 |  | 
 | *   Prove that the function does not persist the pointer in a data structure | 
 |     that is live after the function returns. | 
 |  | 
 | There are also requirements that all usage sites of the candidate function must | 
 | satisfy, for example, that function arguments do not alias, that users are not | 
 | taking the address of the function, and so on. Let's consider verifying usage | 
 | site conditions to be a separate static analysis problem. | 
 |  | 
 | ### Lattice design | 
 |  | 
 | To analyze the function body we can use a lattice which consists of normal | 
 | states and failure states. A normal state describes program points where we are | 
 | sure that no behaviors that block the refactoring have occurred. Normal states | 
 | keep track of all parameter's member fields that are known to be overwritten on | 
 | every path from function entry to the corresponding program point. Failure | 
 | states accumulate observed violations (unsafe reads and pointer escapes) that | 
 | block the refactoring. | 
 |  | 
 | In the partial order of the lattice failure states compare greater than normal | 
 | states, which guarantees that they "win" when joined with normal states. Order | 
 | between failure states is determined by inclusion relation on the set of | 
 | accumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order | 
 | between normal states is determined by reversed inclusion relation on the set of | 
 | overwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of | 
 | overwritten fields). | 
 |  | 
 |  | 
 |  | 
 | To determine whether a statement reads or writes a field we can implement | 
 | symbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer | 
 | dereference operator and `MemberExpr`s. | 
 |  | 
 | ### Using data flow results to identify output parameters | 
 |  | 
 | Let's take a look at how we use data flow analysis to identify an output | 
 | parameter. The refactoring can be safely done when the data flow algorithm | 
 | computes a normal state with all of the fields proven to be overwritten in the | 
 | exit basic block of the function. | 
 |  | 
 | ```c++ | 
 | struct Customer { | 
 |   int account_id; | 
 |   std::string name; | 
 | }; | 
 |  | 
 | void GetCustomer(Customer* c) { | 
 |   // Overwritten: {} | 
 |   c->account_id = ...; // Overwritten: {c->account_id} | 
 |   if (...) { | 
 |     c->name = ...;     // Overwritten: {c->account_id, c->name} | 
 |   } else { | 
 |     c->name = ...;     // Overwritten: {c->account_id, c->name} | 
 |   } | 
 |   // Overwritten: {c->account_id, c->name} | 
 | } | 
 | ``` | 
 |  | 
 | When the data flow algorithm computes a normal state, but not all fields are | 
 | proven to be overwritten we can't perform the refactoring. | 
 |  | 
 | ```c++ | 
 | void target(bool b, Customer* c) { | 
 |   // Overwritten: {} | 
 |   if (b) { | 
 |     c->account_id = 42;     // Overwritten: {c->account_id} | 
 |   } else { | 
 |     c->name = "Konrad";  // Overwritten: {c->name} | 
 |   } | 
 |   // Overwritten: {} | 
 | } | 
 | ``` | 
 |  | 
 | Similarly, when the data flow algorithm computes a failure state, we also can't | 
 | perform the refactoring. | 
 |  | 
 | ```c++ | 
 | Customer* kGlobalCustomer; | 
 |  | 
 | void GetCustomer(Customer* c) { | 
 |   // Overwritten: {} | 
 |   c->account_id = ...;    // Overwritten: {c->account_id} | 
 |   if (...) { | 
 |     print(c->name);       // Unsafe read | 
 |   } else { | 
 |     kGlobalCustomer = c;  // Pointer escape | 
 |   } | 
 |   // Unsafe read, Pointer escape | 
 | } | 
 | ``` | 
 |  | 
 | ## Example: finding dead stores | 
 |  | 
 | Let's say we want to find redundant stores, because they indicate potential | 
 | bugs. | 
 |  | 
 | ```c++ | 
 | x = GetX(); | 
 | x = GetY(); | 
 | ``` | 
 |  | 
 | The first store to `x` is never read, probably there is a bug. | 
 |  | 
 | The implementation of dead store analysis is very similar to output parameter | 
 | analysis: we need to track stores and loads, and find stores that were never | 
 | read. | 
 |  | 
 | [Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a | 
 | generalization of this idea, which is often used to answer many related | 
 | questions, for example: | 
 |  | 
 | * finding dead stores, | 
 | * finding uninitialized variables, | 
 | * finding a good point to deallocate memory, | 
 | * finding out if it would be safe to move an object. | 
 |  | 
 | ## Example: definitive initialization | 
 |  | 
 | Definitive initialization proves that variables are known to be initialized when | 
 | read. If we find a variable which is read when not initialized then we generate | 
 | a warning. | 
 |  | 
 | ```c++ | 
 | void Init() { | 
 |   int x;    // x is uninitialized | 
 |   if (cond()) { | 
 |     x = 10; // x is initialized | 
 |   } else { | 
 |     x = 20; // x is initialized | 
 |   } | 
 |   print(x); // x is initialized | 
 | } | 
 | ``` | 
 |  | 
 | ```c++ | 
 | void Uninit() { | 
 |   int x;    // x is uninitialized | 
 |   if (cond()) { | 
 |     x = 10; // x is initialized | 
 |   } | 
 |   print(x); // x is maybe uninitialized, x is being read, report a bug. | 
 | } | 
 | ``` | 
 |  | 
 | For this purpose we can use lattice in a form of a mapping from variable | 
 | declarations to initialization states; each initialization state is represented | 
 | by the followingn lattice: | 
 |  | 
 |  | 
 |  | 
 | A lattice element could also capture the source locations of the branches that | 
 | lead us to the corresponding program point. Diagnostics would use this | 
 | information to show a sample buggy code path to the user. | 
 |  | 
 | ## Example: refactoring raw pointers to `unique_ptr` | 
 |  | 
 | Modern idiomatic C++ uses smart pointers to express memory ownership, however in | 
 | pre-C++11 code one can often find raw pointers that own heap memory blocks. | 
 |  | 
 | Imagine that we would like to refactor raw pointers that own memory to | 
 | `unique_ptr`. There are multiple ways to design a data flow analysis for this | 
 | problem; let's look at one way to do it. | 
 |  | 
 | For example, we would like to refactor the following code that uses raw | 
 | pointers: | 
 |  | 
 | ```c++ | 
 | void UniqueOwnership1() { | 
 |   int *pi = new int; | 
 |   if (...) { | 
 |     Borrow(pi); | 
 |     delete pi; | 
 |   } else { | 
 |     TakeOwnership(pi); | 
 |   } | 
 | } | 
 | ``` | 
 |  | 
 | into code that uses `unique_ptr`: | 
 |  | 
 | ```c++ | 
 | void UniqueOwnership1() { | 
 |   auto pi = std::make_unique<int>(); | 
 |   if (...) { | 
 |     Borrow(pi.get()); | 
 |   } else { | 
 |     TakeOwnership(pi.release()); | 
 |   } | 
 | } | 
 | ``` | 
 |  | 
 | This problem can be solved with a lattice in form of map from value declarations | 
 | to pointer states: | 
 |  | 
 |  | 
 |  | 
 | We can perform the refactoring if at the exit of a function `pi` is | 
 | `Compatible`. | 
 |  | 
 | ```c++ | 
 | void UniqueOwnership1() { | 
 |   int *pi;             // pi is Compatible | 
 |   pi = new int;        // pi is Defined | 
 |   if (...) { | 
 |     Borrow(pi);        // pi is Defined | 
 |     delete pi;         // pi is Compatible | 
 |   } else { | 
 |     TakeOwnership(pi); // pi is Compatible | 
 |   } | 
 |   // pi is Compatible | 
 | } | 
 | ``` | 
 |  | 
 | Let's look at an example where the raw pointer owns two different memory blocks: | 
 |  | 
 | ```c++ | 
 | void UniqueOwnership2() { | 
 |   int *pi = new int;  // pi is Defined | 
 |   Borrow(pi); | 
 |   delete pi;          // pi is Compatible | 
 |   if (smth) { | 
 |     pi = new int;     // pi is Defined | 
 |     Borrow(pi); | 
 |     delete pi;        // pi is Compatible | 
 |   } | 
 |   // pi is Compatible | 
 | } | 
 | ``` | 
 |  | 
 | It can be refactored to use `unique_ptr` like this: | 
 |  | 
 | ```c++ | 
 | void UniqueOwnership2() { | 
 |   auto pi = make_unique<int>(); | 
 |   Borrow(pi); | 
 |   if (smth) { | 
 |     pi = make_unique<int>(); | 
 |     Borrow(pi); | 
 |   } | 
 | } | 
 | ``` | 
 |  | 
 | In the following example, the raw pointer is used to access the heap object | 
 | after the ownership has been transferred. | 
 |  | 
 | ```c++ | 
 | void UniqueOwnership3() { | 
 |   int *pi = new int; // pi is Defined | 
 |   if (...) { | 
 |     Borrow(pi); | 
 |     delete pi;       // pi is Compatible | 
 |   } else { | 
 |     vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible | 
 |     print(*pi); | 
 |     use(v); | 
 |   } | 
 |   // pi is Compatible | 
 | } | 
 | ``` | 
 |  | 
 | We can refactor this code to use `unique_ptr`, however we would have to | 
 | introduce a non-owning pointer variable, since we can't use the moved-from | 
 | `unique_ptr` to access the object: | 
 |  | 
 | ```c++ | 
 | void UniqueOwnership3() { | 
 |   std::unique_ptr<int> pi = std::make_unique<int>(); | 
 |   if (...) { | 
 |     Borrow(pi); | 
 |   } else { | 
 |     int *pi_non_owning = pi.get(); | 
 |     vector<unique_ptr<int>> v = {std::move(pi)}; | 
 |     print(*pi_non_owning); | 
 |     use(v); | 
 |   } | 
 | } | 
 | ``` | 
 |  | 
 | If the original code didn't call `delete` at the very end of the function, then | 
 | our refactoring may change the point at which we run the destructor and release | 
 | memory. Specifically, if there is some user code after `delete`, then extending | 
 | the lifetime of the object until the end of the function may hold locks for | 
 | longer than necessary, introduce memory overhead etc. | 
 |  | 
 | One solution is to always replace `delete` with a call to `reset()`, and then | 
 | perform another analysis that removes unnecessary `reset()` calls. | 
 |  | 
 | ```c++ | 
 | void AddedMemoryOverhead() { | 
 |   HugeObject *ho = new HugeObject(); | 
 |   use(ho); | 
 |   delete ho; // Release the large amount of memory quickly. | 
 |   LongRunningFunction(); | 
 | } | 
 | ``` | 
 |  | 
 | This analysis will refuse to refactor code that mixes borrowed pointer values | 
 | and unique ownership. In the following code, `GetPtr()` returns a borrowed | 
 | pointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned | 
 | pointer. We don't distinguish between these two assignments, and we want each | 
 | assignment to be paired with a corresponding sink; otherwise, we transition the | 
 | pointer to a `Conflicting` state, like in this example. | 
 |  | 
 | ```c++ | 
 | void ConflictingOwnership() { | 
 |   int *pi;           // pi is Compatible | 
 |   pi = GetPtr();     // pi is Defined | 
 |   Borrow(pi);        // pi is Defined | 
 |  | 
 |   pi = new int;      // pi is Conflicting | 
 |   Borrow(pi); | 
 |   delete pi; | 
 |   // pi is Conflicting | 
 | } | 
 | ``` | 
 |  | 
 | We could still handle this case by finding a maximal range in the code where | 
 | `pi` could be in the Compatible state, and only refactoring that part. | 
 |  | 
 | ```c++ | 
 | void ConflictingOwnership() { | 
 |   int *pi; | 
 |   pi = GetPtr(); | 
 |   Borrow(pi); | 
 |  | 
 |   std::unique_ptr<int> pi_unique = std::make_unique<int>(); | 
 |   Borrow(pi_unique.get()); | 
 | } | 
 | ``` | 
 |  | 
 | ## Example: finding redundant branch conditions | 
 |  | 
 | In the code below `b1` should not be checked in both the outer and inner "if" | 
 | statements. It is likely there is a bug in this code. | 
 |  | 
 | ```c++ | 
 | int F(bool b1, bool b2) { | 
 |   if (b1) { | 
 |     f(); | 
 |     if (b1 && b2) {  // Check `b1` again -- unnecessary! | 
 |       g(); | 
 |     } | 
 |   } | 
 | } | 
 | ``` | 
 |  | 
 | A checker that finds this pattern syntactically is already implemented in | 
 | ClangTidy using AST matchers (`bugprone-redundant-branch-condition`). | 
 |  | 
 | To implement it using the data flow analysis framework, we can produce a warning | 
 | if any part of the branch condition is implied by the flow condition. | 
 |  | 
 | ```c++ | 
 | int F(bool b1, bool b2) { | 
 |   // Flow condition: true. | 
 |   if (b1) { | 
 |     // Flow condition: b1. | 
 |     f(); | 
 |     if (b1 && b2) { // `b1` is implied by the flow condition. | 
 |       g(); | 
 |     } | 
 |   } | 
 | } | 
 | ``` | 
 |  | 
 | One way to check this implication is to use a SAT solver. Without a SAT solver, | 
 | we could keep the flow condition in the CNF form and then it would be easy to | 
 | check the implication. | 
 |  | 
 | ## Example: finding unchecked `std::optional` unwraps | 
 |  | 
 | Calling `optional::value()` is only valid if `optional::has_value()` is true. We | 
 | want to show that when `x.value()` is executed, the flow condition implies | 
 | `x.has_value()`. | 
 |  | 
 | In the example below `x.value()` is accessed safely because it is guarded by the | 
 | `x.has_value()` check. | 
 |  | 
 | ```c++ | 
 | void Example(std::optional<int> &x) { | 
 |   if (x.has_value()) { | 
 |     use(x.value()); | 
 |   } | 
 | } | 
 | ``` | 
 |  | 
 | While entering the if branch we deduce that `x.has_value()` is implied by the | 
 | flow condition. | 
 |  | 
 | ```c++ | 
 | void Example(std::optional<int> x) { | 
 |   // Flow condition: true. | 
 |   if (x.has_value()) { | 
 |     // Flow condition: x.has_value() == true. | 
 |     use(x.value()); | 
 |   } | 
 |   // Flow condition: true. | 
 | } | 
 | ``` | 
 |  | 
 | We also need to prove that `x` is not modified between check and value access. | 
 | The modification of `x` may be very subtle: | 
 |  | 
 | ```c++ | 
 | void F(std::optional<int> &x); | 
 |  | 
 | void Example(std::optional<int> &x) { | 
 |   if (x.has_value()) { | 
 |     // Flow condition: x.has_value() == true. | 
 |     unknown_function(x); // may change x. | 
 |     // Flow condition: true. | 
 |     use(x.value()); | 
 |   } | 
 | } | 
 | ``` | 
 |  | 
 | ## Example: finding dead code behind A/B experiment flags | 
 |  | 
 | Finding dead code is a classic application of data flow analysis. | 
 |  | 
 | Unused flags for A/B experiment hide dead code. However, this flavor of dead | 
 | code is invisible to the compiler because the flag can be turned on at any | 
 | moment. | 
 |  | 
 | We could make a tool that deletes experiment flags. The user tells us which flag | 
 | they want to delete, and we assume that the it's value is a given constant. | 
 |  | 
 | For example, the user could use the tool to remove `example_flag` from this | 
 | code: | 
 |  | 
 | ```c++ | 
 | DEFINE_FLAG(std::string, example_flag, "", "A sample flag."); | 
 |  | 
 | void Example() { | 
 |   bool x = GetFlag(FLAGS_example_flag).empty(); | 
 |   f(); | 
 |   if (x) { | 
 |     g(); | 
 |   } else { | 
 |     h(); | 
 |   } | 
 | } | 
 | ``` | 
 |  | 
 | The tool would simplify the code to: | 
 |  | 
 | ```c++ | 
 | void Example() { | 
 |   f(); | 
 |   g(); | 
 | } | 
 | ``` | 
 |  | 
 | We can solve this problem with a classic constant propagation lattice combined | 
 | with symbolic evaluation. | 
 |  | 
 | ## Example: finding inefficient usages of associative containers | 
 |  | 
 | Real-world code often accidentally performs repeated lookups in associative | 
 | containers: | 
 |  | 
 | ```c++ | 
 | map<int, Employee> xs; | 
 | xs[42]->name = "..."; | 
 | xs[42]->title = "..."; | 
 | ``` | 
 |  | 
 | To find the above inefficiency we can use the available expressions analysis to | 
 | understand that `m[42]` is evaluated twice. | 
 |  | 
 | ```c++ | 
 | map<int, Employee> xs; | 
 | Employee &e = xs[42]; | 
 | e->name = "..."; | 
 | e->title = "..."; | 
 | ``` | 
 |  | 
 | We can also track the `m.contains()` check in the flow condition to find | 
 | redundant checks, like in the example below. | 
 |  | 
 | ```c++ | 
 | std::map<int, Employee> xs; | 
 | if (!xs.contains(42)) { | 
 |   xs.insert({42, someEmployee}); | 
 | } | 
 | ``` | 
 |  | 
 | ## Example: refactoring types that implicitly convert to each other | 
 |  | 
 | Refactoring one strong type to another is difficult, but the compiler can help: | 
 | once you refactor one reference to the type, the compiler will flag other places | 
 | where this information flows with type mismatch errors. Unfortunately this | 
 | strategy does not work when you are refactoring types that implicitly convert to | 
 | each other, for example, replacing `int32_t` with `int64_t`. | 
 |  | 
 | Imagine that we want to change user IDs from 32 to 64-bit integers. In other | 
 | words, we need to find all integers tainted with user IDs. We can use data flow | 
 | analysis to implement taint analysis. | 
 |  | 
 | ```c++ | 
 | void UseUser(int32_t user_id) { | 
 |   int32_t id = user_id; | 
 |   // Variable `id` is tainted with a user ID. | 
 |   ... | 
 | } | 
 | ``` | 
 |  | 
 | Taint analysis is very well suited to this problem because the program rarely | 
 | branches on user IDs, and almost certainly does not perform any computation | 
 | (like arithmetic). |