How do garbage collection algorithms work

Garbage collection

The Garbage collection (short GC, from English garbage collection, literally: "Garbage collection", too Automatic garbage collection or Free space collection In software engineering, ideally denotes the minimization of the memory requirements of a process, which takes place without effort both during software development and during runtime. In practice, storage areas that are no longer required can be released or reused, but which have already been allocated to the process. Eliminating external fragmentation is also an option. In terms of time and storage complexity, there are more or less significant deviations from the ideal.

functionality

In many software systems, storage space is reserved when necessary in order to store the information about a data object. If the object is no longer used after a part of the program has been processed, the space for the object should also be made available again. This task does a Garbage collector called routine automatically, without this having to be explicitly coded in the program. In addition, the automatic garbage collection can usually be called at any time in a running program.

Algorithms

There are several garbage collection algorithms, some of which also tackle the problem of memory fragmentation.

Mark-and-sweep algorithm

With this method of garbage collection, objects that are known to be still in use are followed by all references to other objects. Every object reached in this way is marked. All unmarked objects are then released for reuse.

The release can lead to memory fragmentation. The problem here is somewhat less than with manual memory management. While with manual memory management the deallocation always takes place immediately, with mark-and-sweep several objects are almost always removed at once, which frees up larger contiguous memory areas.

Mark and Compact algorithm

The Mark and Compact algorithm used as well as Mark-and-sweep the principle of accessibility in graphs in order to recognize objects that are still referenced. It copies this to another location in the memory. The entire area from which the still referenced (one also speaks of "living") objects were copied is now regarded as a free memory area.

The disadvantage of this method is that the "living" objects themselves are moved, because pointers to them become invalid and have to be adapted. There are basically at least two methods for this:

  1. Each object is addressed via two indirections (diversions) (via one Pointer to a pointer to the object), so that when moving, only the pointer that points directly to the object has to be adjusted.
  2. All references refer directly to the object in order to avoid time-consuming dereferencing and are appropriately adapted after a move.

Moving the objects, however, has the advantage that those that “survived” the cleanup are now all compacted and the memory is practically defragmented. It is also possible to allocate very quickly because free storage space is not searched for in a laborious manner. Clear: If the referenced objects are moved to the “beginning” of the memory, new memory can simply be allocated at the “end”, behind the last living object. Allocation works comparatively easily, similar to the stack.

Generational

Generational GCs shorten the duration of the memory release. For this purpose, use is made of the situation that in practice the lifespan of objects is usually very different: On the one hand, there are objects that survive the entire runtime of the application. On the other hand, there are a large number of objects that are only needed temporarily to perform a single task. With generational GCs, the memory is divided into several sub-areas (generations). The longevity is quantified by a counter which is incremented with each garbage collection. With each application of the release algorithm (e.g. mark-and-compact or stop-and-copy), long-lived objects are moved to a higher generation. The advantage is that garbage collection can be performed more frequently and faster for lower generations, since only some of the objects need to be moved and their pointers changed. There is a high probability that higher generations contain only living (or very few dead) objects and therefore need to be cleaned up less frequently.

The number of generations is determined heuristically (for example 3 in .NET, 2 in the Java VM from Sun). In addition, different algorithms can be used for each generation. In Java, for example, a modified stop-and-copy algorithm is used for the lowest generation (also called the young generation) and mark-and-compact for the higher (tenured generation).

Reference count

With this method, each object keeps a counter with the number of all references that point to this object. If the reference counter of an object falls to zero, it can be released.

A particular problem of the free memory collection with reference counting lies in so-called cyclical references, in which objects hold references to one another, but are otherwise no longer used by any consumer in the system. Let us assume, for example, that object A holds a reference to object B and vice versa, while the rest of the system no longer needs its services. Thus, both objects refer to each other (cyclical) on top of each other, which is why automatic garbage collection cannot easily detect that they are no longer being used. The consequence of this is that the memory remains occupied for the duration of the program execution. There are different algorithms that can recognize and resolve such situations, mostly according to the principle of Accessibility in graphs.

Consequences

Some programming errors that affect the handling of resources can be avoided entirely by garbage collection. The double release of resources and the dereferencing of resources accidentally released too early (hanging pointers) should be mentioned in particular.

The analysis and release of ten million pointers only takes a fraction of a second on modern computing machines with efficient runtime systems. One disadvantage that some garbage collection algorithms can have is that when they will run, they may not be predictable. In real-time systems, for example, it is unacceptable if the program execution is interrupted at unforeseeable times by the execution of the automatic garbage collection. For real-time systems, automatic garbage collection must be implemented preemptively (e.g. in the idle process) and incrementally. Simple incremental methods work, for example, with so-called three-color marking.[1]

As a consequence of Rice's theorem, it cannot be determined whether referenced objects will ever be used again. This is why automatic garbage collection only releases objects that are no longer referenced by the program and cannot prevent memory leaks.

Automatic garbage collection can speed up programs[2]if memory is only released when the system requirements are currently low or if the memory management of the system is relieved by defragmentation. There are microbenchmarks that prove that in programming languages ​​with automatic memory cleaning, the creation / release of objects is faster than without.[3][4]

distribution

Some older (APL, LISP, BASIC) and many modern programming languages ​​have built-in automatic garbage collection.

For programming languages ​​such as C, in which the programmer has to do the memory management “by hand”, there are sometimes libraries that provide automatic memory cleaning, which can easily be circumvented during programming, or even has to be circumvented in system-related programming. For this reason, in modern development environments, system-level programmed modules are excluded from automatic garbage collection by explicitly marking them (for example in C # with the option / unsafe or in Component Pascal with the mandatory statement IMPORT SYSTEM).

Further examples of programming languages ​​with automatic memory management are Smalltalk, Haskell, Java, Oberon, OCaml, Perl, Visual Objects, ABAP and Objective-C (from version 2.0) as well as all languages ​​that were developed for the common language runtime of .NET ( for example C # or VB.NET).

Conservative and non-conservative garbage collection

Under one conservative automatic garbage collection is understood as one that Not can reliably recognize all non-referenced objects. This usually has no information about where references to other objects are located in the memory.

While a non-conservative collector (sometimes called a "Exact collector" called) metadata with which he can find all references within objects and stackframes, a conservative must use the memory possible Browse references. Any bit sequence that could be a valid reference in memory is taken as a reference. It cannot be determined whether this is not a random pattern after all. Therefore, conservative collectors occasionally recognize objects as referenced when they are actually not. Because automatic garbage collection must never remove objects that are still needed could, it must conservatively assume that the recognized bit sequence is a reference.

A conservative collector can be a risk, especially when an automatic garbage collector has to free more urgent resources than memory (see finalization). In general, conservative GCs are found where automatic memory management is difficult to implement, for example for the C ++ and C languages. (Note: this does not apply to the "Managed types" in C ++ / CLI, since their own reference types for automatic garbage collection have been introduced there, which do not allow the address of an object to be read out directly.)

Finalization

The technique used in many systems to deinitialize objects using automatic garbage collection is also known as Finalization. For example, objects in the Java programming language have a special method called that is used for the same purpose.

This procedure, i.e. having object de-initializations done by the automatic garbage collection, is now regarded as a design error and avoided in newer architectures.

Problems resulting from finalization are:

  • Objects that need finalization have a longer lifespan than others. This service life can even be significantly longer than that of objects without finalization. (The object can only be released after it has been finalized.) If scarce resources are managed with it, this can lead to blocking states within the program flow.
  • Finalization creates additional processing load for automatic garbage collection.
  • There is no defined order of finalization. It can therefore happen that other objects are accessed during the finalization that are also subject to the finalization but no longer exist at all at this point in time.
  • Depending on the implementation (e.g. in the Java programming language), there is no guarantee that the finalization routine will be called by the automatic garbage collector at all.

For these reasons, attempts have recently been made to completely dispense with finalization. The management of resources other than memory is kept away from automatic garbage collection. The automatic memory cleaning then falls exclusively to the task of memory management.

Fragmentation

Traditional memory management is prone to fragmentation over time. This problem is caused by the different lifetimes of objects. The memory management keeps a record of which places represent "free memory", that is, which can be allocated and which are already occupied by objects. Explicitly releasing memory locations creates gaps that cannot always be filled again immediately. If new objects are larger than the gaps that have become free, an unallocated area must be found elsewhere.

Problems that can occur with fragmentation:

  • A certain part of the available memory remains unused.
  • It takes longer to allocate memory as the data structures that manage the heap become more complex. Searching for a free memory location of a suitable size turns out to be more complex.
  • It happens again and again that objects allocated one after the other are not next to one another in the memory (one speaks here of worse Storage location). Studies have shown that objects created one after the other are often used simultaneously for a specific operation. If they are not close enough to one another, accesses are redirected to the slower memory behind it instead of the fast cache memory, which can severely slow down access.

However, fragmentation can be completely avoided by using compacting algorithms. See Mark and Compact. Although this leads to a longer delay in releasing memory, it reduces the allocation time. In order to keep the memory release as short as possible, care is taken to clear up large memory areas as rarely as possible. Therefore, these algorithms are preferably used in combination with generational methods.

See also

literature

  • Richard Jones, Rafael Lins: Garbage collection. John Wiley and Sons Ltd, April 30, 1996, ISBN 0-471-94148-4

Web links

Individual evidence

  1. ^ Garbage Collection for Parallel and Distributed Systems, Frank Joachim Frey, May 7, 2002
  2. ^ Benjamin Zorn, The Measured Cost of Conservative Garbage Collection Software - Practice and Experience 23 (7): 733-756, 1992
  3. Microbenchmarking C ++, C #, and Java: Object creation / destruction and method call. Dr. Dobb's Journal (July 1, 2005). Retrieved October 26, 2009.
  4. ↑ Arne Schäpers, Rudolf Huttary (December 2003): Daniel Düsentrieb - C #, Java, C ++ and Delphi in the efficiency test, part 2 Pp. 222-227. c’t. Accessed on October 26, 2009. "" The results show firstly that a garbage collector (during destruction) does not seem to have any noticeable disadvantages in terms of runtime behavior "and" The sometimes almost double time required by C ++ for construction compared to the other candidates ... ""