Python, like most modern languages, has its own garbage collector (GC). But how does the CPython GC work?
First of all, does it really matter? After, a GC is a GC, right? Well, not exactly. The GC on Java has evolved quite a bit from the days of Java 1.0 (you can find a great deal information about the various algorithms used here). In a interview, three Twitter developers explained the move from Ruby to Scala (a language that runs on top of the Java VM). And one of the reasons was that “because Ruby’s garbage collector is not quite as good as Java’s, each process uses up a lot of memory“. So yes, the GC can matter.
In the rest of this post, the term “collection” refers to the action of garbage-collecting objects which have become unreachable. It does NOT refer to structures such as lists or dictionaries.
The algorithm used
The main algorithm used for CPython is based on reference counting. Each object in Python contains a reference count – that is, the number of times it is referenced by another object or by a variable. Once this reference count reaches zero, it is deleted on the spot. Because that object may reference other objects (thus decreasing their reference count), this can trigger a chain reaction where a lot of objects are being deleted in a row.
Where as traditional GC algorithms look for the objects which are reachable (collecting objects that were not marked as reachable after all the reachable objects are accounted for), Python looks at the objects which are unreachable.
This approach however generates some overhead in memory and CPU. First of all, every single object, number, string, must have a reference count. There is also a CPU overhead as that reference count must be properly maintained. Even for a simple code like:
number = 42 number += 1
Python needs to first increment 42’s reference count (line 1) then in line 2 decrement it and increment 43’s reference count. But this extra processing is actually minor compared to the processing that Python goes through just to find the “number” variable, determine its type, etc.
Dealing with reference cycles
The one thing reference count does not handle is reference cycles. Imagine a circular linked list, or an object referencing itself. Even if these objects are unreachable, their reference count would still be 1.
This is why CPython has an algorithm to detect those reference cycles, implemented in the function collect. First of all, it only focuses on container objects (i.e. objects that can contain a reference to one or more objects): arrays, dictionaries, user class instances, etc. As an extra optimization, the GC ignores tuples containing only immutable types (int, strings, … or tuples containing only immutable types)
CPython for this maintains two double-linked lists: one list of objects to be scanned, and a tentatively unreachable list.
Let’s take the case of a circular linked list which has one link referenced by a variable A, and one self-referencing object which is completely unreachable.
1. When the GC starts, it has all the container objects it wants to scan on a first list (more on that later). Because most objects turn out to be reachable, it is more efficient to assume all objects are reachable and move them to an unreachable list if need be, rather than the other way around. On top of having a reference count (number of objects referencing them), each object container also has a gc_ref field initially set to the reference count:
2. The GC then goes through each container object and decrements by 1 the gc_ref of any other object it is referencing. In other words, we only care about the references from outside the “objects to scan” list (like variables) and not references from other container objects inside that list.
3. The GC scans again the container objects. The objects whose gc_ref is zero are marked as GC_TENTATIVELY_REACHABLE and moved to the tentatively unreachable list. In the following graph, the GC processed the “link 3” and “link 4” objects but hasn’t processed “link 1” and “link 2” yet.
4. Let’s assume that the GC scans next the “link 1” object. Because its gc_ref is equal to 1, it is marked as GC_REACHABLE.
5. When the GC encounters an object which is reachable, it traverses its references to find all the objects that are reachable from it, marking them as well as GC_REACHABLE. This is what happens to “link 2” and “link 3” below as they are reachable from “link 1”. Because “link 3” is reachable after all, it is moved back to the original list.
6. When all the objects are scanned, the container objects in the tentatively unreachable list are really unreachable and can thus be garbage collected.
Optimization #1: limiting the time for each collection
In order to limit the time each garbage collection takes, the CPython GC is using the concept of object generation. The idea is that most objects have a very short lifespan and can thus be collected shortly after their creation. The flip side is that the older an object, the less likely it is to be unreachable.
So CPython defines three generations of objects (0, 1 and 2). Every new object starts in generation 0. If it survives a garbage collection if will be moved to the generation 1, where it will it will be surveyed for collection less often. If it survives another GC round it will be moved to generation 2 where it will be surveyed the least often.
Optimization #2: limiting the number of full collections
The next question is: what triggers garbage collection? Whenever the number of objects in a particular generation gets over a certain threshold (which can be configured through gc.set_threshold()), collection starts. Note that when the GC decides to collect a given generation, it also collects the younger generations (e.g. it would never collect only gen 1 but also gen 0)
As a second optimization, CPython also limits the number of “full” collections, i.e. that are going through all three generations of objects. A full collection is only run when the ratio of objects that have survived the last “non-full” collection (that is, collection of generations 0 or 1 objects) is greater than 25% the number of objects that have survived the last full collection. This 25% ratio is hard-coded and cannot be configured.
As written in a comment in the GC source code: “each full garbage collection is more and more costly as the number of objects grows, but we do fewer and fewer of them”
The danger of finalizers
There is however another issue. In any language, the garbage collector’s worst enemy is finalizers – user-defined methods which are called when the GC wants to collect an object (in Python a finalizer is created by defining the method __del__). What if a finalizer makes one or more objects reachable again? This is why, up to Python 3.3, container objects inside a circular reference with a finalizer are never garbage-collected.
In the code below, we define a class with a finalizer. We then instanciate that class and add a self-reference to the object in order to
mess with test the GC.
>>> class MyClass(object): ... def __del__(self): ... pass ... >>> my_obj = MyClass() >>> my_obj.ref = my_obj >>> my_obj <__main__.MyClass object at 0x00000000025FCEF0>
We then remove the reference to the object. We can then run the garbage collector.
>>> del my_obj >>> gc.collect() 2 >>> gc.garbage [<__main__.MyClass object at 0x00000000025FCEF0>]
gc.garbage returns a list of uncollectable objects (objects that are marked as unreachable but cannot be collected) which now contains the object we tried to delete.
The PEP 442 (Python Enhancement Proposal), implemented with Python 3.4, however changes that behavior. Starting with Python 3.4, during a collection, finalizers for all the objects unreachable group of a circular references are called. Those objects which become reachable again are “revived”. Then, provided that those finalizers do not make the group reachable again, all references are cleared and all the objects are collected. Note that this does not work if you set the GC debug mode (thank you to Antoine Pitrou, the submitter of PEP 442, for the tip)
But Python 3.4 is not a license to do anything with finalizers. If a finalizer makes an object reachable again, its side effects (like writing to a file) would not be rolled back.
So the morale of the story is that you should be very careful with finalizers and only use them when you cannot do otherwise. Remember as well that you cannot tell when the object will really be deleted, even when you explicitly run the GC (the object could still be reachable from somewhere else). Likewise, the fact that the finalizer is called is not even a guarantee that the object will actually be collected.
What can you do?
With all this, what can you do if you believe garbage collection is hampering your application performance?
- Do not use finalizers unless you really, really, really have to.
- If you have to use finalizers and as a result end up with too many objects end up in gc.garbage, consider upgrading to Python 3.4 or greater
- If you are not using cyclic references, you can disable the GC altogether by calling gc.disable()
- Remember that a GC does not entirely get rid of all memory leaks. If you store a lot of data in global variables and forget to clear them, you will forever hold a reference to some objects – and there is nothing the GC can do. You can also use weak references – a reference which is not enough to keep an object alive.
- You can tune the GC using the gc module.