Compact ordered dict__k_lab_meeting_

Preview:

Citation preview

New dict implementationin Python 3.6

Inada Naoki (@methane)

自己紹介

@methane

K-Labo, KLab Inc.

Python core developer

C, Go, Network (server) programming, MySQL clients

ISUCON 6 winner (See http://isucon.net/ )

Table of contents

● dict in Python● Python 3.5 implementation● Python 3.6 implementation● Toward Python 3.7

Dict in Python

DictKey-Value storage. A.k.a. associative-array, map, hash.

x = {"foo": 42, "bar": 84}

print( x["foo"] ) # => 42

Key feature:

● Constant time lookup● Amortized constant time insertion● Support custom (user-defined) key type

Dicts are everywhere in Pythonx = 5 # global namespace is dict. Insert 'x' to it.def add(a): # Insert 'add' to global dict return a + x # lookup 'x' from global dictprint(add(7)) # search 'print' and 'add' from global dict

There are many dicts in Python program.

Lookup speed is critical.

Insertion speed and memory usage is very important too.

Python 3.5 implementation

Key

hash

value

0 1 2 3 4 5 6 7

d["foo"] = "spam" # insert new item

hash("foo") = 42 # hash value is 4242 % 8 = 2 # hash value % hash table size = 2

Key

hash

value

0 1 2 3 4 5 6 7

d["foo"] = "spam"

hash("foo") = 4242 % 8 = 2

"foo"

42

"spam"

Key

hash

value

0 1 2 3 4 5 6 7

d["bar"] = "ham"

hash("bar") = 5252 % 8 = 4

"foo"

42

"spam"

"bar"

52

"ham"

Key

hash

value

0 1 2 3 4 5 6 7

d["baz"] = "egg"

hash("baz") = 5858 % 8 = 2 # "baz" is conflict with "foo"

"foo"

42

"spam"

"bar"

52

"ham"

Key

hash

value

0 1 2 3 4 5 6 7

"Open addressing" uses another slot in the table.(Another strategy is "chaining")

For example, "linear probing" algorithm uses next entry.※Python uses more complex probing, but I use simpler way in this example.

"foo"

42

"spam"

"bar"

52

"ham"

"baz"

58

"egg"

Key

hash

value

0 1 2 3 4 5 6 7

del d["foo"]

hash("foo") = 4242 % 8 = 2

"foo"

42

"spam"

"bar"

52

"ham"

"baz"

58

"egg"

Key

hash

value

0 1 2 3 4 5 6 7

del d["foo"]

hash("foo") = 4242 % 8 = 2

"bar"

52

"ham"

"baz"

58

"egg"

Key

hash

value

0 1 2 3 4 5 6 7

x = d["baz"]

hash("baz") = 5858 % 8 = 2 (!!?)

"bar"

52

"ham"

"baz"

58

"egg"

Key

hash

value

0 1 2 3 4 5 6 7

del d["foo"] remains DUMMY key

"bar"

52

"ham"

"baz"

58

"egg"

DUMMY

Key

hash

value

0 1 2 3 4 5 6 7

x = d["baz"]

hash("baz") = 5858 % 8 = 2 (conflict with dummy, then linear probing)

"bar"

52

"ham"

"baz"

58

"egg"

DUMMY

Problems in classical open addressing hash table

● Large memory usage○ At least 1/3 of entries are empty

■ Otherwise, "probing" can be too slow○ One entry uses 3 words

■ word = 8 bytes on recent machine○ minimum size = 192 byte

■ 8 (byte/word) * 3 (word/entry) * 8 (table width)

Python 3.6 implementation

Compact and ordered dict

PyPy implements it in 2015https://morepypy.blogspot.jp/2015/01/faster-more-memory-efficient-and-more.html

Python 3.6 dict is almost same as PyPy.

Ruby 2.4, php 7 has similar one.

Key

hash

value

0 1 2 3 4 5 6 7

d["foo"] = "spam" # hash("foo") = 42, 42 % 8 = 2

"foo"

42

"spam"

0index

Key

hash

value

0 1 2 3 4 5 6 7

d["foo"] = "spam"d["bar"] = "ham" # hash("bar") = 52 , 52 % 8 = 4

"bar"

52

"ham"

"foo"

42

"spam"

0 1index

Key

hash

value

0 1 2 3 4 5 6 7

d["foo"] = "spam"d["bar"] = "ham"d["baz"] = "egg"del d["foo"]

"bar"

52

"ham"

"baz"

58

"egg"

DUMMY 2 1index

● Less memory usage○ Index can be 1 byte for small dict○ 3*8 *5 (entries) + 8 (index table) = 128 bytes

■ It was 192 bytes in legacy implementation● Faster iteration (dense entries)● Preserve insertion order● (cons) One more indirect memory access

New dict vs Legacy dict

Toward Python 3.7

Working on ...

● Remove redundant code for optimize legacy implementation.

● OrderedDict based on New dict○ Remove doubly linked list used for keep order○ About 1/2 memory usage!○ Faster creation and iterating.○ (cons) Slower .move_to_end() method

We're finding new contributors

Contributing to Python is easier, thanks to Github.

● Read devguide (https://devguide.python.org/ )● Find easy bug on https://bugs.python.org/ and fix it.● Review other's code● Translate document on Transifex

○ See https://docs.python.org/ja/

Future ideas● specialized dict for namespace

○ all keys are interned string○ only pointer comparison○ no "hash" in entry -> more compact

● Implement set like dict○ current set is larger than dict...

● functools.lru_cache○ Use `od.move_to_end(key)`, instead of linked list

PEP 412: Key sharing dict

PEP 412: Key sharing dict

Introduced in Python 3.4

Instances of same class can share keys object

class A:

def __init__(self, a, b):

self.foo = a

self.bar = b

a = A("spam", "ham")

b = A("bacon", "egg")

KeyClass

value

0 1 2 3 4 5 6 7

"bar"

52

"foo"

42

0 1index

"ham""spam"values

"egg""bacon"values

instance

instance

Problem

● Two instances can have different insertion order○ drop key sharing dict?

■ key sharing dict can save more memory.● But __slots__ can be used for such cases!

■ performance improvements in some microbench● Is it matter for real case? __slots__?

■ Needs consensus● it's more difficult than implementation

Keep key sharing dict support

● Only exactly same order can be permitted○ "skipped" keys are prohibited○ deletion is also prohibited

● Otherwise, stop "key sharing"○ `self.x = None` is faster than `del self.x`

Recommended