52
The amazing world behind your ORM Louise Grandjonc

Conf orm - explain

Embed Size (px)

Citation preview

Page 1: Conf orm - explain

The amazing world

behind your ORMLouise Grandjonc

Page 2: Conf orm - explain

Louise Grandjonc ([email protected]) Lead developer at Ulule (www.ulule.com) Django developer - Postgres enthusiast @louisemeta on twitter

About me

Page 3: Conf orm - explain

1. How do we end up with performances problems? 2. How can we see them without roughly guessing how

long you’re waiting before seeing your page? 3. What does it change in our everyday developer job?

Today’s agenda

Page 4: Conf orm - explain

How do we end up with

performances problems?

Page 5: Conf orm - explain

1. To annoy the DBAs 2. Because we can avoid having to worry about DB

connections 3. We keep using our main language 4. We are a bit afraid of SQL 5. 90% of the time, we don’t really need to do more than

really simple SELECT and INSERT, so why bother do it worst than our ORM would?

Why do we use ORMs? (and why that’s not so terrible)

Page 6: Conf orm - explain

Not looking at what happens will cause performances problems, because…

1.The ORMs execute queries that you might not expect 2.Your queries might not be optimised and you won’t know about it 3.To make DBAs to like you, even if you’re using an ORM

Why we should know what our ORM is doing

Page 7: Conf orm - explain

How can we see them without roughly

guessing how long you’re waiting

before seeing your page?

Page 8: Conf orm - explain

How can I see what is happening when I do stuff?

1. Django debug toolbar (to see queries and their explain in your django view) Advantages: can be easily included in your django templates Problems: Does not allow you to see everything (ajax calls !), if you’re working on an API, you cannot use it!

2. Django devserver : puts all the logs of your database into your runserver output

Advantages: you’re not missing the ajax calls 3. Simply look at your database logs Advantages: you can see everything, you won’t be disturbed if you ever change project/programming languages/framework/computer, you can configure how you see your logs Problems: you don’t know where your logs are?

Page 9: Conf orm - explain

Where are my logs?owl_conference=# show log_directory ;

log_directory ---------------

pg_log (1 row)

owl_conference=# show data_directory ; data_directory

------------------------- /usr/local/var/postgres

(1 row)

owl_conference=# show log_filename ; log_filename

------------------------- postgresql-%Y-%m-%d.log

(1 row)

Page 10: Conf orm - explain

Having good looking logs (and logging everything like a crazy owl)

owl_conference=# SHOW config_file; config_file

----------------------------------------- /usr/local/var/postgres/postgresql.conf

(1 row)

In your postgresql.conf

log_filename = 'postgresql-%Y-%m-%d.log' log_statement = 'all' logging_collector = on log_min_duration_statement = 0 log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,host=%h,app=%a'

Page 11: Conf orm - explain

Having good looking logs

user=owly,db=owl_conference,host=127.0.0.1,app=owl LOG: statement: SELECT "owl"."id", "owl"."name", "owl"."employer_name", "owl"."favourite_food", "owl"."job_id", "owl"."fur_color" FROM "owl" WHERE "owl"."job_id" = 1 LIMIT 10

user=owly,db=owl_conference,host=127.0.0.1,app=owl LOG: duration: 0.297 ms

DATABASES = { 'default': { 'ENGINE': 'django.db.backends.postgresql_psycopg2', 'NAME': 'owl_conference', 'USER': 'owly', 'PASSWORD': 'mouseEating', 'HOST': '127.0.0.1', 'OPTIONS': {'application_name': 'owl'} } }

Your logs should look like

Page 12: Conf orm - explain

Yep ! I’ve seen my logs… But … Where are this queries executed in my code?

Django will always execute your queries when it needs to use the object !

Let’s take an example…

Page 13: Conf orm - explain

Example Template

def index(request): owls = Owl.objects.filter(employer_name=‘Ulule’) context = {‘owls': owls} return render(request, 'owls/index.html', context)

SELECT "owl"."id", "owl"."name", "owl"."employer_name", "owl"."favourite_food", "owl"."job_id", "owl"."fur_color" FROM

"owl" WHERE "owl"."employer_name" = 'Ulule'

{% for owl in owls %} <p> {{ owl.name }} </p> {% end for %}

Page 14: Conf orm - explain

Example View

def index(request): owls = Owl.objects.filter(employer_name=‘Ulule’) owl_count = len(owls) context = {‘owls': owls,‘owl_count’: owl_count} return render(request, 'owls/index.html', context)

SELECT "owl"."id", "owl"."name", "owl"."employer_name", "owl"."favourite_food", "owl"."job_id", "owl"."fur_color" FROM

"owl" WHERE "owl"."employer_name" = 'Ulule'

{% for owl in owls %} <p> {{ owl.name }} </p> {% end for %}

Page 15: Conf orm - explain

Yep ! I’ve seen my logs… But … Where are this queries executed in my code?

How to spot where your query is executed?

1. Each model has a table to store data. Find the model.

2. Where in my view, or in my form am I using this model to get/filter objects?

3. Where am I using this objects? Is it in my view/form? Passed into the context and used in templates?

Page 16: Conf orm - explain

What does in change in our everyday

developer job? (Or how to really do something when you have a problem)

Page 17: Conf orm - explain

The two most common problems of any ORM user…

1. I have way too many queries… Why ? 2. One of my query is freakin' slow… Why?

Page 18: Conf orm - explain

Once upon a time… 1000 times The danger of loops in your code, and how your templates

are making fun of you…

1. Preload stuff ! The ORM is executing the queries when it needs the data, if your looping over foreign key, whithout any preload, it will just query every time it needs the foreign key… Imagine you have a loop over 1 million objects. Use prefetch_related and select_related (see next slide)

2. In an ideal world, no query should ever be executed from your django html template. Every data should be in your context, you should never have « surprise » queries from your templates !

Page 19: Conf orm - explain

Once upon a time… 1000 times select_related or prefetch_related?

In django, select_related and prefetch_related will help you lower your amount of query by preloading the foreign keys or many-to-many.

1. select_related uses a join (only for foreign keys): - Advantages: only one request - Problem: if you are joining big tables, with a lot of columns

and no index, it can be slow… We’ll talk about that next. 2. prefetch_related does a second request on your join table (for

foreign keys and many-to-many - Advantages: no big join - Problem: more queries

Page 20: Conf orm - explain

Example … 1 def index(request): owls = Owl.objects.filter(employer_name=‘Ulule’) context = {‘owls': owls} for owl in owls: # do stuff owl.job return render(request, 'owls/index.html', context)

def index(request): owls = Owl.objects .filter(employer_name=‘Ulule’) .select_related(‘job’) context = {‘owls': owls} for owl in owls: # do stuff owl.job return render(request, 'owls/index.html', context)

Page 21: Conf orm - explain

Example … 1 Using select_related

owls = Owl.objects .filter(employer_name=‘Ulule’) .select_related(‘job’)

SELECT "owl"."id", "owl"."name", "owl"."employer_name", "owl"."favourite_food", "owl"."job_id", "owl"."fur_color", "job"."id", "job"."name" FROM "owl" LEFT OUTER JOIN "job" ON ("owl"."job_id" = "job"."id") WHERE "owl"."employer_name" = 'Ulule'

Page 22: Conf orm - explain

Example … 1 Using prefetch_related

owls = Owl.objects .filter(employer_name=‘Ulule’) .prefetch_related(‘job’)

SELECT "owl"."id", "owl"."name", "owl"."employer_name", "owl"."favourite_food", "owl"."job_id", "owl"."fur_color" FROM "owl" WHERE "owl"."employer_name" = 'Ulule'

SELECT "job"."id", "job"."name" FROM "job" WHERE "job"."id" IN (2)

Page 23: Conf orm - explain

One of my query is super slow…

Let’s talk about EXPLAIN !

Page 24: Conf orm - explain

What is EXPLAIN

Gives you the execution plan chosen by the query planner that your database will use to execute your SQL

statement Using ANALYZE will actually execute your query! (Don’t

worry, you can ROLLBACK)

EXPLAIN (ANALYZE) my super query;

BEGIN; EXPLAIN ANALYZE my super query; ROLLBACK;

Page 25: Conf orm - explain

Mmmm… Query planner?

The magical thing that generates execution plans for a query and calculate what is the cost of each plan.

The best one is used to execute your query (hopefully)

Page 26: Conf orm - explain

So, what does it took like ?

Let’s imagine a slow query… I’m trying to have all the owls working at Ulule (super rare job for an owl)

Python version

DB version

Owl.objects.filter(employer_name=‘Ulule’)

SELECT "owl"."id", "owl"."name", "owl"."employer_name", "owl"."favourite_food", "owl"."job_id", "owl"."fur_color"

FROM "owl" WHERE "owl"."employer_name" = 'Ulule'

Page 27: Conf orm - explain

And…

owl_conference=# EXPLAIN ANALYZE SELECT * FROM owl WHERE employer_name=‘Ulule'

QUERY PLAN ------------------------------------ Seq Scan on owl (cost=0.00..205.01

rows=1 width=35) (actual time=1.945..1.946 rows=1 loops=1)

Filter: ((employer_name)::text = 'Ulule'::text)

Rows Removed by Filter: 10000 Planning time: 0.080 ms Execution time: 1.965 ms

(5 rows)

Page 28: Conf orm - explain

Let’s go step by step ! .. 1 Costs

(cost=0.00..205.01 rows=1 width=35)

Cost of retrieving all rows

Number of rows returned

Cost of retrieving first row

Average width of a row (in bytes)

(actual time=1.945..1.946 rows=1 loops=1)

Only if you use analyse, gives you the real times

Number of time your seq scan (index scan etc.) was executed

Page 29: Conf orm - explain

Let’s go step by step ! .. 2 Seq Scan

Seq Scan on owl ... Filter: ((employer_name)::text = 'Ulule'::text)

Rows Removed by Filter: 10000

Scan the entire database and retrieve the rows that correspond to your where clause

It’s okay for small databases but can be very expensive… Do you need an index?

Page 30: Conf orm - explain

Let’s go step by step ! .. 3 Index scan

QUERY PLAN -------------------------------------------------

Index Scan using employer_name_owl on owl (cost=0.29..8.30 rows=1 width=35) (actual

time=0.034..0.034 rows=1 loops=1) Index Cond: ((employer_name)::text =

'Ulule'::text) Planning time: 0.387 ms Execution time: 0.066 ms

(4 rows)

What if there is an index on this column?

The index is visited row by row in order to retrieve the data corresponding to your clause.

Page 31: Conf orm - explain

Let’s go step by step ! .. 4

owl_conference=# EXPLAIN SELECT * FROM "owl" WHERE "owl"."employer_name" = 'post office’;

QUERY PLAN -------------------------------------------------Seq Scan on owl (cost=0.00..205.01 rows=7001

width=35) Filter: ((employer_name)::text = 'post

office'::text) (2 rows)

With an index and a really common value !

It’s quicker for common values for the db to read all data, than scan the index.

Page 32: Conf orm - explain

Let’s go step by step ! .. 5 Bitmap Heap Scan

owl_conference=# EXPLAIN SELECT * FROM owl WHERE owl.employer_name = ‘Hogwarts’;

QUERY PLAN ------------------------------------------------- Bitmap Heap Scan on owl (cost=47.78..152.78

rows=2000 width=35) Recheck Cond: ((employer_name)::text =

'Hogwarts'::text) -> Bitmap Index Scan on employer_name_owl

(cost=0.00..47.28 rows=2000 width=0) Index Cond: ((employer_name)::text =

'Hogwarts'::text) (4 rows)

With an index and a common value (but not too common)

Page 33: Conf orm - explain

Let’s go step by step ! ..4 Bitmap Heap Scan…

Index scan : goes through your index tuple-pointer one at a time and reads the data from the pages. Uses the index order.

Bitmap Heap Scan: orders the tuple-pointer in physical memory order and go through it. Avoids little «physical jumps » between pages

Page 34: Conf orm - explain

So we have 3 types of scan

1. Sequential scan 2. Index scan 3. Bitmap heap scan

And now let’s join stuff !

Page 35: Conf orm - explain

And now let’s join stuff… Nested loops

owl_conference=# EXPLAIN ANALYZE SELECT * FROM owl JOIN job ON (job.id = owl.job_id) WHERE job.id=1;

QUERY PLAN -------------------------------------------------------------

Nested Loop (cost=blabla) (actual time=blabla) -> Seq Scan on job (cost=blabla) Rows Removed by Filter: 6

-> Seq Scan on owl (costblabla) Filter: (job_id = 1)

Rows Removed by Filter: 1000 Planning time: 0.150 ms Execution time: 3.663 ms

(9 rows)

Page 36: Conf orm - explain

And now let’s join stuff… Nested loops

Used for little tables, can be slow

This image does not match

the previous query ;)

Page 37: Conf orm - explain

And now let’s join stuff… Hash Join

owl_conference=# EXPLAIN ANALYZE SELECT * FROM owl JOIN job ON (job.id = owl.job_id) WHERE job.id>1;

QUERY PLAN ------------------------------------------------------------- Hash Join (cost=1.17..318.70 rows=10001 width=56) (actual

time=0.033..36.021 rows=1000 loops=1) Hash Cond: (owl.job_id = job.id) -> Seq Scan on owl (cost=blabla(

-> Hash (cost=blabla) Buckets: 1024 Batches: 1 Memory Usage: 9kB

-> Seq Scan on job (cost=blabla) Filter: (id > 1) Rows Removed by Filter: 1

Planning time: 0.235 ms (10 rows)

Page 38: Conf orm - explain

And now let’s join stuff… Hash Join

Smaller table in hashed because it has to fit into memory

Page 39: Conf orm - explain

And now let’s join stuff… Merge Join

owl_conference=# EXPLAIN ANALYZE SELECT * FROM owl JOIN job ON (job.id = owl.id) WHERE owl.id>1;

QUERY PLAN -------------------------------------------------------------

Merge Join (cost=blabla) Merge Cond: (owl.id = job.id)

-> Index Scan using owl_pkey on owl (cost=blabla) Index Cond: (id > 1)

-> Sort (cost=blabla) Sort Key: job.id

Sort Method: quicksort Memory: 25kB -> Seq Scan on job (cost=blaba)

Planning time: 0.453 ms Execution time: 0.102 ms

(10 rows)

Page 40: Conf orm - explain

And now let’s join stuff… Merge Join

Used for big tables, an index can be used to avoid sorting

Page 41: Conf orm - explain

So we have 3 types of joins

1. Nested loop 2. Hash join 3. Merge join

And a last word about ORDER BY

(last part, I swear !)

Page 42: Conf orm - explain

And now let’s order stuff…

owl_conference=# EXPLAIN ANALYZE SELECT * FROM owl ORDER BY owl.job_id, owl.favourite_food;

QUERY PLAN -------------------------------------------------------------

Sort (cost=844.47..869.47 rows=10001 width=35) (actual time=7.252..8.090 rows=10001 loops=1) Sort Key: job_id, favourite_food

Sort Method: quicksort Memory: 1166kB -> Seq Scan on owl (cost=0.00..180.01 rows=10001 width=35) (actual time=0.017..1.181 rows=10001 loops=1)

Planning time: 0.142 ms Execution time: 8.665 ms

(6 rows)

Everything is sorted into the memory (which is why it can costly)

Page 43: Conf orm - explain

And now let’s order stuff… With an index

owl_conference=# EXPLAIN ANALYZE SELECT * FROM owl ORDER BY owl.job_id, owl.favourite_food;

QUERY PLAN

------------------------------------------------------------- Index Scan using owl_job_id_favourite_food on owl (cost=0.29..544.66 rows=10001 width=35) (actual

time=0.016..2.835 rows=10001 loops=1) Planning time: 0.098 ms Execution time: 3.510 ms

(3 rows)

Simply use index order

Page 44: Conf orm - explain

And now let’s order stuff… ORDER BY LIMIT

owl_conference=# EXPLAIN ANALYZE SELECT name, employer_name FROM owl ORDER BY name LIMIT 10;

QUERY PLAN -------------------------------------------------------------

------------------------------------------------------- Limit (cost…) (actual time…)

-> Sort (cost…) (actual time…) Sort Key: name

Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on owl (cost=0.00..180.01 rows=10001 width=16) (actual time=0.032..5.856 rows=10002 loops=1)

Planning time: 0.201 ms Execution time: 15.846 ms

(7 rows)

Like with quicksort, all the data has to be sorted… Why is the memory taken so muck smaller?

Page 45: Conf orm - explain

Top-N heap sort

- A heap (sort of tree) is used with a bounded size - For each row

- If the heap isn’t full, tuple added at the right place - If heap is full and value smaller (for ASC) than current

values - Tuple inserted at the right place, last value popped

- Else value discarded

Page 46: Conf orm - explain

Top-N heap sortData to order … Iterations 1.. 2.. 3

Iteration 10

Page 47: Conf orm - explain

Top-N heap sort Example (if it wasn’t clear…)

Inserting new smaller value, Potter eliminated (Voldy’s dream)

Heap in the end, after sorting all stuff

Page 48: Conf orm - explain

Be careful when you ORDER BY !

1. Sorting with sort key without limit or index can be heavy

2. You might need an index, only EXPLAIN will tell you

Page 49: Conf orm - explain

Conclusion

Page 50: Conf orm - explain

Conclusion

- Looking at your DB logs, whatever your favourite solution is, will help you build a website with good performances

- Always know where your queries come from

- Careful about loops ! Use prefetch_related and select_related to avoid O(n) queries

- If you have a slow query, there is no magical solution, look into explain to understand what’s going wrong and find a solution

Page 51: Conf orm - explain

Thank you for your attention !

Any questions?

Owly design: zimmoriarty (https://www.instagram.com/zimmoriarty/)

Page 52: Conf orm - explain

To go further - sources

Owly design: zimmoriarty (https://www.instagram.com/zimmoriarty/)

https://momjian.us/main/writings/pgsql/optimizer.pdf

https://use-the-index-luke.com/sql/plans-dexecution/postgresql/operations

http://tech.novapost.fr/postgresql-application_name-django-settings.html