Tag: ruby

Ramblings on optimizations, anti patterns and N+1

A lot of people ask me to teach them how to do query analysis and performance. The truth is: there isn’t a script to follow. The following paragraphs are a brain dump on what usually goes on my mind when I am debugging and analyzing.

Please comment on what you think I should focus on to cover here.

TL; DR;

  • It’s just a messy post with database-y stuff
  • This post doesn’t have a conclusion, it is just me laying down my thoughts on performance and optimizations.

Thoughts

Query performance is a really difficult subject to talk about. Mostly because because SQL is a declarative language, leaving it up to the Optimizer to decide which way is the best to retrieve the information needed and that is based in so many variables.

The most common problem regarding optimization I see, comes not from the Database itself, but how we handle the requests on the application layer, the following for instance would cause N+1 problems:

Code example:


users = User.all
users.each do |user|
puts "Name #{user.name}"
puts "Addresses: "
user.addresses.each do |address|
puts address.street
puts "#{address.city} – #{address.state}"
end
end

view raw

n_plus_one.rb

hosted with ❤ by GitHub

Although seemingly innocent at first, this code could easily slow down performance on the database due to the amount of requests that would be made.

You also need to know about the intricacies of indexes, which one is the best, if you have a composite index, which should go first, and what happens if I only use one of the fields of a two column indexes in my search? Does it still uses the index somehow? Another rule of thumb is that if an index is a BTREE, on a single column, you can use it either ASC​ or DESC.

Or better yet: why my transactions are taking so long to complete? Does it have too many indexes on the table? Is any other query locking table X?

Even a single ​INNER JOIN could be highly costly if joining two large tables.

Why are you saving that JSON in a TEXT​ field? Since we are on the subject, you really need the JSON in the relational database and not in a document store?

You don’t need to port all your data from PostgreSQL/MySQL to MongoDB if you want to have MongoDB on your stack. Everything has its place, relational data on relational databases and non-relational data on non relational databases. I even find unfair benchmarks between a SQL database and a NoSQL one. They were made to solve different problems, you can’t possibly have the same use case for both of them.

No, it’s not ok to have category_1, category_2, ..., category_n as columns on your products table.

Avoid as much as possible nullable fields.

Relationships should also explicitly live on the RDBMS, not only on your model, if you have a user_id​ on your addresses​ table, tell the database so, naming it user_id doesn’t automatically create the foreign key.

You need:


ALTER TABLE addresses ADD CONSTRAINT `fk_user_id` FOREIGN KEY (`user_id`) REFERENCES users (`id`);

Or your migration should look something like this:


class CreateUsers < ActiveRecord::Migration[5.1]
def change
create_table :users do |t|
t.string :name
t.timestamps
end
end
end
class CreateAddresses < ActiveRecord::Migration[5.1]
def change
create_table :addresses do |t|
t.text :street
t.string :city
t.string :state
t.string :zipcode
t.integer :user_id
t.timestamps
end
add_foreign_key :addresses, :users
end
end

Line 24: adds to the table addresses​ a foreign key from users.

End

And you, what you think is missing in this blogpost? What do you want to get deeper on?

Congress, who is? – A Civic Tech project

Congress, who is? – A Civic Tech project

A while ago I had this idea for a project: To show how representatives voted, either for or against, on bills.

People elect representatives but often forget to follow what they are up to. I asked around: who is your representative? The most common response: I don’t know. If people don’t even know who their representatives are, when it comes to being listened to, how they are going to contact the House or Senate member?

That’s when Congress, who is? was born out of a 2 week project where I poured myself into and worked with the ProPublica Congress API, Twilio API and a bit of the Twitter API (those pictures must come from somewhere!).

People are able to search through their zipcode to find their representative or filter by State/Territory, Party, House or Name. Once into the member profile you can do a call directly from your browser to the member’s office.

Some images from https://www.congresswhois.com

The USA map is rendered showing a simple majority of the representation of the House. On click the listing of representatives is shown on the right.

It’s possible to also compare statistics from one politician to another. See how they vote with the party and in common between themselves.

Screenshot-2017-11-28 Congress, who is (3)

Features to come

  • Show beyond current Congress, at this moment the congress number is 115, and the API can show me members since 102-115 for House and 80-115 for Senate.
  • Show bills and votes
  • Add full text search
  • More to be defined

Code

Code will be released under MIT license. There is a few cleaning up to do, and I want to open source it with a few issues already opened and documented. As I said, the app was developed in two weeks, but it grew on me and I want to take a step further.

Stack

Backend:

  • Ruby on Rails
  • PostgreSQL

Frontend:

  • React
  • Redux
  • Semantic UI

Contributions

Right now the code is running in a “closed” beta, if you can’t wait and want to help, DM me on Twitter (no need to follow back, DMs are open on my end), or use this website contact form, or simply mail me at gabriela.io.

Thank you

I want to give a special thanks to Twilio. During this year PHPWorld they hosted a competition to showcase your project using Twilio. I showcased this project and they awarded the project with some awesome amount of credits for us to run for a while on it. So thank you for the support!

Disclaimers

Calls only works on Chrome, Firefox and Safari for Desktop. The client call doesn’t work on mobile, Internet Explorer or Opera. It’s more of a technical limitation on how each browser implement their JavaScript than application level development.

The data displayed may be incorrect. That is because it is synced daily with the ProPublica API, whatever they have on record, it is what I am showing.

Transferring ownership of repositories on GitHub

For the past couple months, I’ve been studying. As a side effect, my GitHub account was cluttered with code that is experimental. I didn’t exactly want to trash the experimental code. I wanted to keep my code but also not specifically keep it under my profile.

The solution I found was to create an organization and transfer my desired repositories there. I thought this was a great solution, there was just one little detail I was missing: the current GitHub API does not support repository ownership transfer. Which meant I would have to go to each repository, click on “Settings”, click on “Transfer”, fill in the “Repository Name” and then put the “Destination User”. A lot of steps for someone looking to move over 250 repositories.

The first thing that came to my mind was to use Selenium to automate this task. But my lack of exposure with the technology made me think a bit outside the box. One of the things I learned these past months was capybara. Capybara is used alongside Rspec, extending the test suite DSL. It mimics user interaction with the browser and comes with a Selenium driver out of the box.

In other words, you can create a bot to go to the browser, fill up forms and submit it. This is exactly what I was looking for. As I stated before, Capybara uses Rspec, so my code would actually have to be wrapped inside a test.

Caveats

  • You need to disable two-factor authentication on GitHub, and after the script finishes running set it back on.
  • It does not transfer private repositories.
  • You need Mozilla’s geckodriver installed. If you use macOS, you can use brew to install it.

The Script

This was developed as a hack. Use at your own risk. You can download it at gabidavila/github-move-repositories. As of now this code gets all public repositories, unless the variable ONLY_FORKS is set to TRUE, and moves to a destination user. Do not push your .env​ file.If you want to move only specific repositories, you will need to edit the code yourself. For more information the README.md of this project is kept up to date.

Contributions are welcome if you feel you can help improve the tool. For example: add options of which repositories to move.

Enjoy!

Using Active Record migrations beyond SQLite

Using Active Record migrations beyond SQLite

SQLite is really a good tool to set up quick proof of concepts and small applications; however it’s not the most robust solution on the market for working with relational databases. In the open source community two databases take the top of the list: PostgreSQL and MySQL.

I did a small project for my studies. I was using SQLite as I didn’t need much out of it. Curious, I decided to see how the application would behave on other databases and decided to try PostgreSQL and MySQL. I had two problems to solve, and this post is about the first one: how to deal with the migrations. They were as follows:

class CreateArtists < ActiveRecord::Migration
def change
create_table :artists do |t|
t.string :name
t.timestamps
end
add_index :artists, :name
end
end
class CreateSongs < ActiveRecord::Migration
def change
create_table :songs do |t|
t.string :title
t.integer :artist_id
t.timestamps
end
add_index :songs, :title
add_foreign_key :songs, :artists
end
end

Active Record automatically put the field id in all of its tables, that’s why it is omitted on the migrations.

In PostgreSQL it went smoothly, all the migrations ran without any hiccup, except on MySQL, it gave me an error!

StandardError: An error has occurred, all later migrations canceled:

Column `artist_id` on table `songs` has a type of `int(11)`.
This does not match column `id` on `artists`, which has type `bigint(20)`.
To resolve this issue, change the type of the `artist_id` column on `songs` to be :integer. (For example `t.integer artist_id`).

Original message: Mysql2::Error: Cannot add foreign key constraint: ALTER TABLE `songs` ADD CONSTRAINT `fk_rails_5ce8fd4cc7`
FOREIGN KEY (`artist_id`)
REFERENCES `artists` (`id`)

The problem, beyond generating an ineligible name for an index: fk_rails_5ce8fd4cc7, is that artist_id on my table was as INT. The first thing I checked was to see if the artist.id was UNSIGNED and if my foreign key was also unsigned. They weren’t, but since were both signed, it wouldn’t throw an error. Looking more closely to the error message I noticed that the type in my foreign key column did not match the type on the primary key on the other table. Little did I know that Active Record generates the id field not as an INT, but as BIGINT.

I decided to go back and look at PostgreSQL, and to my surprise, and up to now I still am not sure of why, PostgreSQL did allow the column type mismatch where MySQL threw an error.

To fix it, I had to change the migration as follows:

class CreateSongs < ActiveRecord::Migration
def change
create_table :songs do |t|
t.string :title
t.integer :artist_id, limit: 8
t.timestamps
end
add_index :songs, :title
add_foreign_key :songs, :artists
end
end

Digging online, I found out how to create a bigint field with AR. According to the post, this would only work on MySQL, which they did, but I found it also worked with PostgreSQL (I tested MySQL 5.7 and Postgres 9.6): t.integer :artist_id, limit: 8.

The limit is used to set a maximum length for string types or number of bytes for numbers.

Why type matching is important

As an INT let’s say you can fit your number inside an espresso cup. Sure you can use the Starbucks Venti size cup to fit your coffee, but the full content of a Venti would never fit an espresso cup.

In the specific domain I am working on if I had a big list of Artists, and happen to have an artist which ID was higher than 2,147,483,647 (signed, and for both PostgreSQL and MySQL), I would get an error when trying to insert it into the Songs table since an Artist id can be up to 8 bytes (9,223,372,036,854,775,807).

Example:

Queen has its Artist id as: 21474836481 (which is a BIGINT)

Trying to insert “We Will Rock you” in the artist_id column for songs:

INSERT INTO songs (title, artist_id, created_at, updated_at)
VALUES ('We will Rock you', 21474836481, now(), now());

We get:

********** Error **********

ERROR: integer out of range
SQL state: 22003

This is the kind of problem we don’t usually notice in the beginning, and more often than not while the application is in production for even years, but this can happen and will happen if we don’t pay attention to foreign key types.

After that change, all the migrations ran smoothly. And I could actually move forward to the next problem (and post): Filtering a song title or artist name.

ActiveRecord: Has Many Through Through Relationship

Developers in general love when stuff works. Having a solution that can solve about 80% of your problems can leave time for you to deal with the other 20%.

But this post is not about Active Record vs. Data Mapper or any thing like it. Each one has its use case where it’s best applicable and it depends on you (or your team) to decide which to use. Keep in mind that with Active Record (AR), domain concerns and persistence concerns are mixed together and that with Data Mapper (DM), domain concerns and persistence concerns are kept separate.

Let’s talk about magic. How magical AR can be and how it can make your life easier. The beauty of programming is that two different individuals can reach the same result using different routes even if using the same tools. The convention over configuration that some frameworks like Laravel and Rails use makes everything feel so effortless, while actually under the hood, there is a lot going on.

Solving a code challenge

This week I was given the following schema on this code challenge:

Database Mapping

With four models:

  • Boat:
    • belongs to a Captain
    • has many records of BoatClassification
    • has many records of  Classification throughBoatClassification
  • Captain
    • has many records of Boat
  • BoatClassification
    • belongs to a Boat
    • belongs to a Classification
  • Classification
    • has many records of BoatClassification
    • has many records of Boat through  BoatClassification

And here is the code in Ruby:

### app/models/boat.rb ###
class Boat < ActiveRecord::Base
belongs_to :captain
has_many :boat_classifications
has_many :classifications, through: :boat_classifications
end
### app/models/captain.rb ###
class Captain < ActiveRecord::Base
has_many :boats
end
### app/models/boat_classification.rb ###
class BoatClassification < ActiveRecord::Base
belongs_to :boat
belongs_to :classification
end
### app/models/classification.rb ###
class Classification < ActiveRecord::Base
has_many :boat_classifications
has_many :boats, through: :boat_classifications
end
view raw models.rb hosted with ❤ by GitHub

The models were given to me as shown above, including the relationships. Stuff started easy, like:

Class: Boat -> Retrieve all boats without a Captain:

Boat.where(captain_id: nil) which translates to:

SELECT `boats`.*
FROM `boats`
WHERE `boats`.`captain_id` IS NULL

But then, stuff started to get a bit more complicated…

Class: Boat -> Retrieve all boats with three Classifications:

My thought: This one I got it! The code already showed me the has many through from Boat to Classification, now what I need to do is GROUP BY boats.id and all will be fine…

Boat.joins(:classifications).group("boats.id").having("count(classifications.id) = ?", 3)

Active Record saved me from a lot of trouble from doing the following query:

SELECT `boats`.*
FROM `boats`
INNER JOIN `boat_classifications`
ON `boat_classifications`.`boat_id` = `boats`.`id`
INNER JOIN `classifications`
ON `classifications`.`id` = `boat_classifications`.`classification_id`
GROUP BY `boats`.`id`
HAVING count(classifications.id) = 3

This are 8 lines of code translated into one!

magic trick

Ok, I thought, we are going places with this. Give me one more!

Class: Captain -> Retrieve all Captains that pilot a specific Classification of a Boat

I knew the SQL code for this one! It’s easy when coding to end up doing chained queries with subqueries inside. I wanted to avoid that as much as possible knowing that would be able to solve the question with joins.

Reading the documentation I saw that Ruby’s Active Record gives us tools to avoid this kind of situation, one of them are the relationships. It’s possible to say looking at the classes and diagrams that Classification and Captain have a nested relationship. A has many through through if you like, yes that’s “through” twice.

Diagram

One way to remember is to look at the model Captain and Boat:

class Captain < ActiveRecord::Base
has_many :boats
end
class Boat < ActiveRecord::Base
belongs_to :captain
has_many :boat_classifications
has_many :classifications, through: :boat_classifications
end

:boats is a relationship for Boat. This means I can do a join, (specifically a nested one):

Captain -> Boat -> BoatClassification -> Classification

Captain.joins(boats: {boat_classifications: :classification})

confused, oh wait!

Ok, that makes sense, through :boats I have access to :boat_classifications which in turn has access to the :classification relationship. But, :boats also has access to :classifications, making this possible:

Captain -> Boat ->> Classification

Captain.joins(boats: :classifications)

And we finally add the filter to the query:

Captain.joins(boats: :classifications).where(classifications: {name: 'Sailboat'})

Saving us from having to write this:

SELECT
`captains`.*
FROM
`captains`
INNER JOIN
`boats` ON `boats`.`captain_id` = `captains`.`id`
INNER JOIN
`boat_classifications` ON `boat_classifications`.`boat_id` = `boats`.`id`
INNER JOIN
`classifications` ON `classifications`.`id` = `boat_classifications`.`classification_id`
WHERE
`classifications`.`name` = 'Sailboat'

TL;DR;

  • Chaining method calls on the model’s class always returns the model itself
  • Hashes are more used than you would imagine
  • Avoid subqueries
  • Magic happens through relationships (which saves you from the subqueries)
  • Putting things on diagrams is not a question of being fancy, but rather to be able to better visualize problems.

Bonus – Performance

If we had used the subquery for searching on the last category we would have:

SELECT
`captains`.*
FROM
`captains`
WHERE
`captains`.`id` IN (SELECT
`boats`.`captain_id`
FROM
`boats`
WHERE
`boats`.`id` IN (SELECT
`boat_classifications`.`boat_id`
FROM
`boat_classifications`
WHERE
`boat_classifications`.`classification_id` IN (SELECT
`classifications`.`id`
FROM
`classifications`
WHERE
`classifications`.`name` = 'Sailboat')));

Doing a query cost analysis on it, by adding an EXPLAIN in the beginning of the query, it returns:

SubqueriesAR

But using the correct relationships we have:

InnerJoinsAR.png

Don’t worry much about the numbers, but look more at the colors, by using the existing foreign keys we avoid doing a full table scan on the tables, even with a join of four tables the query plan showed that subqueries are 30% more slower than using the existing indexes and relationships.