Month: July 2016

How I became a Data Engineer

A reminiscence of a personal timeline of events; please, excuse minor errors and/or let me know if my memory has shuffled things around a bit.

My first job was at a company that had their own CRM and they wanted a web version of some parts of the system. At that time, I only had “experience” with ASP and Microsoft Access (I know, tough). They wanted the web version in PHP, their reasoning, I think, being that they wanted the integration to run directly into the database. The web app would write directly into the DB. The CRM system was written using Delphi and Firebird. So I learned PHP and my first database, which wasn’t MySQL (I don’t count MS Access as a DB). After that I got a job in which MySQL was used. I was really fresh on MySQL, and I didn’t know about the engines and such, so it was a bit weird learning about MyISAM (which didn’t have foreign keys for instance).

After that I got a job in a huge multinational where they had this project migrating every Excel spreadsheet to a PHP program. VBA was heavily used there, and they had entire programs running into that. What they didn’t tell us was that it was cheaper for them to have a whole team of PHP developers doing an internal system than to have the features built into their ERP. For “security” reasons no business logic could be inside the PHP code, so I had to do tons of Stored Procedures. They also had integrations with MS SQL Server. The workflow system used it together with a PHP tool called Scriptcase.

Another job I had was with a different multinational, I had to do a program to read from various sources and store in a DB for later reports and all. It would be a “scratch” of an ETL (Extract, Transform, Load), but at the time I wasn’t yet well versed with data warehouse techniques. For that job I used PostgreSQL. In the same company we later on did Magento white label stores for our clients (other big companies), and it had to have integration with our ERP (unlike my first job). This integration was through a Service Bus written in Java and the ERP had Oracle as the DB.

One of my employers noticed my interest in database tasks, and how quickly I became familiarised with the schema and our data architecture. Our main relational database had millions and millions of records, terabytes and they created a Data Team to make it more efficient. We would create workers to sync data into Elasticsearch for instance. I was still a PHP Developer officially, but mainly curating the DB and doing workers with NodeJS (we needed the async and the indexing time was crucial for business).

I could go on and on through every job I’ve had. Most of my jobs have been at big corporations, and that has put me into contact with many flavours of Relational Databases and NoSQL too (MongoDB, Dynamo and Cassandra being the main ones).

In the end, this kind of exposure has made me the “DB” person among my fellow engineers, and everything considered more than a CRUD tends to fall in my lap, and I’m happy with that.

That’s how I discovered that I wanted to work with data. I didn’t have any idea of what kind of title that would be, but I knew I didn’t want to be a DBA, since much of the infrastructure tasks involved in it didn’t sound fun to me. Once the opportunity to work officially as a Data Engineer appeared, I grabbed it with both hands and I knew it was right.

Creating Migrations with Liquibase

Liquibase is a versioning tool for databases. Currently, it’s on version 3.5 and is installed as a JAR. It has been on the market since 2006, and recently completed its 10th anniversary. In its feature list we have:

  • Code branching and merging
  • Multiple database types
  • Supports XML, YAML, JSON and SQL formats
  • Supports context-dependent logic
  • Generate Database change documentation
  • Generate Database “diffs”
  • Run through your build process, embedded in your application or on demand
  • Automatically generate SQL scripts for DBA code review
  • Does not require a live database connection

Why you need it?

Some frameworks comes with built-in solutions out of the box like Eloquent and Doctrine. There is nothing wrong with using something like that when you have only one DB per project, but when you have multiple systems, it starts to get complicated.

Since Liquibase works as a versioning tool, you can branch and merge as needed (like you would with code in git). You have contexts, which means changes can be applied to specific environments only, and tagging capabilities allow you to perform rollbacks.

A rollback is a tricky thing; you can either do an automatic rollback or define a script. Scripted rollbacks are useful when dealing with MySQL, for instance, where DDL changes are NOT transactional.

Guidelines for changelogs and migrations

  • MUST be written using the JSON format. Exceptions are changes/legacy/base.xml and changes/legacy/base_procedures_triggers.sql.
  • MUST NOT be edited. If a new column is to be added, a new migration file must be created and the file MUST be added AFTER the last run transaction.

Branching

There could be 3 main branches:

  • production (master)
  • staging
  • testing

Steps:

  1. Create your changelog branch;
  2. Merge into testing;
  3. When the feature ready to staging, merge into staging;
  4. When the feature is ready, merge into production.

Example:

download

Rules:

  • testing, staging and production DO NOT merge amongst themselves in any capacity;
  • DO NOT rebase the main branches;
  • Custom branch MUST be deleted after merged into production.

The downside of this approach is the diverging state between the branches. Current process is to, from time to time, compare the branches and manually check the diffs for unplanned discrepancies.

Procedures for converting a legacy database to Liquibase migrations

Some projects are complete monoliths. More than one application connects to it, and this is not a good practice. If you are working with that sort of project, I recommend you treating the database sourcing as its own repository, and not together with your application.

Writing migrations

This is a way I found for keeping the structure reasonably sensible. Suggestions are welcome.

Create the property file

Should be in the root of the project and be named liquibase.properties:

driver: com.mysql.jdbc.Driver
classpath: /usr/share/java/mysql-connector-java.jar:/usr/share/java/snakeyaml.jar
url: jdbc:mysql://localhost:3306/mydb
username: root
password: 123

The JAR files in the classpath can be manually downloaded or installed though the server package manager.

Create the Migration file

You can choose between different formats. I chose to use JSON. In this instance I will be running this SQL:

CREATE TABLE `mydb_users` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`username` varchar(25) CHARACTER SET utf8 DEFAULT NULL,
`password` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
`activated` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
view raw mydb_users.sql hosted with ❤ by GitHub

Which will translate to this:

{
"databaseChangeLog": [
{
"changeSet": {
"id": "create_mydb_users",
"author": "gabidavila",
"changes": [
{
"createTable": {
"tableName": "mydb_users",
"columns": [
{
"column": {
"name": "id",
"type": "int unsigned",
"autoIncrement": true,
"constraints": {
"primaryKey": true,
"nullable": false
}
}
},
{
"column": {
"name": "username",
"type": "varchar(25)",
"encoding": "utf8",
"constraints": {
"nullable": true
}
}
},
{
"column": {
"name": "password",
"type": "varchar(255)",
"encoding": "utf8",
"constraints": {
"nullable": true
}
}
},
{
"column": {
"name": "activated",
"type": "tinyint",
"defaultValueNumeric": 0,
"constraints": {
"nullable": false
}
}
}
]
}
}
]
},
"modifySql": {
"dbms": "mysql",
"append": [
{
"value": " ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci"
}
]
}
}
]
}

It is verbose? Yes, completely, but then you have a tool to show you what the SQL will look like and be able to manage the rollbacks.

Save the file as:

.
/changes
- changelog.json
- create_mydb_users.json

Where changelog.json looks like this:

{
"databaseChangeLog": [
{
"include": {
"file": "changes/create_mydb_users.json"
}
}
]
}
view raw changelog.json hosted with ❤ by GitHub

For each new change you add it to the end of the databaseChangeLog array.

Run it

To run, execute:

$ liquibase –changeLogFile=changes/changelog.json migrate

Don’t worry if you run it twice, the change only happens once.

Next post will cover how to add a legacy DB into Liquibase.

To learn how to go deeper into Liquibase formats and documentation, access this link.

Save