Tag: Data Engineering

I don’t know Ops, and that may be OK

I am a Software Engineer at heart. I started as such and worked with PHP for about 7 years, always correlating my work with data somehow until I got an opportunity and decided to follow my instincts and be a Data Engineer.

I didn’t turn a Data Engineer from one night to another. It was a process. I was lucky to have a boss that noticed my skills with data and decided to give me room to play with it.

But the Ops part, was never my forte and this is why.

Data Integrity

This is my main concern. I am more worried about keeping consistency as much as possible and even in many times choosing it over performance.

Another trait I have is to be always looking for logic errors that may generate bad data into an application. I despise badly written models and have had a bit of problem when working with RDBMS and ActiveRecord. My take on it is: if you have a complex business model, what is easy may become painful. Also, there is no silver bullet solution, you don’t need to use only one technology.

But I won’t go into what is better, what is better it is what it works for you and make you application works and don’t let your users down while maintaining your data integrity.

Value

Your software is not valuable 99% of the times. Your software it is a means to an end. It is a path to interpret business logic and generate value to your company. And if your data sucks (duplicated records, lack of foreign key checks, extreme denormalization in the main DB if relational) you may have no real value at all.

Why don’t I know it?

When I needed doing ops, managing database servers, they were all single servers, at most a read replica on AWS, or only one slave.

One can say I’ve never needed actually to know it. I got lucky having good people working with me on the DevOps team, and we trusted each other’s work, in the end the managers would prefer to use my abilities in another area.

As I said before I do know it’s an area I need to improve, but it is ok to not know it, because even with me not being an expert on it my value lies in understanding the data, the data model and how application handles on data. As a DBA main job usually is to keep the database servers healthy, mine is to keep data itself as a valuable as possible to a company.

Not a DBA

What I do it is many times considered a DBA job, there are a couple areas where both can overlap but I try as a Data Engineer to support the Developers and the Business as a DevOps person do.

You may notice on my posts: I don’t write about replication, cluster, etc. One reason is: I have never in real life have to deal with this particular area deeply. I do know I should know more about that. But one thing is to set it up a couple servers on the cloud with no real data to analyze and performance issues to attend than doing it in real life. However, that doesn’t lessen my value.

I know how to prepare data, I do ETL’s, I do data modeling, I do deep research on which storage would be the best for a case scenario, I help to define policies around migrations and data access, for instance.

My Goal

My goal is to help developers. It’s to help them do the right thing regarding to data as they do regarding with test coverage and code quality.

Everybody reviews code, I rarely seem people reviewing data models. And I want to help to create a culture where people see data as their true value. Remember this: data leaks are more valuable than “code” leaks and potentially more devastating too. So please, let me help you.

How I became a Data Engineer

A reminiscence of a personal timeline of events; please, excuse minor errors and/or let me know if my memory has shuffled things around a bit.

My first job was at a company that had their own CRM and they wanted a web version of some parts of the system. At that time, I only had “experience” with ASP and Microsoft Access (I know, tough). They wanted the web version in PHP, their reasoning, I think, being that they wanted the integration to run directly into the database. The web app would write directly into the DB. The CRM system was written using Delphi and Firebird. So I learned PHP and my first database, which wasn’t MySQL (I don’t count MS Access as a DB). After that I got a job in which MySQL was used. I was really fresh on MySQL, and I didn’t know about the engines and such, so it was a bit weird learning about MyISAM (which didn’t have foreign keys for instance).

After that I got a job in a huge multinational where they had this project migrating every Excel spreadsheet to a PHP program. VBA was heavily used there, and they had entire programs running into that. What they didn’t tell us was that it was cheaper for them to have a whole team of PHP developers doing an internal system than to have the features built into their ERP. For “security” reasons no business logic could be inside the PHP code, so I had to do tons of Stored Procedures. They also had integrations with MS SQL Server. The workflow system used it together with a PHP tool called Scriptcase.

Another job I had was with a different multinational, I had to do a program to read from various sources and store in a DB for later reports and all. It would be a “scratch” of an ETL (Extract, Transform, Load), but at the time I wasn’t yet well versed with data warehouse techniques. For that job I used PostgreSQL. In the same company we later on did Magento white label stores for our clients (other big companies), and it had to have integration with our ERP (unlike my first job). This integration was through a Service Bus written in Java and the ERP had Oracle as the DB.

One of my employers noticed my interest in database tasks, and how quickly I became familiarised with the schema and our data architecture. Our main relational database had millions and millions of records, terabytes and they created a Data Team to make it more efficient. We would create workers to sync data into Elasticsearch for instance. I was still a PHP Developer officially, but mainly curating the DB and doing workers with NodeJS (we needed the async and the indexing time was crucial for business).

I could go on and on through every job I’ve had. Most of my jobs have been at big corporations, and that has put me into contact with many flavours of Relational Databases and NoSQL too (MongoDB, Dynamo and Cassandra being the main ones).

In the end, this kind of exposure has made me the “DB” person among my fellow engineers, and everything considered more than a CRUD tends to fall in my lap, and I’m happy with that.

That’s how I discovered that I wanted to work with data. I didn’t have any idea of what kind of title that would be, but I knew I didn’t want to be a DBA, since much of the infrastructure tasks involved in it didn’t sound fun to me. Once the opportunity to work officially as a Data Engineer appeared, I grabbed it with both hands and I knew it was right.

Creating Migrations with Liquibase

Liquibase is a versioning tool for databases. Currently, it’s on version 3.5 and is installed as a JAR. It has been on the market since 2006, and recently completed its 10th anniversary. In its feature list we have:

Code branching and merging
Multiple database types
Supports XML, YAML, JSON and SQL formats
Supports context-dependent logic
Generate Database change documentation
Generate Database “diffs”
Run through your build process, embedded in your application or on demand
Automatically generate SQL scripts for DBA code review
Does not require a live database connection

Why you need it?

Some frameworks comes with built-in solutions out of the box like Eloquent and Doctrine. There is nothing wrong with using something like that when you have only one DB per project, but when you have multiple systems, it starts to get complicated.

Since Liquibase works as a versioning tool, you can branch and merge as needed (like you would with code in git). You have contexts, which means changes can be applied to specific environments only, and tagging capabilities allow you to perform rollbacks.

A rollback is a tricky thing; you can either do an automatic rollback or define a script. Scripted rollbacks are useful when dealing with MySQL, for instance, where DDL changes are NOT transactional.

Guidelines for changelogs and migrations

MUST be written using the JSON format. Exceptions are changes/legacy/base.xml and changes/legacy/base_procedures_triggers.sql.
MUST NOT be edited. If a new column is to be added, a new migration file must be created and the file MUST be added AFTER the last run transaction.

Branching

There could be 3 main branches:

production (master)
staging
testing

Steps:

Create your changelog branch;
Merge into testing;
When the feature ready to staging, merge into staging;
When the feature is ready, merge into production.

Example:

download

Rules:

testing, staging and production DO NOT merge amongst themselves in any capacity;
DO NOT rebase the main branches;
Custom branch MUST be deleted after merged into production.

The downside of this approach is the diverging state between the branches. Current process is to, from time to time, compare the branches and manually check the diffs for unplanned discrepancies.

Procedures for converting a legacy database to Liquibase migrations

Some projects are complete monoliths. More than one application connects to it, and this is not a good practice. If you are working with that sort of project, I recommend you treating the database sourcing as its own repository, and not together with your application.

Writing migrations

This is a way I found for keeping the structure reasonably sensible. Suggestions are welcome.

Create the property file

Should be in the root of the project and be named liquibase.properties:

driver: com.mysql.jdbc.Driver
classpath: /usr/share/java/mysql-connector-java.jar:/usr/share/java/snakeyaml.jar
url: jdbc:mysql://localhost:3306/mydb
username: root
password: 123

The JAR files in the classpath can be manually downloaded or installed though the server package manager.

Create the Migration file

You can choose between different formats. I chose to use JSON. In this instance I will be running this SQL:

	CREATE TABLE `mydb_users` (
	`id` int(11) NOT NULL AUTO_INCREMENT,
	`username` varchar(25) CHARACTER SET utf8 DEFAULT NULL,
	`password` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
	`activated` tinyint(1) NOT NULL DEFAULT '0',
	PRIMARY KEY (`id`)
	) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

view raw mydb_users.sql hosted with ❤ by GitHub

Which will translate to this:

	{
	"databaseChangeLog": [
	{
	"changeSet": {
	"id": "create_mydb_users",
	"author": "gabidavila",
	"changes": [
	{
	"createTable": {
	"tableName": "mydb_users",
	"columns": [
	{
	"column": {
	"name": "id",
	"type": "int unsigned",
	"autoIncrement": true,
	"constraints": {
	"primaryKey": true,
	"nullable": false
	}
	}
	},
	{
	"column": {
	"name": "username",
	"type": "varchar(25)",
	"encoding": "utf8",
	"constraints": {
	"nullable": true
	}
	}
	},
	{
	"column": {
	"name": "password",
	"type": "varchar(255)",
	"encoding": "utf8",
	"constraints": {
	"nullable": true
	}
	}
	},
	{
	"column": {
	"name": "activated",
	"type": "tinyint",
	"defaultValueNumeric": 0,
	"constraints": {
	"nullable": false
	}
	}
	}
	]
	}
	}
	]
	},
	"modifySql": {
	"dbms": "mysql",
	"append": [
	{
	"value": " ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci"
	}
	]
	}
	}
	]
	}

view raw liquibase_migration.json hosted with ❤ by GitHub

It is verbose? Yes, completely, but then you have a tool to show you what the SQL will look like and be able to manage the rollbacks.

Save the file as:

.
/changes
- changelog.json
- create_mydb_users.json

Where changelog.json looks like this:

	{
	"databaseChangeLog": [
	{
	"include": {
	"file": "changes/create_mydb_users.json"
	}
	}
	]
	}

view raw changelog.json hosted with ❤ by GitHub

For each new change you add it to the end of the databaseChangeLog array.

Run it

To run, execute:

$ liquibase –changeLogFile=changes/changelog.json migrate

Don’t worry if you run it twice, the change only happens once.

Next post will cover how to add a legacy DB into Liquibase.

To learn how to go deeper into Liquibase formats and documentation, access this link.

Save