In an upcoming screencast I’m doing with O’Reilly I’ll be discussing what reinforcement learning is and how it applies. I figured I’d give you all a little behind the scenes look. Here’s a quick intro in 8 slides:
How to build a successful data team without hiring a unicorn

You can’t go into battle by yourself; it takes an army. The same is true with a successful data project. The fact that we’ve focused so much on finding unicorns, or individuals who can do it all, is foreign to me. Personally, I can do a lot of engineering tasks and scientific algorithms but am not as good at visualizing things.
There was a team I was on that had excellent engineering and data scientists on board. We spent 6 months working on beautiful solutions to problems that we thought the rest of the team needed. After 6 months we were called into the CEO’s office to present our results. Problem was we didn’t have any, we had a pile of useless beautiful code and models. Long story short, that project was killed that day. 6 months of work down the drain!
That’s why today I’m going to be talking about putting together a successful data team; who should get on the bus? While things might change depending on the specifics of your organization there are four important archetypes on any data team.
Before I explain what the four archetypes are, let’s talk a little bit about what archetypes are. Carl Jung is a psychologist who came up with the modern interpretation of archetypes. He defined them as highly developed elements of the collective unconscious. Basically they’re traits that are universal. A common example would be “the hero” which has been chronicled for millennia. Nobody is 100% any of these archetypes. So for instance nobody is 100% “hero” (even though they might want to be), that idea of the “hero” is only a theory.
But what does Jungian psychology have to do with putting together a successful data team?
I believe strongly that there are four major archetypes that can be used when filling a team. They are the scientist, the storyteller, the domain expert, and the engineer.

If you were to think about each of these archetypes, they fit into a quadrant of internal vs. external, and group vs. individual. The scientist is individually focused on what is going on on the inside of the data, what makes it do what it does, while the engineer is focused on moving the data around. The storyteller is about telling the story to the rest of the group by examining the parts that the scientist and engineer find. While the domain expert knows the data inside and out and also the rest of the group.
All of these archetypes work together in conjunction by taking the problem, breaking it down systematically, discovering insight, and then finally presenting that back to the rest of the company.

While I will be talking about each of these archetypes in future blog posts individually here’s some overarching information about each.
The Domain Expert
The domain expert in many ways is probably looked over. This is the project manager, the business owner, the person who gets the domain. Of course, you’re probably thinking that data science should be impartial. Well yes and no… Machine learning and AI can become unconstrained goal optimization. This leads us to tricky situations where we are either breaking the law or doing something unethical. The less extreme example is if we are not following guidelines for the rest of the company.
The Engineer
After the domain expert has explained the problem, it’s up to the engineer to break it down systematically. You probably know already who is the engineer on your team. These get it done type personalities who are willing and able to move data, program up scripts, and deploy things to production. These are the make it happen kind of folks.
The Scientist
From the engineer comes the scientist who is like frosting on the cake. While what they do seems magical, the real point behind science in a data team is to determine insights. Whether that’s through building mathematical models, clever ways of rearranging data, or anecdotal models.
The Storyteller
Lastly and probably one of the most overlooked and important steps is the storyteller. This person’s job is to take what the domain expert, engineer, and scientist have uncovered and then present it to the rest of the company or customers. Without this last step, the project would stagnate. The best part about this is that the storyteller can also take feedback and give it to the domain expert.
Note: These Are Not Four Individual People
Realize that all of these archetypes can be blended. You can, for instance, have someone who is both a domain expert and a storyteller. You can also have an engineer scientist. Though it becomes harder to find someone who satisfies three. If you are looking for someone who satisfies all of these archetypes, then I think you are looking for a unicorn.
This is just the beginning of understanding the archetypes of a successful data team. We still need to discuss each one individually. Which is why I will be publishing a series of these articles. The next article will be about finding and training a “Domain Expert”. This will be out in 2 weeks from now.
How to run Zero-Downtime migrations on an RDBMS (like MySQL)
Being able to migrate your database is an essential part of the health of any application in production. Without the ability to refactor, or rethink your data model things will either become slow, or technical debt will increase.
Migrations are essential… but they can also cause the app to break.
Say you want to add a column to a 20 million row table. In databases like MySQL that will most likely cause the entire table to lock up for the duration of the ALTER
command.
Locking tables could mean your clients are losing money; you’re depleting your bank account, either way, someone will be angry… Nevermind the political consequences of having your app go down randomly throughout the week.
Downtime mitigation is why zero-downtime migrations are so essential to your business and development team. This post will outline the different methods for zero-downtime migrations that I’ve found to be useful.
In general, all of these can be achieved through some SQL or relying on a tool like pt-online-schema-change. The basic idea remains the same between all four which is to limit the amount of ALTER
commands on a table.
The Basic Zero-Downtime Migration Algorithm
The basic procedure is (this is MySQL specific, but the same idea applies for Postgres or other DB’s):
- Create a copy of the table schema based on the original.
CREATE TABLE _receipts_new LIKE receipts
. - Add triggers
- Chunk through all data in the original table and insert into the new table.
- Rename old table to _old_receipts, rename new table to receipts
- DROP triggers
Create copy of table schema
The idea here is that we want to migrate something, which we’ll get into in a bit. But we want to either add the column, rename the column, or change the column type here.
For instance, if we want to remove `url` from receipts (let’s say we’re going to make these dynamically).
CREATE TABLE _receipts_new LIKE receipts; ALTER TABLE _receipts_new DROP COLUMN `url`;
2. Add triggers to catch new data
Triggers are the bread and butter of zero-downtime migrations. While we can easily migrate all the data from the old table into the new table now, we might miss out on new information. These migrations serve as a way to keep things up to date as the original table is updated.
The three triggers are below to capture whenever the original table is updated, inserted, or deleted.
CREATE TRIGGER `after_receipts_ins` AFTER INSERT ON receipts FOR EACH ROW REPLACE INTO _receipts_new (id, ...) VALUES (NEW.id, ...); CREATE TRIGGER `after_receipts_upd` AFTER UPDATE ON receipts FOR EACH ROW REPLACE INTO _receipts_new (id, ...) VALUES (NEW.id, ...); CREATE TRIGGER `after_receipts_del` AFTER DELETE ON receipts FOR EACH ROW DELETE IGNORE FROM _receipts_new WHERE id = OLD.id;
3. Chunk through all the data and insert into the new table
Unfortunately, many straight SQL languages don’t allow you to chunk very well. MySQL for example, allows you to do things like LIMIT / OFFSET, but it can be slow as you get further on the table.
For this part, it’s smart to either use a Ruby, Python, or Javascript script. The idea is effectively this:
Determine an optimal chunk size (say 1000) and then iterate through the primary keys incrementing by 1000 each time. Here’s a Rails style version of this.
id = Receipt.first.id # this will be kind of weird if you have changed default scope... max_id = Receipt.last.id id.upto(max_id, 1000) do |i| ActiveRecord::Base.connection.execute(<<-SQL) REPLACE INTO _receipts_new (id, …) SELECT id, … FROM receipts WHERE id BETWEEN #{i} AND #{i + 1000} SQL end
This script will chunk through all of the available records and updates the information. After this is done then the table should be up to date.
4. Optional: Check the integrity of the table
You can run a quick gut check to make sure everything is there by doing something like this. Though depending on the size, of the table this might be quite large.
SELECT count(1) from receipts r LEFT JOIN _receipts_new rn ON rn.id = r.id WHERE rn.id IS NULL;
This should return a count of 0. If it doesn’t that means you have missed some data.
5. Rename the table and drop the triggers
Now it’s time to finish.
RENAME TABLE receipts TO _receipts_old; RENAME TABLE _receipts_new TO receipts; DROP TRIGGER after_receipts_ins; DROP TRIGGER after_receipts_upd; DROP TRIGGER after_receipts_del; -- After you verify you can drop the old table too DROP TABLE _receipts_old;
If at this point everything goes to hell make sure you are prepared to rename _receipts_old back to receipts
RENAME TABLE receipts TO _receipts_new; RENAME TABLE _receipts_old TO receipts;
But this shouldn’t happen unless the code has a bug.
The 4 Types of Zero Downtime Migrations
There are only four types of database schema migrations:
- Add a column
- Change a column type
- Rename a column
- Delete a column
How to Add a Column
Now I’m going to get into specifics of each type of migration. Each of these four types of migrations has their specific nuances related mostly with code. For instance, if you have code that references a column that doesn’t exist that will cause production issues.
But let’s start with the easiest first which is adding a column. Unless your code is a mutant, it won’t reference a column that shows up randomly. Adding a column in this way is quite simple. You can add the column before any code changes go out and as soon as it’s live.
Though there is a possible downside which is adding a column that requires something to be in it. For example, let’s say you add a password field to a table. What to do then?
Of course, this will depend on the code in question, but you can write code that will allow you to fill the field if it exists and ignore it if it doesn’t exist.
One important part of this is that if you are adding data based on other points, you will need to add it as you populate the table, and also update the triggers to reflect the change.
You Can Also Change a Column Type
The benefits of changing a column type can have substantial performance payoffs. For instance, removing things like blobs or varchars out of a MySQL database can reduce the memory footprint of a table.
In general column type changes shouldn’t have any fidelity changes. And as a result, they should be as simple as adding a column as above.
For example, changing a signed int to an unsigned tinyint doesn’t change anything. Same with a varchar to an enum.
When it comes to populating the table you might need to cast things accordingly. Or perhaps you will find one bit of data doesn’t quite sync up with what you thought was there.
Maybe you need to rename a column type
Renaming a column type is where things get a bit more tricky. Renaming a column for instance from pdf_identifier to pdf_id can have long ranging consequences with code. If you change the database and the field disappears the code will not know what to do.
That is why it’s important to ship code that will allow both to happen before migrating the table.
This connection between data and code adds complexity to the deployment process:
- Deploy code that allows both pdf_identifier and pdf_id as fields
- Migrate table from pdf_identifier to pdf_id
- Ship code that removes references to pdf_identifier
That is three times more complicated than most deploys! So if you can, it’s best to try to avoid these types of triple deploys. If a table is small enough (<50k rows) then it’s probably just safe to migrate it using an ALTER
command.
Here’s how to delete a column (without screwing up)
Removing a column is by far the most complicated to do without any downtime or errors in judgment. Whenever you remove a column you need to realize that the data doesn’t go to a trash bin, it’s gone forever. So when it involves customer data, it’s important to be careful!
So this requires the following to pull off effectively:
- Deploy code that stops referencing the old column
- Migrate column to a renamed version of the same. For example, pdf_identifier -> pdf_identifier_deleted
- Wait a week
- If no errors crop up then remove column through a zero-downtime migration
- If errors crop up then hotfix code and wait another week
Conclusion
Zero-downtime migrations can be tricky to pull off using SQL so whenever possible try to rely on a script like pt-online-schema-change which will automate some of this for you. Although even though that tool works well, it can sometimes not work as well as expected which is why this article exists.
These migrations take a long time to execute, and are tedious but can prevent you from losing money and possibly your customers. If your app isn’t online, then you can easily be losing thousands of dollars a minute as your users either go elsewhere or can’t make money themselves. Which easily justifies the cost of having someone methodically go through zero-downtime migrations.
My Next Chapter
After a year and a half of working with Clickfunnels, I’ve decided to focus on my own business.
Considering the customer base has grown exponentially and the app runs on a substantial amount of hardware, I’m pleased we were able to reduce the database size multiple times – reducing it by over 244GB in one instance!
What am I doing next?
I juggled my full-time position at Clickfunnels alongside the launch my second book, Thoughtful Machine Learning with Python, and the start of my first-ever Safari course, A Practical Introduction to Machine Learning.
Both experiences taught me that I love to teach other engineers learn how to solve problems with data. As a response to my book and classes, I built a video course on how to start a machine learning project – minus higher math knowledge. Check it out if you’re serious about becoming an expert on machine learning.
Get my video course
I also found out high-growth startups want to pursue data science project, but don’t often know what projects to pursue, or have the resources for a full-time data person.
Therefore, I created a new offering called Your Chief Scientist, which supports engineering teams over a 3-month period with their data projects, ranging from database performance to deep learning.
If you or someone you know is:
- Concerned about the size of your database and the server costs.
- Excited about machine learning only to get confused with how it can apply to your job.
- Afraid of missing out on the possibilities of Deep Learning.
- Excited about learning how to write code that optimizes itself.
Then we should talk.
Regardless, feel free to e-mail me at matt [at] matthewkirk.com or to reach out to me on twitter. I’d love to hear from you.
Strength vs Power
Yesterday was a beautiful day here in the Seattle area. It’s been beautiful most days lately but seizing the day I figured I’d go out on a bike ride. Biking is something I really enjoy to decompress and keep healthy. Plus you get to see such beautiful scenery around here.
While I was out riding near Juanita Beach in Kirkland I started thinking about the class 4 climb that was ahead of me. Naturally I started pedaling with more force. Having strong quads I can ride a bike fairly well. Though as I got up the hill I was spent. I had to stop and take a break.
That got me thinking about strength vs power.