Monday, May 9, 2011

My Beef with Databases (Part 3)

This is the second in a series of blog posts about my beef with relational databases today, and some musings on how we could make databases better.

I have a bone to pick with the databases of today. It has developed because of the frustrations I've had in writing, deploying, and interacting with databases over the years. In fact, I have three specific issues I want to talk about, and perhaps, come up with a few ideas to make databases great.

Last time, I described my issues with the differences between the data definition between my application and database. So let's talk about my third biggest beef with databases as they exist today.

SQL is a human language used as a communication protocol

Before we go any further, let's make sure we're talking about the same thing. I am NOT talking about relationship databases, I'm fond of those. I am talking about the actual SQL language. You know, "select * from table;". And even then, I don't have a problem with the language itself. It was designed to be human readable first and it just happened to become something of a data exchange language.

I do, however, very much have an issue using it as a communication protocol between two computers. It's clearly not designed for that since it's structure is more linguistic than data structure. For example, you can't leave extra comma's after lists, or leaving an AND at the end of the where clause is invalid, or even change the order of the clauses, which by itself would simplify a lot of code. Just for starters, why isn't "SELECT user_id, username, password, WHERE username='atrodo' AND password LIKE '12345' AND FROM user;' acceptable? It is unambiguous and is far easier for programs to generate.

And that's the root of the problem. We're using programs to generate text that was meant to be written by humans in order to talk to another computer. It's not forgiving, it's not loosely defined, and it's not a data structure.  It's difficult to produce and difficult to parse, and most importantly, is as much of an API as cobol is.

Put another way, are we using Cobol text to create web APIs? Do we exchange ssl keys using C? Are http request headers a perl program?

Less and less SQL inside an application is being hand written. More and more SQL is being generated by ORMs or SQL abstraction layers. And these tools are rarely simple. Quite the opposite actually, they are normally much more complex than they really need to be. And in my experience these SQL generators are used to generate the SQL that works instead of really expressing what data is really wanted.

So then the question I ask myself is, if not SQL, then what? Is XML or JSON really a better option? Absolutely. Both were designed as mappings to real application data structures. In the case of JSON, it is a data structure. They are meant to be a data exchange format first and happen to be somewhat human readable.

Why do we continue to use SQL as a database API?

Friday, May 6, 2011

YouTube "Http/1.1 Service Unavailable"

Can anyone tell me why, right now YouTube is "Http/1.1 Service Unavailable" when I try to do anything with my account on the site?  Why does this even happen?

I thought the cloud was suppose to save us from this.  One node fails?  That's cool, we'll route to someone else. From my searches, it sounds like this is not an uncommon occurrence.  And has been happening for years.  This sounds like a lack of redundancy and a single point of failure.  This is exactly what Google assures us they do not have.

And yet, here I am, unable to access portions of YouTube with no explanation.  Nothing on the website, nothing on twitter, no signs at all of what's happening.  If this is what the cloud gives us, then it sounds like it's more cloud than computing.

My Beef with Databases (Part 2)

This is the second in a series of blog posts about my beef with relational databases today, and some musings on how we could make databases better.

I have a bone to pick with the databases of today. It has developed because of the frustrations I've had in writing, deploying, and interacting with databases over the years. In fact, I have three specific issues I want to talk about, and perhaps, come up with a few ideas to make databases great.

Last time, I described my number one issue, the code separation that databases enforce. That is the biggest issue primarily because the other two are related. So let's review my second biggest beef with databases as they exist today.

My data definition is not in my repository

Which, to a point, isn't completely accurate assessment of the situation. The definition of my data is in two places. Two very different, very disjointed places. And there is no mechanism for the two to know that the definition is different. This issue is exaggerated in the presence of ORMs or other tools that need to or want to understand the structure of the data in the database and depend on the programmer to describe the data definition.

Then the question becomes who is the authority on the data definition. On the theoretical side, the database is in charge of the definition and structure of the database. On the other side, the database is a tool for the application so the definition and structure should stay in the application. I'm a firm believer that is the proper case. My application should define all behaviors, structures and definitions.

When our applications connect to the database, both assumes that they are talking about the same data model. The reality is that often that is not the case. What's even worse is when the database is just subtly different, which means things will appear to happen correctly when it's not, potentially even corrupting data.

I can appreciate that this is a difficult problem in practice. That doesn't mean the problem should be ignored. Quite the opposite. My data is vital, which is why I use a database and not some mish-mash of files and /dev/null. I could accept the issue if I could use always use create scripts and not be forced to use a create script and then a series of alter scripts. Or, maybe even I could give it a create script, and it tells me what's different. That'd also be cool. But that doesn't happen. No one, to my knowledge, does that.

As I was describing this issue to my friend, he asks about object or document data stores, like what GAE uses or MongoDB. These things are cool, and to a degree they solve this issue, but in the general sense, they solve it by not enforcing data structure, making data integrity your responsibility. And that's entirely true, data integrity IS my responsibility. But I think that without the right tools, that responsibility becomes overly difficult and cumbersome. And when things are difficult, programmers are among the first to become lazy and ignore the problem. Ignore it until something goes wrong. And it will eventually go wrong.

I want to ask, why can't my application define my database?

Wednesday, May 4, 2011

My Beef with Databases (Part 1)

This is the first in a series of blog posts about my beef with relational databases today, and some musings on how we could make databases better.

I have a bone to pick with the databases of today. It has developed because of the frustrations I've had in writing, deploying, and interacting with databases over the years. In fact, I have three specific issues I want to talk about, and perhaps, come up with a few ideas to make databases great.

Before I get much farther, let's get a few points straight. First, I'm not in the NoSQL camp. Sorry if that disappoints you. Relational databases are great things. But, by the same token, the NoSQL camps have some neat ideas and there are times when going that route makes more sense than using relational databases. There are merits to both methodologies.

So let's get on with issue number one, and this is my biggest beef with databases as they exist today.

My code is not in my repository


Yes, that's right, it's as simple as that. But, what exactly does that mean? All of my code, including all of my business logic.  All of my stored procedures, views and data definitions are stored on the database. Now, I know what most people are going to say at this point, that what I'm saying is false and a good repository holds the create and alter scripts for everything and as long as the database is kept in sync with the current version of the scripts, everything works.

Except when it doesn't.

Let me take a detour for a moment. If you have Perl Best Practices laying around, it's time to crack it open because Damian Conway has some sage advice that speaks to this issue. In Chaper 13 of Perl Best Practices, Damian goes out of his way to explain why exceptions are better than returning special values, specifically:
It's human nature to "trust but not verify".
He goes on to explain that returning special values makes ignoring errors easy and dangerous. What happens when opening a file fails? In the normal case, open return a special value, in perl it's undef and in C it's -1. So, programmers have two choices. Either check for the special value and handle it, or ignore it and let the bug present itself later in some unrelated section of the code. Guess which one is easier? That's right,the obscure bug inducing option of doing nothing. Damian finishes this thought by explaining:
Don't return special error values when something goes wrong; throw an exception instead. The great advantage of exceptions is that they reverse the usual default behaviors, bringing untrapped errors to immediate and urgent attention.
So what does this have to do with databases? It illustrates why I think alter and create scripts are bad. The developer trusts that those scripts have run, and often, do nothing to verify that. Yes, developers should check, just like they should always check return codes. But they don't. And that's my point, why should they? We, as programmers, have moved away from assuming that everyone agrees that everyone is talking the same way. We have doc-types at the top of XML files since we no longer assume we're all talking HTML 3.2. We have protocol versions embedded in streams so everyone knows how to communicate. We have extendable file formats that allow older readers to read newer files.  But yet, when talking to databases, we assume that the triggers stored procedures are there.

Some clever programmers have recognized this issue and decided to completely avoid it by putting all of the business logic into the applications. I've done it before, and it's a solution.  The major flaw with that is the logic is farther from the data when the desire is to get the business logic closer to the data so the logic is always applied at the database level. That also assumes that everyone is going to use the business logic or ORM to access the data. Hint, they're not always going to.

So on the one side, we have the maintenance cost of making sure the database we are talking to is always running the correct version of our database code (which is completely separate from the rest of our code). On the other side we have to make sure all of our business logic is ran no matter who interacts with the database.

I want to ask, why do I have to make a decision?