Monday, May 9, 2011

My Beef with Databases (Part 3)

This is the second in a series of blog posts about my beef with relational databases today, and some musings on how we could make databases better.

I have a bone to pick with the databases of today. It has developed because of the frustrations I've had in writing, deploying, and interacting with databases over the years. In fact, I have three specific issues I want to talk about, and perhaps, come up with a few ideas to make databases great.

Last time, I described my issues with the differences between the data definition between my application and database. So let's talk about my third biggest beef with databases as they exist today.

SQL is a human language used as a communication protocol

Before we go any further, let's make sure we're talking about the same thing. I am NOT talking about relationship databases, I'm fond of those. I am talking about the actual SQL language. You know, "select * from table;". And even then, I don't have a problem with the language itself. It was designed to be human readable first and it just happened to become something of a data exchange language.

I do, however, very much have an issue using it as a communication protocol between two computers. It's clearly not designed for that since it's structure is more linguistic than data structure. For example, you can't leave extra comma's after lists, or leaving an AND at the end of the where clause is invalid, or even change the order of the clauses, which by itself would simplify a lot of code. Just for starters, why isn't "SELECT user_id, username, password, WHERE username='atrodo' AND password LIKE '12345' AND FROM user;' acceptable? It is unambiguous and is far easier for programs to generate.

And that's the root of the problem. We're using programs to generate text that was meant to be written by humans in order to talk to another computer. It's not forgiving, it's not loosely defined, and it's not a data structure.  It's difficult to produce and difficult to parse, and most importantly, is as much of an API as cobol is.

Put another way, are we using Cobol text to create web APIs? Do we exchange ssl keys using C? Are http request headers a perl program?

Less and less SQL inside an application is being hand written. More and more SQL is being generated by ORMs or SQL abstraction layers. And these tools are rarely simple. Quite the opposite actually, they are normally much more complex than they really need to be. And in my experience these SQL generators are used to generate the SQL that works instead of really expressing what data is really wanted.

So then the question I ask myself is, if not SQL, then what? Is XML or JSON really a better option? Absolutely. Both were designed as mappings to real application data structures. In the case of JSON, it is a data structure. They are meant to be a data exchange format first and happen to be somewhat human readable.

Why do we continue to use SQL as a database API?

Friday, May 6, 2011

YouTube "Http/1.1 Service Unavailable"

Can anyone tell me why, right now YouTube is "Http/1.1 Service Unavailable" when I try to do anything with my account on the site?  Why does this even happen?

I thought the cloud was suppose to save us from this.  One node fails?  That's cool, we'll route to someone else. From my searches, it sounds like this is not an uncommon occurrence.  And has been happening for years.  This sounds like a lack of redundancy and a single point of failure.  This is exactly what Google assures us they do not have.

And yet, here I am, unable to access portions of YouTube with no explanation.  Nothing on the website, nothing on twitter, no signs at all of what's happening.  If this is what the cloud gives us, then it sounds like it's more cloud than computing.

My Beef with Databases (Part 2)

This is the second in a series of blog posts about my beef with relational databases today, and some musings on how we could make databases better.

I have a bone to pick with the databases of today. It has developed because of the frustrations I've had in writing, deploying, and interacting with databases over the years. In fact, I have three specific issues I want to talk about, and perhaps, come up with a few ideas to make databases great.

Last time, I described my number one issue, the code separation that databases enforce. That is the biggest issue primarily because the other two are related. So let's review my second biggest beef with databases as they exist today.

My data definition is not in my repository

Which, to a point, isn't completely accurate assessment of the situation. The definition of my data is in two places. Two very different, very disjointed places. And there is no mechanism for the two to know that the definition is different. This issue is exaggerated in the presence of ORMs or other tools that need to or want to understand the structure of the data in the database and depend on the programmer to describe the data definition.

Then the question becomes who is the authority on the data definition. On the theoretical side, the database is in charge of the definition and structure of the database. On the other side, the database is a tool for the application so the definition and structure should stay in the application. I'm a firm believer that is the proper case. My application should define all behaviors, structures and definitions.

When our applications connect to the database, both assumes that they are talking about the same data model. The reality is that often that is not the case. What's even worse is when the database is just subtly different, which means things will appear to happen correctly when it's not, potentially even corrupting data.

I can appreciate that this is a difficult problem in practice. That doesn't mean the problem should be ignored. Quite the opposite. My data is vital, which is why I use a database and not some mish-mash of files and /dev/null. I could accept the issue if I could use always use create scripts and not be forced to use a create script and then a series of alter scripts. Or, maybe even I could give it a create script, and it tells me what's different. That'd also be cool. But that doesn't happen. No one, to my knowledge, does that.

As I was describing this issue to my friend, he asks about object or document data stores, like what GAE uses or MongoDB. These things are cool, and to a degree they solve this issue, but in the general sense, they solve it by not enforcing data structure, making data integrity your responsibility. And that's entirely true, data integrity IS my responsibility. But I think that without the right tools, that responsibility becomes overly difficult and cumbersome. And when things are difficult, programmers are among the first to become lazy and ignore the problem. Ignore it until something goes wrong. And it will eventually go wrong.

I want to ask, why can't my application define my database?

Wednesday, May 4, 2011

My Beef with Databases (Part 1)

This is the first in a series of blog posts about my beef with relational databases today, and some musings on how we could make databases better.

I have a bone to pick with the databases of today. It has developed because of the frustrations I've had in writing, deploying, and interacting with databases over the years. In fact, I have three specific issues I want to talk about, and perhaps, come up with a few ideas to make databases great.

Before I get much farther, let's get a few points straight. First, I'm not in the NoSQL camp. Sorry if that disappoints you. Relational databases are great things. But, by the same token, the NoSQL camps have some neat ideas and there are times when going that route makes more sense than using relational databases. There are merits to both methodologies.

So let's get on with issue number one, and this is my biggest beef with databases as they exist today.

My code is not in my repository

Yes, that's right, it's as simple as that. But, what exactly does that mean? All of my code, including all of my business logic.  All of my stored procedures, views and data definitions are stored on the database. Now, I know what most people are going to say at this point, that what I'm saying is false and a good repository holds the create and alter scripts for everything and as long as the database is kept in sync with the current version of the scripts, everything works.

Except when it doesn't.

Let me take a detour for a moment. If you have Perl Best Practices laying around, it's time to crack it open because Damian Conway has some sage advice that speaks to this issue. In Chaper 13 of Perl Best Practices, Damian goes out of his way to explain why exceptions are better than returning special values, specifically:
It's human nature to "trust but not verify".
He goes on to explain that returning special values makes ignoring errors easy and dangerous. What happens when opening a file fails? In the normal case, open return a special value, in perl it's undef and in C it's -1. So, programmers have two choices. Either check for the special value and handle it, or ignore it and let the bug present itself later in some unrelated section of the code. Guess which one is easier? That's right,the obscure bug inducing option of doing nothing. Damian finishes this thought by explaining:
Don't return special error values when something goes wrong; throw an exception instead. The great advantage of exceptions is that they reverse the usual default behaviors, bringing untrapped errors to immediate and urgent attention.
So what does this have to do with databases? It illustrates why I think alter and create scripts are bad. The developer trusts that those scripts have run, and often, do nothing to verify that. Yes, developers should check, just like they should always check return codes. But they don't. And that's my point, why should they? We, as programmers, have moved away from assuming that everyone agrees that everyone is talking the same way. We have doc-types at the top of XML files since we no longer assume we're all talking HTML 3.2. We have protocol versions embedded in streams so everyone knows how to communicate. We have extendable file formats that allow older readers to read newer files.  But yet, when talking to databases, we assume that the triggers stored procedures are there.

Some clever programmers have recognized this issue and decided to completely avoid it by putting all of the business logic into the applications. I've done it before, and it's a solution.  The major flaw with that is the logic is farther from the data when the desire is to get the business logic closer to the data so the logic is always applied at the database level. That also assumes that everyone is going to use the business logic or ORM to access the data. Hint, they're not always going to.

So on the one side, we have the maintenance cost of making sure the database we are talking to is always running the correct version of our database code (which is completely separate from the rest of our code). On the other side we have to make sure all of our business logic is ran no matter who interacts with the database.

I want to ask, why do I have to make a decision?

Tuesday, February 8, 2011

jQuery/pre-wrap trick

I've had a recent encounter with Internet Explorer 7 for $DAYJOB.  For this section of code, IE7 is our minimum required browser.  I was displaying some user input and wanted to maintain the text formatting while not using <pre>, which I find ugly.  So, for a while, I was using the CSS style "white-space: pre-wrap;" which will let the text flow like html, but break at new line characters like pre.  Perfect.  This has the added benefit that the text will be editable and we have a good jQuery plugin that will take a div and turn it into a textarea and use the content of said div as the inital content.  After a while, I noticed that IE7 was not behaving.  This is because IE7 does not support the pre-wrap CSS style.  Well, that's an issue.  So after a bit of reworking, I came up with this:

        function(v, i)
          return [

So let's break this down a little.  
  1. I create the div, and add a class to it.
  2. I begin an append of the content
  3. I then take the output from a map, and turn it into a jQuery object for the append (Yes, this is important)
  4. I use the data I've been given and split it by a new line character and use that array as the input to a jQuery map operation
  5. Please note that I used the quote form of split, not the regex form.  IE7 also removes empty array items when using the regex version.
  6. For each item from the split, I return an array.  The first is a span that includes the text we need, plus a new line, to maintain the text representation of the data.  The second is a <br/> to take place of the newline in the html
  7. Last, I return the DOM elements of the span and br.  This is important, since that's what the jQuery constructor back in step 3 expects.  Also, I must ask for element 0 since otherwise, I'll get an array instead of an element back from .get().
The best part of this whole thing, is that if I do a .text() on the div, I get exactly the text I need for the editable textarea.  Perfect!  Well, not exactly.  This may have issues with spaces being insignificant, but in our case, that's acceptable for now.

Friday, February 4, 2011

The NQP question

This weekend was a busy weekend for me, but more importantly, it was a busy weekend for Parrot.  There was a Parrot Developer's Summit (PDS), which I wasn't able to attend, and a lot of good discussion happened.  Perhaps too much since I glazed over a couple parts of the discussion.  If you're interested, the summit can be found at  Make sure to hit the next day, since it spans two days worth.

There was some Lorito discussion, which is interesting for me.  But actually, most of the discussion about Lorito was about the announcement that by Parrot 3.3, which lands in April, we will have a spec and initial implementation of Lorito.  The team, to my recollection, is going to be cotto, dukeleto, bacek and myself.  There was also come clarification between the terms M0 and Lorito.  M0 being the opcode set and bare minimum, and Lorito being the whole project.

But this blog posting isn't about Lorito.  It's about NQP.  The major discussion that took place during PDS was the suitability of the new NQP being included as a core component of Parrot.  This discussion was interesting for me since there were good points on each side.

Before I go any farther, let me explain what's happening to NQP.  pmichaud explains the change in detail at, but the quick version of it is that NQP is going to become multi-backend.  It will support output to Parrot, JVM and CLI.  It was made clear that it may not be able to support all VM backends the same, some features that Parrot gives naturally will be difficult in JVM, for example.  But, the goal is to make Rakudo, and anyone using NQP, to be able to retarget to another platform with relative ease.  On the surface, this appears to be a noble goal and something Parrot would be interested in.  I'm going to admit, it's appealing.  But, looking deeper under the surface, and you'll find some deep issues.  The big issue is maintenance cost.  This may not seem like much, since it's the NQP team that will be maintaining the code, but as long as NQP is a core component, included with Parrot, the Parrot team has a responsibility to it.  It's like when I was at college, students always wanted the IT department to install AIM on all the lab computers, since it always gets installed anyways.  What few people seem to realize is that by doing that, IT is taking some form of responsibility for the installation.  Students are more apt to go to IT first to resolve an issue with AIM.  There's a cost for that amount of support tickets.  When AIM is installed by IT, the support is implied at that point.  Same with Parrot, if JVM support is there in the Parrot core, support for it is implied by it's existence.  In this regard, I agree, including the new NQP, alternate backends and all, is not what Parrot should do.

The issue isn't so cut and dry to say that NQP should leave the nest because it supports multiple backends.  The reality is that Parrot needs a tool like NQP.  People that want to start developing and targeting Parrot need a tool that can get them started fast.  That's also not to say that there should be one true tool to use, there should be multiple to fit people's needs.  But Parrot needs a default, and it's something that Parrot needs to support.

In the PDS aftermath, there were a few reasonable options brought up.

  1. Include the new NQP, but only include the "use parrot;" equivalent functionality.  This seems to be the option that is in favor right now.
  2. Continue to use NQP-rx, which is what is currently supported.  Personally, I don't like this option, since it means that Parrot will diverge from NQP, which will be difficult for HLL writers to transfer from NQP-rx to NQP.
  3. Use another, existing tool, something like Winxed.  Actually, the more I look at winxed, I don't doubt it could fit the bill.
  4. Create a brand new tool.  Feasible, but wouldn't happen anytime soon.  Although my fear is that it'd end up being something unpleasant to work with like PASM/PIR.  That fear is probably unfounded, but it's a concern none the less.
So, in the end, the question I ask myself is, what do I think Parrot should do?  Personally, I think option 3 and 1 are the most reasonable options, and failing that, option 4.  Option 2 just shouldn't be an option in my opinion, too many problems could and would arise from having two divergent languages.  I do think option 3 is the most desirable, since that way we can support HLL writers out of box, but they still have options.  I understand the cost to do that, and at that point, I concede that option 1 is probably, at the very least short term, the option that should be persuaded.

But that's just my opinion.

Thursday, December 16, 2010

In the beginning (A Story of Lorito and Parrot)

In the beginning...

...there was Parrot.  And it was magical.  But it was slow and would not stand still, so it was hard to target.  Then there were calls amongst those near Parrot for speed and efficiency.  Thus, Lorito was imagined.  It was to be leaner, faster and more stable, but still be able to handle almost all the magic of Parrot.  Ideas were thrown around, and much hand waving ensued.  Then, an idea of limiting the core to 20 opcodes was thrown out.  This piqued my interest, so new code was written, stubbs were created, and a project on github created.  Code was written fast and furious and a vision appeared.  I began to dig at the surface of the issues that Parrot exhibited today.  Implementation details began to form around avoiding the mistakes of the past, as they were described to me.  Soon, a mantra was formed: Less Magic == More Magic.  Simplicity and efficiency can be optimized later, but complexity today remains tomorrow.

First, I made the opcodes single sized.  8 bytes, and only 8 bytes ever.  1 for the opcode, 3 for the 3 arguments (destination, source 1 and 2), and 4 bytes for the immediate value.  Quickly, I meet resistance.  "We need far more than 255 registers" was the cry I heard.  I took note, contemplated doubling opcode size, which could afford 4k of registers.  Then I heard the reason why they needed more registers.  "Because that's what's produced today."  When I dug the surface of the claim, it appeared that there is no register allocation scheme.  That many were used because they could be.  I said to myself "I have a hard time believing that at one point in so many programs, that a function call would need to track more than 255 registers at one time."  So the decision stayed.

Then I said that all PMCs are pointers and have no inherent structure.  There would be no way to ask, politely or otherwise, the PMC itself what it looked like.  The PMC could have a gate keeper, but forcing that seems like the wrong solution to the problem.  Why, I was asked.  That will make them flexible, I respond.  I have removed no functionality, no features lost.  Everything that has been done before can still be done.  Do we complain that machines have blocks of memory that we access by address and not by name?  But now, you can manipulate, store and load any data of any size and requirements into a PMC as though it is merely memory, because it is.  Since the PMC is always the same size, and it is the only thing that points to it's memory, being able to rearrange after garbage has been collected for the time is trivial.

Then, upon the advice of elders, I stopped treating PMCs with a level of reverence that objects do not enjoy. They are the same, I declared.  No difference.  Objects are PMCs, and PMCs are objects.  No more tables of v, they are all methods to be lookuped and called.  Perhaps, I whispered to myself, we still need some special functions.  Simple, I replied, for the information you can ask of all PMCs, make those special functions special.  But do not let others interfere, and make their numbers small.

I surveyed my objects around, and saw my scatterings.  Registers put into boxes, segments of consts, datas and code.  In inspiration, I promoted them all to the same level.  Special PMCs, yes, but PMCs none the less.  To be interacted with in an identical matter as all other objects.

Then I noticed the registers that spoke words.  What use are they as special?  But, how do we efficiently do lookups on objects without them?  Then someone spoke of symbols.  What are these symbols, I asked.  He showed me a world where comparison was cheap of large volumes text.  Symbols become the new register and we gain performance for lookuped objects and regular manipulation of these strings become objects, like everything else.

I create a container that contained everything Lorito needed to know about what it was doing.  It had the registers, the code it's working on, the arguments it has and returns, on the constants it has access too.  It has no requirement to return where it came from, so that will relieve some concerns about being able to do magic called CPS.  This context object then leaves the details of how the operations actually take place inside an interpreter for Lorito.  Later, after a braindumping, the idea was reinforced by those in attendance.  Thankfully, brains intact.

How can you call methods of objects, some asked.  Here's an idea, called P&W, another one told me.  Interesting, it was, but perhaps was still too complex since it dictated inheritance.  What if there is no need for inheritance?  What if inheritance is complex?  Let the object deal with it all.  Make a special operation called lookup.  When it's called, call the object's special lookup method and pass it whatever object it handed us in the beginning.  It shall let everyone talk to everyone, and allow all any possible complexities to be possible.

Lorito has limited operations, and the instructions are limited in size.  The way to call methods will have to change.  No longer can we pass all of the arguments in one operation.  So a context is created to contain these arguments and other details, and the arguments are added to this in separate operations, not unlike a stack of trays.  Then, when the method is done, it can return any information in the same list.  But this method is not without controversy.  Efficiency, speed, compatibility are the concerns I hear.  I cannot speak to any of the three, but I believed it was my only choice with my decisions.  But the calls to those not inside the sphere of Lorito, they can be simplified.  These outside functions can be treated inside Lorito exactly like Lorito methods.  Then, once this function is called, it is passed the context I already have and it can do it's magic.  A nice, simple way of interacting with those that are not Lorito.  Sadly, the concerns of others cannot be satisfied for some time, but I persevere.

Parrot has a difficult time talking to itself.  It can ask itself a question and wait for an answer, what it likes to call it's innerloop, but it's slow and prone to problems.  Others say that Lorito will eliminate this problem while I am less optimistic.  Lorito may alleviate the problem, but there is a lot of magic needed to eliminate it.  So, I save the solution for another day.

I wondered, can Lorito exist without goto or call?  I stood and thought, and decided that no, it can't.  Not unless we make a huge, flat map of all the instructions that Lorito knows about at any one time.  So I kept both types of operations and made the available code available to Lorito self contained.  I like things that are self contained, it's easier to handle and simpler to work with.  So I made the sheets of instructions individual for each method.

But I worried, I needed to justify my operations to anyone that uses Lorito.  So I wrote instructions on how to use each of my 44 operations and justified the existence of each.

As I looked over what I had accomplished, myself asked me a question.  What if all this is not what others what?  That's okay, I assure myself.  It's not the destination, it's the journey.  I understand the decisions I made are a departure from what Parrot is today, but this is about how I would create Lorito to support Parrot.  I believe earnestly that anything Parrot can do, Lorito can do and then some thanks to the flexibility and simplicity of the decisions made.  Its future is uncertain, but I will continue to work on it until there is no need for it, either by others or myself.

Less Magic == More Magic

(Yea, probably a bad idea.  But fun, I must admit)

Tuesday, December 14, 2010

My Little Lorito World

It's been a long time coming, six months in fact, that I intended to start blogging again.  And by again, I mean actually try and do a good job at it.  We'll see.

But, with some prodding by cotto, I've decided, hey, why not, let's review atrodo's little Lorito world.  One could argue that I should step back and explain what I'm talking about it, and why I'm interested.  But I won't, since it'll take too long and I don't expect people that don't know what I'm talking about to be here yet.  If you are, sorry.

But, real quick.  Parrot is a virtual machine, not unlike the JVM or .Net.  It is aimed at more dynamic languages, but I actually have an interest in static languages on Parrot.  Lorito is a hand waving term for a group of refactoring the Parrot project is aiming for.  I started a Lorito prototype back in July-ish after some very basic guidelines were posted to parrot-dev by allison.  Since then, I've been working on it (along with Hordes of Tarshish and, not to mention $dayjob) in an attempt to reconcile the visions of other parrot developers and my own.  I will be saving my current Lorito vision for another post (which is apparently a narrative, huh).  But at issue today is the meeting from last Thursday between some of the core parrot.  I would have tried to attend, but Portland is no where close to where I live.

There were 10 points in cotto's notes, so I'll hit my highlights.  The original notes can be found on a gist I created, and went on to edit with my notes (this is the current version).

Point 1: There is only context.

This is kind of a muddy idea, leaving a few details that I suspect are important, out.  But it's one I can get on board with.  I will, for the time being, keep an interpreter around, but change it's function and the concept behind it.  From now on, the interpreter will be in charge of executing bytecode as well as holding the pointers to the loaded files, symbols and the PMC heaps (which, not implemented in Lorito yet).  The context, for all intents and purposes, holds everything else.

The rest, however, worries me.  It implies several key things that I'm not sure I'm on board with.  First, that the MOP is so tightly integrated with Lorito, that Lorito cannot be used without using the Lorito MOP.  Now, maybe that's desirable to some extent, but the idea of Parrot is to allow a large cross section of languages to target it.  If I have to use Lorito's MOP, no matter how flexible it sounds, that sounds contrary to these ideals.

The other worry some part is that setting a program counter makes it jump to C.  To me, jumping between the boundaries should be well contained and explicit.  The C code is it's own context, it does not share it with the VM.  There is also, I think, a lot of good security intentions that will make security more difficult down the line.  But that's my gut telling me that.

Point 2: Register Count

That's cool.  Although, I had already set my limit to 255, since that's what can fit in a byte to keep the opcode length down.  But what really concerns me, is the integration of the MOP with the registers.  I just don't understand.

Point 3: Alloc as an op

I've already done this.  It's my new op.  To me, PMCs are blobs of memory, so that way you don't have two ways of doing essential the same thing.  Some may argue that a PMC needs structure, but I don't see why that's not true with my method.  Then again, I always take the encapsulation method.  Outside code should not be able to peel back the veil and inspect my data structure.  This actually leads me to considering a seal function on PMCs, so only a particular code segment can do store or load operations on a PMC.  But that has some developing to do.

Point 4: ffi

This is important, I must agree.  But I also think this might be outside the bounds of Lorito.  Personally, and I've gotten some feedback about this position, I think Lorito bytecode should have one way to interact with anything from the outside world.  This means Lorito does not interact with nci or ffi directly.  Instead, it delegates that complexity to other code.  Yes, I know this sounds inefficient, but I have heard whiteknight and chromatic repeat over and over that PCC (Parrot Calling Convention) is slow.  Really slow.  So I abandoned anything that resembled PCC for two reasons.  First, for the above being slow.  Second, because I don't have the opcode length to.  Instead, all Lorito calls are made by pushing arguments (all of which are PMCs) onto a stack in the context.  If it's a Lorito method, it can pop the arguments off.  If it's a C function, the C function can pop the arguments off.  In the case of nci or ffi, it can be thunk that massages the arguments into C types, and call the actual C function, arguments in tact.  The big advantage of this is the simplicity.  If you're interacting with Lorito, you can get the arguments easily with a (inlinable) function call.  If you're calling nci or ffi, you still have to do your thing.  But you would have to do it anyways.  So what I've done is made the simple case simple (and in theory, faster) and the complex case possible.  The PCC way is overly complex, and is always complex, even for the simple case.

Point 5: int sizes

Haven't thought a lot about this.  Honestly, I've been ignoring the issue.

Point 6: Don't fear the REPR

Seems completely possible to me.  No matter if the REPR is at the Lorito level, or at a higher level, this seems worthwhile.

Point 7: CPS and Context

Exactly what I do today.  Hurray!

Point 8: Awaiting clarification.

Point 9: Pass

(I know it looks like I'm giving up, but really, there's nothing for me to talk about)

Point A: Strings, not Symbols, are PMCs

That sums it up.  The S register in my Lorito are symbols.  Not Strings.  Strings are high level constructs and deserve to be elevated to PMC status.  But low level symbols are a highly useful and time saving construct.  I've added no functionality to manipulate symbols.  They are there to load into registers and compare.  That is about all they can do.


I was hoping for concrete answers from last Thursday, and I got some, but I also got a lot more possible directions.  So, I am going to keep on doing what I've been doing, and try and reconcile ideas as the come along and advocating what I've done so far.  Not sure where this is going to lead me and Parrot, but the possibilities are interesting to watch unfold.

And wow, that's a lot of text for such a short topic.