Archive for the ‘Uncategorized’ Category

It’s been a few weeks since my last post. Things have shifted around for me career-wise, and while I’m enjoying my new position, I’m transitioning out of my old role. Sometimes it feels like I’ve worked two jobs when I get home! Anyway, I’m going to take a break from the cryptography series and talk about something much more practical.

At my job, I’m finding that a large percentage of the questions that come to my desk are “Hey this thing you did is broke,” or “Hey, this thing you worked with is on the fritz.” 99% of the time, I’d have no idea why the behavior they were asking about was occurring! However, I think I’ve got a great problem solving methodology, which I attribute to being key in my success. Maybe some agile writer can take the concepts here and we can develop a formal methodology for solving problems 🙂

Preventative Measures

As professionals, we should accept that production problems are unavoidable. You will create defects in your code, and no amount of testing will ever weed them all out. Code is rushed, servers crash, disks fail, and datacenters overheat. Accepting the fact that production problems occur is very important, but most importantly you need to plan for them. As one of my favorite financial advisor says, “It’s not a matter of if these events will happen; it’s simply a matter of when they will happen.” Plan your week/month with hours available to investigate strange behavior, crashes or user complaints.

  • Your developers must be on call. I’ve discovered there is a strong correlation between the cleanliness of a developer’s code, and whether or not he had to be on call at some point in his career. People that don’t have to shovel their own shiztpoo do not have stake in preventing problems! Here’s the thing though, the on-call phone is for emergencies. There should be an understanding that if the issue isn’t happening right now, then it doesn’t need to be worked right now. Too often a pager is used to extend an employee’s work day! If you get to the point where you are looking at code, you should probably agree to throw up a 500 page and sick the whole team on the problem in the morning. Furthermore, if a user calls in the middle of the night to report an issue that they failed to report hours ago, chances are it can wait till daylight. It’s important to protect the on call phone, save it for when the meadow is truly on fire!
  • Use a bug tracking system! This is so obvious it amazes me how many companies do not have this. When events occur, update the bug report with an occurrence. It’s very important that you keep record of the event… When you suggest to management they give you time to fix an issue, having a record of the hours spent doing crisis diagnosis is an invaluable resource!
  • Keep an application inventory. Have your developers create a wiki entry for every application, that has the following information:
    • What configuration files are need to run
    • Which databases are connected to
    • What tables are read
    • What EIS systems are interacted with
    • What names of queues/topics are used
    • What usernames are used to interact with above systems
    • Where the logs are located
    • Known problems!

    Do not underestimate how vital this information is! In your SDLC, you should have a step where the above information is updated!

Ok, I’m on call. The phone is ringing. What do I do?

Your first step should always be to get an accurate description of the symptoms. You then must narrow the problem scope to either being able: to reproduce it on demand, or predict when it will occur next. For instance, if you have a B2B webservice and the consumer is complaining your sending SOAP faults, your first move should be to have them dump the XML request that is causing the issue. If an end user is complaining about an error message when they click a link, you should have them copy the link and take screenshots. I call this “narrowing the problem.” This is _the_ most important step. It is not optional, you should not proceed to work on a system unless you know what the problem is! You could easily worsen the situation by making arbitrary changes, and the likelihood of doing something random to solve an unknown issue is abysmal.

What if the end user causing the problem is not available, or the person on the other end is unwilling to provide the information you’re requesting? I think the best tactic at this point to ask them how they would like you to proceed. Let them finish completely without interrupting. Perhaps they were onto something, or had an alternative. Or perhaps they’re in left field, and if it leans towards left field, explain to them that you cannot solve a problem you don’t understand and can’t recreate. Don’t be afraid to stick up for yourself, it takes information to solve problems!

Ok, lets get crackin. It’s late and I want to go back to sleep

After you’ve narrowed the problem to a very very specific story or use case, it’s time to get working. First, copy the logs of the production server onto your local computer. If the logs roll while you’re troubleshooting, you may never find root cause. Next, copy any data related to the problem out of the database and store in text format with the log files. This may mean taking a snapshot of configuration tables of an application, the user’s profile rows, and maybe the data they’re trying to access. If you think it’s related, copy it! If you think it’s unrelated, copy it! Finally, note the version of the application, the bugfix level of the server, and the version number of the libraries available on the server. Whew!

The next thing is to get a pencil and paper. You should write down ever step you took and the time, and more importantly why you did it. This must be done as you perform the steps, but it doesn’t have to be verbose. For example:

“3:12am Noticed the logfile contained many a transaction timeout, meaning the connection pool may be locked. Restarting the application server”

There are three parts to this entry, the first is the observation, the second is the analysis, and the third is the action to correct.

So how do you find the problem?

It’s time to use that application inventory that you did way back in the preventative measures section. Lets say the log is reporting that the user cannot be found in the database. Since you’ve already narrowed the problem, simply refer to your inventory and connect as the application, then proceed to locate the user. If you can’t find them, then problem solved (for now)! Your inventory will tell you exactly what table and database to retrieve this information from.

But what if you’re not finding anything? Well… you need to poke around some more. Filter through your logs files and database entries. Do you see anything that looks out of place? Since you’ve already narrowed the issue down and reproduce it, can you try reproducing it using a different user? How about if you tail the log in realtime while the end user create the problem again? Can you enable debug logging in the frameworks (Spring, Hibernate, etc)? Can you reproduce the problem in a non-production environment? How about remote enabling remote debugging and stepping through as the system executes? How about cleaning up temporary files or deleting JSP compilation directories? Has the system ran out of disk space? Finally, does a good old fashioned restart fix the problem?

The hangover

So you were on the phone for an hour or two last night, what do you do the next day besides show up late? Well before you leave your house, take that pencil/paper with you. Open your bug tracking system, or send an email to the owners and include the original complaint, the logs, the database entries, and what finally fixed the problem. Transcribe your quick notes into the ticket. It’s best if this is done the next day, before your forget how you fixed the problem and why you made the decisions you did.

If you followed every step in this entry, you should have no problem giving management a thorough report of what was wrong and how it was fixed. This information is critical to mend over any rough spots with your customers. It can also be used to CYA if any flak comes your way! Finally, I hope this will make you a problem solving rockstar, ready to write better, simpler code (with tons of debug statements using SLF4J).

One of the most interesting trends I’ve seen lately is the unpopularity of Java around blogs, DZone and others. It seems some people are even offended, some even on a personal level, by suggesting the Java is superior in any way to their favorite web2.0 language.

Java has been widely successful for a number of reasons:

  • It’s widely accepted in the established companies.
  • It’s one of the fastest languages.
  • It’s one of the most secure languages.
  • Synchronization primitives are built into the language.
  • It’s platform independent.
  • Hotspot is open source.
  • Thousands of vendors exist for a multitude of Java products.
  • Thousands of open source libraries exist for Java.
  • Community governance via that JCP (pre-Oracle).

This is quite a resume for any language, and it shows, as Java has enjoyed a long streak as being one of the most popular languages around.

So, why suddenly, in late 2010 and 2011, is Java suddenly the hated demon it is?

  1. It’s popular to hate Java.
  2. C-like syntax is no longer popular.
  3. Hate for Oracle is being leveraged to promote individual interests.
  4. People have been exposed to really bad code, that’s been written in Java.
  5. … insert next hundred reasons here.

Java, the actual language and API, does have quite a few real problems… too many to list here (a mix of native and object types, an abundance of abandoned APIs, inconsistent use of checked exceptions). But I’m offering an olive branch… Lets discuss the real problem and not throw the baby out with the bath water.

So what is the real problem in the this industry? Java, with its faults, has completely conquered web application programming. On the sidelines, charging hard, new languages are being invented at a rate that is mind-blowing, to also conquer web application programming. The two are pitted together, and we’re left with what looks a bunch of preppy mall-kids battling for street territory by break dancing. And while everyone is bickering around whether PHP or Rails 3.1 runs faster and can serve more simultaneous requests, there lurks a silent elephant in the room, which is laughing quietly as we duke it out in childish arguments over syntax and runtimes.

Tell me, what do the following have in common?

  • Paying with a credit card.
  • Going to the emergency room.
  • Adjusting your 401k.
  • Using your insurance card at the dentist.
  • Shopping around for the best car insurance.
  • A BNSF train pulling a Union Pacific coal car.
  • Transferring money between banks.
  • Filling a prescription.

All the above industries are billion dollar players in our economy. All of the above industries write new COBOL and mainframe assembler programs. I’m not making this up, I work in the last industry, and I’ve interviewed and interned in the others.

For god sakes people, COBOL, invented in 1959, is still being written today, for real! We’re not talking maintaining a few lines here and there, we’re talking thousands of new lines, every day, to implement new functionality and new requirements. These industries haven’t even caught word the breeze has shifted to the cloud. These industries are essential; they form the building blocks of our economy. Despite this, they do not innovate and they carry massive expenses with their legacy technology. The costs of running business are enormous, and a good percentage of those are IT costs.

How expensive? Lets talk about mainframe licensing, for instance. Lets say you buy the Enterprise version of MongoDB and put in on a box. You then proceed to peg out the CPU doing transaction after transaction to the database… The next week, you go on vacation, and leave MongoDB running without doing a thing. How much did MongoDB cost in both weeks? The same.

Mainframes software is licensed much different. Lets say you buy your mainframe for a couple million and buy a database product for it. You then spend all week pegging the CPU(s) with database requests. You check your mail, and you now have a million dollar bill from the database vendor. Wait, I bought the hardware, why am I paying another bill? The software on a mainframe is often billed by usage, or how many CPU cycles you spend using it. If you spend 2,000,000 cpu cycles running the database, you will end up owing the vendor $2mil. Bizzare? Absolutely!

These invisible industries you utilize every day are full of bloat, legacy systems, and high costs. Java set out to conquer many fronts, and while it thoroughly took over the web application arena, it fizzled out in centralized computing. These industries are ripe for reducing costs and becoming more efficient, but honestly, we’re embarrassing ourselves. These industries stick with their legacy systems because they don’t think Ruby, Python, Scala, Lua, PHP, Java could possibly handle the ‘load’, scalability, or uptime requirements that their legacy systems provide. This is so far from the truth, but again, there has been 0 innovation in the arenas in the last 15 years, despite the progress of web technology making galaxy-sized leaps.

So next week someone will invent another DSL that makes Twitter easier to use, but your bank will be writing new COBOL to more efficiently transfer funds to another Bank. We’re embarrassing ourselves with our petty arguments. There is an entire economy that needs to see the benefits of distributed computing, but if the friendly fire continues, we’ll all lose. Lest stop these ridiculous arguments, pass the torch peacefully, and conquer some of these behemoths!

While trying to think of something to write for a first post, I noticed that by default, WordPress creates a post for you called “Hello World.” For those that are coders, this phrase probably invokes memories of your first computer program that ran. For those that aren’t, for some reason, this phrase has been traditionally used as the output of a new coder’s first program.

If you Google for “Hello World!”, you’ll get hundreds of results relating from setting up a OrientDB cluster, to creating your first Scala program. With the explosion of dynamic languages in the last few years, “How do I create a Hello World” has reached almost meme status. The phrase is so pervasive, finding the true origin has lead me across the internet until finally I came to the entry in Wikipedia which states, “[Hello World] was inherited from a 1974 Bell Laboratories internal memorandum by Brian KernighanProgramming in C: A Tutorial, which contains the first known version.”

It is interesting how such a simple program has evolved into a standard right of passage for new languages. I think back to the first time I saw that phrase in a C programming class in college, and thinking how strange it was that the professor chose that phrase. Later when I reported it to a coworker, he said, “Yep, that’s everyone’s first program. It’s so simple, it’s only downhill from here…” Almost 10 years later, I found my answer.

My name is Jonathan, this is my Hello World! Thanks for reading.