Encoding vs. Internationalization
I received an email today that got me thinking, a question from an ex-coworker about a system I worked on 7 years ago. The question was about how to handle writing out data in a specific encoding. Not to give her a hard time, but this subject has come up quite a bit over the years, and it seems that people struggle with the easy part of internationalization: simply how to get the data to appear on a screen in the proper characterset.
This is very straightforward, especially with newer browsers and OSes. In C# and Java, text in memory will always be in a flavor of Unicode, and you use a few well-named methods to write that text out into whatever encoding you want. On the reading side, it either auto-detects the input encoding, or you tell it what to expect (by far the more rare case).
It seems to me that people struggle with the issue of making text appear properly, then give up before heading into the real gist of internationalization - number formats, dates, currencies, right to left text, regional differences in UI expectations, and perhaps most unfortunately just how to manage content in multiple languages. There are too many apps out there built on content databases that have columns such as: TextInLanguage1, TextInLanguage2, TextInLanguages 3 etc. This assumes you are not going to operate the same instance of the database to serve more than a very few languages. Very limiting for multi-national companies.
Modern languages provide nearly all of the facilities necessary to handle all of internationalization, usually the only bit left is to rework portions to handle scalability or flexibility. For example, Java’s property bundles, they work great for most desktop apps right out of the box, only need improvement if you want the content on screen to be easily editable by business users so you throw it in a database (using proper organization, not the kind I mention above. So keep each value as a separate record keyed by a content code of some sort and a language code).
The one thing I see lacking is a resource showing the mappings for all known encodings. A place to look up the byte values for all characters in Cp273 for example. The maps are spread across the internet and far from complete, some encodings you literally have to order reprints of 20 year manuals from IBM.
SQL Server and… Scaling?
The new word is that Microsoft’s SQL Server 2005 is awesome, bringing more performance and features to the table than prior versions. It has been out for a while, but due to some backwards compatibility issues it has, I have not yet had a chance to kick the tires much but like any new car buyer I keep circling it and looking to see if I actually want to invest in it. It has almost everything going for it, primarily that it is relatively cheap, yet I hold back. Where is the grid computing feature? What if I actually want to be reassured that I can scale this thing to meet not some faraway potential demand but the demand that I see looming just months or quarters away? What do I do - buy a bigger box at that time? Thanks but no thanks, my other vendors don’t support the really big Wintel hardware such as Itanium. Why can’t Microsoft take a hint from Oracle circa 2000 and implement grid computing??? How can I possibly condone a database management system that doesn’t have a clear scaling path at this point in time and technology? Don’t misunderstand, this is not a complaint about SQL Server, it is a plea: Microsoft, take the current “enterprise” version and call it what it is, a mid-market version, then build us a truly enterprise version that can beat the pants off Oracle. Feel free to raise the price, I don’t care - your Transact-SQL trap means I have to run on SQL Server so you’ve got us already, just like Oracle has everyone who didn’t migrate out of the PL/SQL nonsense.
The Standards Ogre
As an architect, I feel it necessary to be a complete jerk about standards. On the surface this could be explained easily as “I get to make everyone do it the way I want them to do it”, a fun little power trip that never stops. The actuality of course is nearly the opposite - it is such an incredible pain to enforce standards that it is the worst part of my job. You have to be cold-hearted, obstinate, and annoying about standards just to have an even chance of them being followed. I’m not going to bother to explain the value of standards, in the Aristotelian sense you either get it or you don’t. Unfortunately, even the people who get it think it is a great idea until it prevents them from carrying out some crackpot idea.
Everyone is a special case, and the standards don’t apply to them. A person, a team, a department, even entire divisions feel exempt from standards for one reason or another. Why?
- The usual cause is vanity - the standards don’t apply to you. You are the genius hero of your personal work area, and you know a better way to do it than what the standard says. You found a cool new product that allows your to streamline a standard business process at your location, never mind that there are only five people in the world who know how to configure that product and the vendor will be out of business in 2 years.
- Next up - The standard would prevent you from being able to do your job efficiently, thus it couldn’t possibly apply to you. Change control processes only apply to the app dev geeks, not to your task of swapping out a NIC on a production server because that would mean you couldn’t do it until after the CCB meeting tomorrow. Not your fault you didn’t know that server was being used for a demo to a critical sales prospect.
- And the case for ignorance - You are exempt because you had no idea that the standard applied to the particular context that you work within. Yes, the data security standards apply even though you’re only sending out that sensitive data once via email, just for testing purposes.
And so on. The point of this post is not to complain about these people (well, not entirely), more it is to say that we must not give up the fight for standards. When you have a standard, you make it stick - that is part of your job. You always evaluate the crackpipe ideas people come up with to go against your standard, because it may be time to change the standard, but you make sure that you are involved and that you are the gatekeeper. You in the broad sense of the architectural group within your organization. In my experience, the best way to do this is by gaining the respect of the key business owners. They often don’t really understand the reasons for the standards, but if they respect your ideas and how you do your job, they will keep the organization in line with your standards. You will never win against the genius heros of your organization, not directly. Not that I’ve seen anyway. They need to be kept in line by their bosses, the ones who respect you. Those people know very well that they won’t be around to pick up the mess they made.
Securing Personal or Sensitive Data of Customers or Employees
One issue I have to deal with on a weekly basis is that of ensuring the security of our customers’ personal data. As a modern company, we have to exchange information with all sorts of other companies, either giving them customer data or employee data. For example, health care and benefits administration is often done by a 3rd party who provides a website for the employees to use and takes care of sending updates to the actual health care companies. This requires that some very sensitive and personal information be sent to the 3rd party handling the benefits administration. My job - to make sure that they are securing it appropriately. Well, it is my job because we are currently short a chief security officer type.
This does come up constantly, and I am sure that every other company out there is either also dealing with it or ignoring the matter (at great risk). So what really is the issue here? Whether customer or employee, we have their personal information and need to care for it responsibly. This includes making sure it is not used or accessed inappropriately anywhere, whether within our network or outside of it. This is no joke or half-hearted dusty policy; as with any public company within the US, we would take an enormous hit if this data was exposed in any way. The public image would be shattered and the costs of notifying and preparing the victims is very high (estimated at $50 or more per victim, generally). So I have to take this very seriously. At the very least, my personal information is going out along with everyone else’s.
What I have seen is that this is quite a difficult task. Assume an ideal situation, you can have whatever you want, and solve the problem: your data is in someone else’s hands, and it needs to be so difficult to steal that it is not worth the effort. The way I approach the problem is to break it into two categories: securing the data in transmission, and securing the data while it is at rest.
Securing the Data in Transmission
This is really the easier of the two categories. What do you have to worry about while the data is moving between you and the other company (or between them and yet another company)?
- The data being seen and recorded (stolen) while en route
- The data being misdirected and routed to the wrong destination
- The data being altered while in transit
There are other concerns I’m sure, but these are the key ones that I look at. In either case, rather than trying to directly prevent the theft itself, it is easier to encrypt the data so that it is useless to anyone who does manage to steal it, and alteration in transit becomes obvious. For encryption I would use PGP to encrypt the files and then Secure FTP (SSH2 based) to move the files over the internet. The SFTP point to point encryption does not add any further protection on top of PGP, but I use SFTP for all transfers, not just those of sensitive data, so it is just my standard. Another reason to use both is that often other companies can support one or the other, but not both, so having SFTP and PGP mechanisms ready and in place means I can deal with nearly any company safely. It is important to note that SFTP negotiates an encryption method during the handshake, you must carefully look at what methods you are comfortable using based on your security requirements, and disallow any other methods.
So far I’ve been talking about moving files. For RPC style communication SSL works fine to protect the data - again assuming you are have high levels of encryption enabled so the protocol negotiates to use them, and deactivate the weaker methods.
The data is still effectively moving after it arrives at the partner’s network, usually sitting temporarily on a server waiting to be processed (and I make sure that the partner leaves it PGP encrypted until processing begins) or going directly into processing. While being processed, I have no concerns about the data being stolen - it is in memory and is definitely accessible, but I see no practical way to protect it there. The data comes to rest when it is written into a local file by the processing, or pushed into a database.
Securing the Data at Rest
This is the frustrating problem. It is common knowledge that the majority of security breaches occur from the inside of a company, either by current or former employees. I’d love to see some hard stats behind this, but since I’ve seen it so many times in so many places, I assume there is some truth to it. The case I’m about to describe is what I typically see at companies I deal with, before we impose our requirements on them:
The company very carefully protects the data while in transit, and then leaves it in plaintext within a database. The database server is in a secure network segment, secured using firewalls and careful access policies. Only production DBAs have logins to the database system, no developers or business people. The applications used by the company connect to the database and work off of the data.
So what is the problem with this? As much as they claim otherwise, people do have direct access to those databases. I was a developer, I know how it works. The logins to the database are either part of the codebase, in source control alongside the code, or in config files or screens on an application server. It takes a highly structured production control process to correctly secure that login information - I have never heard of it being done properly. Now I don’t suspect developers of stealing data generally, but what does happen is they export it for use in their own testing. Or the DBAs themselves take a copy of production and load it into a test or development database to ensure some new code really works. Or the team builds a demo for customers and uses a copy of the data - scary but I have seen this happen more than once, having a companies sensitive data exposed to other companies because the partner happened to use it as data for a demo. The real problem is that that production data does end up in non-production systems, which are far more open to access to the general company, and there anyone with ill intent can misuse it.
Locking down access to the database is difficult, so as with data in transit, I think the solution is to make the sensitive data useless to a thief (conscious or unconscious). My preference for doing this is via database encryption and security. Using Oracle 10g as an example, I can have the sensitive data reside in tables that are literally invisible to a user who does not match a strict policy (coming from IP address xyz for example), and can have the data itself encrypted with the key stored on separate servers requiring a more complex attempt - and at that point it is difficult to do this unconsciously as well. When I deal with partners who do not use this method, I make it a requirement that they either implement this sort of security and database encryption before we sign a contract with them, or that they work it into their short term product roadmap and leave an out clause in the contract in case they don’t follow through on it (limiting our exposure).
I’ve had companies suggest some alternatives to database encryption, none of them are any good in terms of security but they fall into 2 basic patterns: file system encryption, and disk-level encryption. Encryption of the file system is appealing to companies because it is cheap, and it is transparent to the database and to users. Unfortunately, that is the downfall of it - it is therefore also transparent to any reading data out of the database. Useless. Disk-level encryption has the same failing, and is also quite expensive. Both of those solutions are more appropriate for file servers or protecting data so it cannot be stolen off of backup tapes.
I have yet to come across a company that has done a thorough implementation of database encryption, and have talked to enough of them about it that I begin to wonder really what is the problem. I think it is just that IT at these companies think it is too hard, or too complex. And so it will be in most cases where people are using old technology. Upgrading your database software is rarely easy.
Other Concerns
The other sorts of questions I go through with companies are aimed at seeing how they really use our data internally. Are they perhaps sending a feed of it off to someone who has a handy Access database application? Bad. Do they ever share it with partners of their who my company does not have a business relationship with? Do they use it in demo’s? And so forth. In the really important cases, the VP of infrastructure and I will even go and look at their data center and their business operations, to judge how secure and organized they are.
I’d love to get your thoughts on how to deal with these issues!
How to Shoot Yourself in the Foot
So it was only a month ago that I wrote about project management, and of course I forgot to knock on wood and I end up taking having to take a week-long class on project management. Kidding aside, I looked forward to my class, I wanted to see what a true “project management professional” had to say on the subject. The class was taught by one of these people, he was from a large training corporation and PMI certified. In brief, it was an enlightening week. Aside from a few impractical assertions, the material was excellent and if I had project managers who actually used those methods I would be overjoyed. The methods we learned about were logical and measurable, the fascinating part to me was really just how measurable it was. And not those misleading MS project plans, I’m talking about ways to judge the actual state of a project while it is active, in terms of time and money. So the course showed that well defined things that could be done to manage projects, but also clearly pointed out that the majority of a PM’s work is actually social and involves all of the little steps required to make sure that the project moves forward as expected. A perfect example of this was a story about someone who notified an executive that a specific risky task was going to be executed on the following weekend. The task was carried out, and the executive was furious the following week. The mistake? The PM assumed this exec read and was caught up on email. The lesson here was to not rely on implied consent, but to demand a response (through approach, not being a bully of course). This simple mistake and lesson would be incredibly useful if it were well known and used. I can think of 4 critical and expensive mistakes made on projects I’ve seen in the last 3 months that would have been avoided if the PM’s did not assume implied consent through email (or even poor processes, such as “requirements are assumed to be acceptable as submitted unless challenged in writing within 10 days of being submitted”).
All of this fascinating information came out of the course, but I do wonder about the usability in a real environment. It is not that I question the practices myself, but that I know several PMI certified PM’s and none of them do anything remotely like the methods tested by PMI in their jobs. They are the professional (and expensively certified) PMs so I can’t really challenge them, perhaps I should assume that they have just decided the PMI methods don’t work well in practice. I suspect that the PMI methods require a rigor from both IT and business units that is typically not possible - either the organizational structure is wrong for supporting best practice project management, or the companies are unwilling to spend the money to support it.