Archive for November, 2005

Hyperthreading CPUs and User Experience

Tuesday, November 29th, 2005

Brian has an article on his blog about Hyperthreaded CPUs and their effects on “the user experience”, by which I’m sure he means the typical response on a graphical desktop to a user’s actions — something like moving the mouse, dragging a window, opening-up a menu, etc.

I disagree with a few of his assertions… namely that HT itself is responsible for improving the user experience. For example, if you have a single (and non-HT) processor and you run some CPU-intensive process (such as a compiler, a complex graphical manipulation that doesn’t take advantage of your graphics processor, or a poorly written program that runs away with your CPU in a tight loop), that process is going to eat cycles that would otherwise be used to redraw your mouse pointer (hardware-drawn cursors went away with Microsoft Windows 3.1), draw the menus in your spreadsheet program, or drag your windows around the screen. This makes the responsiveness of your graphical desktop seem sluggish.

The reason this happens is that CPUs can only do one thing at a time. Fortunately, they typically do things reeeeeally fast, so you don’t notice that it’s only doing that one thing at a time because it switches tasks and does a little bit of work here and a little bit of work there, and it magically looks like everything is getting done “all at the same time”.

With HT, the CPU itself can actually do more than one thing at a time. Sure, the CPU still does that frienzied-switcheroo dance, except that it can — ostensibly, anyway — do work on two whole tasks at once. Brian mentions that HT isn’t as nice as actually having two equally-fast processors, but let’s ignore that fact for the moment.

I assert that the responsiveness of the graphical desktop has more to do with the way that the desktop functions than the way the CPU works. Evidence? Compare any version of Microsoft Windows with a similar machine running Linux and any one of the graphical desktops that run atop it. When you launch a program under Microsoft Windows, you get an hourglass mouse pointer, the computer churns for a while, and the program window eventually opens. The next time you do that, move the mouse around… try to open another application…. try to drag another window around. For the most part, your desktop will respond quite favorably. The mouse cursor will smoothly follow your hand motions, the windows will redraw, and the second application will also eventually open.

My experience with Linux is not the same. If I open an application, the mouse cursor immediately starts jerking around and loses its smoothness. With the mouse jerking around, the windows jerk around as well. Other apps will start, of course, but it’s really still like dropping menus down and moving the mouse that people really notice.

Note to Linux zealots: I totally love Linux. I run it on everything except the computer that I use as my primary desktop, mostly because of games that I want to play. Yeah, Wine just doesn’t work for me. Get over it.

Anyway, these observations lead me to believe that Microsoft Windows, no doubt through some kind of unholy voodoo, has gone through great pains to schedule the user interface at the highest possible priority. Linux, in typical pragmatic style, has chosen not to hijack the CPU for such trivial details as turning your mouse pointer into the Energizer Bunny.

As for Brian’s compiler running in a virtual machine, it’s a shame that VMWare doesn’t properly expose both processors available to Microsoft Windows to the OS running in the virtual machine. I would expect that a decent virtualization environment would allow you to set the number of CPUs to expose to the guest OS. I would have expected his gentoo compile to be able to peg both of his virtual CPUs.

But back to CPU utilization versus user experience. I would bet that if he were using a threaded compiler (which almost doesn’t make any sense) directly in his Microsoft Windows environment, and compiling the same code (or at least performing a compile that was equally CPU-intensive), then both HT CPUs would be pegged, and he’d still be able to move the mouse around, click on things (with a slight delay), etc.

I think it comes down to scheduling. Your OS can always interrupt your compiler for any reason. Your compiler (well, really your VM) is probably scheduled at in “normal” mode, whatever that means for your OS. I’m willing to bet dollars-to-doughnuts (mmmm… doughnuts) that Microsoft Windows’s graphical shell itself (explorer?) is scheduled at a higher-than-normal priority, or that all the UI calls that it makes are either running at the kernel level (which wouldn’t suprise me one bit for MS Windows, honestly) or at a higher-than-normal priority. It’s not the CPU, it’s the scheduler.

There are a lot of folks out there that say that HT is actually hurting performance. I haven’t read any of them, ’cause I’m honestly not that interested in looking at those numbers. After reading the ARS article linked above a few years ago, I thought that some really smart dudes got really high one day and had themselves a fantastic idea. I figured that it wasn’t as cool as the hype would suggest, but hey… why not squeeze as much out of the CPU as you possibly can? My gut reaction is that you can find data to either support or deride HT technology. I do know one thing: lots of Java developers were complaining in the past that HT CPUs would crash all the time with very strange errors, and turning off HT would solve their problems. >shrug<. You gotta do what you gotta do. Too bad those folks paid extra for their super-sexy HT processors.

I had a friend at Rose-Hulman that used to play Unreal Tournament with a couple of friends and me. He had just gotton a dual-CPU machine and decided to play with it a little: he created a dedicated server and set the processor affinity to his second processor (i.e. not the primary one). Then, he started UT in client mode so he could play it, and set the processor affinity for that process to the primary CPU. I’m not sure if it really made any difference than just running them separately with no tweaks, but it was an interesting idea.

When I heard that he had done that, I decided that since the OS itself actually needs very little CPU time to do it’s stuff, that an OS that could monopolize a considerably lower-powered processor and then schedule all user tasks on a much higher-powered processor would be great. Super-fast memory allocation (not that it’s particularly slow in the first place), buffer management, DMA, etc. For most OSs, this also means that the various hardware drivers would run on a CPU that wasn’t being used for applications. That would speed-up graphics processing since even a computer with the latest monster GPU still needs the graphics driver to actually send the data to the GPU to do the work.

Who knows. Maybe someone will steal my idea and make a jillion dollars. That would really suck for me.

Character Assassination

Friday, November 18th, 2005

At the dawn of (computer) time, someone decided that computers being able to deal with letters as well as numbers would be a great idea. And it turned out to be a big ‘ole mess.

The problem is that you have to decide how to encode these letters (or characters) into numbers, which is the only thing that computers can handle. EBCDIC and ASCII were two of the first, and while DBCDIC has effectively died, ASCII has turned into a few (relatively compatible) standards such as US-ASCII and ISO-8859-1 (also called “Latin-1”). These jumbles of letters are called character sets, and the describe how to take the concept of a letter and turn it into one or more 8-bit bytes for processing within the computer.

One of the most flexible characters sets is called UTF-8, and represents an efficient packing of bytes by only using the minimum necessary. For example, there are jillions of characters out there in human language if you take into account written languages like Chinese, Sanskrit, etc. We would need many bytes to represent all character possibilities (maybe 4 or 5), but UTF-8 has a trick up its sleeve that helps reduce the number of bytes taken up by common (read: Latin-1) characters. It’s also completely backward-compatible with ASCII, which makes it super-handy to use in places where ASCII was already being used, and it’s time to add support for international characters.

Now that the history lesson is over, it’s time to complain.

I’m writing an application in the Java programming language, which is generally highly touted as having excellent internationalization (or i18n) support: it has encoding and decoding capability for a number of different character sets (ASCII, UTF-8, Big5, Shift_JIS, any number of ISO-xyz-pdq encodings, etc.), natively uses Unicode (actually, UTF-16, which is a specific type of Unicode), and has some really sexy ways to localize (that’s the process of managing translations of your stuff into non-native languages — such as Spanish being non-native to me, an English speaker) content.

I was tyring to do something very simple: get my application to accept a “funny” (or “international” or non-Latin-1… I’ll just say “funny”, since I don’t use those characters very often) character. I love the Spanish use of open-exclaimation and open-question characters. They’re upside-down versions of ! and ? and preceed questions and exclaimations. It makes sense when you think about it. Anyhow, I was trying to successfully take the string “¡Bienvenidos!”, put it into my database, and get it back out successfully, using a web browser as the client and my own software to move the data back and forth.

It wasn’t working. Repeated submissions/views/re-submissions were resulting in additional characters being inserted before the “¡”. Funny stuff that I had clearly not entered.

I’ve done this before, but the mechanics are miserable and I pretty much block out the painful memories each time if happens.

The problem is that many pieces of code get their grubby little hands on the data from the time you type it on your keyboard and the time it gets into my database. Here is a short list of code that handles those characters, and where opportunities for cock-ups occur.

  • Keyboard controller. Your keyboard has to be able to “type” these characters correctly so that the operating system can read them. I can’t type a “¡”on my keyboard, so I need to take other steps.
  • Your operating system. MS-DOS in its default configuration in the US isn’t going to handle Kanji characters very well.
  • Your web browser. The browser has to take your characters and submit them in a request to the web server. Guess what? There’s a character encoding that is used in the request itself, which can complicate matters.
  • The web server, which may or may not perform any interpretation of the bytes being sent from the web browser.
  • The application server, which provides the code necessary to convert incoming request data into Java strings.
  • My database driver, which shuttles data back and forth between Java and the database server.
  • The database itself, which has to store strings and retrieve them.

I can pretty much absolve the keyboard and operating system at this point. If I can see the “¡” on the screen, I’m pretty happy. I can also be reasonably sure that the web browser knows what character I’m taking about, since it’s being displayed in the text area where I’m entering this stuff. My web server is actually ignoring request content and just piping it through to my app server. The database and driver should be okay, as I have specified that I want UTF-8 to be used both as the storage format of characters in the database, and for communication between the Java database driver and the database server.

That leaves 2 possibilities: the request itself (made by the web browser) or the application server (converts bytes into Java strings).

The first step in determining the problem is research: what happens when the web browser submits the form, and how is it accepted and converted into a Java string?

  1. The web browser creates a request by converting all the data in a form into bytes. It does this by using the content-type “application/x-www-form-urlencoded” and some character encoding. You can ignore the content-type for now.
  2. The request is sent to the server.
  3. The application uses the ServletRequest.getParameter method to get a String value for a request parameter.
  4. The application server reads the parameter out of the request using some character encoding, and converts it into a String.

So, it looks like the possibilties for confusion are where the character sets are chosen. The W3C says that <form> elements can specify their preferred character set by using the accept-charset attribute. The default value for that attribute is “UNKNOWN”, which means that the browser is free to choose an arbitrary character set. A semi-tacit recommendation is that the browser use the character encoding that was used to provide the form (i.e. the charset of the current page) as the charset to use to make the request.

That seems relatively straightforward. My responses are currently using UTF-8 as their only charset, so the forms ought to be submitted as UTF-8. Perfect! “¡” ought to successfully be transmitted in UTF-8 format, and go straight-through to my database without ever being mangled. Since this wasn’t happening, there was obviously a problem. What character set *was* the browser using? A quick debug log message ought to help:

DEBUG - request charset=null 

Uh, oh. A null charset means that the app server has to do some of it’s own thinking, and that usually spells trouble.

Time to take a look at the ‘ole API specification. First stop, ServletRequest.getParameter(), which is the first place my code gets a crack at reading data. There’s no mention of charsets, but it does mention that if you’re using POST (which I am), that calling getInputStream or getReader before calling getParameter might cause problems. That’s a tip-off that one of those methods gets called in order to read the parameter values themselves. Since InputStreams don’t care about character sets (they deal directly with bytes), I can ignore that one. ServletRequest.getReader() claims to throw UnsupportedEncodingException if the encoding is (duh) unsupported, so it must be applying the encoding itself. There is no indication of how the API determines the charset to use.

The HTTP specification has a header field which can be used to communicate the charset to be used to decode the request. The header is “content-type”, and has the form: “Content-Type: major/minor; charset=[charset]”. I already mentioned that the content-type of a form submission was “application/x-www-form-urlencoded”, so I should expect something like “Content-Type: application/x-www-form-urlencoded; charset=UTF-8” to be included in the headers from the browser. Let’s have a look:

DEBUG - Header['host']=[deleted]
DEBUG - Header['user-agent']=Mozilla/5.0 [etc...]
DEBUG - Header['accept']=text/xml, [etc...]
DEBUG - Header['accept-language']=en-us,en;q=0.5
DEBUG - Header['accept-encoding']=gzip,deflate
DEBUG - Header['accept-charset']=ISO-8859-1,utf-8;q=0.7,*;q=0.7
DEBUG - Header['keep-alive']=300
DEBUG - Header['connection']=keep-alive
DEBUG - Header['referer']=[deleted]
DEBUG - Header['cookie']=JSESSIONID=[deleted]
DEBUG - Header['content-type']=application/x-www-form-urlencoded
DEBUG RequestDumper- Header['content-length']=121

Huh? The Content-Type line doesn’t contain a charset. That means that the application server is free to choose one arbitrarily. Again, the unspecified charset comes back to haunt me.

So, the implication is that the web browser is submitting the form using UTF-8, but that the app server is choosing its own character set. Since things aren’t working, I’m assuming that it’s choosing incorrectly. Since the Servlet spec doesn’t say what to do in the absence of a charset in the request, okly reading the code can help you figure out what’s going on. Unfortunately, Tomcat’s code is so byzantine, you don’t get very far into the request wrapping and facade classes before you go crazy.

So, you try other things. Maybe the app server is using the default file encoding for the environment (it happens to be “ANSI_X3.4-1968”) for me. Setting the “file.encoding” system property changes the file encoding used in the system, so I tried that. No change. The last-ditch effort was to simply smack the request into submission by explicitly setting the character encoding in the request if none was provided by the client (in this case, the browser).

The best way to do this is with a servlet filter, which gets ahold of the request before it is processed by any servlet. I simply check for a null charset and set it to UTF-8 if it’s missing.

public class EncodingFilter
    implements Filter
{
    public static final String DEFAULT_ENCODING = "UTF-8";

    private String _encoding;

    /**
     * Called by the servlet container to indicate to a filter that it is
     * being put into service.
     *
     * @param config The Filter configuration.
     */
    public void init(FilterConfig config)
    {
	_encoding = config.getInitParameter("encoding");
	if(null == _encoding)
	    _encoding = DEFAULT_ENCODING;
    }

    protected String getDefaultEncoding()
    {
	return _encoding;
    }

    /**
     * Performs the filtering operation provided by this filter.
     *
     * This filter performs the following:
     *
     * Sets the character encoding on the request to that specified in the
     * init parameters, but only if the request does not already have
     * a specified encoding.
     *
     * @param request The request being made to the server.
     * @param response The response object prepared for the client.
     * @param chain The chain of filters providing request services.
     */
    public void doFilter(ServletRequest request,
			 ServletResponse response,
			 FilterChain chain)
	throws IOException, ServletException
    {
	request.setCharacterEncoding(getCharacterEncoding(request));

	chain.doFilter(request, response);
    }

    protected String getCharacterEncoding(ServletRequest request)
    {
	String charset=request.getCharacterEncoding();

	if(null == charset)
	    return this.getDefaultEncoding();
	else
	    return charset;
    }

    /**
     * Called by the servlet container to indicate that a filter is being
     * taken out of service.
     */
    public void destroy()
    {
    }
}

This filter has been written before: at least here and here.

It turns out that adding this filter solves the problem. It’s very odd that browsers are not notifying the server about the charset they used to encode their requests. Remember the “accept-charset” attribute from the HTML <form> element? If you specify that to be “ISO-8859-1”, Mozilla Firefox will happily submit using ISO-8859-1 and not tell the server which encoding was used. Same thing with Microsoft Internet Explorer.

I can understand why the browser might choose not to include the charset in the content type header because the server ought to “know” what to expect, since the browser is likely to re-use the charset from the page containing the form. But what if the form comes from one server and submits to another? Neither of these two browsers provide the charset if the form submits to a different page, so it’s not just an “optimization”… it’s an oversight.

There’s actually a bug in Mozilla related to this. Unfortunately, the fix for it was removed because of incompatibilities that the addition of the charset to the content type was causing. Since Mozilla doesn’t want to get the reputation that their browser doesn’t work very well, they decided to drop the charset. :(

The bottom line is that, due to some bad implementations out there that ruin things for everyone, I’m forced to use this awful forced-encoding hack. Fortunately, it “degrades” nicely if and when browsers start enforcing the HTTP specification a little better. My interpretation is that “old” implementations always expect ISO-8859-1 and can’t handle the “charset” portion of the header. Fine. But, if a browser is going to submit data in any format other than ISO-8859-1, then they should include the charset in the header. It’s the only thing that makes sense.