Psychedelic Panorama of Foo

Á¦ ÀÌ Inigo Montoya ÀÔ´Ï´Ù. ³ÊÈñ Á¦ ¾Æ¹öÁö »ìÀδ٠ÀÔ´Ï´Ù. °¢¿ÀÇÏ Á× ÀÔ´Ï´Ù

ÀÏ¿äÀÏ, 8¿ù 03, 2008

 

Impressions of Git VCS

¾Æ³çÇϼ¼¿ä. People have been asking a lot lately about git. Well, the questions have been about distributed VCS in general (git, mercurial, darcs, etc...) I figured that I would just answer some of that in this blog. Distributed VCS is becoming increasingly popular. So much, that companies owning PVCS, Perforce, and ClearCase are looking into distributed features for their products. Even Atlassian cannot ignore it. The shortcomings of traditional VCS are underscored by Fisheye.

Before getting started, the VCS I have used are SVN, CVS, SCCS, RCS, PVCS, CMVC, ClearCase, and git. Ones I definitely have no experience in are darcs, perforce, bitkeeper, and mercurial.

What is Distributed VCS?

You can get really semantic and technical here, but I'm going to keep it simple. Think of distributed VCS as P2P and your traditional VCS as client-server. That's about as simple as it gets. In fact, the advantages/disadvantages of both parallel that of P2P v. client-server architectures.

Compared to Traditional VCS (Pros and Cons)

I'm not going to separate these into pros/cons because depending on your situation, that can vary. You'll see what I mean.
  • Distributed bandwidth If you are on remote networks, the bandwidth can become distributed so that everyone is not pushing/pulling source from the same server, cluster of servers, or storage. If you're on the same network (an intranet for example,) this is less helpful because then distributed bandwidth is no longer as distributed because it's all shared across the same network.
  • Security This is kind of a double-edged sword. Distributed or not, it can go either way for you.
    • Distributed
      When distributed, you don't have to worry about things like access control or user privileges. Everyone is treated the same and treated as though they are on a remote system. No one needs to have access to the main repository because the repository is actually here and there.

      Sounds like it's actually a good thing. There are still some logistical issues. You now have to work out how to distribute the changes. It's easy to get a repository started. Usually, you have at the very least read access to someone else's repository. You clone it. Ok, now what? What if I want to send someone changes? There are a number of ways to do this, and it's really up to you how you do that. I was recently on a project where some of us had write access to each other's repositories and others did not. That meant that some of us had to do things a little differently. The way it worked out is that we ended up mailing each other patches. We were a small group, so that worked out.

      I have also worked on a project where we distribute patches via RSS feed. I saw a project on google code that actually used SMTP-to-NNTP with git's git-send-email for delivering patches over email in order to publish patches via newsfeed. Mail works pretty well for me. I recently developed a procmail script that allows me to do what is equivalent to SVN's update which grabs all the changes that were not applied, and applies them automatically. Something like that would take a little more work for RSS or newsfeed.
    • Traditional
      With traditional VCS, there are a number of ways to host a repository securely. IMHO, the best way to do it is through SSH. Of course, SSH requires anyone that gets access to the repository to also have access to the machine. From this spans a number of system administration headaches.

      If the users get access through single-signon access control, then your problems are actually probably lessened. It is likely that you can rely on your organization's identity management to handle this. Even if you are using LDAP or Kerberos, your life is muuuuch easier. It's when you decide to do something different that things get complicated. Suppose you decided to use local authn/authz. That means you are on your own for managing access control and user credentials. If you absolutely have to do this, I recommend using SSH and setting up a jail. That way you can be pretty liberal with access and just manage credentials.
  • Storage Normally, who cares about storage? Does a source code repository really take up that much space? No, not really, but it's a point that should be brought up.
    • Distributed
      Of course the amount of storage used is proportionate to the number of users. On the upside though we have a form of virtualized backup system. Since the amount of storage is proportionate to the number of users, the strength of the backup redundancy is also proportionate to the number of users. That's actually pretty cool. Get a VCS and backups all-in-one. If you're already getting backup virtualization like I am, this is really no gain at all.
    • Traditional
      Gets points for more central storage, but then your industrial projects will require redundancy and backups. This is stuff you really get from distributed VCS easily. With traditional VCS, you have to pay out a good deal of money for this.

Beyond Distributed or Traditional

Enough of the Distributed/Traditional blah. Let's talk about git and SVN. How do I think they match up outside of the other stuff?

Git

First, I'd like to declare that I like SVN better than git. There. I said it. My reasons are not because of distributed/traditional VCS though. My reasons are because of the user interface of git. I don't mean GUI either. I mean the command-line interface and the intuitive or non-intuitive nature of the commands. It may just be me because I come from using SVN for the last several years now.

Git also likes to store hashes of each file in its index. How is this a bad thing? Well, it's not. This is actually pretty helpful with managing binary files. I'm not sure if I can say either way (SVN or git) is better in this regard. All VCS really need to index in some way. Handling so many file changes, differences, and history is really difficult without some kind of quick reference map like an index. Indexes are also nice because they aren't mandatory. They can always be rebuilt, so if it gets corrupt, it's not a big deal.

I'm not really sure about this, but it looks like git isn't really efficient about the way it stores files either. Of course, it uses the index and stores them by hash, but I'm not exactly clear on checkins. It looks like it copies the checkin. This is something I consider to be undesirable. At least, for me it is.

All the bad stuff aside, I still really like git. I use git for smaller/personal projects. I know. That's really weird because I'm totally not taking advantage of the distributed nature of git. Well, that's just it. I never would. What I really like about git isn't that it's distributed. What I really like is how easy it is to get setup. Creating a project and creating a repository are the same command. Sharing my code is cake work. It's really perfect for maintaining versions for small, personal projects. This is particularly so if I decide I want to share later on or not. I just don't have to think about things like that and concentrate more on writing the software.

Definitely simplifies branching, merging, and tagging. When I clone a repository, I can branch directly from there. I can easily merge my changes back to the repository I cloned from. This really promotes more branching than anything. It really benefits projects like the linux kernel project. Normally, you wouldn't want to branch very often, but the linux kernel has a branch for just about each committer. This really works well for the linux kernel project.

Subversion

First, let's talk about what I like about SVN. All of my projects and source code exist in one location. I don't have to worry about managing where my projects are because they are all contained within the repository. With git, I tend to create a new repository per project. Probably not the best thing to do, but I can't help it. It's so easy.

It is also extremely secure. It's as secure as I want to make my server that it's hosted on. The way I tend to use SVN is I don't use it remotely. This is where SVN really shines. I use SSH with SVN, but the way I use SSH with SVN is that SVN commands are piped through SSH to svnserve. This is essentially the same as logging into the machine, and running svnserve to handle the version control of the source code. This basically, contains everything within the server. The client/server communication is managed exclusively through SSH. The authz/authn is managed by whatever I'm using on my server. Finally, I use SVN like I'm local to the machine.

SVN, unlike git, likes to store everything as a revision. This actually contrasts a number of other VCS. What it means is that only incremental changes are stored. Even the first version of a file is stored as a revision. I really like this concept. Even though I shouldn't be concerned about storage, I am. When I see that files just get copied, it disturbs me. I feel like I should only committing what changed.

Now, for the downsides. Setting up a repository is a pain. If I make a mistake, I have to start over. For example, if I created a repository using Berkeley DB instead of filesystem, converting is much harder than just starting over. On the upside, I only have to do it once, but then again only doing it once makes it easy to forget for the next time. If it weren't for the convention of branches/ trunk/ tags/, I fear setting up a new project would be difficult as well because of the svnadmin tool.

Conclusion

To put it plainly, SVN is probably best for organizations that have internal projects, and need to manage them with VCS. git is probably better for far spread OSS projects that require melee branching and merging. git is also very good for personal projects, but has management overhead that needs to be considered.

I like git, but I just probably would never find as much use for it as I do SVN.

·¹À̺í: , ,


 

Static vs. Dynamic Languages (Loose vs. Strict Typing)

¾Æ³çÇϼ¼¿ä. Recently, someone asked me about the difference between strict and loose typing. Specifically, "Why would anyone want strict-typing in the first place?" At first, I was a little stumped myself. I'm not good at making strong arguments impromptu. I always come off as sounding like I don't know what I'm talking about. Usually, it's something I have to take time to think about. I deliberated on this, and I even asked for some input from some friends. This is something I've argued pretty frequently in the past, so I know just who to ask for with counter-opinions.

I'm going to come to the defense of strictly-typed languages because I'm assuming we all know why we want to use loose-typing; therefore, I'm going to try and explain "Why strict-typing?" First, I'd like to explain their differences. Then, I'd like to outline the good and bad parts of each. Finally, I will give my own personal opinion in the matter. All of it will be from the perspective off "Why strict-typing?"

What is the Strict-Typing?

Static languages are languages that are strictly-typed. This means that when a variable, method, or anything really is defined that a type must be explicitly declared as well. Take the example in java:

public class SomeClass { @Override public String toString() { String retval = new String(); return retval; } }

You may observe that there is a type required for the class definition. Within the class is a method toString() whose signature has a String type declaration; Likewise, within the method is a variable explicitly declared as type String that is then explicitly returned. It's true that you could do:

return "";

Most any language can do that though. My point is aimed specifically at variable declaration though. Rather, how would I have to declare the variable? That's when the strict-typing comes into play. Probably the most extreme example of strictly-typed language is "C":

typedef char(*(x())[])() RandomWord; RandomWord randW = newRandomlyGeneratingWord();

The above describes a type where each character in a null-terminated character array actually points to a function that returns a random character. That means that each time the character array (String) is read, it's completely different. The important part to notice is that the type has to be explicitly declared. The typedef is there to show that one type can be defined as another. Each thing I just described is an example of strict-typing within a language.

What is Loose Typing?

Not all languages are like this. Loosely-typed (dynamic) languages are not. Take for example Perl:

{package SomeClass; sub new { foreach my $arg (@_) { # do some stuff } } sub toString { my $str = ""; return $str; } }

Where are the types? "Don't know. Don't care." Pretty simple, right? Then, why bother with a static language at all? Dynamic languages seem simple enough, and we don't have to deal with the hassle of types. Well, the thing about dynamic languages is that most of them are not compiled. That puts into a sub-discussion of compiled vs. interpreted-uncompiled.

Compiled means that before it can run the code goes through syntax checking, validation, and optimization. Either bytecode or machine code is produced as a result of compiled software. This code typically can run much faster than interpreted code because interpreted code typically runs within a specific runtime environment which gets its own resource space. This can be a performance hit. Another performance it is done by actual parsing of the language. Compiled languages can defer this to the compilation step, so parsing doesn't happen during runtime. To counter, compiling code is an extra step. interpreted code can just run, but compiled code must be preprocessed and processed before actually running. In short, you are given a choice. Do you want to take extra time to build the application or have the application take longer to function?

That's just compiled v. uncompiled. What if you want to compile a dynamic language? Does that mean it will run faster than an uncompiled language? Yes. Take groovy for example. When it is optimized by being compiled into java bytecode before execution, it runs faster than flushing the groovy code directly through the interpreter. The same goes for python. When pyc (python compiled bytecode) is used instead of py (normal python code,) it tends to run faster in the runtime environment because it's been optimized. What does this mean? Well, it means that for some languages compilation cannot be a factor when comparing static and dynamic languages. For example, if you want to compare java and groovy, you have to leave out the compilation part because both will compile to java bytecode.

You'll have to make the decision for yourself when you weigh in dynamic or static language whether you want to consider compilation as a factor or not. I do consider it because I typically do not care to compile my dynamic language even though it is possible. To me, that takes away one of the reasons to use a dynamic language which is simpler deployment. When you start to make your dynamic language more static, you lose more of the reasons why you want to use it in the first place.

Which brings us to...

What do we Get From Static Languages

Above, I just illustrated my first point. "Why strict-typing?" Because it is compiled. Compiled means its faster and more efficient. I know what you're thinking. "That might have been important to me 20 years ago." True. It's not a very strong point, but it is one. I'll explain some more below.

Forces the developer to think in terms of code reuse

Static languages are not very forgiving when it comes to API's. Take Java for example. It has access control for methods which most dynamic languages don't have. This can really effect what external classes get to use internal function. For OO languages, we have to consider polymorphism. Java, for example, requires interfaces to fulfill polymorphism. Interfaces as we know do not have any implementation whatsoever. All methods within them are declared public. When we want to use polymorphism, we're forced to think about code reuse, API's, etc... because extends only gives us single-inheritence. With generics and autoboxing, we can let the compiler handle optimizing these types and handling type correction for us. We don't get this in dynamic languages. Type correction is something we have to worry about on our own. This makes static languages muuuuch more useful for creating API's and frameworks.

Static Languages are easier to read

Perl has earned a reputation as being the "write once, read never" language. Mostly because types are sooooo hard to determine. You can have an array within a hash with in an array of references to functions that return objects blah blah blah. It's crazy. Unlike C, there is no typedef that declares something so complex. You have to figure it out.

Ruby and Groovy are really concise languages, but a lot of the code is difficult to understand because of closures, mixins, in-line anonymous classes, etc... Sometimes, code can almost look hackish because we know what it does (according to documentation) we just can't understand how it's doing it by looking at it. It's magic.

Static languages give us method signatures which let us know what parameters a method takes, what it returns, and what exceptions (if any) are thrown. Sometimes we can get even more information. With IDE integration, this is great because you can have the IDE deal with things like creating methods for you based on parameters, return values, and exceptions. A friend of mine expressed frustration about this in dynamic languages because he feels he's forced to read the code which puts him in a tough spot. He has to write code with explicit knowledge of the internals. This speaks to the previous point about API's and frameworks. Without API's, frameworks, method signatures, etc... there isn't any abstraction. You have to develop knowing internals of the library or have an extremely strong faith in the documentation. Excellent point Andrew!

Static languages are different. Types are easy to understand because they are explicit and simple. Java for example, we can easily determine all of the aggregates of a class just by looking at it. With an IDE, we can determine exceptions, signatures, errors, code completion, and code cross-referencing very easily. This brings us to the next point.

Static Languages play nicer with IDE's and tools

Because of strict typing, these languages are easier to parse than dynamic languages. Therefore, it's easier to do cross-referencing, debugging, type hierarchies, documentation, and code completion within IDE's. That isn't to say that dynamic languages cannot get this from an IDE. It's just more difficult. Java, for example, has just about every IDE under the sun supporting it; however, it is very difficult to find a good IDE for Ruby, PHP, or Python. Forget about finding one for Perl. For Ruby, I hear that IDEA is great, but Eclipse is pretty good for PHP, and Komodo for Python. Getting your favorite IDE to support all of them is just impossible. Even when an IDE does support it, it doesn't work as well as it does with a Static Language like C or Java.

Further, tools are difficult to obtain for these languages that work in your environment. Just try and find a line-by-line stepping debugger that integrates with your IDE, mod_python, fastcgi, etc... It's just not happening. Scala and Java let us use the Java Debugging Framework for profiling and debugging code. We can use it with IDEA or Eclipse which also happen to support tomcat and Jetty appservers. It's Nirvana!

Static Languages are compiled

I can say this because they have to be. There is no way that an uncompiled language can handle the type correction unless it was preconstructed to do so. It's just logically impossible.

Dynamic languages run into more problems here than just performance. You actually can't validate an entire application until you run it. Which means there can be syntax errors in the code, and you would never know. With compiled code, you know for a fact that there aren't any mismatched type conditions, syntax errors, etc...

I'm going to quote Andrew again here. He made a good point about the community regarding this opinion. Andrew, I totally agree with you.

"The Ruby/Python zealots will tell you that Unit Tests will solve nearly all the problems above, if done right. That's nice in principle, but I shouldn't have to do grunt work, and ruby/python at least force you to do so much that the computer or IDE could be doing for you. It's like digging ditches with a hand shovel, because it 'enforces good posture' or some garbage." --Andrew Hollamon

What do I Think About Static v. Dynamic Languages?

I think they're both great. There's reasons to use both. I think it's important to know what those reasons are because you don't want to use the wrong one just because you don't see the point in using the other.

How do you know when to use which one?

I tend to go with static languages for projects that require long-term sustainability. It's just better for management overall. Resource management is less of a hassle because we know the code is easier to train people on because of common API's and frameworks, as well as strong documentation and understandable code.

Smaller projects like POC or CRUD applications are probably better in dynamic languages. Rails for example is perfect for CRUD because a lot of the code is written for you. The conventions it uses make it less necessary to understand the code. What then when you have to make your application more complex? Well, then I would start looking into a static language. The last thing you want in a complex project is "write once, read never" code. This is especially the case if it is for a professional project. If it's personal, do whatever you want. It's your own grave.

·¹À̺í: , ,


È­¿äÀÏ, 6¿ù 17, 2008

 

Ashley at the Jedi Training Academy

¾Æ³çÇϼ¼¿ä.w00t! I did my own fanfilm.

·¹À̺í: , , ,


Åä¿äÀÏ, 5¿ù 03, 2008

 

String Concatenation in Java: A Case Study of Logging

¾Æ³çÇϼ¼¿ä. In my entry on String Concatenation in Java, I went over a test that compares appending and concatenating a string literal "foo" to itself 10,000 times. The main points for its impracticality were:

  • Doesn't consider String Pool for string literals.
  • Doesn't consider compiler optimizations, garbage collection, or HotSpot
  • Who in the world is going to write an application that concatenates a string literal to itself 10,000 consistently?

I decided to write my own tests specifically for logging in java. Here are my requirements:

  • Only 4 or 5 string concatenations/appends. Since String and StringBuilder are immutable types in Java, I consider them new Strings. 4 or 5 concatenations of combined literal and non-literal string instances seems like a normal use case for a logging message.
  • Exercises the stack a little bit. Most logging messages are pretty deep in the stack. (No, it doesn't effect the test at all, but I figured for the sake of completeness, I would add it for those who care.)
  • Run tests with 10,000 concatenation tests to compare results.
  • Produce a reasonable number of Strings. Upon starting an application I work on, 23973 log messages are produced. If just starting the a mass-multiuser application creates this many log messages upon starting, then 10,000,000 log messages seems pretty reasonable for a 5-day work-week of 8 hours per day.
  • Eliminate compiler optimizations so neither method can take any shortcuts.
  • Use incremental garbage collection so garbage collection interference is mitigated.

To summarize, the goal is more strings, but fewer concatenations per string. How does StringBuilder hold up now?

leo@sark~/test (18:03:51) [28] ant Buildfile: build.xml test-compile: test: [java] INFO - Starting test usingStringConcat [java] INFO - usingStringConcat finished. Created 10000000 strings in 154691 milliseconds [java] INFO - Starting test usingPlusConcat [java] INFO - usingPlusConcat finished. Created 10000000 strings in 114850 milliseconds [java] INFO - Starting test usingStringBuilder [java] INFO - usingStringBuilder finished. Created 10000000 strings in 122337 milliseconds [java] INFO - Starting test usingStringBuilderNoToString [java] INFO - usingStringBuilderNoToString finished. Created 10000000 strings in 108929 milliseconds [java] INFO - Starting test concat [java] INFO - concat finished. Created 10 strings in 10006 milliseconds [java] INFO - Starting test append [java] INFO - append finished. Created 10 strings in 97 milliseconds

First, I should explain the difference between the tests.

  • usingStringConcat Tests concatenation using the String#concat() method.
  • usingPlusConcatTests concatenation using '+'
  • usingStringBuilder Tests appending
  • usingStringBuilderNoToString Same as usingStringBuilder, but doesn't call toString() on the result.
  • concat Paul Barry's concat test
  • append Paul Barry's append test

You can see now that the margin between StringBuilder and '+' concatenation has been reduced. It has been reduced so much that now '+' concatenation is actually faster. The margin is approx. 8000 milliseconds. Statistically speaking, I should run this several times and calculate the standard deviation, but I'm not willing to do that much work. I've noticed that '+' is consistently faster though. Why!? I don't know, but let's review the code, and I will fire off some speculations.

import java.io.PrintWriter; import java.io.StringWriter; import static org.kuali.kra.logging.BufferedLogger.*; /** * Simple unit test class to compare various {@link String} concatenation methodologies in Java. * */ public class StringConcatenationTest extends ConcatenationTest { public StringConcatenationTest(String name) { super(name); } /** * Tests concatenating {@link String} literal instances. Handles conversion of integers to {@link String} * instances as well. The main part to realize is that this is using the '+' operator for this test. * */ public void usingPlusConcat() { new StackSimulation(20).execute(new DispatchableStackElement() { public void dispatch() { for(int i = 0; i < iterations; i++) { String foo = "foo" + i + "foofoo" + i + "foofoofoo" + i; } } }); } /** * Tests concatenating {@link String} literal instances. Handles conversion of integers to {@link String} * instances as well. The main part to realize is that this is using the {@link String#concat(String)} method * for this test. * */ public void usingStringConcat() { new StackSimulation(20).execute(new DispatchableStackElement() { public void dispatch() { for(int i = 0; i < iterations; i++) { String foo = "foo".concat(String.valueOf(i)) .concat("foofoo").concat(String.valueOf(i)) .concat("foofoofoo").concat(String.valueOf(i)); } } }); } /** * Tests appending {@link String} literal instances. Handles conversion of integers to {@link String} * instances as well. The main part to realize is that {@link StringBuilder} is used for this * exercise. * */ public void usingStringBuilder() { new StackSimulation(20).execute(new DispatchableStackElement() { public void dispatch() { for(int i = 0; i < iterations; i++) { String foo = new StringBuilder() .append("foo").append(i) .append("foofoo").append(i) .append("foofoofoo").append(i).toString(); } } }); } /** * Tests appending {@link String} literal instances. Handles conversion of integers to {@link String} * instances as well. The main part to realize is that {@link StringBuilder} is used for this * exercise. * */ public void usingStringBuilderNoToString() { new StackSimulation(20).execute(new DispatchableStackElement() { public void dispatch() { for(int i = 0; i < iterations; i++) { StringBuilder foo = new StringBuilder() .append("foo").append(i) .append("foofoo").append(i) .append("foofoofoo").append(i); } } }); } /** * C-Style formats are available in Java. This is just testing the impact of such a thing. * Runs in a loop and executes <code>iterations</code>. It will construct a completely separate * {@link String} instance each time. * */ public void usingCStyle() { for(int i = 0; i < iterations; i++) { StringWriter writer = new StringWriter(); new PrintWriter(writer).printf("%s%s%s%d", "foo", "foofoo", "foofoofoo", i); String foo = writer.getBuffer().toString(); } } /** * Entry point for test */ public static void main(String args[]) { if (args.length < 1) { error("Come on! Give me a test to run!"); System.exit(1); } new StringConcatenationTest(args[0]).runTest(); } }

Why the Integer Conversion of i? Why Not Just Append String Literals?

Let's look at usingPlusConcat()

for(int i = 0; i < iterations; i++) { String foo = "foo" + i + "foofoo" + i + "foofoofoo" + i; }

At first, I used "foo" + "foofoo" + "foofoofoo". What's wrong with it? Well, it's exactly the same as "foofoofoo" + "foofoofoo". According to the Java Language Specification on the concatenation operator, this is basically a compile-time constant. That means, it gets created at compile-time. This won't yield a realistic test for us because log messages are not compile-time constants. To mix things up a little bit, I make it handle a conversion between int and String, then concatenate it non-literally. The String Pool doesn't get used as much, and naturally, the test ran much slower. I replicated the same effect with the usingStringBuilder test.

for(int i = 0; i < iterations; i++) { String foo = new StringBuilder() .append("foo").append(i) .append("foofoo").append(i) .append("foofoofoo").append(i).toString(); }

StringBuilder Looks Fast Without toString()

You may have also noticed a test, usingStringBuilderNoToString. This is exactly the same test, except it doesn't call toString() at the end of the appends. Clearly, this has a large impact on the results. Without which, StringBuilder is much faster than concatenating with '+'.

If you look at the toString() method of StringBuilder, you'll see:

public String toString() { // Create a copy, don't share the array return new String(value, 0, count); }

So why is it so heavyweight? I'll cut to it. The String constructor does this:

this.value = Arrays.copyOfRange(value, offset, offset+count);

It literally copies the buffer byte-by-byte. Ouch! Thanks to the immutable nature of String and StringBuilder, just about everything you do to a String means copying it. Even when you append a String to a StringBuilder, you are copying the char[] out of the String and into the buffer of the StringBuilder.

It All Comes Down to Optimization

After J2SE 1.4, the Sun Java compiler is ALWAYS optimizing. There's just no way to turn it off. The Java Language Specification says about Optimizing String Concatenation that a compiler can optimize it using StringBuilder. I can't really say how the optimization is happening, but it's pretty obvious. My best guess is that the object->string conversion is optimized by the compiler with '+' concatenation better than with StringBuilder. That's my guess. I know, that sounds kind of lame, but there's just no way to know for sure.

Conclusion

I'm going to summarize the facts I discovered in the testing.

+= for String Concatenation Does not Optimize Well

If you look at Paul Barry's Test, you'll see += is used. This is actually a really inefficient way to concatenate strings. It's also a very unlikely way. You just don't see this used very often for string concatenation.

primitive/Object->String Conversion is Inefficient

If you can avoid it, then avoid it! The Integer.getChars() method has huge amounts of overhead. If you look at the java source code for Integer.toString(int,int), you'll find:

if (i == Integer.MIN_VALUE) return "-2147483648";

The reason it is hard-coded as a string literal is because

  • It's added to the String Pool, so it pretty much becomes weightless
  • Avoids getChars(), so we don't have to worry about conversion or buffer space issues

toString() has Some Overhead

You really can't avoid using it completely, but you can minimize the number of times you call it. Leaning on optimization to keep from throwing away Strings is a good idea. For example, '+=' is considered less efficient than just '+' because '+' can be optimized to call toString() less, but '+=' is forced to call toString() each time.

With Optimization StringBuilder is NOT Faster than '+'

If you consider the compiler optimizing to keep from throwing away strings, then there is no advantage to using one over the other. It is possible that '+' optimizes object -> string conversion better than StringBuilder. That's just speculation though.

Source Code

The source code is a self-extracting shell script. That means it doesn't work in windows. Really. This should work fine with OS X, BSD, Linux, Solaris, etc... Just not windows.

·¹À̺í: , ,


±Ý¿äÀÏ, 5¿ù 02, 2008

 

String Concatenation in Java

¾Æ³çÇϼ¼¿ä. For awhile now, there has been a lot of controversy over whether to use '+' or StringBuilder. Usually, when one wants to compare the two, a strong inductive proof is technique is used. We assume that

if p(i)p(i+1)p(k) is true, then p(k+1) is true.

How does this do anything for us? We use it when we concatenate suppose we set k = 10,000. Suppose we do an iteration of k concatenations and appends. We assume that if p(i)p(i+1)p(k) is true, then p(k+1) is true.That makes this kind of test a very sensible proof, but not very practical. Why isn't it practical?

I want to first go over the tests that I've seen people have created for showing how much faster StringBuilder is than '+'. Here is a test, for example, considers what I illustrated above. Let's review it.

String Pool

According to the intern method all String literals are added to the String Pool. There is a lot of speculation around how the String Pool is managed, but it's important enough to just know that it exists.

for(int i = 0; i < iterations; i++) { foo += "foo"; }

With the String Pool "foo" gets added to the String Pool; furthermore, with each iteration foo grows by "foo". This is impractical because with k = 10000, who in the world is ever going to append the same literal from the String Pool 10,000 times? It'll never happen. Even though it does cover several cases, it doesn't cover ALL cases. More importantly, it doesn't cover any of the cases that programmers are likely to encounter. In the case where append is used, the literal string "foo" gets appended 10,000 times

Further, '+=' is considered less efficient than just '+' because '+' can be optimized to call toString() less, but '+=' is forced to call toString() each time. See Optimization of String Concatenation.

Compiler Optimizations

public static void main(String[] args) { StringTest test = new StringTest(100); for(int i = 0; i < 10; i++) { test.concat(); test.append(); } }

100 is hardcoded as the iteration count in the test. The compiler will optimize this and replace each variable instance of iterations with 100 wherever it sees it. Basically this ...

for(int i = 0; i < iterations; i++) {

... becomes ...

for(int i = 0; i < 100; i++) {

This is important because the capacity of a StringBuilder normally is initialized with a default fixed capacity as a buffer. If it starts to exceed this capacity, it has to grow that capacity. What if it knew at compile-time exactly how long it was required to be? It could optimize that default capacity to be large enough that it wouldn't need to grow.

String concat took 1812 ms String append took 0 ms String concat took 1766 ms String append took 0 ms String concat took 1766 ms String append took 0 ms String concat took 1812 ms String append took 0 ms String concat took 1750 ms String append took 0 ms String concat took 1828 ms String append took 0 ms String concat took 1750 ms String append took 0 ms String concat took 1750 ms String append took 0 ms String concat took 1782 ms String append took 0 ms String concat took 1750 ms String append took 0 ms

As for normal string concatenation, it doesn't look like it benefits from this optimization at all. This is probably because of '+=' that forces a new String to be created and thrown out with each iteration.

Resource Management and the Garbage Collector

Here is the iteration loop for the tests. It iterates each test 10 times. This will produce 10 strings that have "foo" concatenated 10,000 times.

for(int i = 0; i < 10; i++) { test.concat(); test.append(); }

Notice that concat() runs before append(), and also notice that they are running together in the same loop. This implies they are running in the same VM. concat() produces the same string that append() does. This is important because a majority of all the setup and overhead is handled by the concat() method.

Another thing to consider is heap size, stack size, and the garbage collector. For tests like these, we not only want to increase the heap size, but the stack size as well. The reason for it is we don't want the garbage collector or virtual memory to interfere with our test. It would be a shame if append() suddenly had to operate with the garbage collector cleaning up after the concat() run.

To avoid both of these scenarios, it is best to adjust stack and heap size with -XmssNm -XmxNm, and run each test in a separate VM.

Leo's Version

I decided to create my own set of tests that ran out of ant. I concatenated the same string 10000 times for 10 strings. It yielded similar results.

[java] INFO - Starting test concat [java] INFO - concat finished. Created 10 strings in 10314 milliseconds [java] INFO - Starting test append [java] INFO - append finished. Created 10 strings in 45 milliseconds

You'll notice that my 10 strings took several thousand milliseconds longer than Paul Barry's. 2 things changed are the jvmargs and fork="true"

  • jvmargs. An overview of JVM arguments shows that
    • -Xint will disable HotSpot and thus the native code. Everything runs interpreted.
    • -Xincgc will use an incremental garbage collector that is more predictable. The incremental garbage collector uses small increments of garbage collection to minimize impact by spreading out the garbage collection. IMHO, that's better than having all the garbage collection happen at once and skew test results. There's no guarantee that it will ever run.
  • fork="true" means that the particular instance will be forked from ant and run in its own JVM.

·¹À̺í: , , ,


¿ù¿äÀÏ, 4¿ù 14, 2008

 

Easter Music Video

¾Æ³çÇϼ¼¿ä. I put together a music video of Ashley and Leilia doing an egg hunt during Easter.

·¹À̺í: , ,


 

Ashley playing Go

¾Æ³çÇϼ¼¿ä. I have some nice photos of Ashley playing Go.

Ashley playing Go

On the right: Ashley playing Go around Christmas/New Years in 2007. She's always wanting to play more. I should be glad that she is so interested in it and play more with her.


The Match Go Board

On the left: The board around the time we decided to quit and pick it up again later. Not much for shapes, but I think it was a good learning game for her first time. Did I mention it was her first time?

·¹À̺í: ,


Åä¿äÀÏ, 3¿ù 22, 2008

 

GNUL v. AOSX

¾Æ³çÇϼ¼¿ä. I thought it would be amusing to title this after what they would look like if they were stock symbols. It has no further meaning than that. This is really "GNU/Linux vs. Apple Mac OS X." There I fully qualified the operating system names for the picky enthusiasts from both sides.

Since an apparent rivalry is beginning between Linux and OS X, I thought I would offer up my perspective as a user of both. I've been using GNU/Linux since I installed version 0.2.1 (yes, the versioning system has existed that long,) BSD since 1999, and OS X since 2001. First computer I ever used was an Apple II and I learned to type in 1991 on an Mac Classic. My first programming job was DCL and MACRO on VAX's, but my second job was virus projection for Macs written in perl, and e-commerce sites also written in perl. I developed on the Power Computing Mac clones. Remember those?

Why Do I Use Linux?

Linux is for ALL Development

Needless to say, I have serious Mac-envy. I love Macs. Why then did I become a Linux user. Even so, why did I stay a Linux user after OS X was released? In simplest terms, I'm a software developer. Linux is THE software development platform. I'm not speaking for just web development, database programming, embedded software, or anything specific at all. It's just good for ANY and ALL software development. Seriously. Linux is an OS designed and created by developers for developers. OS X in contrast is an OS designed to be cool and draw consumers to Apple Consumer Electronics.

A Variety Cross Platform/Architecture Compilers and Tools

Here's an example. I haven't run Windows on my home or work computer willingly since 1996. I gave Windows 95 a chance hoping it would improve on 3.11 WFW. In 2003, a friend asked me to write a multi-threaded program to integrate with their performance testing tool on Win32 platform. I'd done plenty of multi-threaded programming in and out of C++ by that point. Not hardly any Win32 though because I swore off that stuff. It was my friend though. We do crazy things for favors. Using the MinGW libraries, I was able to produce something for ia32 and alpha platforms in a couple weeks.

Here's another good example. There's a product called perl2exe. It's a build tool that takes your perl software and converts it to native binaries (bypassing code completely.) It just takes the perl and creates a binary plus library for any platform (Solaris, Linux MIPS, Linux ia32, Linux ia64, et al.)

Basically, Linux easily crosses architecture/platform barrier with ease. OS X can do this as well, but with a bit more tweaking. Some might argue that the examples I give are primarily aimed toward GCC which is also available on OS X. OS X even has a prime IDE for C/C++/Obj-C development that just comes with the operating system for free. That may be true, but the GCC compiler that is distributed for OS X is usually dated and not well supported. To get the newer features and fixes that caused problems on OS X, one needs to download the new source and rebuild it. OS X does not lend itself well to this kind of tweaking. Especially, if one is not experienced with GCC from Linux.

Aside from GCC, there are other tools and compilers supported on Linux that are not supported on OS X for development. This is coming along, and maybe it will get there. My point is that the way things are right now, OS X has not caught up to Linux. Take Wind River for example. Wind River is an embedded compiler that supports a number of platforms. I prefer this compiler for all my embedded development. It has much better hardware optimization support than GCC. It is possible to just use GCC, but Wind River is head and shoulders of above it. What's my point? If you follow the link, you'll find that it only runs on Windows, Linux, and Solaris. It's proprietary and not OSS, so you can't just rebuild it for OS X. It just simply isn't supported. Maybe in the future it will be, but for now I'm using Linux for my embedded development.

A Variety of Hardware Emulation Platforms

Hardware emulators aren't just for emulating your favorite video game console or for running Windows, but that seems about the best OS X supports. Why do we need virtual machines and hardware emulators? Well, for the same reasons given above. Once you write your source and it compiles, do you really expect it to run on your machine? Absolutely not. You have to run it on a hardware emulator which Linux has so many and so much support, that it's insane.

Some may say that this is really only useful for hacking and enthusiasts, but that's absolutely not true. I think too that this is definitely the electronic age. If you truly want to get the most out of your products, you should be looking at ways to improve them. Likewise, this kind of thinking is what landed us the iPhone, other smartphones, and PDA phones. As a result, more jobs for this kind of work are appearing in the world. The demand for hardware emulators for professional software development is going up faster and faster. OS X has an uphill battle on this because the only embedded development platform it truly supports is for the iPhone and people have to pay to do that.

Linux is Lean

Now we get into probably my favorite reasons why I like to use Linux and not OS X. Linux is not a bloated BSD. I'll agree that the best thing to happen to BSD is OS X, but it's still bloated. It's got tons of software that you can't easily configure, or trim. It takes forever and a day to install. I can install Linux in less than 10 minutes and have everything I need. OS X took me over an hour on my Macbook and it came with a ton of things I really didn't want. People don't realize it, but there are tons of things to configure to get your OS X to run A LOT faster. Who's idea was it to turn all the features on and make it run super slooooow out-of-the-box?

Most Linux systems are designed to run lean out-of-the-box. If you want the cool features, you have to be cool to know what to tweak. Furthermore, the number is growing of Linux distributions that can run in small-footprint platforms, and the desire for embedded Linux platforms is growing because the ease of development.

Desktop Environments Aimed for Geeks and Developers

This is probably my greatest reason for Linux. I don't think this will change either. The desktop environment is aimed toward developers. Linux Distributions (Ubuntu, Fedora, et al.) have made many attempts at a more user-friendly environment for desktop users, but they're in their prime right now for developer desktops. As a developer, let me list the things that impress me about Linux desktop.

  • Alt-Button-1 window movement. This is being able to move your window or resize it without clicking on the title bar or the window anchors. This is huge for me because sometimes on OS X, the window drifts off-screen. Now you can't move it unless you know the appropriate key-combination. I want to be able to click on a window and have it move. Sometimes, my mouse is near a window corner, but not the title bar. Or sometimes, my pointer is not near a window anchor. Maybe I don't like wasting my time trying to grab the ever so tiny window anchor for resizing. This is a big one for me. As long as Linux can do this and OS X doesn't, I'm going to love Linux. Yes, it's that important and I see it in OS X's future.
  • Multiple-desktops and Virtual Desktops. Leopard has this, but it's not very good. It's more flashy than functional. It's the basics, so that's alright. The support on most Linux Distributions is much better though. Furthermore, why is this important? Well, when I worked at IBM, this was the developers' prime reason for using Linux. To have more than one desktop, and organize them based on what one is doing. This was a huge productivity feature for many programmers. I don't use it as much, but I do use it a lot.
  • Mouse gestures built-in. I don't think I have to say anything else about this.
  • Customizable keystrokes. OS X comes with its own set of keystrokes that are standardized. Linux does as well, but you can deviate from the norm. I choose not to, but rather I like to add more keystroke commands. I have to say that my favorite ones are for bringing to the fron my favorite applications. For example, I use Ctrl-Shift-F12 to bring my buddylist to the front. Then, I pick a user and message. That's nice. I like being able to cycle similar program windows (you can do this in OS X too) with Alt-Esc. I can use Ctrl-Shift-F10 to switch to Firefox, and then Alt-Esc to cycle windows. It's nice. My favorite is Ctrl-Alt-T. It is a standard keystroke in linux for bringing up your preferred terminal emulator.

Why Do I Use OS X?

This may seem absurd to some, but I actually think OS X is the best OS out there right now for Consumer Level Desktops and Servers.

OS X is for Desktops

From the perspective of a Desktop User and a Desktop Support Specialist, OS X is a dream.

  • Everything just works.
  • It works well.
  • It does what you expect it to do.
  • Almost never crashes, and hung programs have an simple, and intuitive way to kill them.
  • It's user friendly.
  • There's a consistent way to get to Preferences for each program.
  • Configurations are always simple and easy to understand.
  • The OS uses terminology that normal users can understand.
  • Great API integration so programs can easily reuse Services and work together cohesively. This produces and experience such that a user doesn't need to know what program he/she is in. They are all pretty much the same.
  • Very easy to support platform from the Desktop Support Perspective.

There is just no denying it. OS X is the most dumbed-down, simple OS in the world. Perfect for the average consumer or someone who just doesn't want to worry about support of their PC.

Now some have suggested that in order to get over my Linux Desktop environment hangups, that I should run XonX. Now, I want to dispel this as a good idea right here and now. That's just silly. Why would I run two desktop environments on the same system. I would run either X or Aqua. Now, with that said, if I do choose X, why am I running OS X in the first place? That is my logic. If I'm going to run X on my Mac, I'm going to install Linux and let that be the end of it. I want to run OS X for its unique features. That's why I run it at all. When I get home, I am glad that I know I can come home to a stable system that I don't have to maintain.

OS X is for Servers

A lot of companies like to lean on Blade Center running Linux or Windows. Primarily Linux has been chosen for most server platforms because it's lean, and Linux Administrators are a dime-a-dozen. You can just get a Linux SA and a cheap ia32 machine and there you go. I don't necessarily believe that is all that great. Apple's XServe is some of the best hardware for a 1U server that you can find anywhere. It comes with OS X though. You can install Linux on it, but wait! OS X for the same reasons illustrated above for Desktop Users is exactly the reason why you want to keep it for your server. Setting it up becomes a no-brainer. It ships to you headless and ready-to-go. You can configure it remotely with no problems. It just works.

Basically, my logic is that XServe is some of the best server hardware you can find, so why not buy it? It comes with OS X and is simple to manage on your own, so why install Linux on it? Apple has made it so that anything more than going to the apple store and purchasing an Xserve through it is just too much of a hassle.

·¹À̺í: ,


ÀÏ¿äÀÏ, 3¿ù 16, 2008

 

When to Use One-Line Comments

¾Æ³çÇϼ¼¿ä. In my post on Comments in OO Java Software, I stated firmly that I believe one-line comments are bad smells and should be refactored. I still stand by that, but there are exceptions. This post is about those exceptional use cases.

Literate Programming

This is my #1 favorite excuse. What is it? Here's an explanation from the Literate Programming FAQ

Literate programming is a phrase coined by Donald Knuth to describe the approach
of developing computer programs from the perspective of a report or prose. The focus, then, 
is on description (and documentation) of the approach in human-readable form. This is in 
contrast to the normal approach of focusing on the code. -- David B. Thompson

Here are a few words about it from Dr. Knuth himself.

The structure of a software program may be thought of as a web that is made up of many
interconnected pieces. To document such a program we want to explain each individual part
of the web and how it relates to its neighbours. The typographic tools provided by TeX give
us an opportunity to explain the local structure of each part by making that structure visible,
and the programming tools provided by languages such as C or Fortran make it possible for
us to specify the algorithms formally and unambigously. By combining the two, we can
develop a style of programming that maximizes our ability to perceive the structure of a
complex piece of software, and at the same time the documented programs can be
mechanically translated into a working software system that matches the 
documentation. -- Donald E. Knuth 

In my own words, I would say that Literate Programming is a style of programming that makes source code a work of literature.

What advantage is there to writing software this way?

  • Reuse. Just like source code reuse is so popular in programming, documentation reuse is also a good idea. Literate Programming promotes documentation reuse by normalizing documentation that can be reused from source code to different documents.
  • Cross-referencing documentation. Documentation that is easy to read, navigate, and understand.
  • Documentation that can be treated as source code because it is in the source code.
    • Documentation version control. The documentation maintains the same version as the source.
    • Documentation and source code can always be found. If you have the source, you also have the documentation.
    • Documentation does not need to exist in a proprietary or specific format. It can be generated as long as it exists in the code.

Let's examine javadoc for example. This is a prime example of Literate Programming in OO software. It is useful because it can be generated from the source. A number of IDE's support it almost automatically. For example, Eclipse can be configured to produce javadoc comments for every method. These comments include input parameters, output values, exceptions, and cross-referencing with similar/related functionality. The javadoc comments are parsed and documentation is produced. The primary medium for output is HTML, but javadoc is not the tool. Javadoc is the style of documentation comments. The tool is irrelevant. For example, a tool could be created to produce LaTeX or even PDF output. That would not change that it is javadoc.

Javadoc is Literate Programming because it is a living work of literature within the source code. This does not mean it turns your program into a Literate Programming environment. The context is scoped just toward javadoc. It makes your source MORE Literate, but not a Literate Programming environment. It basically does for Java what WEB/CWEB did for Pasca, C, and other programming languages.

What does this have to do with one-line code comments? Let's examine javadoc again. If you follow the documentation commenting guidelines in The Elements of Java Style, then you have produced good documentation comments with input/output parameters, exception explanations, code examples, gotchas, domain-specific details, implementation details, et al.

However, the examples in the documentation comments are not continuously updated. They can change, and maybe even become neglected. Also, the source code itself is not included in any of the generated documentation. Therefore, we have all the documentation we may need, but no source to relate it to.

Another case would be where we have external documentation that refers to the source code. That being the document is not part of the source code like javadoc is, but rather it is created outside. Not only is it created outside the source code, but it refers to specific case internal to the source code. Let's analyze the requirements:

  • Has access to the source code.
  • Is generated.
  • Refers to specific cases in the source code, but not by line number.
  • Not effected by concurrent changes to code. Whenever the code changes, the documentation does not need to be updated, just regenerated.

Basically, pulling out specific sections of code to be inserted into the documentation when it is generated, and always grabbing the correct source. How is this done? There are a number of ways to do it, but one of the most popular ways is to tag the source code itself. Here's an example using a tool called the Snippet Plugin for referring to source code from Confluence.

// START SNIPPET: deleteAbstract public ActionForward deleteAbstract(ActionMapping mapping, ActionForm form, HttpServletRequest request, HttpServletResponse response) throws Exception { return confirm(buildDeleteAbstractConfirmationQuestion(mapping, form, request, response), CONFIRM_DELETE_ABSTRACT_KEY, EMPTY_STRING); } // END SNIPPET: deleteAbstract

Notice the one-line comments that say ":START SNIPPET" and "END SNIPPET". These comments are used to tag where to start the snip and where to end it. This is a common pattern for extracting source example. A similar approach is used by The Pragmatic Programmers publishing house. The Pragmatic Programmers use a proprietary docbook stylesheet library to create consistent layouts for their books, but they also use a perl-based token replacement system that scans source code for tags written in one-line comments and replaces the appropriate docbook code in their build system. The end result is a system where source code only needs to be changed in one place and the documentation is updated with it. Reusable documentation.

Some people think the comments for the Snippet Plugin couple us to the Snippet Plugin or even to Confluence. That is just not true. The comment is configurable for the Snippet Plugin. That means that you can make it look however you want. It doesn't even need the word SNIPPET in it. I use listings with my TeX documentation to build with source code. It is the same concept. Using ambiguous tags that could look like anything really to tag sections of code for use in external documentation. Contrary to what people might think the source code would not require the Snippet Plugin or Confluence to run. That's the beauty of using a one-line comment.

This isn't a new concept. This kind of thing has been around for a long time. Donald E. Knuth came up with Literate Programming and the WEB system in 1988. The Pragmatic Programmers are just building on a solid foundation of concepts created by one of the greatest computer scientists in our history.

So that's one good reason to use one-line comments.

Temporary Metadata

What is that? Well, I think it is information that is abstract. It's not necessarily about the domain, the code or anything, but it is useful about a specific task. For example, javadoc and Eclipse both have this concept of a "todo." This is a temporary metadata comment. It exists for only a short while. It should only exist for a few hours at most. Some people manage to let them creep into the code and stay there indefinitely.

Below is another good example. For whatever reason (debugging, troubleshooting, whatever...) sometimes code is commented out. You don't want to use a C-Style comment or a documentation comment for this kind of stuff. It's best to use a one-line comment.

public ActionForward execute(ActionMapping mapping, ActionForm form, HttpServletRequest request, HttpServletResponse response) throws Exception { KualiConfigurationService configService = getService(KualiConfigurationService.class); ((ProposalDevelopmentForm)form).getProposalDevelopmentParameters().put("proposalNarrativeTypeGroup", configService.getParameter(Constants.PARAMETER_MODULE_PROPOSAL_DEVELOPMENT, Constants.PARAMETER_COMPONENT_DOCUMENT, "proposalNarrativeTypeGroup")); // ProposalDevelopmentForm proposalDevelopmentForm = (ProposalDevelopmentForm) form; // ProposalDevelopmentDocument proposalDevelopmentDocument = proposalDevelopmentForm.getProposalDevelopmentDocument(); // proposalDevelopmentDocument.populateNarrativeRightsForLoggedinUser(); ActionForward actionForward = super.execute(mapping, form, request, response); return actionForward; }

Keep in mind you do not want to check this into your source code versioning system. This is something you as a developer are the only one who will see. It should also only exist for a short period of time for your own sake. Again, for the sake of the planet, don't check this stuff into CVS/SVN. Visual Sourcesafe is ok.

·¹À̺í: , ,





This page is powered by Blogger. Isn't yours?

¿¡ °¡ÀÔ °Ô½Ã¹° [Atom]