Monday, January 10, 2005

Semantic Diff

Hari and I happened to read an article on Martin’s blog about Semantic diff. It talks about a smart diff utility that not only displays the difference between the two files, but also the refactoring changes that have been implemented. Traditional file diff utilities display the changes that have occurred in each line, given two source files. They do this because they treat both the source files as simple text files. But is this really the way to do it? Are source files just simple text files?

We don’t think so. We believe that source files should be treated as a stream of tokens / blocks. Consider a Java source file; The package declaration, Import statements, comments would be separate blocks. A Class/Interface definition would be a block which could contain other blocks like inner classes, methods and member declarations. Now the diff utility needs to compare the different blocks that it has identified. The order of these blocks would not be important. For example, a developer may choose to rearrange classes or methods inside classes. A traditional diff utility would show a thousand differences between the two versions, but a good diff utility would identify it as cosmetic changes to the source.

It doesn’t end here. A really smart diff program should also show you refactoring changes like method extraction. It should be able to predict that a new method has been extracted out of an existing method and also display the locations where this refactoring has been applied. Such a diff utility would be a boon to developers.

We are trying to develop a program which performs a few of the above mentioned ideas. It seems to be an extremely challenging task. We are making very slow progress with it, but progress nevertheless. Any help with this is more than welcome.


Aditya Kulkarni said...

Anirudh said...

Because i have tried it and i find it very useful, i would suggest Antlr to you as well. You will need to write grammar using reg expressions of the various statements (import, assignment, you can even specify blocks for methods which in turn contain statements). It will tokenise everything and you can then use these tokens to compare the files. Antlr produces java files which you can modify to suit your needs. Hope it helps!

Erich said...

Check out SSDDiff - this is a "semantic" diff basically for any kind of "semistructured data". Right now it only has a XML parser, but you could add a Java parser, too.
The full analysis mode can eat a lot of memory, thats why it was written in C++, not in Java; but the fast mode should provide okay results in most real-world cases.
It tries to keep as much of the structure intact when doing the diff.

Anonymous said...


Ira said...