Team Spider Development Web Page

5/15/2006

The three implementation changes outlined in the last update have been completed. The installer will be available later today to test out the new functionality. We should also have all our artifacts available on the website later today.

In the meantime, we have added the installer code to the deliverables section. This was requested by Dan so he could build a new installer for Office 2000.

5/10/2006

Team Spider met with Dan tonight to hand over our fully tested version of the Text Extraction Tool. Dan will be completing acceptance testing within the next few days. With any luck, everything will check out and we can get formal acceptance.

While at the meeting, Dan brought up a few final requests for us to complete. While a couple do involve changing the code, all are small and (hopefully) easily completed. The requests are listed below:

Add a parameter to specify where to send logging output
Create a canParse method in the TextExtractionTool interface that will return the id of the parser (as defined in the config.xml file) that would be used in parsing the file specified.
Add a parameter to specify the location of the config.xml file
Look into how we could add our DLL files to the Add References screen in Visual Studio .NET

We will also be posting the final versions of all our deliverables on the website shortly in addition to providing TCN with a CD containing all our deliverables.

5/8/2006

It's been a busy seven days. Team Spider has been working on a number of things in preparation for the end of the school year (and our undergraduate degrees here at RIT!).

To start, we have been implementing the final part of the Text Extraction Tool. A last minute change request was submitted to allow TCN to change the file types a parser will work for. We accomplished this with an XML configuration file. Now that we have completed that, we will be meeting with Dan on Wednesday to hand him our final product to begin acceptance testing. With a little bit of luck, we are done coding. We have also created a mini-app on Dan's request to show how an external system would interface with the Text Extraction Tool. That sample is now available on the Deliverables page.

The second item we have been working on is the poster presentation. Last Friday James and Ted presented our project to all that were interested in the Golisano building. From the reports we've received the event went over very well. We were planning on a repeat event this coming Friday but it now sounds like that will not involve actually presenting the posters to outsiders. If it is, Jim and Dan are more than welcome to attend.

The final item we have been working on is our Technical Document. We will be presenting Prof. Reddy with our final draft for review on Wednesday. He has agreed to provide any final feedback by Friday so that we can turn it in during Week 10.

Everything is proceeding along very smoothly in all three of these areas.

5/1/2006

We met with Dr. Reddy today to discuss the Technical paper we have to produce before the end of the quarter. We had presented him with a draft copy which he commented on. We have some rephrasing to do across the document but all the important parts were present.

As for the project, we are almost complete. We need to complete support for the config file and we will be ready to start acceptance testing.

4/19/2006

The meeting with TCN today went well. I think we reached an agreement we are all at least somewhat happy with. After some discussion with Dan, we agreed that it was best to leave the design as it is, mainly since little or nothing is gained by implementing the change. Instead we are going to create a config file that the parsers will look to in order to determine what file extensions they can parse. This way, if an RTF file keeps failing with our parser, TCN can switch it over to the Word parser (which also supports RTF) quickly and easily. This change should be completed within the next week or two, depending on how much time the poster and technical paper takes up.

As for the technical paper and poster, we have submitted both for review by Professor Reddy. He has provided some initial feedback about the poster. We are continuing to revise both of these documents as we review them internally as well.

4/12/2006

We just got back from our meeting with TCN. The Word and Excel parsers we had implemented worked fine. Due to a miscommunication, Ted didn't have the latest code for the PowerPoint parser which caused it to produce errors. We are going to email TCN the updated version soon.

TCN has made two requests for changes to the Text Extraction Tool. The first is fairly small and almost expected. Right now the three Office parsers only parse their default document types (.doc for Word, .xls for Excel and .ppt for PowerPoint). That is a limitation we introduced into the tool on purpose. After discussion, Dan, Jim and the three of us agree that opening up the Office tools to parse the other formats they support is desirable. It is a relatively easy fix on our size and will open up the ability to parse many other formats like MS Works, WordPerfect, etc.

The second change is much larger. Dan has expressed a desire to be able to interact with each parser individually. Essentially, instead of having the MainControl class handle the determination of which parser to use, he would like to have that ability pulled out to within the Spider tool using our tool, which can then be used to make more decisions on their end as to which parser to use with a file. This was not an expected, planned or even envisioned change. We had been under the assumption from our requirements gathering that TCN wanted a single point of entry that took in a file location (plus some other administrative values) and returned the text from the file.

We do not see the change being terribly difficult to do (which is as much luck as anything this late in the game). However, the change does kill most of the flexibility we introduced into the system in our design. After an initial discussion immediately following the meeting, the three of us feel that this change will make it much harder to build new parsers, remove the ability to dynamically add and remove parsers and introduce a lot of copy-and-paste code between the parsers. This also will increase the coupling between the Spider tool and our parsers.

We are busy with the technical paper and poster design for the rest of this week and will not even begin to look at the necessary changes until next week. In that time we will definitely talk with Professor Reddy about the changes. We will also be filling out a change request to record the reasons why this decision is being made. Once completed we will also have to spend a considerable amount of time revising our requirements and design documentation.

4/10/2006

James and I met today to discuss the poster and prepare for the meeting with TCN on Wednesday. We agree that it's a good start but need to be "prettied" up some and more importantly, we are not finalized on what we want our content to be. James is going to continue working on that through this week. We have not yet started the technical paper (other than an outline of the headings). The plan is to have that completed by Friday, reviewed by the team by Sunday and sent to Professor Reddy in time for him to review it for our Monday afternoon meeting.

4/5/2006

Today Professor Ludi and Professor Lutz presented on the two big "school" deliverables we have left to do for the senior project. The first item is a 30" x 40" project poster. These will be presented once or twice in May. The second item is our technical report. The purpose of this is to summarize our project in a level of detail that will allow others to become familiar with what we have done. The due date for the rough draft of this has been set to the 24th of April.

After the presentations by the Professors, Team Spider met for a short period and planned out our actions for next week. We want to get started on both the deliverables as well as finish development on the Office Automation parsers. We feel the parsers will be ready by next Wednesday and plan on meeting with TCN sometime next week.

4/3/2006

We met with Professor Reddy to go over the outlines for our technical report and poster design. We also brought him up to speed on our meeting with TCN last week. Together we set a team due date for first drafts for both of the artifacts, April 17th. This will give us time to review it with Professor Reddy the week before they are due.

3/30/2006

Been a bit since the last update to the status page so there is a bit to write about this time.

We'll start with the Team Spider meeting on Sunday. We met at school for two reasons. The first was to discuss what we were going to do with our Word parser. Our conclusion was that we needed to talk to TCN in depth and plan out a course of action.

The second and more important reason we met was to conduct a code review. As a team, we want to make sure we are all commenting in a standardized fashion, performing the same error checkes, handling exceptions the same, etc. as well as look for potential defects. We analyzed the three big classes within our project: MainControl.vb, DynamicLoader.vb and ParserPDF.vb. As we reviewed the classes, the respective owner took notes on what suggestions we had. These changes will be implemented over the next week or so. All said, we did find a few potential defects and also got on the same page standards-wise.

On Wednesday we met with TCN to discuss the Word (and Excel and Powerpoint) parser. We presented the two working options we had for them and gave recommendations for going forward. After some discussion, we reached a decision and plan of action.

The first option we presented was Office Automation. This is the API Microsoft provides into their Office products. Essentially, Microsoft allows programmers to fire up a slimmed down Word/Powerpoint/Excel/etc process and interact with it through the Office Automation API. The positive sides of this are that using Office tools provides very high levels of support for any Microsoft product, whether that be Word, Works, Excel, Powerpoint, or something else. These tools also theoretically make it easier to upgrade to support new versions of Office. Another positive side of Office Automation is that all our code would be in VB.NET. This is the requested language for development of the Text Extraction Tool. The downsides to Office Automation presented were that you must have Microsoft Office installed on the machine running the Text Extraction Tool and that you do have to fire up an external process.

The second option we have gotten working and presented to TCN was the catdoc tool. This tool is an open-source C library that can open and parse Word documents. The library would allow us to view the actual code being used to parse the file (which Office Automation does not). There are some significant problems with it however. First, the author states on his site that catdoc is not tested or built to run on Windows. We have gotten it to run within Windows, but with some painful restrictions, namely a limit on the filesize. Catdoc also suffers from the same problem as Office Automation in that we would have to fire up an external process to get it to run.

Our recommendation going forward was to use the Office Automation tool. It is the method Microsoft supports for accessing Office files programatically. It also offers better future support and doesn't have any issues running in Windows. After discussion, TCN made a compromise. We will develop the three parsers with Office Automation initially. Since this is fairly simple to do, we are planning on that taking no more than one week, with the possibility of testing extending over the following week. Once the initial parsers are completed, we will go back to the Word parser and attempt to develop our own version. It is to support the Word 2003 format and be independent from any external tools. We are researching OpenOffice and the code in catdoc to determine the format for Word 2003 files.

3/22/2006

Team Spider met again today to go over the progress we have made since Monday. We have a number of tasks going forward as well as some more questions that should probably result in another meeting with TCN. Everything is outlined below.

We investigated the possibility of utilizing the Swish-e tool for Microsoft format parsing. The tool itself actually uses a number of third party, open-source C libraries that can parse Microsoft Word, Excel and Powerpoint files. The biggest downside is that the author of these three libraries has posted on his web-site that the tools run in all Unix/Linux distributions, Solaris, and DOS. He specifically points out that they are not supported under Windows. Whether this is just filename length issues or something more problematic we are not sure yet. One possibility would be to try to modify the code to work under windows.

We have also been working on the Rich Text Format parser. It is progressing along nicely and we see a good chance that it will be ready next week.

Going forward, we are planning on having a code review this Sunday to review what we have written thus far and also to delve into the C code from the three Microsoft parsing libraries. The hope behind this is that we will find and remove as-of-yet undiscovered bugs in our current implementation and build up some working knowledge to make an estimate for modifying the Swish-e libraries to work under Windows.

We have a number of questions and concerns that we would like some time to express with TCN. We feel, with the current state of affairs, there is a high risk of failure if you assume the Word parser is a requirement of success. To mitigate this we need to have a high level of communication to allow us to move quickly and not have to second guess our decisions. We would like to meet with TCN early next week to update our status and propose these questions. An email will be sent later this week to schedule that meeting.

3/20/2006

Team Spider met with Prof. Reddy to discuss the comments from other senior project advisors about our interim presentation. It was mostly positive, with a couple areas of improvement.

While there, Prof. Reddy also brought to our attention three possible open-source parsing tools we had not discovered yet. Our hypothisis as to why we never saw them is that we were searching for VB.NET specific libraries, which these are not. That said, the libraries do hold some promise for aiding us in completing as much of the requirements we elicted from TCN as possible.

The first tool, Swish-e holds the most promise. It appears to essentially be an open-source web crawler that supports most standard document formats. It appears to be written in C, although we haven't had time yet to look at the code. It is open-source under the GNU General Public License, similar to xpdf, so we believe we can use it for free. It would probably be a good idea at some point to enlist the help of a lawyer in determining what the GPL allows.

The second tool is OpenOffice. OpenOffice is a free office suite that supports all of Microsoft's proprietary formats. If OpenOffice offers a programatic interface similar to what we have used with Microsoft Word, we may be able to parse text through that. If the source code is available (which we are not sure of yet) we could also try to copy how the OpenOffice developers read in Microsoft formats. Once again, we have not yet looked at the code but OpenOffice is an offshoot of StarOffice produced by Sun so we are guessing it is written in Java.

The third option is Apache Lucene. Apache Lucene is a full-text search engine. We have not looked too in depth into the tool so we are not sure if it provides support for office documents or not. This tool is definitely written in Java.

We are planning on looking further into all three of these tools and from that knowledge, develop a recommendation for TCN going forward. We are also continuing to work on the RichText parser as well as updating some documentation requested by Prof. Reddy. Assuming everything goes well, we will probably request a meeting either late this week or early next week to decide our course of action going forward.

3/15/2006

The team met with Dan today. While there, we presented him with the latest version of the Text Extraction Tool and discussed the issues we have found with the Word parser.

While testing out the new install, Dan discovered that for some reason the PDF parser freezes on very large PDF files. We have tested files of that size with the parser before so something we did in the latest patch must have caused it. Right now we are at a loss as to what that is. The team will be working on getting that fixed ASAP.

As for the Word parser, we have ruled out the PIA libraries. They will not give us the data we need. We presented the two remaining options left for the Word parser (and Excel and Powerpoint because they are Microsoft products) to Dan.

The first would require TCN to have a copy of Microsoft Office installed on the server they have running the Text Extraction Tool. Microsoft provides a programatic interface into its Office applications and if we have Office on the server, we can interface with the application and retrieve the text fairly easily. This method would have minimal impact on our project schedule and may even put us ahead due to the simplicity of creating the Excel and Powerpoint parsers.

The second option would be to manually create our own Word parser. This has a number of negative side effects associated with it. First, we had never planned on implementing a parser from scratch. This is a large and complicated process and as such, will take a long time to do. We had planned on interfacing with free third-party software for the actual parsing of the files. If we do implement the Word parser on our own, that will probably be the last parser we work on. Second, Microsoft provides no public specification that we know of for the Microsoft Word formats. We would have to piece together what we can figure out. Not only is this a hack, but we are almost guarenteed to miss some Word construct and as a result either include binary information we shouldn't have or miss text that we should have retrieved. And third, our own homemade parser will most likely only support one version of Word. We may be able to create parsers for other versions fairly quickly due to similarities but that is unknown. We would only plan on a parser for one version initially. This is obviously not a good strategy going forward if TCN desires to have more parsers completed by the end of this quarter.

Unfortunately, Jim could not attend the meeting. Because of this we did not make a decision last night. Dan is planning on meeting with him tomorrow and then informing Team Spider of the decision made.

3/13/2006

Break is over and the three of us are ready to get back to work. We met tonight to discuss our current status and what our tasks were going forward.

Over break we have fixed all known defects from our test document (Available on the deliverables page). We have also resolved the issue we discovered with the installer during our last meeting. And of course, we have been working on the Word parser. It is due today but is not yet complete. We will fall slightly behind schedule with it but are optimistic about catching up on the Excel and Powerpoint parsers.

We discussed our options for implementing the Word parser during the meeting. We have a working parser right now. Unfortunately it uses a library provided by Microsoft Office, which would mean that the server the Text Extraction Tool is running on would need to have a copy of Microsoft Office installed. Obviously this is not an ideal solution. We are currently experimenting with two similar tracks. The first involves using libraries provided by the free Microsoft Word Viewer application. We have yet to figure out if/how we can use these libraries. The second track involves using some publicly available PIA libraries Microsoft offers. We are not yet positive these will give us access to the file without Microsoft Office being installed though.

2/21/2006

We submitted our first deliverable to TCN this past Monday. It included the code for the core components and the PDF parser. All-in-all, it went fairly well. There were no big issues with the code so apparently we got the basic idea right. There is a small list of things we need to work on. This was to be expected and planned for in our process. We are now in the review phase of the first iteration of plug-in development.

Our main tasks over the next week are going to be to finish preparing our presentation for Thursday, complete the list of changes we received during our meeting with TCN and continue forward with the Word parser.

We have also updated our documentation on the website and have added the metrics we have gathered thus far as well as the results we are getting for our test cases. The installer we delivered should be posted within the next day. As we implement the changes theses files will be updated.

2/16/2006

James and I met today (Ted is out of town for the weekend) and planned out our tasks for this weekend to get everything completed by Monday. Knowing that Xpdf is available to us, we are confident of getting the parser to work. The major tasks we see as having left to do are to implement a PDF parser using the Xpdf tool, get a copy of NUnit and complete our test cases and ideally get a code coverage suite (which we have not yet found) to determine the effectiveness of our test cases.

2/15/2006

We received a response from Glyph & Cog today. The Xpdf software is free for use. All we are required to do is include some documentation around the GPL with and distributions. If desired, Glyph & Cog also offer support contracts for the software that includes technical support, iterim releases and priority for bug fixes. For the module we will be using, the support contract is $1000 per year.

2/14/2006

We have officially released v2.0 of the team website. Essentially, it was boring so we made it more fun to look at. We also added the status page your looking at to help keep everyone informed with what's happening with Team Spider.

At our meeting today we discussed the possibility of using Xpdf, an ANSI C++ Pdf parser. It falls under the Gnu General Public License (GPL) but also has a commercial version so we are going to contact the owning company, Glyph & Cog (http://www.glyphandcog.com/) Assuming they don't require a fee to use, going forward we will use Xpdf to aid our parsing of PDF files. James is looking into alternatives in case Xpdf does not pan out.

Since this is the first status post, we would also like to remind everyone involved that we are presenting on our work thus far on Thursday, Feb. 23 at 5:15PM in the Golisano Auditorium. All parties involved are more than welcome to attend.