Design: Office Importer
Proposal
Introduction
- use openoffice runtime as server to convert document to html code
- clean html code
- parse html to xwiki syntax
- integrate those feature into xwiki. see below mock-up
Integration mock-up
The features below is usable only Office Converter Plugin is installed. After discussion with Vincent, we decide the integration for office converter will be plugin + application. This is,- a xwiki plugin for converting office document to many document format, like pdf, html, xwiki syntax.
- a office import application for user to import office document to xwiki page
- Import from WYSIWYG
- mock up demo: OfficeImporterWYSIWYG.png
- Preview Office document
- mock up demo: OfficeImporterPreview.png
Current State
Features
- Convert a office document to html code and save the html code to a xwiki
- handle xwiki syntax in html content and escape special characters in the html content
- support document type: doc, xls, ppt, odt, odp, ods
- support convert ppt odp to a zip file and display the zip in a iframe in a xwiki page
- handle the images in office document. Upload pictures into xwiki page as attachments
- integrate to xwiki as a xwiki plugin
- provide a xwiki application to import office document which can can select to convert2html or convert2xwiksyntax
- a unfinished convert2xwikisyntax feature. To be finished in next version.
Quick Start
Install
- latest XE 1.6 in svn trunk is required.
- install openoffice(>=2.3) in the computer in which xwiki will run. Refer http://www.openoffice.org
- copy all the libs mentioned below to XWIKI_WEB_HOME/WEB-INF/lib/
- All the dependanted libraries can be downloaded here. install requirement libraries.include:
- slf4j-api-1.4.3.jar
- slf4j-jdk14-1.4.3.jar
- jodconverter-2.2.1.jar http://sourceforge.net/project/showfiles.php?group_id=91849
- jurt-2.3.0.jar
- juh-2.3.0.jar
- ridl-2.3.0.jar
- unoil-2.3.0.jar
- htmlcleaner-2.0.jar http://htmlcleaner.sourceforge.net
- All the dependanted libraries can be downloaded here. install requirement libraries.include:
- copy office importer plugin lib to XWIKI_WEB_HOME/WEB-INF/lib/
- add the office converter plugin in xwiki.cfg
- Edit your WEB-INF/xwiki.cfg file as follows:
xwiki.plugins=[...], com.xpn.xwiki.plugin.officeconverter.OfficeConverterPlugin
- Edit your WEB-INF/xwiki.cfg file as follows:
Start Server
- start xwiki as you always do.
- start the openoffice as a server in the computer.
- If you are using windows, it's a little complicated. please refer http://www.artofsolving.com/node/11 to find out.
- Or you just find the executable soffice file(often it is in c:/program files/openoffice-2.3) and go to the path in command line run
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard - If you are in linux, the simplest one is to start it from the command line with the following options:
soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard
Use the plugin in xwiki
- Import the office import application to xwiki
- go to Import.WebHome to convert office document
- select the source file, input the target xwiki page's space and page name.
- select covnert2xhtml or convert2xwiki
- click "convert" button
- if success, you can click "result" link to see the new page.
- The source file should have a normal filename with correct extension.
- The target xwiki page should not existed. Otherwise, will show you not allowed to view the page.
- If you don't have the edit right of the target page, will show you "not allowed to view the page.")
ToDo List and plan
Use htmlcleaner to clean html but not jdom filters.
Time: 10 hoursPredict Begin: 2008.08.16
Predict End: 2008.08.17
Task:
clean html code to well formatremove head taghead tag can be handled by xhtmlparer.- replace <img> tag to {image}
remove empty link <a/>- replace deprecated tags of xhtml(if possible)
- pb: HTMLCleaner can't just simple replace a tag, so a a little hard.
Write test cases for the conversion
Time: 10 hoursPredict Begin: 2008.08.17
Predict End: 2008.08.18
Task:
refactor the test framework of office converter test casesmake small test input file(MS word, excel, powerpoint and openoffice) and verify the output- test the HtmlCleaner( have to implement the filter and fix some bugs in htmlcleaner, so it's out of track)
test the typeformat, util, and other classes
Insert task convert2html
see hereimplement a convert2html featureclean the codewrite javadocwrite readmefeature listquick start for how to use it
Predict Begin: 2008.08.19
Predict End: 2008.08.19
Convert xhtml to xwiki syntax 2.0
Main Task:- Write test cases for WikimodelXHTMLParser. Consider all the base tags in xhtml.
- submit patches to wikimodel and xwiki-core-rendering to make WikimodelXHTMLParser + XWikiSyntaxRendering works well for all the test cases.
Predict Begin: 2008.08.18
Predict End: 2008.08.26
Detail Plan for this:
| Name | Predict time | Predict begin | Predict end | Test cases | Problems |
|---|---|---|---|---|---|
| About 1 day | 2008.08.18 | 2008.08.19 | <b> <strong> <i> <u> <s> <strike> <em> <del> <ins> <sup> <sub> <p> (existed) title or section level(existed) <hr> <br> | if the tag is deprecated in xhtml, like <u>, how to deal with it. That would be the role of the HTML cleaner. So I need to do it in the "html clean" step. Add TagHandler in wikimodel's XhtmlHandler and add blocks, parser method in xwiki-core-rendering | |
| List | About 2 days | 2008.08.19 | 2008.08.21 | <html> <ol> <li>Item 1 <ol> <li>Item 2 <ul class="star"> <li>Item 3</li> </ul> </li> <li>Item 4</li> </ol> </li> <li>Item 5</li> </ol> <ul class="star"> <li>Item 1 <ul class="star"> <li>Item 2 <ul class="star"> <li>Item 3</li> </ul> </li> <li>Item 4</li> </ul> </li> <li>Item 5</li> <li>Item 6</li> </ul> </html> | This is hard to fix. Need to see what happen in wikimodel's xhtmlparser. |
| About 2 days | 2008.08.21 | 2008.08.23 | <a href="http://www.xwiki.org">xwiki</a> | This is hard too. If can't solve in parser, I will use filter to replace link to xwiki syntax when clean html. | |
| Table | About 2 days | 2008.08.23 | 2008.08.25 | <html> <body> <table> <tr> <th>1.1</th> <th>1.2</th> </tr> <tr> <th>2.1</th> <th>2.2</th> </tr> </table> </body> </html> | even harder because it's handled by macro in new rendering. Can I just add a simple temporary tableblock solution . |
| Image | 5 hours | 2008.08.25 | 2008.08.25 | <img src="imgurl"/> | just ignore as I replace <img> to {image} |
| attribute | 10 hours | 2008.08.25 | 2008.08.26 | <p align="center" color="red">middle</p> | use the style, but how? Need to find out. |
| class | <span class="underline">test</span> | maybe ignore, just as the same without class. | |||
| font | <font size="1" style="font-size: 8pt">test</font> | ignore? or something else. |
Make ppt and odp works
Time: about 1 day
Predict Begin: 2008.08.27
Predict End: 2008.08.28
Test the project on windows
Time: 5 hours
Predict Begin: 2008.08.28
Predict End: 2008.08.29
Maybe if you are using windows OS, you can help me test it. Thanks.
Documents and package
Time: 5 hours
Predict Begin: 2008.08.29
Predict End: 2008.08.30
Old Plan
core of plugin July 8 - July 12
Actually, this work will last to the end of the project, as the core code need to change to meet the high level api.- Todo
- Clean up code. provide low level api and high level api. Hense the plugin can be used in xwiki page and other part of xwiki both.(to be detail)
- handle the conflict of the xwiki syntax(maybe it's the job of xhtmlparser)
Integration with xwiki
Develop a application July 13 - July 15
- upload a file
- select the target page
- convert the document to the page
how to upload a file using fileuploadplugin and get the byte[] of the file.a new page or insert to the existed page
Source Code
This project is just started and only product the initial code. Any suggestion is appreciated. And please add comment to this page to discuss.- svn for office converter plugin: https://svn.xwiki.org/svnroot/xwiki/sandbox/xwiki-plugin-officeimporter
- svn for office import application: http://svn.xwiki.org/svnroot/xwiki/sandbox/xwiki-application-officeimporter
Build
This project use maven2 as project management tool. You can get the source code, type "maven install" to get the plugin package.But as it dependence on some libs which are not release yet, you need to build the dependencies if you want to try the latest version.
- get the latest code from the svn for these libs below
- xwiki-core
- xwiki-core-rendering
- xwiki-core-xml
- org.wikimodel.wem
- Patch them as these issues:
- install these libs above to your maven repository
- if you want to test the project with "mvn test" or "mvn install", you should start the openoffice as a server
- if you want to build it without test, you should run "mvn install -Dmaven.test.skip=true"
POM File
Please see pom.xmlReference Libraries
Libraries dependented by Office Importer.Known issues
Support
Any question and problem, please send email to devs@xwiki.org(need to subscribe) or to me daning106(at)gmail.com
Version 114.1 last modified by daning on 28/08/2008 at 18:14
Comments: 0