A Perl script to convert a text file to Multiterm XML
Added: 12 June 2008, last updated: July 17, 2008
I've often read in forums and mailing lists about people having problems with importing a glossary into Multiterm. Most of the time you need a bilingual glossary from which you can insert terms into Word, TagEditor or SDL Edit.
A few years ago I had the same problem and it was then that I developed a Perl script to convert a tab-delimited text file into an XML file compatible with Multiterm's bilingual glossary template. Unfortunately this script ran only inside Notetab Pro, which can execute external scripts and grab their output.
Now I learned how to handle encoding conversions in Perl and I made a standalone version of the script, which can be downloaded here. Actually two scripts, one for English – Polish, and the other for Polish – English glossaries.
Before you use the script, you will have to edit it once to adapt it to your language pair. The script may be made smarter in the future, but for now it does its job well, once you edit the language settings.
When you open the script with a plain text editor (one that can edit/save UTF-8 encoded files), disable wrapping of long lines and go to line 31 which looks like this:
<language type="English (United States)" lang="EN-US"/>
Change English (United States) and EN-US to the correct settings for your source language.
Go to line 45 of the script and repeat this step for the target language.
You can check the correct language names and codes in step 3 of 5 of Multiterm's Termbase Creation Wizard.
Make sure that you do not delete the quotation marks around the language name and language code, nor the forward slash at the end of the tag.
Save the script. You can use the Save As command and rename the script from MTENUSPL.pl to something that represents your source and target language. Keep the pl extension. It stands for “perl” not “Polish”. ;-)
To run the script, you need:
- Perl . You can get Perl for Windows as a free download from www.activestate.com
- A tab-delimited file with your glossary, saved in the UTF-8 encoding. You can save a tab delimited file with this encoding in Word, OpenOffice Writer, UltraEdit, NoteTab Pro, PSPad etc. I can't make the script guess the encoding of the source file, so it must be in UTF-8. Also, if your tab-delimited file contains any of these characters: <, >, &, replace them as follows:
- replace < with <
- replace > with >
- replace & with &
If you leave these characters without any changes, they will break the import into Multiterm.
To run the script, copy it where your tab-delimited file is.
Open the command line (Start key+R, type cmd, press Enter). Change the directory to where the script and the tab-delimited file are.
Type the following command:
perl MTENUSPL.pl sourcefile.txt
and press Enter. If the source file is “well formed”, the script will process it and create a sourcefile.txt.xml file. This file can be imported into a Multiterm termbase based on the bilingual glossary template.
You can download the two scripts in a zipped file here. For feedback about the scripts please use this form, or contact me through the cat_conv yahoogroup.
When you have successfully created your xml glossary file, you can import it into Multiterm. This short tutorial explains how to do it.
December 24, 2008:
Added the Multiterm Termbase Setup and Import Tutorial.
July 17, 2008:
- The actual user name is inserted, based on the current logged-on user name, “Piotr” was hardcoded before;
- Current date and time is inserted instead of a hardcoded date in the earlier version;
- Fixed an error in the numbering of “concept” elements, all concept elements were numbered as 1, now they are numbered sequentially.