struggling with mediawiki: help import
I have set up a new instance of Mediawiki, and since other people will use it, I wanted to import the help pages from the french wikipedia (because by default, the help namespace is empty).
So, I go to the export page from wikipedia; there’s an input field to add all pages from a given namespace. It doesn’t work even though I tried different methods. I look around in the documentation and the official way of getting the list of pages from a namespace is to use a special page (Special:AllPages) and to process the resulting HTML file through… Word. Yes, the official way of exporting a list of pages from a mediawiki instance requires Microsoft Word. Well, there’s also a small Python script that does some regexp magic to extract this list.
All is well, then. I proceed and get my list of pages, go to my local import page and there we go. There are a lot of red (ie, no page behind) links, but since most of them are project links, it doesn’t bother me. But there are also a few help links. And that does bother me.
So, today, I check and when looking at some of the missing pages I can see that most of those missing help pages have names containing accentuated characters. Doh! The regexp used by the Python helper only matches ASCII characters. I tried a few things (yeah, I know about re.U, but no, it doesn’t work as expected, and finally, I get about 70 more pages in my list.
So, again, I import my dump. Hum, xml failure. Look at the file. Looks ok. Try again. Edit php upload limit on server (the size of the file was large but no reaching yet the limit, but just in case…). Try again. No luck. Think about it. Rename the file from Wikip%C3%A9dia-20080114100132.xml to Aide2.xml. Try again. Bingo!
sigh.