Dot Grid: Visualizing Possible Problems Installing/Importing Flat CEDICT File To MySQL

May 25, 2007 – 6:58 pm

I have started writing a PHP5 script for quick CEDICT dictionary file installation into a MySQL 5.x database.

The intial plan is to first create a script that will install a local file into MySQL on the same server. The later features will include grabbing an automatic update from a central web server (of course, as a web service) so that connected dictionary sites will be running updated versions of the CEDICT file. Well, that’s way down the road.

Looking at today’s first development task, basically it’s the same-old, same-old cornhusking ritual that is required by any one mining data. From the data provided in the CEDICT dictionary file, I basically needed the Traditional Chinese, Simplified Chinese, Pinyin and each definition for the entry. So, the idea is to get a clean set of data where the multiple-definition entries would be broken up into multiple fields.

The problem? Well, input into the dictionary file is neither standardized or filtered YET :) Thus, there is no guaranteed syntax. However, there is pattern that works for the vast majority of the entries.

Chinese Chinese [pin yin] /def def/def def/defdef def/

Ok, what’s the problem? Well, 43000+ entries are a lot to check manually as they are being installed (of course, later on crowdsourcing and some nifty AJAX can be used to mitigate that issue, but that’s for another blog post).

So, how do I visualize the problem entries? I use a Dot Grid. The “x” marks the spot where I can immediately check for either syntax error. You can even create some fancy-schmancy popup-mouseover script that gives you more information about what may be going wrong. This was a handy tool for debugging on this task.

I wish I had thought of this when I was working with 2 million row installs.

dot grid shows errors during installation

:)

About Primezero

Primezero Research and Innovation is an engineering and semantics workshop, specializing in product development and rapid prototyping since 1996.

Primezero develops online learning tools for math, science, Mandarin Chinese teachers, as well as software for bloggers.

Major projects include: Arizona AIMS Mathematics Test Preparation Web Site for teachers and students, Primezero Chinese Tools 2008, and pzphp (open source toolsets) on Google Code, Chinese Seal Chop Widget for WordPress, Chinese Seal Chop Google Gadget, etc, etc, etc...

For more projects, see the Primezero Portfolio.

Post a Comment