How I extract data from MySQL dump files with mydp

author:Cimarron Taylor
version:0.1
date:January 16, 2009

From time to time I need to process the .sql files generated by mysqldump so I wrote a tool to help make this easy.

Overview

mydp is a program for processing mysqldump files in the manner of ‘awk’.

It reads mysqldump sql statements from standard input and executes Lua callback functions as table definitions and insert statements are encountered. The Lua callback functions are specified in a file passed to mydp via the command line, e.g:

% mydp callbacks.lua < dump.sql > output

The callbacks are typically used to filter and process only a small subset of the mysql dump data and/or save it for further processing.

mydp is written in Lex and Lua. It is released under the 2-clause FreeBSD licence.

Source

The source is kept in a Mercurial repository over at bitbucket which you can get via:

% hg clone http://bitbucket.org/cdtmydp/mydp

You can also browse the source at bitbucket here: http://bitbucket.org/cdtmydp/mydp/src

Requirements

Checking out mydp from bitbucket requires Mercurial

Building mydp requires make, gcc, flex and lua version 5.1

Re-building the mydp documentation from its reStructuredText sources requires Python and Sphinx

Lua callbacks

Mydp expects the lua file passed on the command line to define the following Lua functions:

  • create
  • column
  • insert
  • values
  • finish

create(tablename,position)

This function is called whenever a new CREATE TABLE statement is encountered in the dump file. The name of the table and the byte position in the stream are passed as arguments.

If create returns true, mydp will scan and process the data for the table. Otherwise it will skip forward until it exhausts its input or encounters the next CREATE TABLE statement.

column(columnname)

This function is called whenever a column definition within a CREATE TABLE is encountered. It is passed the name of the column as an argument.

if column returns true, mydp will scan and process the data for the column via the values() callback. Otherwise the column will be skipped.

insert(tablename)

This function is called when mydp encounters an INSERT INTO statement. It is useful when the processing involves initializing state between the calling of the column() callbacks and the values() callbacks.

values(...)

This function is called whenever mydp encounters a row of data. It is passed the values of only those columns for which the corresponding call to the column() callback returned true.

finish()

This function is called after all the data is processed.

Examples

Here are a few selected examples to show how the callbacks work.

src/example.lua

This is a trivial example which simply processes all tables and columns, printing the values on separate lines:

function finish()                                   -- cleanup any resources
end

function create(tablename,position)                 -- handle the start of a new 'CREATE TABLE'
    print( '-- processing ' .. tablename)
    return true
end

function column(columnname)                         -- handle the next column definition in the CREATE table
    return true                                     -- tell the lexer we want to process the column
end

function insert(tablename)                          -- handle the start of a new 'INSERT INTO'
end

function values(...)                                -- handle the next set of values
    for i=1, select('#', ...) do                    -- for each value passed
        local v = select(i,...)                     -- get the value
        print(v)                                    -- print the value
    end
    print()                                         -- print a blank line in between the records
end

src/example3.lua

In this example we split the mysql data into separate column files. This creates one directory for each table in the dump file with one column file holding the data for the corresponding column in the table:

table_insert = table.insert
TARGETDIR = 'work/example3'

cur_table      = nil                                    -- the current table we're processing
cur_columns    = nil                                    -- list of columns in the current table
cur_files      = nil                                    -- file handles

function finish()                                       -- cleanup any resources
    if cur_files then                                   -- \ if we have any files open from the previous
        for i,f in ipairs(cur_files) do                 --  \ create table statement then we close them
            assert(f:close())                           --   >
        end                                             --  /
        cur_files = nil                                 -- /
    end
end

function create(tablename,position)                     -- handle the start of a new 'CREATE TABLE'
    finish()                                            -- close files from previous table
    cur_columns = {}                                    -- reset column list
    cur_table   = tablename                             -- save new table name
    return true
end

function column(columnname)                             -- handle the next column definition in the CREATE table
    table_insert(cur_columns, columnname)               -- record the next column
    return true                                         -- tell the lexer we want to process the column
end

function insert(tablename)                              -- handle the start of a new 'INSERT INTO'
    if cur_files then return end                        -- nothing to do if we've already initialized files for this table
    cur_files = {}                                      -- otherwise this is the first call
    local tablename = cur_table                         -- \
    local tabledir = TARGETDIR .. '/' .. tablename      --  > create the directory for the table
    os.execute('mkdir -p '..tabledir)                   -- /
    for i,cname in ipairs(cur_columns) do
        local filename = tabledir .. '/' .. cname       -- \  for each column, open a new column file and
        local f = assert( io.open(filename, 'a+') )     --  > save the file handle so we can access it later
        cur_files[i] = f                                -- /
    end
end

function values(...)                                    -- handle the next set of values
    for i=1, select('#', ...) do                        -- for each value passed
        local v = select(i,...)                         -- get the value
        local f = cur_files[ i ]                        -- get the corresponding file
        if v then assert(f:write(v)) end                -- write the value if not NULL (mapped to nil by lexer)
        assert(f:write('\n'))                           -- write a newline to separate values
    end
end

Project Structure

mydp                          - project directory
  |-- Makefile                - build commands
  |-- doc                     - sphinx documentation
  |-- src                     - C and Lua sources
  |   |-- example.lua         - simple example which prints all tables/columns
  |   |-- example2.lua        - example which filters some values
  |   |-- example3.lua        - example which splits data into separate column files
  |   `-- mydp.l              - Lex source to mydp
  |-- test                    - sample test files and outputs
  `-- todo                    - known improvements

Additional directories are created by the Makefile directives:

mydp
  |-- bin                     - compiled executables
  `-- work                    - temporary files generated by tests