For inspiration
A few years back we migrated from Oracle 8 on a Unix box to "modern" MSSQL. Large tables were 8M - 20M rows. Everything was fixed format. Date formats were complete nonsense to MSSQL. And we knew some code tables could have "unexpected" codes. We had no influence on file formats.
We ended up doing the import for each of the large files via staging tables because the shear data volume meant we had to restrict data roundtrips between client app and database.
- Staging table #1: INITIAL LOAD
- ImportRowID IDENTITY column (PKEY)
- LineData VARCHAR(xxx) wide enough to hold each imported line.
- Staging table #2: IMPORT DATA
- ImportRowID
- One varchar column for each imported data field
- A status column to help later filtering/statistics
- Table #3: IMPORT MESSAGES
- NOTE: Potentially multiple message regarding same row
- Each row holds {msgID, ImportRowID, Severity, Message}
- Table #4: FINAL DATA
- ImportRowID
- Each data column in correct data type
- Rows in error have no final row, rows in warning have final row but user may want to ignore them
Import process
- Import full data file into table #1
Import to this table is trivial but takes long time due to data volume.
- Split data into separate columns
Basically one big INSERT … SELECT … - and therefore FAST!
- Validate data
This is the complex part.
- Code values need to checked
- Some codes need translation
- Dates etc. need format/validity check
- etc.
- Transform to correct data types
- Another big INSERT … SELECT … - and therefore FAST!
User starts the import - then waits because importing 8M - 15M rows takes time even if it is "fast". We let user filter/sort messages, seep through imported data - deciding whether some rows should be accepted despite warnings. User has to confirm import before app finally copied from "FINAL DATA" into the destination tables.
We did many validation "per column" for performance reasons like below:
INSERT INTO ImportMessage(ImportRowID, Severity, Message)
SELECT ImportRowID, 2, ('Unknown code ' + CodeXX)
FROM ImportData
WHERE CodeXX NOT IN (SELECT CodeValue FROM CodeTable WHERE CodeType = 'XX')
This validates millions of rows in fractions of time it takes for a client app to process data one row at a time.
HTH /Michael