Pages

Tuesday, August 2, 2011

So we have a corrupt database

When a developer started to run a query the other day he got error:
SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 6:16547936; actual 0:0). It occurred during a read of page (6:16547936) in database ID 15 at offset 0x00001f900c0000 in file '.....SampleDB_4_data.ndf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

So we checked the databaseID=15 which was SampleDB on the server and ran the
DBCC CHECKDB ('SampleDB') WITH NO_INFOMSGS
go

output:
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Page (6:16547927) is missing a reference from previous page (6:16547928). Possible chain linkage problem.
Msg 8928, Level 16, State 1, Line 1
Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data): Page (6:16547928) could not be processed. See other errors for details.
.
.
.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Index node page (6:16585523), slot 15 refers to child page (6:16547950) and previous child (6:16547951), but they were not encountered.
Msg 8928, Level 16, State 1, Line 1
Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data): Page (6:16547951) could not be processed. See other errors for details.

Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Page (6:16547951) was not seen in the scan although its parent (6:16585523) and previous (6:16547952) refer to it. Check any previous errors.
CHECKDB found 0 allocation errors and 49 consistency errors in table 'SampleTable' (object ID 80719340).
CHECKDB found 0 allocation errors and 73 consistency errors in database 'SampleDB'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (SampleDB).


--Since it suggested running repair_allow_data_loss we wanted to find out what data we were going to loose. We did not have a backup of the database, since it was not one of our production databases.
So we ran
DBCC TRACEON (3604)
go
DBCC PAGE(SampleDB,6,16547952,2) --parent page
DBCC PAGE(SampleDB,6,16585523,2) --previous page
DBCC TRACEOFF(3604)
go
This puts out a lot of information, the most important being Metadata: IndexID. If this value is greater that 2, the non-clustered index on table is corrupt and needs to be dropped and recreated thus without any loss of data.

If the value is less than 2, which was in our case, it meant the primary key/clustered index was corrupt. There is not easy way to drop and recreate index and doing so will definitely result in loss of data. We did not have a backup of database, since this was internal reasearch database. Hence the other option left for DBCC CHECKDB('SampleDB', REPAIR_ALLOW_DATA_LOSS). Run repair data loss would remove the corrupt pages and re-allign the indexes to point to the correct ones.

For this first take a full backup of corrupt database, then bring the database in simple mode and then run the REPAIR_ALLOW_DATA_LOSS command.

Once that was run, the corrupted pages were got rid of and the database was put back in multiuser mode for everyone to use.
The next day, yes you guessed it right, they wanted the lost records back. Thanks to the foresight of my manager of taking a backup, we restored the corrupt database as SampleDBCorrupt.
The unique key was the primary key to get the records. But we were not able to query the corrupt database with Primary key, since the primary key clustered index was corrupt.
So we selected the non-clustered index which had maximum columns (from primary key). This was so that we could get as unique a record as possible while still forcing the corrupt database to use a secondary index. Those lost set of values were acquired running an except command across the 2 databases. These were only partial values of the columns in the index. We were able to put them in table and match it up with an older datamart to get all values. Luckily our corrupt data was from year 2004 and we were easily able to retrieve it without any loss. The retrieved records were then inserted into our repaired db SampleDB.

Take a look at a good link I found online from sql server pro on details of dbcc:
Also, Paul Randall's blogs are one of the best source on DBCC topics.

No comments:

Post a Comment