So we had this long running query run by an internal developer to process data. This query had a cursor and output a bunch of stuff after every record was processed. And after a point the query failed with following:
Probably the data out output in SSMS window was too much too handle.
Online solutions do suggest putting the query in a file and then running the file using SQLCMD. Instead here, we created a temporary sql agent job on that database. Dumped the SQL in the job and output data to a text file. The query ran really fast and we had all the data we needed.
Everyday brings something new. This is an attempt to blog things learnt on a daily basis to solve issues in our environment from DBAs, colleagues and the great collection of articles in the field. Please share your comments, alternative solutions and suggestions.
Thursday, August 4, 2011
Slow running query
We recently got a call from one of our developers that they had a slow running application on one of the databases. It was a lot longer than usual and with this rate they were estimating nearly a day of processing as compared to their usual couple of hours.
First: I ran the sp_whoisactive (an awesome proc by Adam Machanic) and sp_who2 to see is there was any blocking or long running query on the server. After that was ruled out.
Second, I started the activity monitor for that server in SSMS. That showed the processor time and recent expensive queries. As was evident our query was in there.
Thirdly, I started the SQL profiler to trace and get a few paramters on running query. The reads, writes and the duration column in the profiler easily led us to the proc in the application that was running the slowest.
It also showed that this part of the application was being run very frequently as might be the processing need. Since the number of iterations * time taken was proving costly for it, we decided to tune that query.
Again back to the great sp_whoisactive @get_plans=2 (or @get_plans=1),
we were able to catch the exact execution plan for that proc. The query plan was doing a index scan where it was to do a clustered seek and key lookup. It had the necessary index, but did not look like using it. We looked at query plans in other databases and they looked like they had an index seek and key lookup, which made us sure of our assumption. The database needed updated statistics to generate the correct query plan.
We ran the sp_updatestats on that database and the duration for that part of the query came down from half a minute to 10 milliseconds, taking it back to its original time. The processing finished within a couple of hours.
First: I ran the sp_whoisactive (an awesome proc by Adam Machanic) and sp_who2 to see is there was any blocking or long running query on the server. After that was ruled out.
Second, I started the activity monitor for that server in SSMS. That showed the processor time and recent expensive queries. As was evident our query was in there.
Thirdly, I started the SQL profiler to trace and get a few paramters on running query. The reads, writes and the duration column in the profiler easily led us to the proc in the application that was running the slowest.
It also showed that this part of the application was being run very frequently as might be the processing need. Since the number of iterations * time taken was proving costly for it, we decided to tune that query.
Again back to the great sp_whoisactive @get_plans=2 (or @get_plans=1),
we were able to catch the exact execution plan for that proc. The query plan was doing a index scan where it was to do a clustered seek and key lookup. It had the necessary index, but did not look like using it. We looked at query plans in other databases and they looked like they had an index seek and key lookup, which made us sure of our assumption. The database needed updated statistics to generate the correct query plan.
We ran the sp_updatestats on that database and the duration for that part of the query came down from half a minute to 10 milliseconds, taking it back to its original time. The processing finished within a couple of hours.
Tuesday, August 2, 2011
So we have a corrupt database
When a developer started to run a query the other day he got error:
SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 6:16547936; actual 0:0). It occurred during a read of page (6:16547936) in database ID 15 at offset 0x00001f900c0000 in file '.....SampleDB_4_data.ndf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
So we checked the databaseID=15 which was SampleDB on the server and ran the
DBCC CHECKDB ('SampleDB') WITH NO_INFOMSGS
go
output:
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Page (6:16547927) is missing a reference from previous page (6:16547928). Possible chain linkage problem.
Msg 8928, Level 16, State 1, Line 1
Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data): Page (6:16547928) could not be processed. See other errors for details.
.
.
.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Index node page (6:16585523), slot 15 refers to child page (6:16547950) and previous child (6:16547951), but they were not encountered.
Msg 8928, Level 16, State 1, Line 1
Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data): Page (6:16547951) could not be processed. See other errors for details.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Page (6:16547951) was not seen in the scan although its parent (6:16585523) and previous (6:16547952) refer to it. Check any previous errors.
CHECKDB found 0 allocation errors and 49 consistency errors in table 'SampleTable' (object ID 80719340).
CHECKDB found 0 allocation errors and 73 consistency errors in database 'SampleDB'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (SampleDB).
--Since it suggested running repair_allow_data_loss we wanted to find out what data we were going to loose. We did not have a backup of the database, since it was not one of our production databases.
So we ran
DBCC TRACEON (3604)
go
DBCC PAGE(SampleDB,6,16547952,2) --parent page
DBCC PAGE(SampleDB,6,16585523,2) --previous page
DBCC TRACEOFF(3604)
go
This puts out a lot of information, the most important being Metadata: IndexID. If this value is greater that 2, the non-clustered index on table is corrupt and needs to be dropped and recreated thus without any loss of data.
If the value is less than 2, which was in our case, it meant the primary key/clustered index was corrupt. There is not easy way to drop and recreate index and doing so will definitely result in loss of data. We did not have a backup of database, since this was internal reasearch database. Hence the other option left for DBCC CHECKDB('SampleDB', REPAIR_ALLOW_DATA_LOSS). Run repair data loss would remove the corrupt pages and re-allign the indexes to point to the correct ones.
For this first take a full backup of corrupt database, then bring the database in simple mode and then run the REPAIR_ALLOW_DATA_LOSS command.
Once that was run, the corrupted pages were got rid of and the database was put back in multiuser mode for everyone to use.
The next day, yes you guessed it right, they wanted the lost records back. Thanks to the foresight of my manager of taking a backup, we restored the corrupt database as SampleDBCorrupt.
The unique key was the primary key to get the records. But we were not able to query the corrupt database with Primary key, since the primary key clustered index was corrupt.
So we selected the non-clustered index which had maximum columns (from primary key). This was so that we could get as unique a record as possible while still forcing the corrupt database to use a secondary index. Those lost set of values were acquired running an except command across the 2 databases. These were only partial values of the columns in the index. We were able to put them in table and match it up with an older datamart to get all values. Luckily our corrupt data was from year 2004 and we were easily able to retrieve it without any loss. The retrieved records were then inserted into our repaired db SampleDB.
Take a look at a good link I found online from sql server pro on details of dbcc:
Also, Paul Randall's blogs are one of the best source on DBCC topics.
SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 6:16547936; actual 0:0). It occurred during a read of page (6:16547936) in database ID 15 at offset 0x00001f900c0000 in file '.....SampleDB_4_data.ndf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
So we checked the databaseID=15 which was SampleDB on the server and ran the
DBCC CHECKDB ('SampleDB') WITH NO_INFOMSGS
go
output:
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Page (6:16547927) is missing a reference from previous page (6:16547928). Possible chain linkage problem.
Msg 8928, Level 16, State 1, Line 1
Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data): Page (6:16547928) could not be processed. See other errors for details.
.
.
.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Index node page (6:16585523), slot 15 refers to child page (6:16547950) and previous child (6:16547951), but they were not encountered.
Msg 8928, Level 16, State 1, Line 1
Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data): Page (6:16547951) could not be processed. See other errors for details.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Page (6:16547951) was not seen in the scan although its parent (6:16585523) and previous (6:16547952) refer to it. Check any previous errors.
CHECKDB found 0 allocation errors and 49 consistency errors in table 'SampleTable' (object ID 80719340).
CHECKDB found 0 allocation errors and 73 consistency errors in database 'SampleDB'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (SampleDB).
--Since it suggested running repair_allow_data_loss we wanted to find out what data we were going to loose. We did not have a backup of the database, since it was not one of our production databases.
So we ran
DBCC TRACEON (3604)
go
DBCC PAGE(SampleDB,6,16547952,2) --parent page
DBCC PAGE(SampleDB,6,16585523,2) --previous page
DBCC TRACEOFF(3604)
go
This puts out a lot of information, the most important being Metadata: IndexID. If this value is greater that 2, the non-clustered index on table is corrupt and needs to be dropped and recreated thus without any loss of data.
If the value is less than 2, which was in our case, it meant the primary key/clustered index was corrupt. There is not easy way to drop and recreate index and doing so will definitely result in loss of data. We did not have a backup of database, since this was internal reasearch database. Hence the other option left for DBCC CHECKDB('SampleDB', REPAIR_ALLOW_DATA_LOSS). Run repair data loss would remove the corrupt pages and re-allign the indexes to point to the correct ones.
For this first take a full backup of corrupt database, then bring the database in simple mode and then run the REPAIR_ALLOW_DATA_LOSS command.
Once that was run, the corrupted pages were got rid of and the database was put back in multiuser mode for everyone to use.
The next day, yes you guessed it right, they wanted the lost records back. Thanks to the foresight of my manager of taking a backup, we restored the corrupt database as SampleDBCorrupt.
The unique key was the primary key to get the records. But we were not able to query the corrupt database with Primary key, since the primary key clustered index was corrupt.
So we selected the non-clustered index which had maximum columns (from primary key). This was so that we could get as unique a record as possible while still forcing the corrupt database to use a secondary index. Those lost set of values were acquired running an except command across the 2 databases. These were only partial values of the columns in the index. We were able to put them in table and match it up with an older datamart to get all values. Luckily our corrupt data was from year 2004 and we were easily able to retrieve it without any loss. The retrieved records were then inserted into our repaired db SampleDB.
Take a look at a good link I found online from sql server pro on details of dbcc:
Also, Paul Randall's blogs are one of the best source on DBCC topics.
Subscribe to:
Posts (Atom)