Pages

Thursday, August 4, 2011

System.OutOfMemoryException Error

So we had this long running query run by an internal developer to process data. This query had a cursor and output a bunch of stuff after every record was processed. And after a point the query failed with following:

Probably the data out output in SSMS window was too much too handle.
Online solutions do suggest putting the query in a file and then running the file using SQLCMD. Instead here, we created a temporary sql agent job on that database. Dumped the SQL in the job and output data to a text file. The query ran really fast and we had all the data we needed.

Slow running query

We recently got a call from one of our developers that they had a slow running application on one of the databases. It was a lot longer than usual and with this rate they were estimating nearly a day of processing as compared to their usual couple of hours.

First: I ran the sp_whoisactive (an awesome proc by Adam Machanic) and sp_who2 to see is there was any blocking or long running query on the server. After that was ruled out.
Second, I started the activity monitor for that server in SSMS. That showed the processor time and recent expensive queries. As was evident our query was in there.
Thirdly, I started the SQL profiler to trace and get a few paramters on running query. The reads, writes and the duration column in the profiler easily led us to the proc in the application that was running the slowest.
It also showed that this part of the application was being run very frequently as might be the processing need. Since the number of iterations * time taken was proving costly for it, we decided to tune that query.

Again back to the great sp_whoisactive @get_plans=2 (or @get_plans=1),
we were able to catch the exact execution plan for that proc. The query plan was doing a index scan where it was to do a clustered seek and key lookup. It had the necessary index, but did not look like using it. We looked at query plans in other databases and they looked like they had an index seek and key lookup, which made us sure of our assumption. The database needed updated statistics to generate the correct query plan.
We ran the sp_updatestats on that database and the duration for that part of the query came down from half a minute to 10 milliseconds, taking it back to its original time. The processing finished within a couple of hours.

Tuesday, August 2, 2011

So we have a corrupt database

When a developer started to run a query the other day he got error:
SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 6:16547936; actual 0:0). It occurred during a read of page (6:16547936) in database ID 15 at offset 0x00001f900c0000 in file '.....SampleDB_4_data.ndf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

So we checked the databaseID=15 which was SampleDB on the server and ran the
DBCC CHECKDB ('SampleDB') WITH NO_INFOMSGS
go

output:
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Page (6:16547927) is missing a reference from previous page (6:16547928). Possible chain linkage problem.
Msg 8928, Level 16, State 1, Line 1
Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data): Page (6:16547928) could not be processed. See other errors for details.
.
.
.
Msg 8980, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Index node page (6:16585523), slot 15 refers to child page (6:16547950) and previous child (6:16547951), but they were not encountered.
Msg 8928, Level 16, State 1, Line 1
Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data): Page (6:16547951) could not be processed. See other errors for details.

Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 80719340, index ID 1, partition ID 72057594103136256, alloc unit ID 72057594106806272 (type In-row data). Page (6:16547951) was not seen in the scan although its parent (6:16585523) and previous (6:16547952) refer to it. Check any previous errors.
CHECKDB found 0 allocation errors and 49 consistency errors in table 'SampleTable' (object ID 80719340).
CHECKDB found 0 allocation errors and 73 consistency errors in database 'SampleDB'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (SampleDB).


--Since it suggested running repair_allow_data_loss we wanted to find out what data we were going to loose. We did not have a backup of the database, since it was not one of our production databases.
So we ran
DBCC TRACEON (3604)
go
DBCC PAGE(SampleDB,6,16547952,2) --parent page
DBCC PAGE(SampleDB,6,16585523,2) --previous page
DBCC TRACEOFF(3604)
go
This puts out a lot of information, the most important being Metadata: IndexID. If this value is greater that 2, the non-clustered index on table is corrupt and needs to be dropped and recreated thus without any loss of data.

If the value is less than 2, which was in our case, it meant the primary key/clustered index was corrupt. There is not easy way to drop and recreate index and doing so will definitely result in loss of data. We did not have a backup of database, since this was internal reasearch database. Hence the other option left for DBCC CHECKDB('SampleDB', REPAIR_ALLOW_DATA_LOSS). Run repair data loss would remove the corrupt pages and re-allign the indexes to point to the correct ones.

For this first take a full backup of corrupt database, then bring the database in simple mode and then run the REPAIR_ALLOW_DATA_LOSS command.

Once that was run, the corrupted pages were got rid of and the database was put back in multiuser mode for everyone to use.
The next day, yes you guessed it right, they wanted the lost records back. Thanks to the foresight of my manager of taking a backup, we restored the corrupt database as SampleDBCorrupt.
The unique key was the primary key to get the records. But we were not able to query the corrupt database with Primary key, since the primary key clustered index was corrupt.
So we selected the non-clustered index which had maximum columns (from primary key). This was so that we could get as unique a record as possible while still forcing the corrupt database to use a secondary index. Those lost set of values were acquired running an except command across the 2 databases. These were only partial values of the columns in the index. We were able to put them in table and match it up with an older datamart to get all values. Luckily our corrupt data was from year 2004 and we were easily able to retrieve it without any loss. The retrieved records were then inserted into our repaired db SampleDB.

Take a look at a good link I found online from sql server pro on details of dbcc:
Also, Paul Randall's blogs are one of the best source on DBCC topics.

Thursday, July 21, 2011

Redgate schema compare

We have a system of about 300+ databases which the way the applications are designed need to be identical. Our releases include releasing objects to all the databases. But once in a while we have a production requirement or issue and have to roll out a hotfix to a particular database. You know how exciting DBA jobs get once a while and thus we tend to forget releasing those few objects to all other databases.

Hence, for that reason we wanted a database schema compare job to be run once a month or week.

Recently we bought Redgate developer suite and loved the capability to integrate with SSMS.
Also, its database schema compare was much faster than any other tools I have used previously. Along with that it also offered the capability to do source control and database compare, which we had been looking for a while.

Thus to automate the compare process, we decided to use the Redgare sql compare commandline features.
Here is a sample of what options were used/ignored and what switches were used to include and exclude objects:

1. echo Start Compare %date% -%time% >>C:testlog.txt

2. cd "C:Program FilesRed GateSQL Compare 9"
SqlCompare /S1:server1 /db1:baseDatabase /S2:server2 /db2:databaseToCompare /Exclude:Schema /Exclude:User /Exclude:Role /Options:Default,IgnoreComments,IgnoreConstraintNames,IgnorePermissions,
IgnoreFillFactor,IgnoreWhiteSpace,IgnoreQuotedIdentifiersAndAnsiNullSettings,
IgnoreWithNocheck >>C:testlog.txt

3. echo End Compare %date% -%time% >>C:testlog.txt

As you can see, the 1st and the 3rd statement is used to get time estimates for comparison. For us it took only about 40 minutes to go through 300+ databases, which is impressive for the kind of work this tool is doing in each database.

The 2nd statement is where the actual command is either run in windows command prompt or put in a bat file. As you can see it first changes directory to where sqlcompare.exe lies and then runs the compare across server1-database1 and server2-database2.

/Exclude is a switch used in comparison which lets us exclude comparison for certain objects, as in our case like user, role and schema. But you can also exclude to the level of tablename which is really cool aspect.
More information on switches can be found at:
Switches

/options are options in option tab you see in the SQL Schema compare Gui. You can use it to Ignore Permissions, whitespace, FillFactors etc during comparisons which may not be that important during the first run.
More information on options is at:
Options

The way we designed it it I ran the SQL comapre across two same databases, from GUI and commanline using same set of options at a time. Then adding more options as needed. This way I cud get the exact same result in commandline as in Gui.

We wanted this as a step by step process and initially we compared only Procedures, functions, views and tables fixing them across all databases.
Second iteration we compare only tbales and constraint names, because we had hundreds of system named default constraints and renamed them appropriate and same across all databases.

Needless to say before doing any of the object changes across all databases, we send them to our software developers for a quick review and released it using regular release process.

Saturday, July 9, 2011

Tempdb on separate drive on server

Case study:
How many times have we heard that tempdb should be on a fast cache drive? Thus to start with we had put tempdb files along with mdf files on fast drive. While the ldf's for the databases were on separate drive. This worked very well till SE's started writing queries to process data for internal use growing the tempdb mdf. Or when we had to load months of data in a database to resolve some issue growing its mdf out of limits and that choked that drive.

Thus we decided to shift tempdb to its own fast cache read write drive T:. On an average our tempdbs required only about 5 GB space. But once in a while shot unto 100 GB during some manual processing. What our Operations team decided was to let tempdbs of various servers each be on T: drive on the servers and share the same pool of space common to all. That pool being about 100GBs. The logic behind this was, when data was processed for one database, not necessarily other databases would also process that amount of data as well. So at a time even if one tempdb grew a lot it could use up 60-80GB and shrink back when the processing is over.
Thus that saved a lot of space given to each tempdb on each server if they were kept separate and had to be given 100GB each. It also helped since tempdb being on separate drive did not cause a bottle neck on the mdf drives, thus not slowing down other production queries. This resulted in faster data retrievals for client applications.

Friday, July 8, 2011

Deleting data from heap table

We started running out of space on a few of our servers. Adding more space is an easy getaway, but everyone knows the perils of throwing more and more hardware to solve space issues. So determined to use the available space, we decided to look at any particular tables which had a lot of data and could be deleted or archived. We figured that there was a heap table used in every databases, in some as big as 200GB, which had data from when the system had started. Also new data was being added to that table everyday. After careful consideration of keeping back dated data upto one year and deleting the rest since we had flat files of data over many years which could be loaded if required, we decided to delete the data from heap.
The data was deleted in batches since the table was continuously used and we din't want to lock it down for long time. But as everyone knows the major issue with deleting data from a heap is the data is deleted but pages are not deallocated, unless you do a rebuild or have an index over it or delete with a TABLOCK.
Creating an index though an option was not going to work for us, since it would take nearly as much space as heap. We did not have that much space on server. Rebuilding also took a lot of space and hence with only a few GB remaining even this was not an option.

So for the time being we decided to delete 500 rows at a time with TABLOCK option. With TABLOCK, the delete acquires a lock on the table and deleted data as well as deallocates the pages. So we planned to run and monitor this so as to not have other queries wait on it and detect blocking early on.

Once we got a lot of space back and the heap had only one year of data in all databases, we decided to create a new table with index and have the queries use this and delete the heap.
The other option was to use a compressed table while doing so, so that the space required for a years data would be pretty less.
We decided to drop the heap once the new table was in place and data from heap was transferred to it.
Next, we put a job in place which would run weekly or monthly and delete data from the table, so that the table remained manageable with only one years data preserved.

These were a few options we used to sort our space issues, other than deleting unused indexes. Creating required indexes etc. If you know of any other options please do share your thoughts.

Transactional Replication

The way our system is designed, We use transactional replication for quite a while now. Though once in a while it gets confused with snapshot replication, since transactional replication creates a log reader job, a snapshot job and a distributor job.

The snapshot job is required only the very first time when the articles in the database need to be published from the publisher to the subscriber. We found that this job is no more required and is best turned off at other times. We realized this when we started having blocking in the middle of the day. Investigation revealed that it was because of replication. But replication ran at other times too. The blocking did not happen at those other times. So we looked closely and found another job running at the same time as the snapshot creation and distribution and since snapshot job blocked the tables, the other job also blocked, thus causing blocking and deadlocks. We shifted the other jobs to run 5 minutes after the replication job and disabled the snapshot job since it was not necessary.

As we experimented we realized that, the log reader continuously runs and reads any new changes to the articles to be published in the transaction log and the distributer which can be run either continuously or say every hour or once a day according to the requirement, will pick up all the data and schema changes and distribute it to other servers.
Our initial assumption that distributer can distribute only data changes and not schema changes was wrong.
Thus we disabled the snapshot job.

The snapshot is only required when a new article (table) is added to the published set and that needs to be newly created at the subscriber. Since we have streamlined releases and client downtime during releases. When we notice such a scenario (which is rare) we plan to run the snapshot job during releases manually and hence reduce the blocking caused by the snapshot job.

Saturday, June 25, 2011

Index Fill factor

Index Fill factor is the percent of page you want to fill and thus percent of free space you want to have in a page. A page is 8k bytes. Depending on columns in index, index can be a few bytes. Lets take 1000 bytes in our case.
A default fill factor is 100% and thus one can fill 8 records on a page.

If a default fill factor is 80% would mean, 8000*0.8=6400 bytes to fill, which would allow probably 6 records and leave part of page free.
When inserts updates happen, if there is a free space on the page, there is no need for a page split/new page to accomodate the new record. Thus fill factor is important, since it reduces the number of page splits.
The fill factor, is important only when indexes are created, reorganized or rebuild. Since that is when, sql server keeps this percentage of space free on every page.

Outdated Statistics

If we get an actual execution plan for a query and the estimated versus actual rows vary by a factor of 10 indicates the optimizer is using an outdated statistics. ANd an indication that stats need to be updated if query is slow.

links: extract from SQL Server 2008 query performance tuning distilled - Grant fritchey and Sajal Dam

Friday, March 4, 2011

Full and differential backups with compression

We used to take backups of databases on all our systems every night for a long time. That meant we backuped up 12TB of data every day for about 2 weeks. Since we delted backups older than that. Imagine how much space that took. Getting the databases back from tapes was another job.

So recently at our company, they set up full back up once a week (with compression) and differential backups once a day. Since the database were in simple mode, a point of time recovery was not possible. But we could always bring the database upto the latest differential back up and use flat files of data that we had daily to bring it up to date. With a full back per week with compression :a 160GB database went down to nearly 40GB. So considering an approximate factor of four (not exactly). The total backup went down to about 3TB. And the differential backups came to hardly 70GB for a week. We too deleted any full and differential backups more than two weeks old.

This greatly solved our space problem since,
we now required about
3*2(once every week)tb + 70*2 GB = 6TB(approx) to back up 2 weeks worth of data,
With easy recovery, since they were on the same box as the online databases.
It also meant transferring the dbs much more faster across our produciton and dev/qa environments.
as opposed to:
12*14 = 168TB!
The things we needed to be careful about was:
To use backup with copy only option, incase we needed to copy most current data from production. So as to not throw the backup process out of sync by taking full backups manually midday.


Full backup code:
Declare @vr_path VARCHAR(1000),
@UseTimeInFileName Char(1) = 'N'
DECLARE @vr_dbname VARCHAR(100)
DECLARE @SQL VARCHAR(MAX)
DECLARE @BackupTime VARCHAR(50)

IF @UseTimeInFileName = 'Y'
BEGIN
SET @BackupTime = convert(varchar(8),getdate(),112) + '_' + replace(convert(varchar(8),getdate(),108),':','')
END
ELSE
BEGIN
SET @BackupTime = convert(varchar(8),getdate(),112)
END

SET @SQL = ''

SELECT @SQL = @SQL + 'BACKUP DATABASE [' + name + '] TO DISK = ' + Char(39) +
@vr_path + name
+ '_' + @BackupTime + '.bak'+ Char(39) + ' WITH INIT, COMPRESSION;'
FROM sys.databases
WHERE name not in ('tempdb')
AND state = 0 -- Online Databases Only
AND source_database_id IS NULL --Not a database snapshot
AND is_in_standby = 0

PRINT @SQL
EXEC (@SQL)

Differential backup code:
Declare @vr_path VARCHAR(1000),
@UseTimeInFileName Char(1) = 'N'
DECLARE @vr_dbname VARCHAR(100)
DECLARE @SQL VARCHAR(MAX)
DECLARE @BackupTime VARCHAR(50)

IF @UseTimeInFileName = 'Y'
BEGIN
SET @BackupTime = convert(varchar(8),getdate(),112) + '_' + replace(convert(varchar(8),getdate(),108),':','')
END
ELSE
BEGIN
SET @BackupTime = convert(varchar(8),getdate(),112)
END

SET @SQL = ''

SELECT @SQL = @SQL + 'BACKUP DATABASE [' + name + '] TO DISK = ' + Char(39) +
@vr_path + name
+ '_' + @BackupTime + '.bak'+ Char(39) + ' WITH DIFFERENTIAL, COMPRESSION;'
FROM sys.databases
WHERE name not in ('tempdb', 'master')
AND state = 0 -- Online Databases Only
AND source_database_id IS NULL --Not a database snapshot
AND is_in_standby = 0

PRINT @SQL
EXEC (@SQL)

Thursday, February 24, 2011

Sql agent maintenance job run duration query

Run this across all registered servers for dates your server runs optimizations on.
You will get total time it took for the jobs to run in the two weeks. Drop the values in excel and plot graph if required. Courtsey Ken Simmons, he ran this in our production environment to show how reindexing only fragmented index took a lot of time out of optimization schedule. How he changed the optimizaiton plan is a different blog post.

select
STUFF(STUFF(RIGHT('00000000' + CAST(run_duration as varchar(10)) ,6),3,0,':'),6,0,':') AS
run_duration,
run_date
from msdb.dbo.sysjobs A Join msdb.dbo.sysjobhistory B
ON A.job_id = B.job_id
WHERE A.name like '%opt%' and B.step_id = 2 and (run_date = 20110212 or run_date = 20110205)

Wednesday, February 16, 2011

SSIS error running executable/batch file from network

So, I had this issue for sometime now, where an SSIS scheduled from a sql agent job, would hang at a particular step and not run at all. We figured it was due to the exe and bat file in SSIS which needed a user prompt confirmation to go to next step. I had looked online if there was an option to suppress it inside of SSIS but dint find one.

Finally my manager pointed me to this solution:
"The publisher could not be verified" prompt running executable from network
http://www.annoyances.org/exec/forum/winvista/t1151260847

Problem:
Running XP SP2 or higher, you try to run an executable located on another machine
on your network. Your accosted with a prompt: "The publisher could not be verified".
You are forced to confirm that you wish to run this program... every time you run
it.

Solution:
Run gpedit.msc

Go to User Configuration >> Administrative Templates >> Windows Components >> Attachment
Manager

Add "*.exe;*.bat;" to the "Inclusion list for moderate risk file types" setting.



which I configured on the server that runs the package (as above)
and that fixed it! No more prompts.

Monday, February 14, 2011

Sql agent Jobs (owner,lastrun,runstatus)

select * from msdb.dbo.sysjobs a
inner join sys.server_principals b
on a.owner_sid=b.sid where b.name like '%%'

SELECT j.[name],
MAX(CAST(
STUFF(STUFF(CAST(jh.run_date as varchar),7,0,'-'),5,0,'-') + ' ' +
STUFF(STUFF(REPLACE(STR(jh.run_time,6,0),' ','0'),5,0,':'),3,0,':') as datetime)) AS [LastRun],
CASE jh.run_status WHEN 0 THEN 'Failed'
WHEN 1 THEN 'Success'
WHEN 2 THEN 'Retry'
WHEN 3 THEN 'Canceled'
WHEN 4 THEN 'In progress'
END AS Status,b.name as jobowner
FROM msdb.dbo.sysjobs j
INNER JOIN msdb.dbo.sysjobhistory jh
ON jh.job_id = j.job_id AND jh.step_id = 0
inner join msdb.dbo.syscategories sc
on j.category_id = sc.category_id
inner join sys.server_principals b
on j.owner_sid=b.sid
--WHERE sc.[name] like '%DTS%'
GROUP BY j.[name], jh.run_status,b.name

Wednesday, February 9, 2011

Using registered servers tab in ssms

Register all sql servers under registeres servers in ssms,either by adding manually or using a regsrvr file

1) Right click local databases and import-->

2)Select the exported file which has a list of already existing servers exported from someone else's. It is in .regsrvr format.



Once all servers are registered a query can be run across all of them at one.

Very useful to see:
Exec sp_who2
Exec sp_whoisactive
server configurations etc

Tuesday, January 25, 2011

privilages to non-admin to connect to SSIS

To configure rights for remote users on Windows Server 2003 or Windows XP

1. Very important

If the user/group is not a member of the local Administrators group, add the user to the Distributed COM Users group. You can do this in the Computer Management MMC snap-in accessed from the Administrative Tools menu.

2. Give user/group remote Launch and activate permissions, so that they can connect from there machines to the SSIS server.

Open Control Panel, double-click Administrative Tools, and then double-click Component Services to start the Component Services MMC snap-in.

3.
Expand the Component Services node in the left pane of the console. Expand the Computers node, expand My Computer, and then click the DCOM Config node.

4.
Select the DCOM Config node, and then select MsDtsServer in the list of applications that can be configured.

5.
Right-click on MsDtsServer and select Properties.

6.
In the MsDtsServer Properties dialog box, select the Security tab.

7.
Under Launch and Activation Permissions, select Customize, then click Edit to open the Launch Permission dialog box.

8.
In the Launch Permission dialog box, add or delete users, and assign the appropriate permissions to the appropriate users and groups. The available permissions are Local Launch, Remote Launch, Local Activation, and Remote Activation. The Launch rights grant or deny permission to start and stop the service; the Activation rights grant or deny permission to connect to the service.

9.
Click OK to close the dialog box.

10.
Under Access Permissions, repeat steps 7 and 8 to assign the appropriate permissions to the appropriate users and groups.

11.
Close the MMC snap-in.

12.
Restart the Integration Services service.

13. If you cannot add a package or migrate it to server. Try adding user to msdb database and give read and write permissions.

Some other things to keep in mind are replace . by servername\instance in the file MsDtsSrvr.ini.xml on that server