Pages

Wednesday, March 12, 2014

All about R

SIMPLE Variable assignment and use of FUNCTIONS
 a<-4 nbsp="" p="">a/2=2
a*2=8
b<-1 nbsp="" p="">a+b=5, a-b=3, b-a=-3
sin(a), cos(b)
a==b False or F
a>b TRUE or T

VECTOR
Null in SQL= NA in R
sum(a, na.rm=True) will add all values without NA
x<-c p="">names(x)<-c asters="" br="" college="" school="">
plot(x)
y<-1:4 p="">print(y) 1,2,3,4

MATRIX
create matrix
matrix(1,3,4) corresponds to matric(value in each column, rows, columns)
A<-matrix p="">contour(A) -----creates a graph for matrix so its easily readable

3D prespective plot:
persp(a)

3D perspective with less expansion
persp(a,expand=0.2)

R includes some sample data sets to play around with. One of these is volcano, a 3D map of a dormant New Zealand volcano.

contour(volcano)
persp(volcano, expand=0.2)
image(volcano) ------image function create a heat map 

SUMMARY STATISTICS
http://www.ltcconline.net/greenl/courses/201/descstat/mean.htm


  • Average value (mean)=sum(n)/n
  • Most frequently occurring value (mode)
  • On average, how much each measurement deviates from the mean Formula
    Variance and Standard Deviation: Step by Step 
    Calculate the mean, x.  
    Write a table that subtracts the mean from each observed value.
    Square each of the differences.
    Add this column. 
    Divide by n -1 where n is the number of items in the sample  This is the variance.
    To get the standard deviation we take the square root of the variance.  
 finally, Mean+sd and mean-sd is the range in which the values should lie, other are outliers.
  • Span of values over which your data set occurs (range), and
  • Median= average of two middle values when ordered asc or desc in series (this value gives a better and robust idea of an average than mean, since it does not take outliers in consideration)

mean(volcano)
barplot(x)
> limbs<-c br="">> mean(limbs) 3.428571
> names(limbs)<-c br="" five="" four="" one="" seven="" six="" three="" two="">> barplot(limbs)
> abline(h=mean(limbs)) --horizon
median(limbs) = 4
sd(limbs) =0.7867958
abline(h=mean(limbs)+sd(limbs),lty"dotted",col="red")
abline(h=mean(limbs)+sd(limbs))

Factors

Data Frames

type<-c gems="" gold="" p="" silver="">weight<-c p="">prices<-c br="">

> treasure <- code="" data.frame="" prices="" types="" weights=""> 
> print(treasure) 
 
    weights prices  types
1     300   9000   gold
2     200   5000 silver
3     100  12000   gems
4     250   7500   gold
5     150  18000   gems

treasure[[2]]= treasure[["prices"]] = treasure$prices
[1]  9000  5000 12000  7500 18000 

Read files

read.csv("C:\\Program Files\\R\\targets.csv") 
read.table("C:\\Program Files\\R\\Infantry.txt",sep="\t")
read.table("C:\\Program Files\\R\\Infantry.txt",sep="\t",header=TRUE)
 
 
 plot(countries$GDP,countries$Piracy)
 
 cor.test(countries$GDP, countries$Piracy)
Pearson's product-moment correlation

data:  countries$GDP and countries$Piracy 
t = -14.8371, df = 107, p-value < 2.2e-16
Conventionally, any correlation with a p-value less than 0.05 is 
considered statistically significant, and this sample data's p-value is 
definitely below that threshold. In other words, yes, these data do show
 a statistically significant negative correlation between GDP and 
software piracy.   
 
 If we know a country's GDP, can we use that to estimate its piracy rate?
We can, if we calculate the linear model that best represents all our data points (with a certain degree of error). The lm function takes a model formula, which is represented by a response variable (piracy rate), a tilde character (~), and a predictor variable (GDP). (Note that the response variable comes first.)
Try calculating the linear model for piracy rate by GDP, and assign it to the line variable:
 line <- b="" countries="" iracy="" lm="">
 
Other statistical packages that can be added to R
install.packages("ggplot2") 
 

Tuesday, March 11, 2014

Hortonworks Hadoop

Sharing an awesome post on starting and setting up Hortonworks to learn Hadoop:
http://codegumbo.com/index.php/2014/02/24/first-few-bites-of-the-elephant-working-with-hortonworks-hadoop/
I just set everything up on my Windows8 core i7 HP laptop. One step I had to do extra was boot into the system and turn virtualization bit on and after that everything ran like magic.


Tuesday, April 23, 2013

Fix Untrusted Foreign keys

Going off of Brent Ozar's note on fix untrusted foreing keys here:
http://www.brentozar.com/blitz/foreign-key-trusted/
He explains how the optimizer cannot use untrusted Fk's to generate query plan and hence can lead to suboptimals plans getting selected and hence slowness.
we decided to work on untrusted foriegn key contraints in our system to get performance improvements if any. With a system of hundreds of dbs below is the query I came up with. For finding untrusted fks and also the count of records that did not satisfy the FK constraint (if any i.e. values in child table which were not in parent table). For those the records needed to be fixed before fixing the untrusted FK's.

This was run on all dbs at once using registered server and results copied to excel


CREATE TABLE #temp
    (
      servername NVARCHAR(500) ,
      dbname NVARCHAR(500) ,
      parenttable NVARCHAR(500) ,
      RefTable NVARCHAR(500) ,
      FKName SYSNAME ,
      keyname NVARCHAR(1000) ,
      PTForeignKeyColumn NVARCHAR(500) ,
      RTForeignKeyColumn NVARCHAR(500) ,
      cntStmt NVARCHAR(MAX) ,
      cntnotexistsRecords INT NULL
    )

--select top 1 * from sys.foreign_key_columns
EXEC sp_msforeachdb N' USE [?]
INSERT INTO #temp(servername,dbname,parenttable,RefTable,FKName,keyname,PTForeignKeyColumn,RTForeignKeyColumn,cntStmt,cntnotexistsRecords)
SELECT DISTINCT @@servername AS servername, DB_NAME() AS dbname,OBJECT_NAME(i.parent_object_id) AS parenttable,OBJECT_NAME(i.referenced_object_id) AS RefTable,i.name AS FKName,
''ALTER TABLE ['' + s.name + ''].['' + o.name + ''] WITH CHECK CHECK CONSTRAINT ['' + i.name + '']'' AS keyname,
(SELECT
    c.name + '',''
FROM
    sys.columns c where fkc.parent_object_id = c.object_id AND c.column_id IN (select fkc2.parent_column_id FROM sys.foreign_key_columns fkc2 where i.parent_object_id = fkc2.parent_object_id AND i.referenced_object_id=fkc2.referenced_object_id and i.object_id=fkc2.constraint_object_id)
                                 FOR XML PATH('''')) AS PTForeignKeyColumn,
(SELECT
    c.name + '',''
FROM
    sys.columns c where fkc.referenced_object_id = c.object_id AND c.column_id IN (select fkc2.referenced_column_id FROM sys.foreign_key_columns fkc2 where i.parent_object_id = fkc2.parent_object_id AND i.referenced_object_id=fkc2.referenced_object_id)
                                 FOR XML PATH('''')  ) AS RTForeignKeyColumn,
'''' AS cntStmt,NULL AS cntnotexistsRecords
from sys.foreign_keys i
INNER JOIN sys.objects o ON i.parent_object_id = o.object_id
INNER JOIN sys.schemas s ON o.schema_id = s.schema_id
left JOIN sys.foreign_key_columns fkc ON i.parent_object_id = fkc.parent_object_id AND i.referenced_object_id=fkc.referenced_object_id
 WHERE i.is_not_trusted = 1 AND i.is_not_for_replication = 0
and i.is_disabled = 0
AND DB_NAME() not in (''master'',''msdb'',''tempdb'',''dfwr_th'',''model'',''mmshare'',''distribution'')
;'


/** Create untrusted-mismatched records count statement **/
UPDATE  t
SET     cntStmt = 'Select @cnt_OUT = COUNT(DISTINCT '
        + REPLACE(t.PTForeignKeyColumn, ',', '') + ') FROM  [' + t.dbname
        + '].[dbo].[' + t.parenttable + '] WHERE '
        + REPLACE(t.PTForeignKeyColumn, ',', '') + ' NOT IN (SELECT '
        + REPLACE(t.RTForeignKeyColumn, ',', '') + ' FROM [' + t.dbname
        + '].[dbo].' + t.RefTable + ')'
FROM    #temp t
WHERE   ( LEN(t.PTForeignKeyColumn) - LEN(REPLACE(t.PTForeignKeyColumn, ',',
                                                  '')) ) = 1


/* CURSOR through to execute statement to get mismatched/untrusted record counts */
DECLARE @dbName SYSNAME ,
    @parentbl SYSNAME ,
    @reftbl SYSNAME ,
    @fkname SYSNAME ,
    @sqlStmt NVARCHAR(500);
DECLARE @ParmDefinition NVARCHAR(200)


DECLARE cntcursor CURSOR
FOR
    SELECT  dbname ,
            parenttable ,
            reftable ,
            fkname ,
            cntstmt
    FROM    #temp
OPEN cntcursor
FETCH NEXT FROM cntcursor
INTO @dbName, @parentbl, @reftbl, @fkname, @sqlStmt

WHILE @@FETCH_STATUS = 0
    BEGIN
        DECLARE @cnt INT

        SET @ParmDefinition = N'@cnt_OUT int OUTPUT'

        EXECUTE sp_executesql @sqlstmt, @ParmDefinition,
            @cnt_OUT = @cnt OUTPUT;



        UPDATE  #temp
        SET     cntnotexistsRecords = @cnt
        WHERE   dbname = @dbName
                AND parenttable = @parentbl
                AND RefTable = @reftbl
                AND FKName = @fkname

        SET @cnt = 0

        FETCH NEXT FROM cntcursor
INTO @dbName, @parentbl, @reftbl, @fkname, @sqlStmt
    END

CLOSE cntcursor
DEALLOCATE cntcursor

SELECT  *
FROM    #temp
ORDER BY dbname
DROP TABLE #temp

/*
while doing the last select * from #tempdb, you can do a where cntnotexistsRecords=0 (to get only the records which satisfy the FK's) and cntStmt != NULL ( to get values where FK relationship do not involve more than one column, because I have refrained from creating a count statement for FK's involving multiple columns) This query is designed for Fk's with only one column
*/

Monday, November 5, 2012

SQL hardware configuration

As a DBA as much as knowledge of query tuning and indexing and maintenance is important, so is important the other aspect i.e. Disk I/O configuration. Because sometimes how much ever you tune the query the problem lies elsewhere.

Raid 0: should generally be never used for SQL server
Raid 1: Mirrors data: provides protection from loss of data
Raid 5: stripping with Parity
Raid 6:stripping with double distributed Parity
Raid 10: stripped pair of mirrors
Raid 01: mirrorred pair of strips

For read-only data raid 5 or 6 are good

For heavy writes or OLTPs raid 1+0 is a good choice though expensive

Raid 1 is a good choice for transaction log which are mostly written to continuously

tempdb since is mostly writes Raid 1 or raid 1+0 is good option. Or else the RAM or any other specialized hardware is too.
Direct attached storage (DAS): disk are attached/built into server chasis and hence dedicated to each server. Easy assembly as well as performace troubleshooting. No support for failover clustering, disk array snapshots, cross data center or array based replication.

Storage Area networks (SAN):  Shared storage usage btween multiple servers. Ensures maximum use of storage space.
Since the sytem is complex, performance troubleshooting becomes difficult

Wednesday, October 31, 2012

Memory Management

This is excerpt from memory management chapter of book Troubleshooting SQL Server, A guide for accidental DBA (Some notes I put together from the chapter for self help: This chapter in the book is a wonderful read.)

No memory leak- SQL Server keeps all the memory allocated to it, whether it uses it or not. It does not return memory if the memory is not in use. So basically it looks like there is a memory leak. PS: SQL server will not release this memory unless the OS sets memory low resource notification flag, which indicates SQL server to release its memory allocations. While the memory high feature lets SQL Server know that it can grow to use additional memory. (A dedicated thread was introduced in SQL 2005 to monitor these memory notifications)

SQL      64 bit                                          SQL  32bit

Windows 2008                       OS limit 32 GB                                    limit  4GB --(enabling AWE helps SQL use additional memory unto 64gb)

SQL 2000 enterprise or SQL 2005 standard and above only has AWE enabling feature. SQL 2000 standard edition and below don't have this feature. (PS*: Remember, If the RAM or physical memory is less than 32GB then accordingly the SQL will have an upper limit which is less that the physical memory available. The above examples are based on RAM being around 64gb)

SQL Server 32 bit:

  • Memory allocation: 2GB to User Mode VAS and 2GB to Kernel Mode VAS though Windows assigns it a complete of 4GB to operate. Out of 2GB User mode: most of it is buffer pool memory, and some is non-buffer pool memory.

  • Buffer pool memory: SQL server calls VirtualAlloc function in WinAPI to allocate memory to itself. (VirtaulAlloc returns 32bit pointer which limits amount of usermode VAS to 2GB) Buffer pool memory allocations are for data pages and execution plans. This memory is pageable by Windows.

  • Non-buffer pool memory allocations: Thread stack allocations, heap allocations, exteded stored procs, SQLCLR, linked servers, backup buffers. SQL Server calculates this MemToLeave which is: MaxWorkerThreads * 0.5MB + default revervation (256MB) where MaxWorkerThreads = (ProcessorCount-4)+256  (generally comes to about 0.4GB) Generally buffer pool will not require more memory than this, but incase this memory becomes Fragmented and there is not enough contingeous memory available, the amount of VAS reserved can be increased using -g 256> startup parameter. In 64-bit SQL, VAS is 8TB, greater than allowable physical RAM on a windows server, so this issue does not exist.
    So for 2GB user mode VAS: 1.6Gb is buffer pool and 0.4 GB is non-buffer pool memory.

  • To allow 32-bit SQL to use more buffer pool memory 2 options exist:
    1.VAS tuning (available for Windows server with 4GB memory-RAM) -use with extreme caution, since here you are taking away about 1GB of Kernel VAS and trying to allocate to User VAS. In most cases avoid!
    2. enabling AWE -Address Windowing Extensions (for greater than 4GB RAM).  AWE is used to extend Buffer pool User memory. It requires additional configuration of the OS to use Physical Address Extensions (PAE). When PAE is enabled, the 32 bit memory management pointer is exapanded to 36-bit allowing OS to address 64GB of RAM or RAM's upper limit. For applications to make use of this additional memory they must use AWE. Thus instead of calling VirtuallAlloc, now SQL Server calls AllocateUserPhysicalPages function in Windows to allocate memory. This is non-pageable  by Windows.
    To use AWE, PAE is enabled in Windows 2008 using BCDEdit /set from command prompt. Next, 'awe enabled' sp_configure option is set in SQL server and SQL Server service account must have Locak Pages in Memory user right (assigned using Windows Group Policy Tiil, gpedit.msc). Restart SQL server required after enabling this. 'max server memory' sp_configure option used to set max memory allowed for AWE eg: 4gb to (MAX RAM not recommended), any memory that we think is required by SQL server.Since this is non-pageable memory, Windows OS cannot get this locked memory back if it needs it, so be careful while assigning AWE memory limit.


SQL Server 64 bit:

  • In 64-bit SQL, VAS is 8TB for Kernel mode and 8TB for user mode, greater than allowable physical RAM on a windows server. Hence, AWE enabled bit has no use in 64-bit SQL. But Lock Pages in Memory option still exists.

  • Also, since there is so much memory, The MemtoLeave for non-buffer pool and -g startup parameter also has no significance in 64-bit. But you have to manually monitor to ensure that Memory/Available Mbytes (non-buffer pool memory) remains abover 300MB as we gradually increase value of max server memory.

  • Configuring memory through Min Max server memory bits(via SSMS or sp_configure): Mim server memory is minimum size to which SQL server can shrink buffer pool when WIN memory is under pressure. Max server memory is how much maximum memory a buffer pool can use). Min Server memory should always be set much lower than max memory (If min and max are set to same(not recommended!), we are intentionally locking pages in memory which will limit how much space SQL can free up for windows when windows hits low memory)(More on how to lock pages in memory:http://msdn.microsoft.com/en-us/library/ms190730(v=sql.105).aspx)


Diagnosing Memory pressure: low memory allocated to SQL means more pages to get from disk, means more physical IO and poor performance. Results in continuous buffer pool flushing called buffer pool churn.

SQL Server: Buffer manager\Page Life Expectancy: (time in seconds that a page exists in cache): This counter must be monitored over long periods of time in order to properly identify normal trends and away from normal value.

Free List Stalls/sec > 0 frequently, indicates memory pressure. If free pages counter is near to or is 0 and PLE drops at the same time means memory pressure.

Monday, September 10, 2012

Find and drop unused indexes across multiple servers

So we have a system with about 300+ dbs across 20 servers and all of the databases are identical. Each db represents a client.

/*
This query is used to find index, seeks, scans, lookups and updates across all tables in all  dbs on all servers.
criteria for unused indexes is seeks+scans+lookups<10
the index should not be unique or clustered
and table type is not a heap.

Run the below mentioned query across all servers using registered servers. It uses EXEC sp_msforeachdb to run query on each db
*/
CREATE TABLE #UnusedIndexes
(dbName sysname,
tablename sysname,
indexname sysname,
seeks BIGINT,
scans BIGINT,
lookups BIGINT,

updates BIGINT,
lastseek DATETIME,
lastscan DATETIME,
TotalSeeksScansLookups BIGINT,
SizeInKB BIGINT)


EXEC sp_msforeachdb N' USE [?]

INSERT INTO #UnusedIndexes
SELECT  DB_NAME(),o.name ,
i.name,
u.user_seeks,
u.user_scans,
u.user_lookups,
u.user_updates,
u.last_user_seek,
u.last_user_scan,
u.user_seeks+
u.user_scans+
u.user_lookups TotalSeeksScansLookups,
reserved_page_count * 8
FROM    sys.indexes i
JOIN sys.objects o ON i.object_id = o.OBJECT_ID
LEFT JOIN sys.dm_db_index_usage_stats u ON i.object_id = u.object_id
AND i.index_id = u.index_id
AND u.database_id = DB_ID()
LEFT JOIN sys.dm_db_partition_stats ps ON u.index_id = ps.index_id AND u.object_id = ps.object_id

WHERE   o.type <> ''S''
--and isnull(u.user_updates, 0) > 0
and i.type_desc <> ''HEAP''
AND ISNULL(u.user_seeks,0)+
ISNULL(u.user_scans,0)+
ISNULL(u.user_lookups,0) < 10
AND is_unique = 0
AND i.type_desc = ''NONCLUSTERED''
--  AND o.name = '''' AND i.name=''<indexname>''
AND DB_NAME() LIKE ''%_<type>''
;

'

SELECT * FROM #UnusedIndexes

DROP TABLE #UnusedIndexes

Once you get the output, put it in excel or a sql table for further queries

----determine the total number of databases in the system with similar schemas (in ours its  --301 dbs)
SELECT *
FROM sys.databases WHERE name LIKE '%%_<type>' AND
name NOT IN ('master','tempdb','model','msdb','MMDB','pubs','Northwind','mmdb','MMShare')--301  dbs

/*
Here, I have imported the data to a table called UnusedIndexes and ran the count query below:
--wherever the noofoccurences match the total number of stage dbs that is 301 for this run
AND the user seeks, scans and lookups are 0 or very minimal/negligible as compared to updated
means none of the stage database is using this INDEX
Hence the index can be deleted across all  dbs.

Incase of user seeks,scans,lookups and updates being null, does not give us much information..so we will have to go through each of the index usage and figure out whether to keep or delete it.
*/
SELECT tablename,indexname,COUNT(indexname) AS noofoccurences,SUM(totalSeeksScansLookups) as totalSeeksScansLookups,
SUM(updates) AS Totalupdates,sum(sizeinKB)/1000/1000 AS sizesavingsacrossAllinGB
from [dbo].[UnusedIndexes]
GROUP BY tablename, indexname
HAVING COUNT(indexname) >300
ORDER BY sum(sizeinKB) desc
/* As you begin deleting the indexes you might see a significant decrease in db size. This will also help the CPU since those many lesser indexes need to be updated and maintained.

After the above query run keep changing the --HAVING COUNT(indexname) >275 (and see how many more indexes it gives)


Sometimes an index is picked up by optimizer by mistake and hence shows up as used in a few databases. If that is the behaviour you see, those indexes can be dropped as well, unless you put that index in specifically for a large database or for a particular query.
*/

Be very careful dropping any index. Double check to see its not unique or clustered. We saw savings of approx total 50GB across all servers.