I have been struggling with poor performance on my laptop for sometime, I had realised that the problem was a regular stoppage of everything but couldn't figure out what it was.
Then by luck I spotted "saving auto recovery information" appear on the status bar in SQL Server Management Studio at which point I find myself looking at the auto-recovery options in Tools >> Options. Yes there are loads of blogs indicating how to change the frequency and retention period.
What became apparent is that I have only had this problem since moving to 2012, my 2008 SSMS worked fine, probably because of the absence of auto-recovery features - indeed I was using SSMSBoost but the freebie version is a right pain when the license runs out so I stopped using it.
Back to the topic, what I am missing is how to set the location of the auto-recovery files - I did search all the locations that are listed on line and my files were in none of them, I even manually created the folders in My Documents to see if that would trigger some action to no avail.
Today I found the files and now know why the auto-save is so painfully slow and today would be that day because I am working on a VPN and it turns out that the default document path for my account is set to a network location - hence auto-save is happening to a different town that I am in.
Come on Microsoft make this stuff configurable.
Tuesday, 31 December 2013
Friday, 28 June 2013
Changing the Max Server Memory setting
Of late I have encountered a number of servers with no Max Server Memory setting, in fact all around me I see SQL Server 2008R2 servers which are straight out the box installs.
A couple of these machines have had memory challenging - I was looking at one today, a virtual, with 30Gb of RAM allocated, 100% adhoc queries (and no "optimize for adhoc workloads" but duly sorted that) where SQL is reporting as using 29.7Gb of RAM and 805mb of ram outside of the buffer cache, needless to say paging was quite heavy as SQL alone was using 30.5Gb of 30Gb.
However, a slight catch, the Max Server Memory setting is dynamic and can simply be changed but very recent experience has shown me that regardless of whether the max setting is above or below current usage for certain the procedure cache is dumped, I haven't checked buffer cache but I suspect that the setting change is a blunt instrument so be warned.
A couple of these machines have had memory challenging - I was looking at one today, a virtual, with 30Gb of RAM allocated, 100% adhoc queries (and no "optimize for adhoc workloads" but duly sorted that) where SQL is reporting as using 29.7Gb of RAM and 805mb of ram outside of the buffer cache, needless to say paging was quite heavy as SQL alone was using 30.5Gb of 30Gb.
However, a slight catch, the Max Server Memory setting is dynamic and can simply be changed but very recent experience has shown me that regardless of whether the max setting is above or below current usage for certain the procedure cache is dumped, I haven't checked buffer cache but I suspect that the setting change is a blunt instrument so be warned.
Friday, 19 April 2013
Null storage
I was trying to figure out how much space storage of nulls takes. There are any number of articles online (SQL Server 2012 and prior versions) saying that nulls take no space except for the 2 byte row overhead for having any number of nullable columns.
However, I am convinced I read somewhere that null columns only take no space when they are not subsequently followed by non null data.
So I set up a small experiment that is easy to repeat :
So to summarise what I've done, both tables have 3 fields of varchar(100), the table "fred" has the middle field nullable and the table "fred2" has the final field nullable. Both are filled with identical data and the results show that fred2 with the trailing null colum has taken less space.
This demonstrates that storage of nulls last is most space efficient which on a larger table could translate to real returns on the amount of data being stored on disk and in the buffer cache.
I have a document called "Writing Efficient SQL" which is aimed at in-house developers and I have updated that to recommend all not null fields in a table come first, followed by fixed length nullable fields (fixed length fields are still a fixed length whether null or not it seems) followed by nullable variable length fields in order the the most likely to be not null first to give the greatest chance of storage efficiency.
This does mean that fields are not in a natural order, indeed, it might be easier to simply keep to natural order for all except the variable length nullable fields but generally ordering of fields is not relevant to presentation to the user so shouldn't really make any usability differences.
However, I am convinced I read somewhere that null columns only take no space when they are not subsequently followed by non null data.
So I set up a small experiment that is easy to repeat :
create table fred (
name varchar(100) not null,
name2 varchar(100) null,
name3 varchar(100) not null
)
create table fred2 (
name varchar(100) not null,
name3 varchar(100) not null,
name2 varchar(100) null
)
set nocount on
declare @i int
set @i = 0
while @i < 50000 begin
set @i=@i+1
insert into fred values ('dklfsdklfsdlhfsdfhsdfjsdjk',null,'jkadlfhklsdfsdfjksdjkfjksd')
insert into fred2 values ('dklfsdklfsdlhfsdfhsdfjsdjk','jkadlfhklsdfsdfjksdjkfjksd',null)
if @i%1000 = 0 select @i -- display progress occassionally
end
exec sp_spaceused 'fred'
exec sp_spaceused 'fred2'
drop table fred
drop table fred2
I had a couple of false starts when running this so I received the output below from sp_spaceused where I had more rows that specified in the code but the principle remains the same :
So to summarise what I've done, both tables have 3 fields of varchar(100), the table "fred" has the middle field nullable and the table "fred2" has the final field nullable. Both are filled with identical data and the results show that fred2 with the trailing null colum has taken less space.
This demonstrates that storage of nulls last is most space efficient which on a larger table could translate to real returns on the amount of data being stored on disk and in the buffer cache.
I have a document called "Writing Efficient SQL" which is aimed at in-house developers and I have updated that to recommend all not null fields in a table come first, followed by fixed length nullable fields (fixed length fields are still a fixed length whether null or not it seems) followed by nullable variable length fields in order the the most likely to be not null first to give the greatest chance of storage efficiency.
This does mean that fields are not in a natural order, indeed, it might be easier to simply keep to natural order for all except the variable length nullable fields but generally ordering of fields is not relevant to presentation to the user so shouldn't really make any usability differences.
Thursday, 28 February 2013
Uneven TempDB
With more than a singe TempDB data file it is critical to ensure that all data files stay in the same proportion in order that proportional fill algorithms can operate.
To check for uneven tempdb run the following :
Fixing it can simply mean increasing all the smaller files to the same size as the largest or stopping the SQL service, deleting the tempdb files and restarting the SQL Service if a data file grew due to a one off event (tempdb is automatically created a start up).
To check for uneven tempdb run the following :
;with tab as ( -- identify the tempdb database files that are online and have uneven file sizes
select d.database_id
from sys.master_files mf
inner join sys.databases d on d.database_id = mf.database_id
where d.name = 'tempdb'
and mf.type_desc = 'ROWS'
and d.state = 0 -- ONLINE
group by d.database_id
having count(mf.file_id) > 1 -- no point checking if only one file (could be a design decision, assume designed)=
and avg(mf.size) != min(mf.size)
)
select mf.name
, mf.physical_name
, mf.size*8/1024
from tab t
inner join sys.master_files mf on mf.database_id = t.database_id
where mf.type_desc = 'ROWS'
We need to check all of the datafiles (type_desc='ROWS') and a simple method of deciding if there is an issue is to compare the average file size logged in sys.master_files to the smallest file size - if they are not identical then at least on of the tempdb datafiles has gone out of step.Fixing it can simply mean increasing all the smaller files to the same size as the largest or stopping the SQL service, deleting the tempdb files and restarting the SQL Service if a data file grew due to a one off event (tempdb is automatically created a start up).
Thursday, 21 February 2013
IGNORE_DUP_KEY
I have come across an interesting feature when trying to set up index rebuild processes on SQL Server 2008R2. I am constructing an alter index rebuild statement including appropriate IGNORE_DUP_KEY and other settings and am receiving failures because of IGNORE_DUP_KEY.
It would appear that IGNORE_DUP_KEY=ON is not a valid option when rebuilding an index that is a primary key or has a unique constraint and yet I'm finding not problem finding examples where this is exactly what is configured, notably within Microsoft databases; I first picked this up in the DataCollectiondatabase.
I have managed to get around this because what I do is specifically set IGNORE_DUP_KEY=ON in the alter index rebuild statement but I do not specifically set the OFF value if it is not required and that means that when I rebuild the index without specifying whether to ignore the duplicate key or not the system leaves the system as was configured and I do not need to worry about breaking any system functionality.
It would appear that IGNORE_DUP_KEY=ON is not a valid option when rebuilding an index that is a primary key or has a unique constraint and yet I'm finding not problem finding examples where this is exactly what is configured, notably within Microsoft databases; I first picked this up in the DataCollectiondatabase.
I have managed to get around this because what I do is specifically set IGNORE_DUP_KEY=ON in the alter index rebuild statement but I do not specifically set the OFF value if it is not required and that means that when I rebuild the index without specifying whether to ignore the duplicate key or not the system leaves the system as was configured and I do not need to worry about breaking any system functionality.
Tuesday, 19 February 2013
Tempdb
Useful article on tempdb and the numbers of files required : tempdb article. I tend to start with just two files and extend if experiencing performance problems simply because I've had a number of cases where tempdb activity is slower spread across lots of small files than running everything against fewer, large files.
Thursday, 7 February 2013
Adventures with Powershell, finally some code
Time to show the code for the SQL Server service state monitoring, there is quite a lot but hopefully it should be easy to follow, if all else fails try it out :
Following turning on xp_cmdshell the powershell command is run once for each server & instance Pfound in the SQLInstance table. Yes I have used a cursor to provide my loop through the SQLInstance table, it is set as fast_forward so works quickly, you could change this to a while loop or something else if you want but the coding is less and tidier with a cursor and for something that is not OLTP I don't see problems on my systems.
The Powershell command is actually pretty simple if you've done any command line scripting (I'll jump back to my Unix comment from a previous post here, it's remarkably Unix like) :
The output from the Powershell command is saved into a table and the useful parts processed to find the detail we actually need - both the Powershell command and processing of output are unmodified from Jugal's version as they work perfectly.
I handle as many errors as I have encountered, including inability to turn on or use xp_cmdshell, RPC service unavailable, server uncontactable, access denied and a final "unknown" category just in case something new comes up.
Okay, we've got some output in a table, now what? Well I guess someone needs to know about the results so the following script provides two routes, either direct to the console via Management Studio or via an HTML formatted email (designed to handle Outlook as used in my environment but the HTML is entirely standard so should be fine in any client that accepts HTML) :
This code is a bit of a fudge because having written a means of emailling the results to myself I got fed up waiting for the emails to arrive while testing so needed an output to screen and it made sense to provide the same output. Hence, the code writes the desired output to a temporary table and then processs that differently according to which the chosen destination is.
This function is called as either :
usp_ReportStateOfSQLServers @Destination = 'local'
-- prints direct to screen
usp_ReportStateOfSQLServers @Destination = 'email@address'
-- send to email
You will note some additional parameters that vary the "sensitivity" of what is reported - the default settings will tell you about any SQL services set to automatic start (including delayed start) which are currently stopped and include uncontactable / access denied and unknown errors. For my environment all SQL services that are used normally on a server are set to automatic start and we expect all machines to be visible all the time so this gives me that answers I need. - I deliberately leave my laptop on the list of servers so when I am working remotely I'll get any email that my laptop is uncontactable and that just keeps me aware that the process is alive because it does not send an email if there is nothing to tell me (so if everything is working you won't see an email - be aware, some people like to get a positive response every time, easy enough to change).
You might have noted in the Check State code that records currently in the SQLServiceState are firstly moved to the SQLServiceStateHistory table before current statuses are checked so now we need to manage how large the history table can get :
/*
Check State
Create tables
Chris Page 2013
*/
use [DBAUtility]
go
-- Table to store the windows server name & SQL instance name
IF OBJECT_ID ( 'SQLInstance', 'U' ) IS null begin
CREATE TABLE [dbo].[SQLInstance](
ID int not null identity (1,1),
[WindowsServer] varchar(128) not NULL,
[SQLInstance] varchar(255) not NULL, -- either same as server or server\instance
Active bit -- 1 = active so checkm 0 = inactive don't check
constraint PK_SQLInstance primary key clustered (Id) with (fillfactor=100, pad_index=off)
)
create unique index Idx_SQLInstance on SQLInstance (WindowsServer,SQLInstance) with (fillfactor=70, pad_index=on)
-- Inserting Initial Data into SQLInstane
insert into SQLInstance values ('S-UAT-SQL108.TOWERGATE.LOCAL','S-UAT-SQL108',1) -- Default Instance
insert into SQLInstance values ('L-TPL-3058','L-TPL-3058',1) -- "server", "server\instance"
end
if OBJECT_ID ('SQLServiceState','U') is null begin
-- SQL Service Monitoring Output Table
CREATE TABLE [dbo].[SQLServiceState](
Id int identity (1,1) not null,
[WindowsServerID] int not null,
[SQLService] [varchar](128) not NULL,
[UpdatedOn] [smalldatetime] not NULL,
[ServiceStatus] [varchar](100) not NULL,
[StartMode] [varchar](100) not NULL,
[ServiceAccount] [varchar](1000) NULL, -- long to allow command to be stored in case of issues
constraint FK_SQLServiceState_WindowsServer foreign key (WindowsServerID) references SQLInstance (Id)
)
alter table SQLServiceState add constraint [DF_SQLServiceState_UpdatedOn] default (getdate()) for UpdatedOn
create unique index IX_SQLServiceState on SQLServiceState (ID) with (fillfactor=100, pad_index=off)
end
if OBJECT_ID('SQLServiceStateHistory','U') is null begin
CREATE TABLE [dbo].[SQLServiceStateHistory](
Id int not null, -- not an identity field as data coming from elsewhere
[WindowsServerID] int not null,
[SQLService] [varchar](500) NULL,
[UpdatedOn] [smalldatetime] NULL,
[ServiceStatus] [varchar](100) NULL,
[StartMode] [varchar](100) NULL,
[ServiceAccount] [varchar](1000) NULL,
constraint FK_SQLServiceStateHistory_WindowsServer foreign key (WindowsServerID) references SQLInstance (Id)
)
end
if OBJECT_ID('ProcessLog','U') is null begin -- *** STANDARD SPEC!
create table ProcessLog (
Id int not null identity (1,1), -- UID
Package varchar(200) not null,
Process varchar(100) not null,
ActionTaken varchar(255) not null,
DateStamp datetime not null,
Records int not null
constraint PK_ProcessLog primary key clustered (Id) with (fillfactor=90, pad_index=on)
)
create index IDX_ProcessLog_Process on ProcessLog (Process) with (fillfactor=60, pad_index=on)
create index IDX_ProcessLog_DateStamp on ProcessLog (DateStamp) with (fillfactor=60, pad_index=on)
alter table ProcessLog add constraint [DF_ProcessLog_Records] default (0) for Records
alter table ProcessLog add constraint [DF_ProcessLog_DateStamp] default (getdate()) for DateStamp
end
if OBJECT_ID('vSQLService','V') is not null begin
drop view dbo.vSQLService
end
go
create view vSQLService as
select w.WindowsServer
, w.SQLInstance
, s.SQLService
, s.ServiceAccount
, s.UpdatedOn
, s.StartMode
, s.ServiceStatus
from dbo.SQLServiceState s
inner join dbo.SQLInstance w on w.ID = s.WindowsServerID
This is creating a table to log servers that need checking (SQLInstance), a table in which to store the latest service state information (SQLServiceState), a table to store all previous service state information (SQLServiceStateHistory) - volumes can be large quickly hence separating the tables, if you want everything up to and including now a union will work nicely and finally a logging table (ProcessLog) which follows a standard layout I have been using to record what happened, when and why.
Lastly I have defined a view to put the principal information from SQLInstance and SQLServiceState together to make simple queries against the current data easier (ie I got fed up rekeying while testing).
Next is the Check State script that does all the hard work :
use [DBAUtility]
go
/*
Check State
Chris Page 2013
*/
IF OBJECT_ID ( 'usp_CheckStateOfSQLServers', 'P' ) IS NOT NULL
DROP PROCEDURE usp_CheckStateOfSQLServers
go
create procedure usp_CheckStateOfSQLServers
as
begin
set nocount on
insert into ProcessLog (Package, Process,ActionTaken) values ('CheckStateOfServers','Process','Start')
declare @chkCMDShell as sql_variant
, @Server varchar(128)
, @SQLInstance varchar(128)
, @SQLInstanceID int
, @ERROR bit -- flag to stop onward processing
, @ErrMsg varchar(255)
, @ErrState int
, @ErrSeverity int
, @SQL varchar(1000)
set @ERROR = 0
begin try
select @chkCMDShell = value from sys.configurations where name = 'xp_cmdshell'
if @chkCMDShell = 0
begin
EXEC sp_configure 'Show Advanced Options', 1;
RECONFIGURE with override;
EXEC sp_configure 'xp_cmdshell', 1
RECONFIGURE with override;
EXEC sp_configure 'Show Advanced Options', 0;
RECONFIGURE with override;
insert into ProcessLog (Package, Process, ActionTaken) values ('CheckStateOfServers','xp_cmdshell','Allowed')
end
end try
begin catch
select @ERROR = 1, @ErrMsg = 'Could not enable XP_CMDSHELL '+ERROR_MESSAGE(),@ErrState = ERROR_STATE(),@ErrSeverity = ERROR_SEVERITY()
raiserror (@ErrMsg,@ErrState, @ErrSeverity)
insert into ProcessLog (Package, Process, ActionTaken) values ('CheckStateOfServers','xp_cmdshell','Could not be Allowed')
end catch
if @ERROR = 0 begin
begin try -- start by archiving results from the last run
insert into dbo.SQLServiceStateHistory -- move old entries to history
select *
from SQLServiceState
truncate table dbo.SQLServiceState -- bin the current data having moved to history
end try
begin catch
select @ERROR = 1, @ErrMsg = 'Could not move old ServiceStates to history '+ERROR_MESSAGE(),@ErrState = ERROR_STATE(),@ErrSeverity = ERROR_SEVERITY()
raiserror (@ErrMsg,@ErrState, @ErrSeverity)
insert into ProcessLog (Package, Process, ActionTaken) values ('CheckStateOfServers','ServerStateHistory','Could not moved.')
end catch
end
if @ERROR = 0 begin
create table #output (line varchar(max) null)
declare ServerCursor cursor fast_forward for
select distinct LTRIM(rtrim(s.WindowsServer))
, LTRIM(RTRIM(s.SQLInstance))
, s.ID
from dbo.SQLInstance s
where s.Active = 1
open ServerCursor
fetch next from ServerCursor into @Server, @SQLInstance, @SQLInstanceID
while @@FETCH_STATUS = 0 and @ERROR = 0 begin
set @sql = 'powershell.exe -c "Get-WmiObject -ComputerName ' + QUOTENAME(@Server,'''') + ' -Class win32_service | where {$_.name -like ' + QUOTENAME('*SQL*','''') + '} | select-object Name,state,systemname,startmode,startname | foreach{$_.name+''|''+$_.state+''%''+$_.systemname+''*''+$_.startmode+''@''+$_.startname+''!''}"'
begin try
insert #output EXEC xp_cmdshell @sql
end try
begin catch
select @ERROR = 1, @ErrMsg = 'xp_cmdshell execution failed '+ERROR_MESSAGE(),@ErrState = ERROR_STATE(),@ErrSeverity = ERROR_SEVERITY()
raiserror (@ErrMsg,@ErrState, @ErrSeverity)
insert into ProcessLog (Package, Process, ActionTaken) values ('CheckStateOfServers','xp_cmdshell','Execution failed')
end catch
if @ERROR = 0 begin
if (select COUNT(*) from #output o where o.line like 'Get-WmiObject : The RPC server is unavailable%') = 1 begin
insert into SQLServiceState (SQLService, ServiceStatus, ServiceAccount, StartMode, WindowsServerID) values
('All','Uncontactable','Command tried : '+@SQL, 'Uncontactable', @SQLInstanceID)
end else if (select COUNT(*) from #output o where o.line like 'Get-WmiObject : Access is denied%') = 1 begin
insert into SQLServiceState (SQLService, ServiceStatus, ServiceAccount, StartMode, WindowsServerID) values
('All','Access Denied','Command tried : '+@SQL, 'Access Denied', @SQLInstanceID)
end else begin
delete from #output where len(line) < 30 -- Deleting the rows which contains error or has not sufficient data
update #output set line = line + '!' where line not like '%!%'
IF (SELECT COUNT(*) FROM #output where line like '%Get-Wmi%') = 0 begin
insert into SQLServiceState(SQLService,ServiceStatus,StartMode,ServiceAccount, WindowsServerID)
select rtrim(ltrim(SUBSTRING(o.line,1,CHARINDEX('|',o.line) -1))) as SQLServiceName
, (rtrim(ltrim(SUBSTRING(o.line,CHARINDEX('|',o.line)+1, (CHARINDEX('%',o.line) -1)-CHARINDEX('|',o.line)) ))) as ServiceStatus
--,(rtrim(ltrim(SUBSTRING(line,CHARINDEX('%',line)+1, (CHARINDEX('*',line) -1)-CHARINDEX('%',line)) ))) as WindowsServerName
, (rtrim(ltrim(SUBSTRING(o.line,CHARINDEX('*',o.line)+1, (CHARINDEX('@',o.line) -1)-CHARINDEX('*',o.line)) ))) as startmode
, (rtrim(ltrim(SUBSTRING(o.line,CHARINDEX('@',o.line)+1, (CHARINDEX('!',o.line) -1)-CHARINDEX('@',o.line)) ))) as startname
, @SQLInstanceID
from #output o
where o.line is not null
and LEN(o.line) > 30
end else begin -- Unknown error
insert into SQLServiceState (SQLService, ServiceStatus, ServiceAccount, StartMode, WindowsServerID) values
('All','Unknown Error','Command tried : '+@SQL, 'Unknown Error', @SQLInstanceID)
end
end
end
truncate table #output
fetch next from ServerCursor into @Server, @SQLInstance, @SQLInstanceID
end -- WEND
close ServerCursor
deallocate ServerCursor
drop table #output
end
if @chkCMDShell = 0 -- turn off xp_cmdshell again if we turned it on during execution
begin try
EXEC sp_configure 'Show Advanced Options', 1;
RECONFIGURE;
EXEC sp_configure 'xp_cmdshell', 0
RECONFIGURE;
EXEC sp_configure 'Show Advanced Options', 0;
RECONFIGURE;
insert into ProcessLog (Package, Process, ActionTaken) values ('CheckStateOfServers','xp_cmdshell','DisAllowed')
end try
begin catch
set @ERROR = 1
select @ERROR = 1, @ErrMsg = 'Could not disable XP_CMDSHELL '+ERROR_MESSAGE(),@ErrState = ERROR_STATE(),@ErrSeverity = ERROR_SEVERITY()
raiserror (@ErrMsg,@ErrState, @ErrSeverity)
insert into ProcessLog (Package, Process, ActionTaken) values ('CheckStateOfServers','xp_cmdshell','Could not be disabled.')
end catch
insert into ProcessLog (Package, Process, ActionTaken) values ('CheckStateOfServers','Process','End')
end
this seems like a lot of code but a lot of it is error checking and reporting rather than getting statuses. Note right at the beginning the code turns on xp_cmdshell if it is turned off, if it does turn on xp_cmdshell it does turn it off again at the end to close that security hole again - the implication here is that this task does need to run at sufficiently high security levels to be able to do take this action. My assumption is that the process will be running at the SQL Server Agent level and will probably have the necessary rights anyway - we are after all looking at service statuses and that is a privileged action.Following turning on xp_cmdshell the powershell command is run once for each server & instance Pfound in the SQLInstance table. Yes I have used a cursor to provide my loop through the SQLInstance table, it is set as fast_forward so works quickly, you could change this to a while loop or something else if you want but the coding is less and tidier with a cursor and for something that is not OLTP I don't see problems on my systems.
The Powershell command is actually pretty simple if you've done any command line scripting (I'll jump back to my Unix comment from a previous post here, it's remarkably Unix like) :
powershell.exe -c "Get-WmiObject -ComputerName ' + QUOTENAME(@Server,'''') + ' -Class win32_service | where {$_.name -like ' + QUOTENAME('*SQL*','''') + '} | select-object Name,state,systemname,startmode,startname | foreach{$_.name+''|''+$_.state+''%''+$_.systemname+''*''+$_.startmode+''@''+$_.startname+''!''}"'
This can be run independently of the SQL scripts and simply runs Powershell and calls "get-wmiobject" for the given server and lists out the service status names for any service that is SQL related. This is the only problematic hardcoded part (yes could code around it, necessary? well we'll find out in a future version of SQL server I imagine) because it is looking for any services with SQL in the registered service name - so it might also generate false positives on a box with non SQL server SQL services on there (MySQl etc) - I am not able to test that currently but that should be easy to tweak and potentially is not such a bad thing. to help other people encountering Powershell for the first time, there is another command, "get-service" which seems much simpler and tidier than get-wmiobject, however, it cannot run against a remote server.The output from the Powershell command is saved into a table and the useful parts processed to find the detail we actually need - both the Powershell command and processing of output are unmodified from Jugal's version as they work perfectly.
I handle as many errors as I have encountered, including inability to turn on or use xp_cmdshell, RPC service unavailable, server uncontactable, access denied and a final "unknown" category just in case something new comes up.
Okay, we've got some output in a table, now what? Well I guess someone needs to know about the results so the following script provides two routes, either direct to the console via Management Studio or via an HTML formatted email (designed to handle Outlook as used in my environment but the HTML is entirely standard so should be fine in any client that accepts HTML) :
use [DBAUtility]
go
/*
Check State
Report State
Chris Page 2013
*/
IF OBJECT_ID ( 'usp_ReportStateOfSQLServers', 'P' ) IS NOT NULL
DROP PROCEDURE usp_ReportStateOfSQLServers
go
create procedure usp_ReportStateOfSQLServers (
-- parameters could be
@ShowStopped bit = 1, -- show services not set as running (as opposed to show stopped in case other values are available but not known)
@ShowUncontactable bit = 1, -- show uncontactable servers (ignores @AutomaticOnly)
@ShowDenied bit = 1, -- show servers to whom access was denied
@ShowUnknown bit = 1, -- show servers with unknown errors
@AutomaticOnly bit = 1, -- show automatic start services (inc delayed) only
@Destination varchar(255) = 'chris.page@towergate.co.uk' -- if 'local' then print locally
)
as
begin
set nocount on
declare @OutputStopped varchar(max)
, @OutputUncontactable varchar(max) -- includes access denied
, @OutputAll varchar(max)
, @WindowsServer varchar(128)
, @SQLInstance varchar(255)
, @SQLService varchar(128)
, @ServiceAccount varchar(1000)
, @UpdatedOn smalldatetime
, @StartMode varchar(100)
, @ServiceStatus varchar(100)
select s.WindowsServer
, s.SQLInstance
, s.SQLService
, s.ServiceAccount
, s.UpdatedOn
, case
when s.ServiceStatus = 'Access Denied' then 'Check agent service account has rights on remote'
when s.ServiceStatus = 'Uncontactable' then 'Check server is up'
else ''
end as StartMode
, s.ServiceStatus
into #TempStateList -- store in temp table so local & email destination done in one query & no dynaminc SQL
from vSQLService s
where ( (@AutomaticOnly = 0 or (@AutomaticOnly=1 and s.StartMode like 'Auto%'))
and
s.ServiceStatus != 'Running'
and
@ShowStopped = 1
)
or
(@ShowUncontactable = 0 or (s.ServiceStatus = 'Uncontactable' and @ShowUncontactable=1))
or
(@ShowDenied = 0 or (s.ServiceStatus = 'Access Denied' and @ShowDenied=1))
or
(@ShowUnknown = 0 or (s.ServiceStatus = 'Unknown Error' and @ShowDenied=0))
order by s.WindowsServer
, s.SQLInstance
, s.SQLService
if lower(@Destination)='local' begin
select 'Unreachable Servers'
select * from #TempStateList t where t.ServiceStatus in ('Uncontactable','Unknown Error' ,'Access Denied')
select 'Stopped Services'
select * from #TempStateList t where t.ServiceStatus not in ('Running','Unknown Error','Uncontactable','Access Denied')
end else begin
declare ReportableServicesCursor cursor fast_forward for
select * from #TempStateList
set @OutputUncontactable = ''
set @OutputStopped = ''
open ReportableServicesCursor
fetch next from ReportableServicesCursor into @WindowsServer, @SQLInstance, @SQLService, @ServiceAccount, @UpdatedOn, @StartMode, @ServiceStatus
while @@FETCH_STATUS = 0 begin
--select @WindowsServer,@SQLInstance,@SQLService,@ServiceAccount,@UpdatedOn,@StartMode,@ServiceStatus
select @OutputUncontactable=@OutputUncontactable+'<tr>'+
'<td>'+@WindowsServer+'</td>'+
'<td>'+@SQLInstance+'</td>'+
'<td>'+@SQLService+'</td>'+
'<td>'+cast(@UpdatedOn as varchar)+'</td>'+
'<td><font color="red">'+@ServiceStatus+'</font></td>'+
'<td>'+@Startmode+'</td>'+
'<td>'+@ServiceAccount+'</td>'+
'</TR>'
where @ServiceStatus in ('Uncontactable','Unknown Error' ,'Access Denied')
select @OutputStopped =@OutputStopped+'<tr>'+
'<td>'+@WindowsServer+'</td>'+
'<td>'+@SQLInstance+'</td>'+
'<td>'+@SQLService+'</td>'+
'<td>'+cast(@UpdatedOn as varchar)+'</td>'+
'<td>'+@StartMode+'</td>'+
'<td><font color="red">'+@ServiceStatus+'</font></td>'+
'<td>'+@ServiceAccount+'</td>'+
'</TR>'
where @ServiceStatus not in ('Running','Unknown Error','Uncontactable','Access Denied')
fetch next from ReportableServicesCursor into @WindowsServer, @SQLInstance, @SQLService, @ServiceAccount, @UpdatedOn, @StartMode, @ServiceStatus
end
close ReportableServicesCursor
deallocate ReportableServicesCursor
if @OutputUncontactable!='' or @OutputStopped !='' begin -- only output results if there are some
set @OutputAll = '<body>'
if @OutputUncontactable != '' begin
set @OutputAll = @OutputAll+'The following SQL Server instances are unavailable : <br/><br/>'+
'<table border=1><tr><td><b>Windows Server</b></td><td><b>SQL Instance</b></td><td><b>SQL Service</b></td><td><b>Updated On</b></td>'+
'<td><b>Service Status</b></td><td><b>Notes</b></td><td><b>Script Used</b></td></tr>'+
@OutputUncontactable+'</table>'
end
if @OutputStopped != '' begin
if @OutputAll != '' begin
set @OutputAll = @OutputAll + '<br/><br/><br/>'
end
set @OutputAll = @OutputAll+
'The following SQL Server services are stopped :<br/><br/>'+
'<table border=1><tr><td><b>Windows Server</b></td><td><b>SQL Instance</b></td><td><b>SQL Service</b></td><td><b>Updated On</b></td>'+
'<td><b>Start Mode</b></td><td><b>Service Status</b></td><td><b>Service Account</b></td></tr>'+
@OutputStopped+'</table>'
end
set @OutputAll = @OutputAll + '</body>'
begin try
exec msdb.dbo.sp_send_dbmail @recipients=@Destination, @Importance='High'
, @subject='SQL Server Service Review', @Body=@OutputAll, @Body_Format='HTML'
end try
begin catch -- unable to send mail, might be a config issue (firewall for smtp) or send mail not configured / enabled / no default public profile
raiserror ('Unable to send emails while service statuses.',16,1) with log
end catch
end
end
end
(worth noting here that I used SimpleCode to embed the script & retain the HTML text - it pasted across to my machine and ran perfectly).This code is a bit of a fudge because having written a means of emailling the results to myself I got fed up waiting for the emails to arrive while testing so needed an output to screen and it made sense to provide the same output. Hence, the code writes the desired output to a temporary table and then processs that differently according to which the chosen destination is.
This function is called as either :
usp_ReportStateOfSQLServers @Destination = 'local'
-- prints direct to screen
usp_ReportStateOfSQLServers @Destination = 'email@address'
-- send to email
You will note some additional parameters that vary the "sensitivity" of what is reported - the default settings will tell you about any SQL services set to automatic start (including delayed start) which are currently stopped and include uncontactable / access denied and unknown errors. For my environment all SQL services that are used normally on a server are set to automatic start and we expect all machines to be visible all the time so this gives me that answers I need. - I deliberately leave my laptop on the list of servers so when I am working remotely I'll get any email that my laptop is uncontactable and that just keeps me aware that the process is alive because it does not send an email if there is nothing to tell me (so if everything is working you won't see an email - be aware, some people like to get a positive response every time, easy enough to change).
You might have noted in the Check State code that records currently in the SQLServiceState are firstly moved to the SQLServiceStateHistory table before current statuses are checked so now we need to manage how large the history table can get :
use [DBAUtility]
go
/*
Check State
Manage State History
Chris Page 2013
*/
IF OBJECT_ID ( 'usp_ManageSQLServerStateHistory', 'P' ) IS NOT NULL
DROP PROCEDURE usp_ManageSQLServerStateHistory
go
create procedure usp_ManageSQLServerStateHistory (
-- parameters could be
@KeepAllHistoryForDays smallint = 7, -- delete anything "normal" after this length of time in days
@KeepIssueHistoryForDays smallint = 90-- Keep stopped / uncontactable items for this long (longer than KeepAll)
)
as
begin
set nocount on
insert into ProcessLog (Package, Process, ActionTaken) values ('ManageSQLServerStateHistory','Process','Start')
declare @ErrMsg varchar(255)
, @ErrState int
, @ErrSeverity int
if @KeepAllHistoryForDays > @KeepIssueHistoryForDays begin
insert into ProcessLog (Package, Process, ActionTaken) values ('ManageSQLServerStateHistory','Check periods','Keep All longer than Keep Issues, Keep Issues aligned')
set @KeepIssueHistoryForDays = @KeepAllHistoryForDays
end
begin try
delete from SQLServiceStateHistory -- start by deleting any non issues for the shorter period
where UpdatedOn < DATEadd (DD,@KeepAllHistoryForDAys*-1,GETDATE())
and ServiceStatus in ('Running')
delete from SQLServiceStateHistory -- then delete anything over the longer period
where UpdatedOn < DATEADD (dd, @KeepIssueHistoryForDays*-1, getdate())
end try
begin catch
select @ErrMsg = 'Could not delete history SQL Server State '+ERROR_MESSAGE(),@ErrState = ERROR_STATE(),@ErrSeverity = ERROR_SEVERITY()
raiserror (@ErrMsg,@ErrState, @ErrSeverity)
insert into ProcessLog (Package, Process, ActionTaken) values ('ManageSQLServerStateHistory','Delete History','Failed.')
end catch
insert into ProcessLog (Package, Process, ActionTaken) values ('ManageSQLServerStateHistory','Process','End')
end
This time there are two parameters :- @KeepAllHistoryForDays -- deletes all "running" service records that exist in the database, generally speaking I am not really interested in services that are running as expected so I will only want to keep that data for a short period so I have a default value of 1 week.
- @KeepIssueHistoryForDays -- this deletes any records that still exist after this many days, I am working on the principal that maintaining a recent service problem history is useful and can be used in discussion with service providers as evidence to support reliability discussions (yes I am taking a negative view but if the history is empty then all is well!), hence, I expect this data to be kept for longer and have default to 90 days but could potentially be significantly longer depending on local requirements.
Next comes scheduling, this is simply creating a scheduled task on your designated monitoring server, you will need to alter the Operator name to ensure you get the failure email and the email destination listed on the scheduled task :
USE [msdb]
GO
/*
Check State
Schedule Checking
Chris Page 2013
*/
/****** Object: Job [DBA Maintenance Tasks : Check & Report SQL Server Service States] Script Date: 02/04/2013 13:39:23 ******/
BEGIN TRANSACTION
DECLARE @ReturnCode INT
SELECT @ReturnCode = 0
/****** Object: JobCategory [[Uncategorized (Local)]]] Script Date: 02/04/2013 13:39:23 ******/
IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'[Uncategorized (Local)]' AND category_class=1)
BEGIN
EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'[Uncategorized (Local)]'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
END
DECLARE @jobId BINARY(16)
EXEC @ReturnCode = msdb.dbo.sp_add_job @job_name=N'DBA Maintenance Tasks : Check & Report SQL Server Service States',
@enabled=1,
@notify_level_eventlog=0,
@notify_level_email=2,
@notify_level_netsend=0,
@notify_level_page=0,
@delete_level=0,
@description=N'No description available.',
@category_name=N'[Uncategorized (Local)]',
@owner_login_name=N'chris.page',
@notify_email_operator_name=N'SQLAdministration', @job_id = @jobId OUTPUT
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
/****** Object: Step [Check] Script Date: 02/04/2013 13:39:23 ******/
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'Check',
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=3,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0, @subsystem=N'TSQL',
@command=N'usp_checkstateofsqlservers',
@database_name=N'DBAUtility',
@flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
/****** Object: Step [Report] Script Date: 02/04/2013 13:39:23 ******/
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'Report',
@step_id=2,
@cmdexec_success_code=0,
@on_success_action=1,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0, @subsystem=N'TSQL',
@command=N'usp_ReportStateOfSQLServers @Destination=''email@address''',
@database_name=N'DBAUtility',
@flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N'DBA Maintenance Tasks : Daily',
@enabled=1,
@freq_type=4,
@freq_interval=1,
@freq_subday_type=8,
@freq_subday_interval=1,
@freq_relative_interval=0,
@freq_recurrence_factor=0,
@active_start_date=20120914,
@active_end_date=99991231,
@active_start_time=90000,
@active_end_time=170000
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
COMMIT TRANSACTION
GOTO EndSave
QuitWithRollback:
IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:
GO
Last but not least is logging deployment, one day I will write something clever to keep all deployment logs on SQL Server and be able to output a list of everything that needs to be updated (automating updated might be a step to far in case there is a reason machines are out of step). This is just a screenshot :
All I've done is create a spreadsheet with conditional formatting on column B such that if the date in column B for any server (so B5 onwards) is before the date in B3 is shows the column as red. Not entirely scientific but what I've also noted in red text is that the version notes need to be reviewed before deploying to every server in the estate as potentially the update is minor can wait.
Enjoy.
All I've done is create a spreadsheet with conditional formatting on column B such that if the date in column B for any server (so B5 onwards) is before the date in B3 is shows the column as red. Not entirely scientific but what I've also noted in red text is that the version notes need to be reviewed before deploying to every server in the estate as potentially the update is minor can wait.
Enjoy.
Wednesday, 6 February 2013
Adventures with Powershell, turning it into BAU
Having tamed the security beast for the time being I have now been able to get my version of server monitoring tool in place and running reliably for a few days.
The biggest issue I have encountered though is that our service provider has supplied us with such an array of down servers that I haven't had a chance to roll out the process to any more machines.
I now have a process that breaks into multiple parts (with some documentation for whoever comes after me) broken into several "creation" files to keep functionality clear and simple having chosen a target SQL server (tested on 2008R2 as the host and 2005, 2008R2 targets (Standard, Enterprise & Express editions) :
- Tables Script - creates all the tables, indexes and a view to reduce my typing when looking at results. I have included in the process a "ProcessLog" table, I have used an identical design as I have in place on a live application on another box and one day I might well merge the two together (which has prompted me to create a to do list).
- Check State process creation - creates the requisite stored procedure on the host machine.
- Report State process - creates separate functionality for reporting the state retrieved from the check state process, I separated the two because originally this was configured to email results (HTML formatted) but in testing this got annoying when I had slow / non-functioning email so I added a parameter to present output locally; it was tidier to separate checking a state from reporting on states so allowing me to run the check process once and check the results repeatedly without waiting - handy for multiple users.
- Manage history - each time the state checking occurs it moves all the current data to a history table - that way the live data is uncluttered whilst we keep a record. The maintenance process takes two parameters, firstly how long to keep any data and secondly how long to keep "issues" - that way I can, for instance, say I'm only interested in a complete log of service state information for 7 days but I'd like to keep servers which are not reported as active for 3 months. Handy for pointing at in discussions with service providers.
- Scheduling process. Via my 1-click install process I generate two "DBA Maintenance Tasks" - one called "Daily" and one "Weekly" (I suspect you know how often they run). But this checking should ultimately run more frequently (that will depend on your enviroment & execution time, criticality / SLAs and how on the ball the server boys are), I am thinking ultimately once a hour for normal use (and I have added another to do item, logging checking periods).
- A deployment record. Currently this is a simple Excel sheet that shows the date of the last code update (manually entered) and a list of servers with deployment dates (all manual) with conditional formatting to highlight out of date deployments (another to do item is to make this a tidier SQL process).
I also have a 1-click install process (I'll write about this one day) that deploys all of my maintenance routines to a specified machine. What I've done to that is add "monitoring server" to the list of variables configured (yes you'll need something reliable) and updated the weekly maintenance routine to add a step for the history management action if the 1-click install is running on the monitoring server (the 1-click install is also coded for redeployment of updated code). I also automatically add a new server to the monitoring list assuming a default checking frequency.
Next I'll go through the code that does the work.
Saturday, 2 February 2013
Adventures with Powershell, knocking down the security barrier....
I have now had my routine for checking service status running for a few days but am still discovering new circumstances which need consideration. I have accounted for the obvious "server uncontactable" circumstance which is part of the point of the process but I have since altered my code further to ensure there is a catch all for circumstances I have yet to identify.
One additional circumstance has raised its ugly head above the parapet. On testing one server was not responding via the script. Running the Powershell script directly against the server resulted in a perfect run from several different machines. But running the same script via xp_cmdshell caused a failure.
I finally had to raise the query on MSDN and promptly got the answer I needed. What I was unaware of is that inspite of running interactively via SQL Server Management Studio, xp_cmdshell runs with the rights of the SQL Server Agent account, which in this instance had not got rights to the destination machine.
Simple fix, I would imagine for production use a dedicated proxy account should be put in place on all monitored boxes to ensure limited rights but we're in action again.
One additional circumstance has raised its ugly head above the parapet. On testing one server was not responding via the script. Running the Powershell script directly against the server resulted in a perfect run from several different machines. But running the same script via xp_cmdshell caused a failure.
I finally had to raise the query on MSDN and promptly got the answer I needed. What I was unaware of is that inspite of running interactively via SQL Server Management Studio, xp_cmdshell runs with the rights of the SQL Server Agent account, which in this instance had not got rights to the destination machine.
Simple fix, I would imagine for production use a dedicated proxy account should be put in place on all monitored boxes to ensure limited rights but we're in action again.
Wednesday, 30 January 2013
Adventures with Powershell, first steps...
I keep hearing about this mysterious product that is now bundled with SQL Server in various guises called Powershell.
I've made the occasional foray in it's direction to find a DOS prompt with all the functionality that should be in a DOS prompt, in fact, something surprisingly like a Unix command line but in DOS, off course I'll probably find a horse's head next to me in the morning for suggesting such heresy.
Anyway, to date I found no practical purpose for me or the environment that I manage because everything it seems to offer I can do already via SQL Agent without having to learn arcane Powershell scripting and magic words.
But no, I have one nagging issue.
Prior to some recent data centre activity we DBA types had configured a job that collected, on a daily basis, stats from each SQL server and stored them at a single point. This was entirely hand carved using t-sql, ssis & the SQL agent & worked beautifully on SQL Server 2000, 2005 & 2008R2. The only flaw was that the first method coded, and due to time constraints never altered, the code worked in an entirely linear and fault intolerant fashion - so if server A was uncontactable the SSIS step would fail and the whole process failed for the day. This would have been fine if it wasn't for service companies who seem to make it their mission to make networks about as unreliable and poorly performing as possible to keep themselves in work while not having to tax themselves too hard to be forced into remedying the situation.
So, four years of living with an unreliable but generally working situation became the accepted norm due to massive volumes of actual work - why bother monitoring or checking when there are dozens of calls a day telling us how bad things are.
This cannot continue.
Enter the next service provider who promises everything will be fabulous, they look after these babies in their sleep. Contract signed and responsibility for managing services on Windows servers duly ensconced with said service provider who has to tools to manage the situation.
Said tools, yours truly & colleagues, don't cost the service provider a penny but do provide a regular parade of helpdesk calls along the lines of "service x isn't working, the users are jumping on our heads, why didn't you notice as they've started wearing studs". This is of course good for the service desk because it keeps their volume of calls up whilst they can file under first time fix because the same said tools have had to resort to managing the services themselves.
This toolset is actually pretty good because you can find extensions of the toolset sitting in other teams who merrily report directly to the biggest tools that server x is down again, naturally they won't report it to the helpdesk because they don't actually help and as a change to the environment (bringing server back up) is a change request and investigating why it went down is complex enough to be handed to third line but because it mentions "database" in the text somewhere they shunt it into the DBA queue which puts us all in limbo because the DBA queue is not actively monitored since the same service provider decided that staff of the client in non application support roles can't use the ticketing system.
I'm feeling better now.
So the fundamental problem is that we need to know when servers or services are down before anyone from the user community has a chance to nip home for their stilettos. We cannot rely on the service company for anything. What we need is a means of checking server status and service status whilst minimising the need to connect to SQL boxes but still keeping control of the entire process within the SQL world so that we can stop anyone else scuppering us further.
Enter Powershell (I got there in the end).
There is an interesting article I came across recently that introduces something very close to what I need written by Jugal Shah on MSSQLTIPS.
The principle works nicely - Powershell appears rather wordy so performance isn't great (30 seconds to check 4 servers) so it's not going to running every few seconds but anything around an hour will do fine for our purposes and I suspect there must be some ways to reduce the volumes of text and hence run times.
What it doesn't do is put any control and coordination around the principle that we would expect in our environment, so next time I'll post our implementation which works in our environment.
I've made the occasional foray in it's direction to find a DOS prompt with all the functionality that should be in a DOS prompt, in fact, something surprisingly like a Unix command line but in DOS, off course I'll probably find a horse's head next to me in the morning for suggesting such heresy.
Anyway, to date I found no practical purpose for me or the environment that I manage because everything it seems to offer I can do already via SQL Agent without having to learn arcane Powershell scripting and magic words.
But no, I have one nagging issue.
Prior to some recent data centre activity we DBA types had configured a job that collected, on a daily basis, stats from each SQL server and stored them at a single point. This was entirely hand carved using t-sql, ssis & the SQL agent & worked beautifully on SQL Server 2000, 2005 & 2008R2. The only flaw was that the first method coded, and due to time constraints never altered, the code worked in an entirely linear and fault intolerant fashion - so if server A was uncontactable the SSIS step would fail and the whole process failed for the day. This would have been fine if it wasn't for service companies who seem to make it their mission to make networks about as unreliable and poorly performing as possible to keep themselves in work while not having to tax themselves too hard to be forced into remedying the situation.
So, four years of living with an unreliable but generally working situation became the accepted norm due to massive volumes of actual work - why bother monitoring or checking when there are dozens of calls a day telling us how bad things are.
This cannot continue.
Enter the next service provider who promises everything will be fabulous, they look after these babies in their sleep. Contract signed and responsibility for managing services on Windows servers duly ensconced with said service provider who has to tools to manage the situation.
Said tools, yours truly & colleagues, don't cost the service provider a penny but do provide a regular parade of helpdesk calls along the lines of "service x isn't working, the users are jumping on our heads, why didn't you notice as they've started wearing studs". This is of course good for the service desk because it keeps their volume of calls up whilst they can file under first time fix because the same said tools have had to resort to managing the services themselves.
This toolset is actually pretty good because you can find extensions of the toolset sitting in other teams who merrily report directly to the biggest tools that server x is down again, naturally they won't report it to the helpdesk because they don't actually help and as a change to the environment (bringing server back up) is a change request and investigating why it went down is complex enough to be handed to third line but because it mentions "database" in the text somewhere they shunt it into the DBA queue which puts us all in limbo because the DBA queue is not actively monitored since the same service provider decided that staff of the client in non application support roles can't use the ticketing system.
I'm feeling better now.
So the fundamental problem is that we need to know when servers or services are down before anyone from the user community has a chance to nip home for their stilettos. We cannot rely on the service company for anything. What we need is a means of checking server status and service status whilst minimising the need to connect to SQL boxes but still keeping control of the entire process within the SQL world so that we can stop anyone else scuppering us further.
Enter Powershell (I got there in the end).
There is an interesting article I came across recently that introduces something very close to what I need written by Jugal Shah on MSSQLTIPS.
The principle works nicely - Powershell appears rather wordy so performance isn't great (30 seconds to check 4 servers) so it's not going to running every few seconds but anything around an hour will do fine for our purposes and I suspect there must be some ways to reduce the volumes of text and hence run times.
What it doesn't do is put any control and coordination around the principle that we would expect in our environment, so next time I'll post our implementation which works in our environment.
Tuesday, 22 January 2013
Progress Updates
A colleague not too long ago introduced me to sys.dm_exec_requests which has a percent_complete column and has significantly changed our view of longer processes and lends a degree of confidence where before we would be sitting there wondering if it was time to reboot.
I have, however, encountered a notable problem, on a reasonable sized database restore (this will depend on infrastructure but I've seen this above 100Gb) progress sits at 0% for what seems like at eternity.
In fact the timing before progress starts getting logged has seemed to be to be about how long it would take to create the file and space required and do almost all of the restore process.
I have now discovered more detail that might be useful next time I encounter the "problem" because it appears that recovery is split into 5 phases :
I think I will find that SQL is reporting 0% while the data is being copied back to the data files and given that this will be happening as a transaction it remains 0% until completed which explaims the behaviour being seen.
It seems that the view sys.dm_tran_database_transactions will give further detail during the undo phase and where the database_transaction_log_bytes_reserved column will show decreasing volumes for the given database.
Further, where there are several transactions to undo the database_transaction_next_undo_lsn column will provide additional detail.
I have, however, encountered a notable problem, on a reasonable sized database restore (this will depend on infrastructure but I've seen this above 100Gb) progress sits at 0% for what seems like at eternity.
In fact the timing before progress starts getting logged has seemed to be to be about how long it would take to create the file and space required and do almost all of the restore process.
I have now discovered more detail that might be useful next time I encounter the "problem" because it appears that recovery is split into 5 phases :
- Copy of data from the backup to database pages.
- Discovery
- Analysis
- Redo
- Undo
I think I will find that SQL is reporting 0% while the data is being copied back to the data files and given that this will be happening as a transaction it remains 0% until completed which explaims the behaviour being seen.
It seems that the view sys.dm_tran_database_transactions will give further detail during the undo phase and where the database_transaction_log_bytes_reserved column will show decreasing volumes for the given database.
Further, where there are several transactions to undo the database_transaction_next_undo_lsn column will provide additional detail.
Monday, 14 January 2013
Memory issues
I've a SQL script that produces cumulative waits (sys.dm_os_wait_stats), key items for review (same source & gives % of all waits to each type but ignore "benign" wait types) and then reproduces the same information but as a delta so current waits can be seen and finally looks at current activities with waits so it is possible to see what is hurting (via sys.dm_os_waiting_tasks and sys.dm_exec_sessions / sys.dm_exec_requests).
This gives me a broad summary of what's going on an is a good indicator of where to look when there are problems, what's more it generates a good viewpoint without taking an age to run which is always good news as the only boxes this needs to be executed against are those with problems.
What I also have is an accumulated list of wait types and meanings, I'm not sure where I picked up the initial list but as time has gone on I've added to and clarified meanings. One area I have been looking at lately is memory grants because I had a server which continually has outstanding memory grants - within the last 12 months we've gone from 16Gb to 40Gb and I think the machine would still like more. The core database suffers a range of problems include the wrong fields / no fields being indexed, uniqueidentifiers, non unique primary keys and a failure to agree an archive policy prior to implementation but that's a little out of scope here. What I do know is that the machine is now largely stable but could do with more memory (EDIT:10 days on and performance problems have persisted and more RAM would definitely be good).
During the course of investigating I have collected further information about the RESOURCE_SEMAPHORE wait type that appears to link wait types to memory shortages which ties in with the grants outstanding issue.
In this article MSSQL Tips : Resource_Semource waits the requested_memory_kb column from sys.dm_exec_query_memory_grants is highlighted showing how much memory has been requested and then talks about looking at sys.dm_exec_query_memory_grants and sys.dm_exec_sql_text(sql_handle) to look at the largest memory requests.
Having been really rather instructive the article then fizzles out and concludes the answer is to look at indexes or lack thereof.
That is good starting point and all to often basic indexes are the fix but indexes can be a minefield, vendors are often very sensitive to making changes and if they discover changes have been made without agreement then blame all future problems on that one things - I had a vendor once offer to accept the idea of creating two indexes (which I had tested and transformed their system performance) but having received the creation statements from me wanted to charge £800 to allow us to run the statements (I would have been the one running them in the production instance). Further, changing indexes on a mature system can often lead to unexpected effects that are difficult to predict without testing the application in greater depth.
Looking further at the memory requested I would suggest that there is a case to consider other factors too, yes they will be equally contentious with vendors but adding indexes might be a lazy solution :
This gives me a broad summary of what's going on an is a good indicator of where to look when there are problems, what's more it generates a good viewpoint without taking an age to run which is always good news as the only boxes this needs to be executed against are those with problems.
What I also have is an accumulated list of wait types and meanings, I'm not sure where I picked up the initial list but as time has gone on I've added to and clarified meanings. One area I have been looking at lately is memory grants because I had a server which continually has outstanding memory grants - within the last 12 months we've gone from 16Gb to 40Gb and I think the machine would still like more. The core database suffers a range of problems include the wrong fields / no fields being indexed, uniqueidentifiers, non unique primary keys and a failure to agree an archive policy prior to implementation but that's a little out of scope here. What I do know is that the machine is now largely stable but could do with more memory (EDIT:10 days on and performance problems have persisted and more RAM would definitely be good).
During the course of investigating I have collected further information about the RESOURCE_SEMAPHORE wait type that appears to link wait types to memory shortages which ties in with the grants outstanding issue.
In this article MSSQL Tips : Resource_Semource waits the requested_memory_kb column from sys.dm_exec_query_memory_grants is highlighted showing how much memory has been requested and then talks about looking at sys.dm_exec_query_memory_grants and sys.dm_exec_sql_text(sql_handle) to look at the largest memory requests.
Having been really rather instructive the article then fizzles out and concludes the answer is to look at indexes or lack thereof.
That is good starting point and all to often basic indexes are the fix but indexes can be a minefield, vendors are often very sensitive to making changes and if they discover changes have been made without agreement then blame all future problems on that one things - I had a vendor once offer to accept the idea of creating two indexes (which I had tested and transformed their system performance) but having received the creation statements from me wanted to charge £800 to allow us to run the statements (I would have been the one running them in the production instance). Further, changing indexes on a mature system can often lead to unexpected effects that are difficult to predict without testing the application in greater depth.
Looking further at the memory requested I would suggest that there is a case to consider other factors too, yes they will be equally contentious with vendors but adding indexes might be a lazy solution :
- Are joins of the correct type (can that outer be replaced with an inner) and are all join criteria specified
- Are functions being executed against fields in the where clause that might impact index usage (perhaps in the future, always worth checking as comparisons against variables might be alterable to modify the variable).
- If the query loading tables and fields that add no value.
- Are there varchar fields involved - the query optimiser will estimate that varchars take half their defined length for memory requirements usually - have a looked at "Estimated Row size" in the query plan - if your varchar fields are frequently using a lot less than or more than half their defined size the estimates of row size might be a long way out especially for longer fields. You can test this for yourself, turn on the actual query plan and try :
select cast('hello' as varchar(30)) from
Right click on the "compute scalar" item and have a look at the estimate row size (I see 29Bytes), now change the varchar(30) to text or varchar(max), nvarchar(30) or anything else and see what sizes you get. For varchar(max) I get an estimate of 4035B - in the table I picked I have 129 rows so the difference is (129 * 4035) - (129 * 29) - thats 504Kb difference on just 129 rows, it'll soon add up. - How many rows does the query plan report will be retrieved - I've just been looking at a query that said it was retrieving 233,131 rows which didn't match the 136 rows I could see in the result set. Looking further at the query and the context from Profiler I can see that the query is generating a parameter list for an SSRS report - still too many rows though but the "distinct" reveals why I see so few records. More examination clarifies that the number of records reported is because the developer wants to get "types" of something that are used and is achieving that by an inner join between the types table (225 rows) and the "items" table that has 15m rows - replacing the inner join with an exists reduces the impact somewhat - runtime reduced from 12 seconds to less than 1.
I'm not saying this is the be all and end all of query optimization, it's a massive subject and I am grateful the optimizer does a brilliant job most of the time, I am though trying to highlight other potential issues.
Sunday, 13 January 2013
3NF reminder
Normalisation can be a mine field for some but I find that when I try to normalize data I don't have 3 key steps to get to 3NF, I just end up with a single operation that weeds all three things in one go.
I am conscious that 3 steps becomes more relevant for unfamiliar situations and for newbies but the descriptions I learned :
I am conscious that 3 steps becomes more relevant for unfamiliar situations and for newbies but the descriptions I learned :
- Remove repeating groups
- Full functional dependency
- Transitive dependency
are pretty useless to succinctly describe the purpose - I've recently tried to narrate to a newbie what these mean and ended up with a less than brilliant explanation which made little sense to me let alone someone starting out.
However, I have recently I came across this meaningful summary :
- No repeating elements or groups of elements
- No partial dependencies on a concatenated key
- No dependencies on non-key attributes
which explain everything in a rememberable form http://phlonx.com/resources/nf3/
Saturday, 12 January 2013
Configure Perfmon with counters loaded
Configuring a set of counters on perfmon is pretty easy but saving them for reuse normally means having to create a "scheduled" trace, run it then open the data but here is a means documented to save settings for easier access : http://www.sqlservergeeks.com/blogs/AmitBansal/sql-server-bi/726/sql-server-open-windows-performance-monitor-with-your-default-set-of-counters
Monday, 7 January 2013
Statistics Updates
In the relatively recent past I have been redeveloping index rebuild functionality that I setup for use by an employer.
We now have a weekly routine that rebuilds anything > 40% fragmentation and a daily routine for anything about 90% fragmentation. I would like to lower those numbers but our SAN crashes when too much disk activity hits it and the SAN boys are at a loss to resolve so I need to manage the volume of activity a little, it's not a great compromise but it helps.
What we also do at the end of the index rebuild process is EXEC SP_UPDATESTATS to force an update of any statistics that need it. SP_UPDATESTATS only operates on stats that need updating but it still has to check them all but usefully outputs a list of everything and the operation undertaken. The alternative statement UPDATE STATISTICS does an update regardless so would significantly increase overall load. I'e just found yet more useful information on statistics updates which much more succinctly talks about parameter sniffing that I have seen on other sides on Kimberley's site .
The most recent change I have had included is for each table the first action is to rebuild any non-unique clusters which are common in one system I use (and invariably easy to fix as I've tried to explain) - this is because (certainly in SQL Server 2005 and above) if a non-unique cluster is rebuilt all other indexes on that table are rebuilt too - so we check for the need to rebuild those non-unique clusters first & if that happens ignore all other indexes on the given table to avoid the risk of duplicate activity.
Work outstanding :
We now have a weekly routine that rebuilds anything > 40% fragmentation and a daily routine for anything about 90% fragmentation. I would like to lower those numbers but our SAN crashes when too much disk activity hits it and the SAN boys are at a loss to resolve so I need to manage the volume of activity a little, it's not a great compromise but it helps.
What we also do at the end of the index rebuild process is EXEC SP_UPDATESTATS to force an update of any statistics that need it. SP_UPDATESTATS only operates on stats that need updating but it still has to check them all but usefully outputs a list of everything and the operation undertaken. The alternative statement UPDATE STATISTICS does an update regardless so would significantly increase overall load. I'e just found yet more useful information on statistics updates which much more succinctly talks about parameter sniffing that I have seen on other sides on Kimberley's site .
The most recent change I have had included is for each table the first action is to rebuild any non-unique clusters which are common in one system I use (and invariably easy to fix as I've tried to explain) - this is because (certainly in SQL Server 2005 and above) if a non-unique cluster is rebuilt all other indexes on that table are rebuilt too - so we check for the need to rebuild those non-unique clusters first & if that happens ignore all other indexes on the given table to avoid the risk of duplicate activity.
Work outstanding :
- I have spotted how to identify the sample size set by the designer on statistics so I wouldto alter the code to reuse that sample size so that design decisions remain in place as we have become increasingly conscious of the value of appropriate sample sizes whilst wanting to avoid 100%.
- Identify a means of deciding, per database / table, if automatic stats updates are allowable so reducing the risk of updates during operational hours - this might mean some tracing to identify if any of the problems we are encountering are caused by this activity.
- Review use of online index rebuilds and tempdb rebuilds - in testing these have worked really well.
- Review use of reorganisation, at one point we were not reorganising becasue sometimes it took longer than a rebuild but because of its page by page nature it can be stopped and will work better with other activity so it is possible we want to up the percentage that is acceptable for a rebuild - I think that needs deciding server by server but needs better consideration.
Sunday, 6 January 2013
Clustered Indexes
I've been rummaging further on Kimberley's site, one area I've spent time trying to explain to developers & 3rd parties is how to use clustered indexes more effectively.
What I encounter is a lack of understanding of basic clusters, BOL does not reveal everything but reading it carefully does give the key points. The main culprits are not having a clustered key where one is obvious and then putting the cluster on the wrong field.
One provider I work with has a table with around 100 million rows and the cluster on a field that is NEVER used to reference the data, but, I am not allowed to modify the database in any way, so we have to copy the updates from the table every day and create our own version of the table then index it properly - reducing our nightly processing run from around 10 hours to 3 (I'm not claiming the process is perfect but having the correct clustered index makes a notable difference)....if only indexed views could work inter-database [sigh].
However, Kimberly has provided more detail that I was unaware of which adds further thoughts to creating a clustered index.
What I encounter is a lack of understanding of basic clusters, BOL does not reveal everything but reading it carefully does give the key points. The main culprits are not having a clustered key where one is obvious and then putting the cluster on the wrong field.
One provider I work with has a table with around 100 million rows and the cluster on a field that is NEVER used to reference the data, but, I am not allowed to modify the database in any way, so we have to copy the updates from the table every day and create our own version of the table then index it properly - reducing our nightly processing run from around 10 hours to 3 (I'm not claiming the process is perfect but having the correct clustered index makes a notable difference)....if only indexed views could work inter-database [sigh].
However, Kimberly has provided more detail that I was unaware of which adds further thoughts to creating a clustered index.
Saturday, 5 January 2013
Duplicated Indexes
I've just started looking in more detail at duplicated indexes and have been trying to figure out a means of properly identifying duplicates, from my perspective because of the irrelevance of the order of included columns, but luckily someone has already done the work for me :
Kimberley L Tripp, Duplicated Indexes
Kimberley L Tripp, Duplicated Indexes
Thursday, 3 January 2013
Still here
Time to resurrect this as a store of useful information rather than having notes buried in paper notebooks.
Subscribe to:
Posts (Atom)