Tuesday, March 20, 2012

Plagued by corruption

Since we have migrated our 2000 databases to SQL 2005 on new hardware (described below), we have experienced re-occurring corruption on the same tables and indexes even after the corruption had been resolved by dropping/re-creating the indexes manually. Typical events include:

1. Clients receiving errors, such as 8646
2. We see the errors in the event logs, run DBCC's to confirm the errors which then creates a dump file
3. We resolve by dropping/creating the index(es) affected
4. Re-run the DBCC checks and they come up clean.

A day or two later, the entire cycle will repeat against the same object/index. This problem exists on multiple different databases on multiple different servers, all with the same structure and hardware. Our SAN engineers say there are no issues they can see with the disks.

I would be very appreciative for any suggestions to better troubleshoot the issue, or, of course, a resolution would be magnificent!!

Here is as much information about our setup and errors experienced as I can think of to provide. Of course, if I am missing something just shout and I will include it. TIA.

We are running SQL 2005 enterprise x64 sp1 on Windows 2003 enterprise x64 sp1 as a 2-node active/passive cluster (native windows clustering). The hardware is HP Proliant DL585, 4 dual-core processors @. 2.8 GHz, 48 GB of RAM.

When clustering the servers, we pay special attention to ensure the NIC settings are as specified by published MS white papers both at the NIC level and as cluster configurations. I can provide those settings if necessary.

Other pertinent settings:
- We set the SQL Server's min/max setting to both be 40960 MB (40GB)
- MAXDOP to 2
- System DB's share two drives, one for data, one for log (all drives are RAID or 10)
- TempDB has 1 drive for data, one for log, and we modify TempDB to have as many data files as processors
- User DBs share multiple pairs of drives, each pair has a dedicated data drive and dedicated log drive
- Backups (litespeed) go to a dedicated drive
- Full-text catalogs go to a dedicated drive
- Maintenance jobs run every night and include: Reindex, update stats, backups and other miscellaneous jobs specific to our environment
- We set physical database files to grow by 200MB increments when over 2GB in size and we proactively grow each file overnight if the file's free space is < 10%Additional error and dump file info...

The typical app log entry for error 8646:

Unable to find index entry in index ID 1, of table 894626230, in database 'dbname'. The indicated index is corrupt or there is a problem with the current update plan. Run DBCC CHECKDB or DBCC CHECKTABLE. If the problem persists, contact product support.

Contents of the dump file:

This file is generated by Microsoft SQL Server
version 9.00.3042.00
upon detection of fatal unexpected error. Please return this file,
the query or program that produced the bugcheck, the database and
the error log, and any other pertinent information with a Service Request.


Computer type is AT/AT COMPATIBLE.
Current time is 07:32:39 11/07/07
8 Unknown CPU 9., 2813 Mhz processor (s).
Windows NT 5.2 Build 3790 CSD Service Pack 1.

Memory
MemoryLoad = 92%
Total Physical = 49149 MB
Available Physical = 3913 MB
Total Page File = 60149 MB
Available Page File = 15932 MB
Total Virtual = 8388607 MB
Available Virtual = 8338218 MB

DBCC RESULTS
-------
<DbccResults>
<Dbcc ID="0" Error="2511" Severity="16" State="1">Table error: Object ID 894626230, index ID 1, partition ID 7205
7596565782528, alloc unit ID 72057596808986624 (type In-row data). Keys out of order on page (1:487717), slots 3
and 4.</Dbcc>
<Dbcc ID="1" Error="2511" Severity="16" State="1">Table error: Object ID 894626230, index ID 1, partition ID 7205
7596565782528, alloc unit ID 72057596808986624 (type In-row data). Keys out of order on page (1:487717), slots 79
and 80.</Dbcc>

...and so on...|||and last but not least...

* Short Stack Dump

PSS @.0x000000045399FA80
--------

CSession @.0x000000045399E410
---------
m_spid = 54 m_cRef = 14 m_rgcRefType[0] = 1
m_rgcRefType[1] = 1 m_rgcRefType[2] = 11 m_rgcRefType[3] = 1
m_rgcRefType[4] = 0 m_rgcRefType[5] = 0 m_pmo = 0x000000045399E080
m_pstackBhfPool = 0x0000000000000000 m_dwLoginFlags = 0x83e0 m_fBackground = 0
m_fClientRequestConnReset = 0 m_fUserProc = -1 m_fConnReset = 0
m_fIsConnReset = 0 m_fInLogin = 0 m_fAuditLoginSent = 1
m_fAuditLoginFailedSent = 0 m_fReplRelease = 0 m_fKill = 0
m_ulLoginStamp = 54267395 m_eclClient = 5 m_protType = 5
m_hHttpToken = FFFFFFFFFFFFFFFF

m_pV7LoginRec
-------
0000000000000000: d60b0000 02000972 00100000 00000007 08190000 ......r.........

0000000000000014: 00000000 e0830000 2c010000 09040000 5e000c00 ......,......^...

0000000000000028: 00000000 00000000 76000700 84000000 00000000 ......v.........

000000000000003C: 84000400 8c000000 8c000000 00163582 4ea48c00 ...........5.N...

0000000000000050: 4a0b8c00 00000000 00000000 0000J..........

CPhysicalConnection @.0x000000045399E2E0
------------
m_pPhyConn->m_pmo = 0x000000045399E080
m_pPhyConn->m_pNetConn = 0x000000045399EB00
m_pPhyConn->m_pConnList = 0x000000045399E3E0
m_pPhyConn->m_pSess = 0x000000045399E410 m_pPhyConn->m_fTracked = -1
m_pPhyConn->m_cbPacketsize = 4096 m_pPhyConn->m_fMars = 0 m_pPhyConn->m_fKill = 0

CBatch @.0x000000045399EFD0
--------
m_pSess = 0x000000045399E410 m_pConn = 0x000000045399EEC0 m_cRef = 3
m_rgcRefType[0] = 1 m_rgcRefType[1] = 1 m_rgcRefType[2] = 1
m_rgcRefType[3] = 0 m_rgcRefType[4] = 0 m_pTask = 0x00000000038085C8

EXCEPT (null) @.0x00000000507BA138
----------
exc_number = 0 exc_severity = 0 exc_func = 0x0000000001873F50

Task @.0x00000000038085C8
--------
CPU Ticks used (ms) = 6134 Task State = 2
WAITINFO_INTERNAL: WaitResource = 0x0000000000000000 WAITINFO_INTERNAL: WaitType = 0x0
WAITINFO_INTERNAL: WaitSpinlock = 0x0000000000000000 SchedulerId = 0x2
ThreadId = 0x1cc0 m_state = 0 m_eAbortSev = 0

EC @.0x000000045399FA90
-------
spid = 54 ecid = 0 ec_stat = 0x0
ec_stat2 = 0x0 ec_atomic = 0x0 __fSubProc = 1
ec_dbccContext = 0x00000000507BEF50 __pSETLS = 0x000000045399EF30 __pSEParams = 0x000000045399F350
__pDbLocks = 0x000000045399FEE0

SEInternalTLS @.0x000000045399EF30
----------
m_flags = 0 m_TLSstatus = 3 m_owningTask = 0x00000000038085C8
m_activeHeapDatasetList = 0x000000045399EF30
m_activeIndexDatasetList = 0x000000045399EF40

SEParams @.0x000000045399F350
---------
m_lockTimeout = -1 m_isoLevel = 4096 m_logDontReplicate = 0
m_neverReplicate = 0 m_XactWorkspace = 0x0000000BA3DDB9D0
m_pSessionLocks = 0x0000000780BE6F10 m_pDbLocks = 0x000000045399FEE0
m_execStats = 0x0000000082256E40 m_pAllocFileLimit = 0x0000000000000000|||Do the machines have ECC RAM?|||Yes, all our servers have ECC RAM, which I believe is a req. for the HP servers.

No comments:

Post a Comment