Since we have migrated our 2000 databases to SQL 2005 on new hardware (described below), we have experienced re-occurring corruption on the same tables and indexes even after the corruption had been resolved by dropping/re-creating the indexes manually.  Typical events include:
1.  Clients receiving errors, such as 8646
2.  We see the errors in the event logs, run DBCC's to confirm the errors which then creates a dump file
3.  We resolve by dropping/creating the index(es) affected
4.  Re-run the DBCC checks and they come up clean.
A day or two later, the entire cycle will repeat against the same object/index.  This problem exists on multiple different databases on multiple different servers, all with the same structure and hardware.  Our SAN engineers say there are no issues they can see with the disks.
I would be very appreciative for any suggestions to better troubleshoot the issue, or, of course, a resolution would be magnificent!!
Here is as much information about our setup and errors experienced as I can think of to provide.  Of course, if I am missing something just shout and I will include it.  TIA.
We are running SQL 2005 enterprise x64 sp1 on Windows 2003 enterprise x64 sp1 as a 2-node active/passive cluster (native windows clustering).  The hardware is HP Proliant DL585, 4 dual-core processors @. 2.8 GHz, 48 GB of RAM.
When clustering the servers, we pay special attention to ensure the NIC settings are as specified by published MS white papers both at the NIC level and as cluster configurations.  I can provide those settings if necessary.
Other pertinent settings:
- We set the SQL Server's min/max setting to both be 40960 MB (40GB)
- MAXDOP to 2
- System DB's share two drives, one for data, one for log (all drives are RAID or 10)
- TempDB has 1 drive for data, one for log, and we modify TempDB to have as many data files as processors
- User DBs share multiple pairs of drives, each pair has a dedicated data drive and dedicated log drive
- Backups (litespeed) go to a dedicated drive
- Full-text catalogs go to a dedicated drive
- Maintenance jobs run every night and include: Reindex, update stats, backups and other miscellaneous jobs specific to our environment
- We set physical database files to grow by 200MB increments when over 2GB in size and we proactively grow each file overnight if the file's free space is < 10%Additional error and dump file info...
The typical app log entry for error 8646:
Unable to find index entry in index ID 1, of table 894626230, in database 'dbname'. The indicated index is corrupt or there is a problem with the current update plan. Run DBCC CHECKDB or DBCC CHECKTABLE. If the problem persists, contact product support.
Contents of the dump file:
This file is generated by Microsoft SQL Server                                                                   
version 9.00.3042.00                                                                                             
upon detection of fatal unexpected error. Please return this file,                                               
the query or program that produced the bugcheck, the database and                                                
the error log, and any other pertinent information with a Service Request.                                       
                                                                                                                 
                                                                                                                 
Computer type is AT/AT COMPATIBLE.                                                                               
Current time is 07:32:39 11/07/07                                                                     
8 Unknown CPU 9., 2813 Mhz processor (s).                                                                        
Windows NT 5.2 Build 3790 CSD Service Pack 1.                                                                    
                                     
Memory                               
MemoryLoad = 92%                     
Total Physical = 49149 MB            
Available Physical = 3913 MB         
Total Page File = 60149 MB           
Available Page File = 15932 MB       
Total Virtual = 8388607 MB           
Available Virtual = 8338218 MB       
DBCC RESULTS
-------
<DbccResults>                                                                                                    
<Dbcc ID="0" Error="2511" Severity="16" State="1">Table error: Object ID 894626230, index ID 1, partition ID 7205
7596565782528, alloc unit ID 72057596808986624 (type In-row data). Keys out of order on page (1:487717), slots 3 
and 4.</Dbcc>                                                                                                    
<Dbcc ID="1" Error="2511" Severity="16" State="1">Table error: Object ID 894626230, index ID 1, partition ID 7205
7596565782528, alloc unit ID 72057596808986624 (type In-row data). Keys out of order on page (1:487717), slots 79
 and 80.</Dbcc>                                                                                            
...and so on...|||and last but not least...
* Short Stack Dump                   
PSS @.0x000000045399FA80
--------
CSession @.0x000000045399E410
---------
m_spid = 54                          m_cRef = 14                          m_rgcRefType[0] = 1                    
m_rgcRefType[1] = 1                  m_rgcRefType[2] = 11                 m_rgcRefType[3] = 1                    
m_rgcRefType[4] = 0                  m_rgcRefType[5] = 0                  m_pmo = 0x000000045399E080             
m_pstackBhfPool = 0x0000000000000000 m_dwLoginFlags = 0x83e0              m_fBackground = 0                      
m_fClientRequestConnReset = 0        m_fUserProc = -1                     m_fConnReset = 0                       
m_fIsConnReset = 0                   m_fInLogin = 0                       m_fAuditLoginSent = 1                  
m_fAuditLoginFailedSent = 0          m_fReplRelease = 0                   m_fKill = 0                            
m_ulLoginStamp = 54267395            m_eclClient = 5                      m_protType = 5                         
m_hHttpToken = FFFFFFFFFFFFFFFF      
m_pV7LoginRec
-------
0000000000000000:   d60b0000 02000972 00100000 00000007 08190000 ......r.........                           
0000000000000014:   00000000 e0830000 2c010000 09040000 5e000c00 ......,......^...                           
0000000000000028:   00000000 00000000 76000700 84000000 00000000 ......v.........                           
000000000000003C:   84000400 8c000000 8c000000 00163582 4ea48c00 ...........5.N...                           
0000000000000050:   4a0b8c00 00000000 00000000 0000J..........                                 
CPhysicalConnection @.0x000000045399E2E0
------------
m_pPhyConn->m_pmo = 0x000000045399E080                                                                           
m_pPhyConn->m_pNetConn = 0x000000045399EB00                                                                      
m_pPhyConn->m_pConnList = 0x000000045399E3E0                                                                     
m_pPhyConn->m_pSess = 0x000000045399E410                                  m_pPhyConn->m_fTracked = -1            
m_pPhyConn->m_cbPacketsize = 4096    m_pPhyConn->m_fMars = 0              m_pPhyConn->m_fKill = 0                
CBatch @.0x000000045399EFD0
--------
m_pSess = 0x000000045399E410         m_pConn = 0x000000045399EEC0         m_cRef = 3                             
m_rgcRefType[0] = 1                  m_rgcRefType[1] = 1                  m_rgcRefType[2] = 1                    
m_rgcRefType[3] = 0                  m_rgcRefType[4] = 0                  m_pTask = 0x00000000038085C8           
EXCEPT (null) @.0x00000000507BA138
----------
exc_number = 0                       exc_severity = 0                     exc_func = 0x0000000001873F50          
Task @.0x00000000038085C8
--------
CPU Ticks used (ms) = 6134           Task State = 2                                                              
WAITINFO_INTERNAL: WaitResource = 0x0000000000000000                      WAITINFO_INTERNAL: WaitType = 0x0      
WAITINFO_INTERNAL: WaitSpinlock = 0x0000000000000000                      SchedulerId = 0x2                      
ThreadId = 0x1cc0                    m_state = 0                          m_eAbortSev = 0                        
EC @.0x000000045399FA90
-------
spid = 54                            ecid = 0                             ec_stat = 0x0                          
ec_stat2 = 0x0                       ec_atomic = 0x0                      __fSubProc = 1                         
ec_dbccContext = 0x00000000507BEF50  __pSETLS = 0x000000045399EF30        __pSEParams = 0x000000045399F350       
__pDbLocks = 0x000000045399FEE0      
SEInternalTLS @.0x000000045399EF30
----------
 m_flags = 0                          m_TLSstatus = 3                      m_owningTask = 0x00000000038085C8     
 m_activeHeapDatasetList = 0x000000045399EF30                                                                    
 m_activeIndexDatasetList = 0x000000045399EF40                            
SEParams @.0x000000045399F350
---------
 m_lockTimeout = -1                   m_isoLevel = 4096                    m_logDontReplicate = 0                
 m_neverReplicate = 0                 m_XactWorkspace = 0x0000000BA3DDB9D0                                       
 m_pSessionLocks = 0x0000000780BE6F10                                      m_pDbLocks = 0x000000045399FEE0       
 m_execStats = 0x0000000082256E40     m_pAllocFileLimit = 0x0000000000000000|||Do the machines have ECC RAM?|||Yes, all our servers have ECC RAM, which I believe is a req. for the HP servers.
Tuesday, March 20, 2012
Plagued by corruption
Labels:
below,
corruption,
database,
databases,
described,
experienced,
hardware,
microsoft,
migrated,
mysql,
oracle,
plagued,
re-occurring,
server,
sql
Subscribe to:
Post Comments (Atom)
 
No comments:
Post a Comment