Matthias
2008-05-16 11:07:00 UTC
Hello all,
yesterday one of our clustersystems do an unexpexted clusterswitch.
Systeminformation:
HW: ProLiant DL585 G1 / 2x AMD Opteron 2,2 GHz / 16 GB RAM
OS: Microsoft Windows Server 2003 Enterprise x64 Edition
OS Version: 5.2.3790 Service Pack 2 Build 3790
HP ProLiant Support Pack 7.90
Atached to a SAN via FC
Main Software: SAP CRM 5.0 SP15 / on MS SQL Server 2005
Support Software: DataProdector / McAfee (Enterp. 8.0.0 Patch 15) / SNARE
3.0.0
MSCS-Configuration:
Userlan (Teaming)
Serverlan ( NO-Team)
PrivatLAN (crossover)
Clustergoup / MSDTC-Group / SAP-Group / SQL-Group
___________________________________________________________-
The Clusterlog:
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [DM] DmpGetSnapShotCb:
DmpGetDatabase returned 0x00000000
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsGetTempFileName
Q:\MSCS\, chkpt, 8011 => Q:\MSCS\chk1F4B.tmp, status 0
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] DmpGetSnapshotCb:
Checkpoint file name=Q:\MSCS\chk1F4B.tmp Seq#=8011
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsMoveFileEx
Q:\MSCS\chk619F.tmp=>Q:\MSCS\chk1F4B.tmp
0000098c.00000a64::2008/05/15-15:16:43.912 WARN [LM] DmpGetSnapShotCb:
Failed to move the temp file to checkpoint file,
TempFileName=Q:\MSCS\chk619F.tmp, ChkPtFileName=Q:\MSCS\chk1F4B.tmp,
Error=0x00000005
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsDeleteFile
Q:\MSCS\chk619F.tmp, status 0
0000098c.00000a64::2008/05/15-15:16:43.912 WARN [LM] LogCheckPoint: Callback
failed to return a checkpoint
0000098c.00000a64::2008/05/15-15:16:43.912 WARN [LM] LogpReset:: Callback
failed to return a checkpoint, error=5
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogClose : Entry
LogFile=0x02ad7df0
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogFlush :
pLog=0x02ad7df0 writing the 1024 bytes for active page at offset 0x00000400
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] WriteFile 99c (....)
1024, status 0 (0=>0)
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsFlushBuffers 99c,
status 0
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsCloseHandle 99c,
status 0
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogClose : Exit
returning success
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsDeleteFile
Q:\MSCS\tqu619E.tmp, status 0
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogpReset exit,
returning 0x00000005
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogReset exit,
returning 0x00000005
0000098c.00000a64::2008/05/15-15:16:43.912 ERR [DM]DmpCheckpointTimerCb -
Failed to reset log, error=5
0000098c.00000a64::2008/05/15-15:16:44.005 ERR Cluster service suffered an
unexpected fatal error at line 2324 of source module
d:\nt\base\cluster\service\dm\dmlog.c. The error code was 5.
00000f58.00000f5c::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
Shutdown = 0.
00000f58.00000f5c::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
00000000
00000f58.00000f5c::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
00000f58.00000f5c::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
notification.
00000f38.00000f3c::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
Shutdown = 0.
00000f38.00000f3c::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
00000000
00000f38.00000f3c::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
00000f38.00000f3c::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
notification.
00000f18.00000f1c::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
Shutdown = 0.
00000f18.00000f1c::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
00000000
00000f18.00000f1c::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
00000f18.00000f1c::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
notification.
00000b70.00000b74::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
Shutdown = 0.
00000b70.00000b74::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
00000000
00000b70.00000b74::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
00000b70.00000b74::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
notification.
00000b70.00000b74::2008/05/15-15:16:45.004 INFO SAP Resource <SAP CPR 00
Instance>: ResourceControl request.
00000f18.00000f34::2008/05/15-15:16:45.019 INFO [RM] NotifyChanges shutting
down.
00000f38.00000f54::2008/05/15-15:16:45.019 INFO [RM] NotifyChanges shutting
down.
00000f58.00000f74::2008/05/15-15:16:45.019 INFO [RM] NotifyChanges shutting
down.
00000b70.00000f08::2008/05/15-15:16:45.035 INFO [RM] NotifyChanges shutting
down.
00000b70.00000f10::2008/05/15-15:16:45.050 INFO Physical Disk <Disk H:>:
[DiskArb] CompletionRoutine, status 0.
00000b70.00000f10::2008/05/15-15:16:45.050 INFO Physical Disk <Disk H:>:
There are also Errors in the Eventlog:
Event Type: Error
Event Source: ClusSvc
Event Category: Log Mgr
Event ID: 1016
Date: 15.05.2008
Time: 17:16:43
User: N/A
Computer: NODE1
Description:
Cluster service failed to obtain a checkpoint from the server cluster
database for log file Q:\MSCS\tqu619E.tmp.
Next:
Event Type: Error
Event Source: ClusSvc
Event Category: Database Mgr
Event ID: 1000
Date: 15.05.2008
Time: 17:16:43
User: N/A
Computer: NODE1
Description:
Cluster service suffered an unexpected fatal error at line 2324 of source
module d:\nt\base\cluster\service\dm\dmlog.c. The error code was 5.
alot of:
Event Type: Warning
Event Source: Ftdisk
Event Category: Disk
Event ID: 57
Date: 15.05.2008
Time: 17:16:45
User: N/A
Computer: NODE1
Description:
The system failed to flush data to the transaction log. Corruption may occur.
And:
Event Type: Error
Event Source: Service Control Manager
Event Category: None
Event ID: 7031
Date: 15.05.2008
Time: 17:16:45
User: N/A
Computer: NODE1
Description:
The Cluster Service service terminated unexpectedly. It has done this 1
time(s). The following corrective action will be taken in 60000
milliseconds: Restart the service.
I found the KB http://support.microsoft.com/kb/321531/en-us but I can not
belive that our virusscanner is the reason because we EXCLUDE all recommented
Drives and Files ( e.q Quorumdrive/ Databasedives / DatabasLOG-Drives/
SQL-Executables, Pagefile, C:\Windows\Cluster, ..\NTDS, ..ntfsr, ..SYSVOL,
*.chk, *.ebd, *.ldf, *.log, *.mdf, *.ndf, *.stm) from read and write scan.
Anyone has an idea ?
br, Matthias
____________________________________________
Matthias Schweifer - Austria
yesterday one of our clustersystems do an unexpexted clusterswitch.
Systeminformation:
HW: ProLiant DL585 G1 / 2x AMD Opteron 2,2 GHz / 16 GB RAM
OS: Microsoft Windows Server 2003 Enterprise x64 Edition
OS Version: 5.2.3790 Service Pack 2 Build 3790
HP ProLiant Support Pack 7.90
Atached to a SAN via FC
Main Software: SAP CRM 5.0 SP15 / on MS SQL Server 2005
Support Software: DataProdector / McAfee (Enterp. 8.0.0 Patch 15) / SNARE
3.0.0
MSCS-Configuration:
Userlan (Teaming)
Serverlan ( NO-Team)
PrivatLAN (crossover)
Clustergoup / MSDTC-Group / SAP-Group / SQL-Group
___________________________________________________________-
The Clusterlog:
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [DM] DmpGetSnapShotCb:
DmpGetDatabase returned 0x00000000
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsGetTempFileName
Q:\MSCS\, chkpt, 8011 => Q:\MSCS\chk1F4B.tmp, status 0
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] DmpGetSnapshotCb:
Checkpoint file name=Q:\MSCS\chk1F4B.tmp Seq#=8011
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsMoveFileEx
Q:\MSCS\chk619F.tmp=>Q:\MSCS\chk1F4B.tmp
0000098c.00000a64::2008/05/15-15:16:43.912 WARN [LM] DmpGetSnapShotCb:
Failed to move the temp file to checkpoint file,
TempFileName=Q:\MSCS\chk619F.tmp, ChkPtFileName=Q:\MSCS\chk1F4B.tmp,
Error=0x00000005
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsDeleteFile
Q:\MSCS\chk619F.tmp, status 0
0000098c.00000a64::2008/05/15-15:16:43.912 WARN [LM] LogCheckPoint: Callback
failed to return a checkpoint
0000098c.00000a64::2008/05/15-15:16:43.912 WARN [LM] LogpReset:: Callback
failed to return a checkpoint, error=5
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogClose : Entry
LogFile=0x02ad7df0
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogFlush :
pLog=0x02ad7df0 writing the 1024 bytes for active page at offset 0x00000400
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] WriteFile 99c (....)
1024, status 0 (0=>0)
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsFlushBuffers 99c,
status 0
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsCloseHandle 99c,
status 0
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogClose : Exit
returning success
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [Qfs] QfsDeleteFile
Q:\MSCS\tqu619E.tmp, status 0
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogpReset exit,
returning 0x00000005
0000098c.00000a64::2008/05/15-15:16:43.912 INFO [LM] LogReset exit,
returning 0x00000005
0000098c.00000a64::2008/05/15-15:16:43.912 ERR [DM]DmpCheckpointTimerCb -
Failed to reset log, error=5
0000098c.00000a64::2008/05/15-15:16:44.005 ERR Cluster service suffered an
unexpected fatal error at line 2324 of source module
d:\nt\base\cluster\service\dm\dmlog.c. The error code was 5.
00000f58.00000f5c::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
Shutdown = 0.
00000f58.00000f5c::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
00000000
00000f58.00000f5c::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
00000f58.00000f5c::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
notification.
00000f38.00000f3c::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
Shutdown = 0.
00000f38.00000f3c::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
00000000
00000f38.00000f3c::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
00000f38.00000f3c::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
notification.
00000f18.00000f1c::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
Shutdown = 0.
00000f18.00000f1c::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
00000000
00000f18.00000f1c::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
00000f18.00000f1c::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
notification.
00000b70.00000b74::2008/05/15-15:16:45.004 WARN [RM] Going away, Status = 1,
Shutdown = 0.
00000b70.00000b74::2008/05/15-15:16:45.004 ERR [RM] Active Resource =
00000000
00000b70.00000b74::2008/05/15-15:16:45.004 ERR [RM] Resource State is 1, ""
00000b70.00000b74::2008/05/15-15:16:45.004 INFO [RM] Posting shutdown
notification.
00000b70.00000b74::2008/05/15-15:16:45.004 INFO SAP Resource <SAP CPR 00
Instance>: ResourceControl request.
00000f18.00000f34::2008/05/15-15:16:45.019 INFO [RM] NotifyChanges shutting
down.
00000f38.00000f54::2008/05/15-15:16:45.019 INFO [RM] NotifyChanges shutting
down.
00000f58.00000f74::2008/05/15-15:16:45.019 INFO [RM] NotifyChanges shutting
down.
00000b70.00000f08::2008/05/15-15:16:45.035 INFO [RM] NotifyChanges shutting
down.
00000b70.00000f10::2008/05/15-15:16:45.050 INFO Physical Disk <Disk H:>:
[DiskArb] CompletionRoutine, status 0.
00000b70.00000f10::2008/05/15-15:16:45.050 INFO Physical Disk <Disk H:>:
There are also Errors in the Eventlog:
Event Type: Error
Event Source: ClusSvc
Event Category: Log Mgr
Event ID: 1016
Date: 15.05.2008
Time: 17:16:43
User: N/A
Computer: NODE1
Description:
Cluster service failed to obtain a checkpoint from the server cluster
database for log file Q:\MSCS\tqu619E.tmp.
Next:
Event Type: Error
Event Source: ClusSvc
Event Category: Database Mgr
Event ID: 1000
Date: 15.05.2008
Time: 17:16:43
User: N/A
Computer: NODE1
Description:
Cluster service suffered an unexpected fatal error at line 2324 of source
module d:\nt\base\cluster\service\dm\dmlog.c. The error code was 5.
alot of:
Event Type: Warning
Event Source: Ftdisk
Event Category: Disk
Event ID: 57
Date: 15.05.2008
Time: 17:16:45
User: N/A
Computer: NODE1
Description:
The system failed to flush data to the transaction log. Corruption may occur.
And:
Event Type: Error
Event Source: Service Control Manager
Event Category: None
Event ID: 7031
Date: 15.05.2008
Time: 17:16:45
User: N/A
Computer: NODE1
Description:
The Cluster Service service terminated unexpectedly. It has done this 1
time(s). The following corrective action will be taken in 60000
milliseconds: Restart the service.
I found the KB http://support.microsoft.com/kb/321531/en-us but I can not
belive that our virusscanner is the reason because we EXCLUDE all recommented
Drives and Files ( e.q Quorumdrive/ Databasedives / DatabasLOG-Drives/
SQL-Executables, Pagefile, C:\Windows\Cluster, ..\NTDS, ..ntfsr, ..SYSVOL,
*.chk, *.ebd, *.ldf, *.log, *.mdf, *.ndf, *.stm) from read and write scan.
Anyone has an idea ?
br, Matthias
____________________________________________
Matthias Schweifer - Austria