Jamie
2005-10-12 10:43:02 UTC
Our client has a 4 node cluster running on W2003 Ent. Events 1122 and 1123
appear in the event log every few seconds. Everyting appears to work fine but
worried about these errors.
An engineer went to site to troubleshoot and his recordings are as follows:
Myself and Simon went in on Friday to dissolve teaming on the cluster nics
as recommended by HP. This had no effect on the errors being generated by
the cluster, so we decided to test a few scenarios to see if we could
pinpoint the error. After trying various configurations, we discovered that
we could eliminate the errors on the cluster by moving all of the nics (LAN,
Heartbeat and Backup LAN) off the Jewson production network and onto an
isolated switch. We reconfigured teaming on the LAN nics and we still had no
errors while the cluster was on a switch of its own – this is how we left it.
We found during testing on the production network that the errors being
generated by the cluster could be stopped by restarting the cluster service
on the first node that was booted. After this, no matter which nodes were
taking online/offline, no errors were generated. The moment that all nodes
were physically shutdown and brought back up though, the errors would return
until restarting the cluster service as described. This leads us to believe
that the issue may well be related to DNS or Active Directory. It would
appear that Cluster on Windows 2003 Server attempts to write information into
the AD when the first node boots up. If this fails completely (dnsapi errors
written into event viewer), as is the case on the isolated switch, then
cluster works perfectly. In the production environment (no dnsapi errors
logged), it would appear that the process is partially working, which is then
causing the errors to be generated.
As you know, Jewson are owned by Saint Gobain. We have had experience of
Saint Gobain networks before and they are usually considerably locked down.
This may be a potential issue when the cluster is looking for and attempting
to update domain controllers.
The next step is really to pass this information onto Microsoft to see if
they can tell you what the first node in the cluster attempts to do with
AD/DNS and why the errors would not be generated on the isolated switch.
Until we resolve the errors on the cluster, we do not recommend installing
SQL.
Hope someone can help!
Regards
appear in the event log every few seconds. Everyting appears to work fine but
worried about these errors.
An engineer went to site to troubleshoot and his recordings are as follows:
Myself and Simon went in on Friday to dissolve teaming on the cluster nics
as recommended by HP. This had no effect on the errors being generated by
the cluster, so we decided to test a few scenarios to see if we could
pinpoint the error. After trying various configurations, we discovered that
we could eliminate the errors on the cluster by moving all of the nics (LAN,
Heartbeat and Backup LAN) off the Jewson production network and onto an
isolated switch. We reconfigured teaming on the LAN nics and we still had no
errors while the cluster was on a switch of its own – this is how we left it.
We found during testing on the production network that the errors being
generated by the cluster could be stopped by restarting the cluster service
on the first node that was booted. After this, no matter which nodes were
taking online/offline, no errors were generated. The moment that all nodes
were physically shutdown and brought back up though, the errors would return
until restarting the cluster service as described. This leads us to believe
that the issue may well be related to DNS or Active Directory. It would
appear that Cluster on Windows 2003 Server attempts to write information into
the AD when the first node boots up. If this fails completely (dnsapi errors
written into event viewer), as is the case on the isolated switch, then
cluster works perfectly. In the production environment (no dnsapi errors
logged), it would appear that the process is partially working, which is then
causing the errors to be generated.
As you know, Jewson are owned by Saint Gobain. We have had experience of
Saint Gobain networks before and they are usually considerably locked down.
This may be a potential issue when the cluster is looking for and attempting
to update domain controllers.
The next step is really to pass this information onto Microsoft to see if
they can tell you what the first node in the cluster attempts to do with
AD/DNS and why the errors would not be generated on the isolated switch.
Until we resolve the errors on the cluster, we do not recommend installing
SQL.
Hope someone can help!
Regards