Unable to configure machines to be hubs

Jul 16, 2014 at 10:01 PM
I have been exploring using the WspEventRouter in our data center, and have gotten to the qa phase.

When configured on my workstations, all seems fine. When trying to install this in our qa environment, I found that our qa servers could not act as hubs. When configured as nodes using a workstation as a hub, they seem to work fine. I could run the hubs on different machines, but am concerned this might affect me in production.

Things I looked at:
  • netstat -an | findstr 1300
    When workstation B is a hub and workstation A is a node, workstation A has an established connection to workstation B
    When server A is a hub and workstation A is a node, workstation A does NOT have a link
  • telnet on port 1300 (to verify in port reachable)
    When workstation B is a hub and workstation A is a node, telnet on workstation A is able to connect to workstation B
    When server A is a hub and workstation A is a node, telnet on workstation A is able to connect to server A
    Note that there are firewalls on the servers; I think I put the right holes in it. I also tried disabling them.
Any advice on other things to check?

ps - congrats on your move...
Jul 16, 2014 at 10:31 PM
It sounds like either a firewall issue or with how your data center network is configured. At Microsoft, they had to change some of the ACLs for the network config for things to work or the network routers would block the traffic.
Jul 17, 2014 at 2:01 PM
Keith - thanks for you very prompt response! This is our internal qa environment, so I don't think there are firewalls in place between my workstations and the servers, and I had tried disabligng the os-level firewalls. However, you never know, so I asked our IT staff to validate. Thanks for that feedback.

I had thought I ruled out a firewall issue by successfully opening a socket to port 1300 from the client to the server. Maybe I will try a packet sniffer to see if anything jumps out.

The servers are running on ancient os's (win2003). Unfortunately, this is because our prod servers still have that version. Since it was able to act as a node, I didn't think could be a factor; opinion?
Jul 18, 2014 at 1:25 PM
Did you get this resolved? Wsp shouldn't have any problem running on WS 2003, I did that for years.

If things still aren't working, try making the workstation the hub and the server a node just to see if there is a problem going one direction but not the other.

Another possible issue could be DNS name resolution. Make sure each system can resolve the other's name.

Are you using IPv4 or IPv6? I don't think that should be an issue but never know.
Jul 18, 2014 at 4:49 PM
Thanks for your followup. No, this isn't resolved. Sorry for this long response...

I have verified that no firewalls are in use except the os level firewall; and I have tried disabling it. I have also verified that DNS is working. IPv4 is being used.

The servers are also configured with multiple IP addresses. Might that be related?

I have been using my workstation as the hub, and in general I can get it working. I have put my hub's config at the bottom on this message (my workstation's machine name is "Barrett".

I have found it a bit flakey as I am working with it. For example, I rebooted my workstation (the only hub), and as probably expected the node stopped working. I wanted to restart WSP services to fix that. Or, when I make changes such as killing the firewall, I want to restart services. I have found restarting works best if I do the following:
  1. Shut down Server A (currently node). Wait a few seconds until the netstat shows it is not attached to the server.
    Note: I am using only one node to minimize effort now.
  2. Shut down Workstation A (currently hub). Wait until netstat shows it is not listening; then wait 10 more seconds.
  3. Start up Workstation A. Wait until netstat shows it is listening (takes a few seconds).
  4. Start up Server A. Wait until netstats is is connected to hub
However, about half the time (in fact, might be every other time), Server A never establishes the connection. Following as always fixed it thus far: stopping service on Server A, waiting 10 seconds, and then starting.

Probably related: one time, I tried to add a logging entry to the hub. The config file was successfully pushed to the node, but the node then unestablished the connection to the hub.

Probably unrelated: I wasn't able to get logging working. It looked like the only thing I needed to change was the type attribute to reflect my event type's guid. When I published an event from the hub (which is where I had this config), I was expecting some files to appear in the c:\temp\WebEvents\ or c:\temp\WebEvents\log\ directory. Is the logging expected to be functional in the v3 codebase?

A couple of suggestions for logging: since my workstation the publishing hub, it copied this to the node. At one point, I tried setting the localOnly to false. That got copied too.

Assuming I understand how the attributes work, maybe the publishing should only push the logging out if that logging is set to localOnly? That way all nodes don't start logging everything.

And maybe when a secondary hub/node's config is updated from the publish, it keeps any localOnly=false entries? That way if a single node is configured to log everything; that configuration doesn't get wiped.

<?xml version="1.0" encoding="utf-8"?>
    <section name="eventRouterSettings" type="Microsoft.WebSolutionsPlatform.Configuration.EventRouterSettings" />
    <section name="hubRoleSettings" type="Microsoft.WebSolutionsPlatform.Configuration.HubRoleSettings" />
    <section name="nodeRoleSettings" type="Microsoft.WebSolutionsPlatform.Configuration.HubRoleSettings" />
    <section name="groupSettings" type="Microsoft.WebSolutionsPlatform.Configuration.GroupSettings" />
    <section name="logSettings" type="Microsoft.WebSolutionsPlatform.Configuration.LogSettings" />

<eventRouterSettings role="Hub" group="SigmaCare QA" autoConfig="true" mgmtGuid="4683993D-CBBD-4798-A6B1-0ED2870864DD" cmdGuid="D1C61783-9646-4424-B508-D628D3D8C98C" publish="True" />

  <subscriptionManagement refreshIncrement="3" expirationIncrement="10" />
    <localPublish eventQueueName="WspEventQueue" eventQueueSize="102400000" averageEventSize="10240" />
    <outputCommunicationQueues maxQueueSize="200000000" maxTimeout="600" />
    <thisRouter nic="" port="1300" bufferSize="1024000" timeout="5000" />
    <peerRouter numConnections="2" port="1300" bufferSize="1024000" timeout="5000" />

  <subscriptionManagement refreshIncrement="3" expirationIncrement="10" />
    <localPublish eventQueueName="WspEventQueue" eventQueueSize="10240000" averageEventSize="10240" />
    <outputCommunicationQueues maxQueueSize="200000000" maxTimeout="600" />
    <parentRouter numConnections="1" port="1300" bufferSize="1024000" timeout="5000" />

  <group name="SigmaCare QA" useGroup="">
    <hub name="barrett" />

  <group name="Grp2" useGroup="SigmaCare QA">

  <group name="default" useGroup="SigmaCare QA">

    <!-- <event type="78422526-7B21-4559-8B9A-BC551B46AE34" localOnly="true" maxFileSize="2000000000" maxCopyInterval="60" createEmptyFiles="false" fieldTerminator="," rowTerminator="\n" tempFileDirectory="c:\temp\WebEvents\" copyToFileDirectory="c:\temp\WebEvents\log\" /> -->
Jul 18, 2014 at 5:06 PM
Is barrett the hub or node?

I don't recall if role is case sensitive so change the value to "hub".

When you want to stop the wsp service, don't use stop service. Instead, kill the process either with the kill command or via task manager. The issue is .Net thinks it has stopped all the threads in the process but it has no way to signal the unmanaged thread which is usually in a wait state. Once the wait has timed out, it returns to .Net and then the process ends. This is why I kill the process.

Jul 18, 2014 at 5:59 PM
barrett is a hub (or was, see below). It is one the workstations I was using for poc and has been stable as a hub.

After a quick test, I think killing the process rather than shutting down the service addresses the flakyness; thanks.

I updated the config to use lower-case hub. I then tried putting the server back as being a hub and my workstation as a node. I am getting a different response from the hub's netstat: I now get a TIME_WAIT on the connection from the node (now my workstation). I sometimes see a FIN_WAIT_2 before that.

In fact, one time, I saw a blip of the connection being established from my workstation; see below. Haven't seen since; guess I am not running netstat quickly enough.

C:\Documents and Settings\bselfridge>netstat -an | findstr 1300

C:\Documents and Settings\bselfridge>netstat -an | findstr 1300

C:\Documents and Settings\bselfridge>netstat -an | findstr 1300
Jul 18, 2014 at 6:59 PM
Did you hear back from you IT guy regarding the network config? It really sounds like the network routers are configured to not allow usage on this port number. If so, they need to change the config to open up port 1300 traffic both directions.

Jul 18, 2014 at 7:07 PM
I did; they said there wasn't anything like that between my workstation and these servers.

I think I also validated that. With the server configured as a hub, I then opened a socket to the server from my workstation via "telnet qaweb004 1300". netstat on the qaweb004 shows that the connection is established.
Jul 18, 2014 at 9:54 PM
Could there be some antivirus-type tool running on the server/workstation that might shutdown usage of the port? If not, I think you'll need to start the wspeventrouter.exe in a debugger to see what's happening.