WSP 3.0 Groups

Aug 9, 2013 at 8:15 AM
Edited Aug 9, 2013 at 10:37 AM
Hi, Keithh,

We are using WSP now, can you show us how to configure two groups of hubs?
(each group contains two hubs, the left group connects to the other one, just likes what you described in your doucment), we tried configuration as below, seems it doesn't work:
<eventRouterSettings role="hub" group="EventSystemQA2"

<groupSettings>
    <group name="EventSystemQA" useGroup="">
      <hub name="ServerA.net"/>
      <hub name="ServerC.net"/>
    </group>
    <group name="EventSystemQA2" useGroup="EventSystemQA">
      <hub name="ServerC.net"/>
    </group>
After learning deeper, I found the role of the router got changed to node after below codes executed:
Dictionary<string, Hub> hubs = GetHubList(newConfigSettings, newConfigSettings.EventRouterSettings.Group);

                        if(hubs.ContainsKey(Router.LocalRouterName) == false)
                        {
                            newConfigSettings.EventRouterSettings.Role = "node";
                            EventLog.WriteEntry("WspEventRouter", "Role has been changed to Node since name was not found in Hub list", EventLogEntryType.Error);
                        }
Is it correct behavior? please show us how to configure two groups. Thanks in advance.

Regards
Coordinator
Aug 9, 2013 at 4:49 PM
Following is a sample config file. Servers 1 and 2 would specify in their config file group="DC1" and servers 3 and 4 would specify in their config file group="DC2". For any nodes, you would choose one of the two groups for them to be associated with and their role would be "node".
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
    <configSections>
        <section name="eventRouterSettings" type="Microsoft.WebSolutionsPlatform.Configuration.EventRouterSettings"/>
        <section name="hubRoleSettings" type="Microsoft.WebSolutionsPlatform.Configuration.HubRoleSettings"/>
        <section name="nodeRoleSettings" type="Microsoft.WebSolutionsPlatform.Configuration.HubRoleSettings"/>
        <section name="groupSettings" type="Microsoft.WebSolutionsPlatform.Configuration.GroupSettings"/>
        <section name="logSettings" type="Microsoft.WebSolutionsPlatform.Configuration.LogSettings"/>
    </configSections>

    <!-- role = {hub, node} -->
    <!-- group = <name> -->
    <!-- autoConfig = {true, false} -->
    <!-- mgmtGuid = <GUID> -->
    <!-- cmdGuid = <GUID> -->
    <!-- publish = {true, false}  **Only ONE of the hub servers should be configured to publish. This server will be master for the global config file. -->
    <!-- bootstrapUrl = <URL> **This will be called to retrieve the config file if mgmtGuid does not exist or if the file is corrupt. -->

    <eventRouterSettings role="hub" group="DC1" autoConfig="true" mgmtGuid="DA761E42-69DD-45b0-BDE7-C500A6A0DA0E" cmdGuid="345CF9E9-F206-4572-8451-0F519C739A7E" publish="false"/>

    <!-- refreshIncrement should be about 1/3 of what the expirationIncrement is. -->
    <!-- This setting needs to be consistent across all the machines in the eventing network. -->
    <!-- <subscriptionManagement refreshIncrement="3"  expirationIncrement="10"/> -->

    <!-- <localPublish eventQueueName="WspEventQueue" eventQueueSize="102400000" averageEventSize="10240"/> -->

    <!-- These settings control what should happen to an output queue when communications is lost to a parent or child.-->
    <!-- maxQueueSize is in bytes and maxTimeout is in seconds.-->
    <!-- When the maxQueueSize is reached or the maxTimeout is reached for a communication that has been lost, the queue is deleted.-->
    <!-- <outputCommunicationQueues maxQueueSize="200000000" maxTimeout="600"/> -->

    <!-- nic can be an alias which specifies a specific IP address or an IP address. -->
    <!-- port can be 0 if you don't want to have the router open a listening port to be a parent to other routers. -->
    <!-- <thisRouter nic="" port="1300" bufferSize="1024000" timeout="30000" /> -->

    <hubRoleSettings>
      <subscriptionManagement refreshIncrement="3"  expirationIncrement="10"/>
        <localPublish eventQueueName="WspEventQueue" eventQueueSize="102400000" averageEventSize="10240"/>
        <outputCommunicationQueues maxQueueSize="200000000" maxTimeout="600"/>
        <thisRouter nic="" port="1300" bufferSize="1024000" timeout="5000" />
        <peerRouter numConnections="2" port="1300" bufferSize="1024000" timeout="5000" />
    </hubRoleSettings>

    <nodeRoleSettings>
      <subscriptionManagement refreshIncrement="3"  expirationIncrement="10"/>
        <localPublish eventQueueName="WspEventQueue" eventQueueSize="10240000" averageEventSize="10240"/>
        <outputCommunicationQueues maxQueueSize="200000000" maxTimeout="600"/>
        <parentRouter numConnections="1" port="1300" bufferSize="1024000" timeout="5000" />
    </nodeRoleSettings>

    <groupSettings>
      <group name="DC1" useGroup="">
        <hub name="server1"/>
        <hub name="server2"/>
      </group>

      <group name="DC2" useGroup="">
        <hub name="server3"/>
        <hub name="server4"/>
      </group>

      <group name="default" useGroup="DC1">
      </group>
    </groupSettings>

    <logSettings>

        <!-- type specifies the EventType to be persisted.-->
        <!-- localOnly is a boolean which specifies whether only events published on this machine are persisted or if events from the entire network are persisted.-->
        <!-- maxFileSize specifies the maximum size in bytes that the persisted file should be before it is copied.-->
        <!-- maxCopyInterval specifies in seconds the longest time interval before the persisted file is copied.-->
        <!-- fieldTerminator specifies the character used between fields.-->
        <!-- rowTerminator specifies the character used at the end of each row written.-->
        <!-- tempFileDirectory is the local directory used for writing out the persisted event serializedEvent.-->
        <!-- copyToFileDirectory is the final destination of the persisted serializedEvent file. It can be local or remote using a UNC.-->

    <!-- <event type="78422526-7B21-4559-8B9A-BC551B46AE34" localOnly="true" maxFileSize="2000000000" maxCopyInterval="60" createEmptyFiles="false" fieldTerminator="," rowTerminator="\n" tempFileDirectory="c:\temp\WebEvents\" copyToFileDirectory="c:\temp\WebEvents\log\" /> -->

    </logSettings>

</configuration>
Aug 9, 2013 at 5:03 PM
Keithh,

Thanks for the quick answer. So how the two groups get connected to each other? With the configuration you provided, the servers in DC1 will connect to servers in DC2, right? Can you tell me what is the use case to specify useGroup in the configuration?

Thanks
Coordinator
Aug 9, 2013 at 5:26 PM
Each hub connects to every hub in its group and to one hub in each of the other groups. So yes, the servers in DC1 will connect to one of the servers in DC2.

The purpose for useGroup is so you can create groups of nodes separate from groups of hubs. With useGroup you can then specify what hub group you want to have the node group connect to. So for instance, if you were in a multi-tenancy situation, you could specify a node group for each tenant and it would be easy to change the overall topology by changing the value of useGroup.
Aug 12, 2013 at 7:36 AM
Edited Aug 12, 2013 at 8:12 AM
Keithh,

Follow what you said, we configure the groups likes:
<group name="EventSystemQA" useGroup="">
      <hub name="Server1" />
      <hub name="Server2" />
</group>
<group name="EventSystemQA2" useGroup="">
      <hub name="Server3" />
      <hub name="Server4" />
</group>
And below are the results of netstat (netstat -a | findstr 1301):

Server1:
TCP 0.0.0.0:1301 Server1:0 LISTENING
TCP 150.110.105.76:1301 SERVER4:1816 ESTABLISHED
TCP 150.110.105.76:1301 SERVER2:64391 ESTABLISHED
TCP 150.110.105.76:1301 SERVER2:65481 ESTABLISHED
TCP 150.110.105.76:59021 SERVER2:1301 ESTABLISHED
TCP 150.110.105.76:59024 SERVER3:1301 ESTABLISHED

Server2:

TCP 0.0.0.0:1301 Server2:0 LISTENING
TCP 169.193.125.109:1301 SERVER1:59021 ESTABLISHED
TCP 169.193.125.109:64391 SERVER1:1301 ESTABLISHED
TCP 169.193.125.109:64399 SERVER3:1301 ESTABLISHED
TCP 169.193.125.109:65481 SERVER1:1301 ESTABLISHED

Server3:

TCP 0.0.0.0:1301 SERVER3:0 LISTENING
TCP 168.109.19.84:1301 SERVER4:1815 ESTABLISHED
TCP 168.109.19.84:1301 SERVER1:59024 ESTABLISHED
TCP 168.109.19.84:1301 SERVER2:64399 ESTABLISHED

Server4:

TCP SERVER4:1301 SERVER4:0 LISTENING
TCP SERVER4:1815 SERVER3:1301 ESTABLISHED
TCP SERVER4:1816 SERVER1:1301 ESTABLISHED

The question is:
We tried to publish the message in server3 and subscribe the message in server 4, and suprisely, we observed the message was forwarded to server4 by server3 and server1 two time, is it correct behavior? Please advise how to resove this issue.

Thanks.
Coordinator
Aug 12, 2013 at 2:40 PM
That's not the correct behavior, something seems like it's still not configured correctly.

Send me all four config files to my email account and I'll look at them. Send to keithh@microsoft.com.

Keith
Aug 13, 2013 at 4:12 PM
Edited Aug 13, 2013 at 4:12 PM
Keithh,
Please check the email. Regards
Aug 14, 2013 at 7:35 AM
Edited Aug 14, 2013 at 8:06 AM
Keithh,

Learning from the codes (Communicator, ConnectToPeers) and netstat result, seems Server4 will both connect to Server1 and Server2 , in that case, is it possible: Server1 forwards the message to Server4, and Server1 forwards the message to server2, server2 forwards the message again to Server4, in turn causes the duplicated issue? Does the router need to connect to all the hubs in other group? or one of them?
Coordinator
Aug 15, 2013 at 5:26 AM
I am able to repro the bug but I won't be able to work on fixing it until next week. Hopefully you can stick to using just one group until I have a new build. I will notify you next week when I upload the new build.

Keith
Aug 15, 2013 at 7:22 AM
Keith,

It's OK, we can wait for your fixes. Thanks for the investigation and improvement.
Coordinator
Aug 20, 2013 at 1:50 AM
I updated the msi today, it is now build 3.0.80.0. Please test and verify your issue is resolved. In addition, I changed how DateTime is serialized. I now include the DateTime.Kind property and return the correct kind during deserialization.
Coordinator
Aug 20, 2013 at 2:35 AM
BTW, the change I made is a breaking change. You cannot run version 3.0.80.0 with previous versions.
Aug 20, 2013 at 2:50 AM
Edited Aug 20, 2013 at 2:54 AM
Thanks, Keithh,

Let me download it and do some testing.
Sep 3, 2013 at 4:41 AM
Hi, Keithh,

Thanks for the fixing, we are good to run the router cross the machines, no duplicated issue observed so far.
Nov 12, 2013 at 10:13 AM
Edited Nov 12, 2013 at 12:08 PM
Keithh,

How many hubs we can configured in one group?

Takes below configuration for example:
<group name="EventSystemQA" useGroup="">
      <hub name="ServerA"/>
      <hub name="ServerB"/>
      <hub name="ServerC"/>
    </group>
In this case, suppose there is client named A, connects to Server C for listening message with eventType = eventTypeA, and server C is connecting to ServerB, Internlly, all the subscription event from A will be forwarded to ServerC. From below codes(Communicator.cs, 2196), server B should think the subscription event should not be forwarded to ServerA, since it is from the same group. So what will happen if the message is published from SerevrA, seems the client cannot receive that message at all, since Server A thought serverB has no interesting on that. Am I right? Please help.

Message received by serverB
 OriginateRouter:A
 InRouter:ServerC
 Source:FromHub
      if (socketInfo.Hub == false)
      {
            // forward to all nodes
      }
      else
      {
            if (string.Compare(socketInfo.Group, configSettings.EventRouterSettings.Group, true) == 0 ||
                  element.Source == EventSource.FromPeer)
            {
                  continue;  // don't forward to other hubs of same group or if source is from a peer
            }
            else
            {
                  // forward to all peers
            }
       }
Coordinator
Nov 12, 2013 at 5:44 PM
I don't think you're looking at the current code.

The hubs in a group always have a connection to every other hub in the same group. Also, each hub has a connection of one hub in each other group. So when ServerC receives the subscription from NodeA, it will send it on to ServerA and ServerB. If a node connected to ServerA or ServerB publish one of the events then the event will go from the node, to its hub, to ServerC, and then to NodeA. I think the code you show is for other conditions.
Nov 13, 2013 at 4:17 AM
Hi, Keithh,

I think you are right. Thanks for the responses.
Nov 15, 2013 at 1:52 PM
Edited Nov 15, 2013 at 1:54 PM
Hi, Keitth,

We observed below exception occurred and got the process crashed. We planed to migrate WSP to 3.0 soon to Production, which has 400 client users, so it's very critial for us to figure out what exactly happened. Can you please provide some comments about this? Is that because the content of message received is invalid or corrupted. We just observed this exception just once, so far, cannot reproduce it. Can you suggest what might be the root cause and what we can do to handle this?
Exception: 
    Logged: 11/13/2013 9:47:33 AM
    Application: WspEventRouter.exe 
    Framework Version: v4.0.30319
    Description: The process was terminated due to an unhandled exception.
    Exception Info: System.ArgumentException
    Stack:
    at System.Buffer.BlockCopy(System.Array, Int32, System.Array, Int32, Int32)
    at Microsoft.WebSolutionsPlatform.PubSubManager.WspEvent.ChangeInRouterName(Byte[], Byte[], System.String)
    at Microsoft.WebSolutionsPlatform.Router.Router+CommunicationHandler.ProcessReceive(System.Net.Sockets.SocketAsyncEventArgs)
     at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
    at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
    at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
    at System.Net.Sockets.SocketAsyncEventArgs.FinishOperationSuccess(System.Net.Sockets.SocketError, Int32, System.Net.Sockets.SocketFlags)
    at System.Net.Sockets.SocketAsyncEventArgs.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
    at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped
Coordinator
Nov 16, 2013 at 12:14 AM
I have not seen this exception before. I'm not sure where the data corruption is coming from. I would guess the corruption most likely came from the networking layer but I'm not sure how to identify the true bug. About 6 weeks ago I ran a test of 1.2B events over many days and no errors occurred. I think you will most likely have to log the error if it occurs again and just move on.
Nov 20, 2013 at 7:02 AM
Edited Nov 20, 2013 at 7:03 AM
Keithh,

Thanks for the responses. We are running the stress testing, Hope we can reproduce it the staging environment.

Here is my another question:

In the window event view, we see lots of warns logged likes the ones pasted below, will the kind of exception cause the message losing or something else, so far, we haven't observed that.
System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.Socket'.
   at System.Net.Sockets.Socket.SendAsync(SocketAsyncEventArgs e)
   at Microsoft.WebSolutionsPlatform.Router.Router.CommunicationHandler.ProcessSend(SendState state)

System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.Socket'.
   at System.Net.Sockets.Socket.Shutdown(SocketShutdown how)
   at Microsoft.WebSolutionsPlatform.Router.Router.Communicator.CloseSocket(Socket socket, String clientRouterName) ; 
Coordinator
Nov 20, 2013 at 2:02 PM
In the first message, the socket was closed and the sending thread tried using the closed socket. It gets the error and re-queues the event while communication is being re-established and then the events will be sent. No events should be lost.

In the second message, the socket is being closed. This is probably when you're shutting the service down. I wouldn't see this message as a problem.
Nov 26, 2013 at 1:25 PM
Edited Nov 27, 2013 at 9:21 AM
Hi, Keithh,

Thanks very very much for the response.

Sorry to ask another question again:

Say we have two hubs: HubA and HubB, and we have a client: C. At the first time, C connects to HubA, C should create the queue for HubA for further commuication. For some reasons, the connection was reset and C makes the connection to HubB, in this case, i noticed the queue created for HubA was not cleanup. Some events rotued from HubB are still being queued into the queue created for HubA. I just look into the codes for the queue cleanup and want to confirm with you whehter my suspection is riht:
  internal void CleanupQueues()
            {
                int i;
                long currentTickTimeout;
                PerformanceCounter socketQueueCounter;
                List<string> removeRouters = new List<string>();
                List<KeyValuePair<string, Socket>> removeSockets = new List<KeyValuePair<string, Socket>>();

                lock (socketQueuesLock)
                {
                    foreach (string routerName in commSockets.Keys)
After the connection was reset or aborted, i think HubA should have been removed from commSockets, if it's right, the logic for cleanup should never work for HubA, the queue should be always there and keep the events there until the process stopped, which should occupy more and more memories.

The worse thing is, if the connection is switched back to HubA and if there are lots of events queued up in the queue created for it, those events will be sent to HubA, which should put much pressures on HubA.

Please help confirm whether I am rigjt. We really obseved crazy memory growing in the process of WSP both on server and client side. You know we had added the logging function in WSP, and from the logs, we noticed there are management events travelling crazily between hub and node.

Thanks in advance.
Nov 29, 2013 at 1:38 AM
Keithh,

can you provide your comments, your help is really appreciated.
Dec 12, 2013 at 2:07 PM
Keithh,

Sorry to keep asking question, but we really readlly need your helps.
                            SocketAsyncEventArgs sendEventArg = sendEventArg = new SocketAsyncEventArgs();
                            sendEventArg.Completed += new EventHandler<SocketAsyncEventArgs>(sendEventArg_Completed);

                            sendEventArg.BufferList = buffersOut;
                            sendEventArg.UserToken = state;

                            SendState newState = SendState.Create(state.socketInfo);
                            state = newState;

                            try
                            {
                                bool willRaiseEvent = currSocket.SendAsync(sendEventArg);
                                if (!willRaiseEvent)
                                {
                                    SendCompleted(sendEventArg);
                                }
                      private static void SendCompleted(SocketAsyncEventArgs sendEventArg)
                      {
From the logs we have, we don't see Completed(SendCompleted) was fired for long time, so the messages were not forwared successully to one pariticular machine during that time, do you know in which situations the Completed event will not be fired? send failed or data corrupted? Please provide your comments, we really need this to get resloved and move WSP 3.0 to our PRODUCTION.

Thanks very very much.
Coordinator
Dec 12, 2013 at 5:43 PM
Sorry for taking so long to reply. I am no longer at Microsoft. I was on vacation and then forgot to reply when I did return.

For the first question about hubs, when a connection is lost, the queue will stay around for 10 minutes or until it reaches a max size. This is controlled via config settings. So you should have seen the queue for HubA disappear after 10 minutes of no connection.

The SendCompleted will not be called if the SendAsync completed synchronously. If the SendAsync is unsuccessful, it should throw an exception and you should see the exception in the log.
Dec 13, 2013 at 2:03 AM
Edited Dec 13, 2013 at 2:17 AM
Keithh,

Thanks very much for your response. Hope you enjoying your vacation.

For the first question, i think the CleanupQueues might not work since the key had been removed from commSockets.Keys after the connection was down in some other places, and below codes cannot select out the one should be removed in further steps. If you have free time, can you help check the codes?
  lock (socketQueuesLock)
                {
                    foreach (string routerName in commSockets.Keys)
For the second question, we really hadn't observed any exception related to it, from the logs, we only know the message had been dequeued from the client-specific queue but not see the Completed was fired and got the message forwarded to client. Some one discussed some similar issue at Stackoverflow: http://stackoverflow.com/questions/19824576/c-sharp-socketasynceventargs-stops-firing-completed-event, do you think we were running into the similar issue?

Thanks again
Coordinator
Dec 13, 2013 at 2:24 PM
It looks like a bug in .Net or its behavior with network driver behavior. The link you provide is similar and it looks like there are other articles about the same subject.

http://www.serverframework.com/asynchronousevents/2011/06/tcp-flow-control-and-asynchronous-writes.html

This link shows they were able to fix the problem by changing the settings of their network driver:

http://www.lenholgate.com/blog/2012/06/unexpected-causes-of-poor-datagram-send-performance.html

You might want to follow-up with Microsoft via MSDN.
Dec 14, 2013 at 9:10 AM
Keithh,
Thanks for your response.

do you think WSP 3.0 needs to employ BufferManager/SocketAsyncEventArgsPool which recommened by MSDN and others online to improve the async socket operation, Learning from the codes, WSP 3.0 currently don't have that. It's very easy for us to reproduce the high memory issue when we publish big volumn messages. When we observed WSP process occupied 4 G memory, belows are the information we can collected from with Windbg + sos:
!dumpheap -type System.Byte[]

Statistics:
             MT    Count    TotalSize Class Name
000007fef866d738        1           88 System.Collections.Generic.Dictionary`2[[System.Type, mscorlib],[System.Byte[], mscorlib]]
000007ff002d1728      236         5664 System.Collections.Generic.Dictionary`2+KeyCollection[[System.Byte, mscorlib],[System.Byte[], mscorlib]]
000007ff002d08c0      236        20768 System.Collections.Generic.Dictionary`2[[System.Byte, mscorlib],[System.Byte[], mscorlib]]
000007fef8630978   760046   3197403112 System.Byte[]
Total 760519 objects
Fragmented blocks larger than 0.5 MB:
            Addr     Size      Followed by
00000000df1104f0    0.5MB 00000000df196050 System.Byte[]
000000010e725368    0.7MB 000000010e7d2068 System.Byte[]
000000012ff20c18    0.7MB 000000012ffe00a8 System.Byte[]

!dumpheap -type System.Net.Sockets.SocketAsyncEventArgs

Statistics:
              MT    Count    TotalSize Class Name
000007fef7f184a0   246183     15755712 System.EventHandler`1[[System.Net.Sockets.SocketAsyncEventArgs, System]]
000007fef796a090   246183    110289984 System.Net.Sockets.SocketAsyncEventArgs
Total 492366 objects
Fragmented blocks larger than 0.5 MB:
From above stats, we can know SocketAsyncEventArgs + Byte[] had taken 3 G memory (110289984 + 3197403112 ), do you think we had run into the not-effective buffer manager problem?

Thanks
Coordinator
Dec 14, 2013 at 6:40 PM
I don't recall ever running into this issue and I've often run things under extreme pressure with queues growing to 10+ gigabytes. So though using a buffer manager might alleviate heap fragmentation under these loads, it still seems there is something else going on. Is this only happening to one of your servers? Or do you have the connection issue to any server where the connection gets under high load?
Dec 16, 2013 at 1:31 PM
Keithh,

Thanks again.

We are working to reproduce the issue with clear steps. I will let you know once I had steps. So far, we can reproduce it on differnt servers, when we tried to publih the BIG volumn messages between hubs.
Dec 18, 2013 at 2:47 PM
Edited Dec 18, 2013 at 2:48 PM
Keithh,

We had experienced some production issues based on WSP 2.1, which caused some messages missed and has the impacts on business.
From the logs, for more that 10 mins, we don't see any logs related the distribute handler thread, all the messages published during that period of time were missed.
Information] 12/16/2013 08:40:30.7929549 6020 
[Message] [Information] 2013-12-16 08:40:30 @ .
[Message] DistributeHandler, Start, succesfully get the event from forward-queue, EventType: 3d7b4317-c051-4e1a-8379-b6e2d6c107f9, Original Router: MACHINEA, InRouterName: MACHINEA

Information] 12/16/2013 08:40:30.7929549 32628 
[Message] CommunicationHandler, OutHandler, successfully to send event from queue, Queue Name: , Event Type: 3d7b4317-c051-4e1a-8379-b6e2d6c107f9, OriginalRouter: MACHINEA, InRouter: MACHINEA, SocketError: Success,  going to send it out, EndpointInfo, Remote: 169.193.125.110:1300, Local: 150.110.203.67:65248

Information] 12/16/2013 08:40:30.7929549 6020 
[Message] DistributeHanlder, enqueue the event into corresponding threadQueue for subscription, managerment, command, Router: ServerA.nam.nsroot.net, EventType: 3d7b4317-c051-4e1a-8379-b6e2d6c107f9, Original Router: MACHINEA

Information] 12/16/2013 08:40:30.7929549 32144 
[Message] CommunicationHandler, OutHandler, successfully dequeue event from queue, Queue Name: , Event Type: 3d7b4317-c051-4e1a-8379-b6e2d6c107f9, OriginalRouter: MACHINEA, InRouter: MACHINEA, going to send it out, EndpointInfo: Remote: 169.193.125.110:1300, Local: 150.110.203.67:65227

[Information] 12/16/2013 08:40:30.7929549 6020 
[Message] DistributeHandler, Start, succesfully get the event from forward-queue, EventType: 3d7b4317-c051-4e1a-8379-b6e2d6c107f9, Original Router: MACHINEA, InRouterName: MACHINEA

Information] 12/16/2013 08:40:30.7929549 32144 [Message] CommunicationHandler, OutHandler, successfully to send event from queue, Queue Name: , Event Type: 3d7b4317-c051-4e1a-8379-b6e2d6c107f9, OriginalRouter: MACHINEA, InRouter: MACHINEA, SocketError: Success,  going to send it out, EndpointInfo, Remote: 169.193.125.110:1300, Local: 150.110.203.67:65227
We don't see any logs relatred to distribute thread (6020), until near 19 mins later
Information] 12/16/2013 08:59:10.7865343 6020 
[Message] DistributeHanlder, enqueue the event into corresponding threadQueue for subscription, managerment, command, Router: ServerA.nam.nsroot.net, EventType: 3d7b4317-c051-4e1a-8379-b6e2d6c107f9, Original Router: 
do you think is it possible that the lock the thread tried to acquire was hold by some other threads for such long time and cause it to wait near 19 mins?
if (element.EventType == Event.SubscriptionEvent || element.EventType == mgmtGroup || element.EventType == cmdGroup)
                            {
                                lock (Communicator.threadQueuesLock)
                                {
or the thread failed to dequeue anything from the forwarder queue for such long time?

for the SynchronizationQueue implementation, can you give us some hints why the mutex is used? is it use to restrict that only one thread can acess it? how do you think if we change the codes to use BlockingCollection and ConcurrentQueue provided by .NET 4, and employ more threads for all the distribution works? can we benefit something from that?


Thank you for your kindly helps.
Coordinator
Dec 18, 2013 at 5:09 PM
You can't tell with 2.1 if queuing is an issue or locks are an issue or it could be the issue is with the client. If the client stopped functioning properly such that it quit sending subscription events then the subscription would timeout and the events would stop flowing. So I wouldn't make any changes until you know what the root cause is. With 3.0, you will have visibility via the perf counters to see what is happening with the queues in the app process.
Dec 23, 2013 at 7:40 AM
Keithh,

Thanks for the response.

I have another question for you, thanks in advance.

What's the usage of the lock in the below codes, seems inside InitConnection, there is no acess to the global dictionary, why we still need the lock?
public void AcceptConnection(Socket socket)
            {
                PerformanceCounter threadSocketCounter;
                SynchronizationQueue<QueueElement> socketQueue;

                SocketInfo socketInfoIn = null;
                Thread commThread;

                try
                {
                    lock (Communicator.socketQueuesLock)
                    {
                        socketInfoIn = InitConnection(socket);
                    }
                }
Coordinator
Dec 24, 2013 at 7:29 PM
It does add to the Socket collection.
Dec 25, 2013 at 9:26 AM
Edited Dec 30, 2013 at 10:46 PM
Keithh,

can you explain more about the codes, i don't see the socket collection. for WSP 2.1 and WSP 3,0, there is nothing needed to be protected, sorry if i missed that.
Dec 30, 2013 at 10:48 PM
Keithh,

can you tell me know your email address? We had made some code changes to try to fix our issues, we want to send the our codes and let you review them to buy us more confidences about what we changed.

Thanks in advanced.
Coordinator
Dec 31, 2013 at 4:05 AM
In InitConnection you will find:

socketInfoOut.Sockets.Add(socket);

This needs to be locked.

Use kstevenham@hotmail.com

Keith
Jan 2, 2014 at 3:52 PM
Hi, Keithh,

Thanks for the help.

Here I have another question.

We uses the same tool to run some stress testing with the same volumn. We noticed WSP 3 is slower than WSP 2 for the client to receive all the messages. the message is published by ServerA, and go to ServerB, then forwarded to client. Is it because for WSP 3 there is only 1 connection maintained bettween the servers? it takes more times for all the messages to go to another server? Can we reconfigure the value to 10 connections? Any impact after the change?
Coordinator
Jan 2, 2014 at 5:40 PM
You can change the number of connections in the config file. You can change the number to 10 if you like. Once you have a number > 1 then you will have a greater probability of events getting out of order if this is a concern for you.