Library
|
Your profile |
Software systems and computational methods
Reference:
Dimitrichenko, D.P. (2024). Analysis of the appropriate behavior of various types of automata in the conditions of the placement game. Software systems and computational methods, 4, 49–65. https://doi.org/10.7256/2454-0714.2024.4.72488
Analysis of the appropriate behavior of various types of automata in the conditions of the placement game
DOI: 10.7256/2454-0714.2024.4.72488EDN: SQCVFIReceived: 27-11-2024Published: 05-12-2024Abstract: The object of research in this work is homogeneous collectives of automata with the property of purposeful behavior. The subject of this study is a comparison of different designs of such machines in the implementation of the conditions of the game of placement. The aim of the study is to establish the best (or similar) structures in terms of properties in order to optimize the time and computational costs of more complex machine learning models based on the principle of reinforcement learning. In the collectives under consideration, automata perform actions in a given habitat (functioning) with varying degrees of effectiveness. The automata, in accordance with their design, react to the input signal with another action. The evaluation of the effectiveness of the machine is defined as the sum of positive signals (rewards) or negative signals (penalties) received by the machine during the considered period of time. This characteristic depends on both the declared design of the machine and the depth of its memory. It is necessary to determine the simplest designs of automata that allow achieving optimal efficiency in a given environment in the shortest possible way. The formalization of both the properties of the environment and the actions of automata, as well as the processing of the results obtained, is carried out using the apparatus of game theory. In this case, the values of the effectiveness of the functioning of the machines are represented as the cumulative amounts of winnings and losses of the slot machine players. As result of the research the designs of automata that provide a given efficiency of functioning with a minimum depth of memory (the least complex design) are presented. The result obtained makes it possible to trace the influence of the inertial qualities of automata, implemented in the form of appropriate structures, on the efficiency of functioning in a given environment, formalized in the form of a game of placement. An automaton with linear tactics and a Krylov automaton form two marginal implementations of an automaton strategy for approaching the optimum. The first is due to the high speed of changing actions, the second is due to a long stay in states close to optimal. The field of application of the results obtained is further investigation of more complex dynamic environments using the simplest designs of automata, since synchronous collectives of automata in the process of computational implementation are difficult to parallelize, which leads to a significant increase in time and computational costs with the complication of the structure of dynamic environments or with an increase in these optimization tasks. Keywords: homogeneous group of machines, appropriate behavior, incentives, fines, reinforcement learning, machine, memory depth, game theory, optimal strategy, placement gameThis article is automatically translated. Introduction The organization of complex behavior of decentralized systems built on the basis of simple elements is reflected in such a section of machine learning as "Collective behavior of automata" [1], which has wide applied significance from solving problems of automatic optimization, intelligent control, building robotic systems [2], and to issues of psychology [3]. At the same time, the connection between neural network and automaton approaches is quite close [4, 5].Due to the growth of computing capabilities of multi-user and multithreaded systems, there is an objective need for analysis and management [6, 7] of a set of agents acting in a decentralized manner, in accordance with their own goal-setting and interacting with each other and a given environment. At the same time, there are often situations when the same action selected by several agents can be physically performed at a particular time by only one of them. For example, only one of the threads processing the current data can perform a read (or write) operation. All other applicants are blocked from accessing this data. Only one robot at the intersection, out of two approaching along different paths, can cross the opposite highway. Both robots cannot move at the same time (there will be a collision), or simultaneously wait for the passage of another robot, thereby creating a situation of deadlock in the passage of the intersection for an indefinite period not only by the current, but also by the following robots. In the biological interpretation, certain areas of the ecological niche, shared by several animals, correspond to the jointly selected actions of agents. Then the amount of actual resource on the site is divided (not necessarily equally) among all the animals present on it. A similar situation occurs when multiple applications actively use Internet traffic. These are the simplest cases illustrating the rules of behavior within the framework of the placement game, when an action chosen by several agents can be performed at a given time (in our case, receive encouragement) by only one of them (in our case, one of the automata that chose this action). All of the above makes it possible to involve probability theory, game theory, and automata theory in the analysis of the behavior of multi-agent systems (a finite set of agents in an artificial environment) in order to obtain theoretically justified and confirmed results using simulation methods. The automaton approach used in this work provides the researcher with a formalized (within the framework of discrete mathematics) method that allows to perform a formal statement of the problem and analyze the behavior of agents and the given habitat itself in terms of an automaton model (a collective of automata), input, output and internal alphabets, as well as transition rules and selection results. The purpose of this work is to obtain quantitative estimates of the influence of inertial properties (manifested in various ways) of certain types of automata and different values of memory depth on the effectiveness of the functioning of a collective of automata in a stochastic environment organized according to the rules of the placement game. The novelty of the conducted research is the following:
One of the traditional methods of analyzing the behavior of automata is the use of the apparatus of game theory [8]. This choice is due to the fact that game theory makes it easy to create formalizable systems of rules (environment) and behavioral strategies (available actions) that determine the nature of interaction between an individual automaton (or a set of automata) and the environment [9]. This circumstance makes it possible to form a closed circuit Environment-Automaton, in which the actions of the automaton determine the reaction of the environment, which in turn, through feedback, influences the choice of the subsequent action of the automaton. The first interesting models of this type were created by Mikhail Lvovich Tsetlin [9]. He was the creator of a whole line of research called "collective behavior of automata" [9-12]. He formulated the basic principles underlying such models and ways to implement them. The formalization of the concept of purposeful behavior allowed M. L. Tsetlin to formulate the basic principle of organizing complex behavior of a set of decentralized systems forming collectives of automata. The main provisions of this approach are as follows: 1) the principle of superposition; 2) the principle of commensurability; 3) the principle of universality of the automaton structure. The principle of superposition. Any sufficiently complex behavior consists of a set of simple behavioral acts. The joint implementation of such simple behavioral acts and the simplest interaction result in very complex behavioral processes. The principle of commensurability. The degree of expedient behavior of the analyzed cybernetic system is considered as the value of the mathematical expectation of a set of fines and rewards received by this system from the environment over the observed period of time t. This value is in the range with the following boundaries: The left boundary is the value of the mathematical expectation of the number of fines and rewards received by the simplest system implementing a strategy of randomly selecting actions available in a given environment from the set D=d1,..., dm, m>=2. The right boundary is the value of the mathematical expectation of fines and rewards received by the system, which at any given time t, t>>1, always implements the obviously optimal action in this environment di= d*, i=1, ..., m. Obviously, the closer the value of the mathematical expectation of the incentives of the analyzed system is to the left boundary of the interval defined in this way, the more precisely its behavior corresponds to the strategy of "random selection". And the closer this value is to the right boundary of the interval, the closer the behavior pattern is "to a system aware of the best possible strategy." Such a definition of appropriate behavior does not depend on the structure of the analyzed system, but is based only on the statistical result of interaction with the environment. The principle of universality. The design of the analyzed system should not contain empirical data on optimal (or non-optimal) actions in a given environment, i.e. the system identifies such actions during its operation for some time t. The first automatic implementation of such a system was proposed by M. L. Tsetlin.
Automatic implementation Here is a formal description of the machine: X = x1, ..., xk – the set of input signals coming from the environment to the input of the automaton (input alphabet). D = d1, ..., dm is a finite set of actions available to the automaton, (output alphabet). S = s1,…,sn, 2 <= m <= n is a finite set of internal states of the automaton (internal alphabet). The rules of operation of the automaton at discrete moments of time t are unambiguously set by two functions: The function of transitions of internal states: st+1 = F(st, xt), and the initial internal state at zero time t=t0: s0 = S(t0). The function of dependence of output signals (actions) from internal states: dt = G(st). General statement of the problem Let's define some environment E in which automata of the studied types will function. Let there also be types of automata for which it is possible to establish the degree of appropriate behavior in the environment E in a certain way above. Is required: - to determine the degree of influence of the memory depth q of each of the considered types of automata on the change in the degree of expediency of behavior in the specified environment E; - based on the results obtained, determine the type of machine design that reaches a given level of expediency with the lowest possible memory depth q =q*; - if there are several equivalent designs, specify the simplest design to implement.
Biological background The results from experimental biology required the formulation of a cybernetic "white box" with appropriate behavior. The experiment was as follows [9]: The experimental animal was placed at the base of a T-shaped maze with the possibility of choosing one of two actions: "Turn left"; "Turn right." According to the experimental conditions, at the end of each of the two branches, favorable (or unfavorable) conditions unknown to the experimental animal were realized with independent probabilities for each of the two events (branches). It was necessary to establish the ability of experimental animals to distinguish between environmental properties of a probabilistic nature. It turned out that, despite the selection errors at the beginning of the series of experiments, the animals subsequently correctly associated the turn they chose (the action they performed) with the one for which the probability of a fine (the probability of getting into an unfavorable situation) was minimal.
Stationary environment The stationary environment E is a mathematical description of the conditions of a T-shaped maze. In the T-shaped maze, there were only two actions available for the experimental animals to perform: "turn left" and "turn right". Each of the selected branches was expected to have its own probability of encouraging Pl and Pr. A generalization of this situation is a stationary environment with m outcomes (actions). For each of the m actions, a set of probability values is set: either reward pi or penalty 1 – pi, i=1, ..., m. Choosing the optimal action in such an environment boils down to determining the action with the maximum probability of reward (minimum probability of penalty). At the same time, neither the experimental animal in biological experiments, nor the automaton corresponding to it in behavior, have knowledge of the initial values of probabilities, but are forced to receive this information in an indirect form through reward and penalty signals received from the environment.
Types of automata designs The first design of an automaton capable of behaving expediently in the stationary environment E described above is an automaton with linear tactics proposed by M. L. Tsetlin. The L1 automaton (with memory depth q=1) consists of m states S=s1,.., sm. Each of the m states of si is assigned an action di, i=1, ..., m. The L1 automaton functions as follows: At some point in time t=t*, being in the si state, the L1 automaton performs the output action di, i=1, ..., m. In response to this action, the environment generates a signal that takes the value "Reward" in accordance with probability pi (or "Penalty" with probability 1 – pi). This signal is sent to the input of the l1 machine. If the L1 automaton receives a "Penalty" signal at the input, then it switches to the si+1 state and, therefore, switches the external action from di to di+1. Upon receiving the "encouragement" signal, the L1 automaton remains in the initial state of si, while there is no change of action. Thus, at the next moment of time t*+1, the L1 automaton will perform an action in accordance with its internal state, and the process of interaction of the automaton with the environment will repeat in the manner described above. By assigning q consecutive states to each of the m actions di, we obtain a sequence of linear automata with monotonously increasing memory depth q. Cetlin was able to show that the sequence of such automata L1, L2, ..., Lq, functioning in a stationary environment E, is asymptotically optimal, i.e. the greater the memory depth of a linear automaton Lq, the longer such an automaton performs the most optimal action (with the maximum probability of encouragement), and almost never leaves the associated states. A complication of the machine with linear tactics is the Krinsky machine. It is characterized by the fact that when receiving encouragement, this automaton always goes into the deepest state corresponding to the current action performed by it. When receiving a penalty from the environment, this automaton reacts in the same way as an automaton with linear tactics, lowering the status number of the current (completed) action by one. For this automaton, the theorem on asymptotic optimality when operating in stationary media is also proved. The Robbins automaton is the next in terms of the degree of enhancement of the inertia property. It differs from the Krinsky automaton in that, unlike it, when changing an action, the Robbins automaton immediately goes into the deepest state corresponding to the new action. Otherwise, it is no different from the Krinsky machine gun. The theorem on asymptotic optimality in stationary media is also true for him. Note that for three automata: the automaton with linear tactics, the Krinsky automaton and the Robbins automaton with a memory depth of q = 1, the algorithms of functioning completely coincide, losing the differences between them, which will be illustrated when analyzing their functioning. A slightly different approach to inertia enhancement is applied in the design of the Krylov automaton. It combines elements of behavior, both deterministic and stochastic automata. When receiving a reward, the automaton functions in the same way as an automaton with linear tactics, increasing up to the deepest state number corresponding to the action for which this automaton receives encouragement from the environment E. When receiving a fine, Krylov's machine "flips a coin." When an eagle falls, the machine does not change its state, and when a tails falls, it reduces the number of the current state corresponding to the action for which a penalty was received from the environment. This automaton is also asymptotically optimal in stationary environments.
Dynamic environments Initially, game theory served as the formal language for describing both stationary and dynamic environments with automata functioning in them. The actions of the automaton from the set D are mutually unambiguously correlated with the strategies available to the player in the corresponding (given) game with a finite number of strategies. For example, the set of m actions available to the automaton, together with the values of the probabilities of rewards (penalties) in a stationary environment E, is formulated in the language of game theory as a game with nature: 1) a single-column payment matrix with m non-negative values is specified; 2) the elements of the game matrix are normalized relative to the maximum value of the payment and are considered as the probabilities of rewards from a stationary environment E; 3) the reward value is one, the penalty value is zero; 4) the actions of the machine (player 1) correspond to the row numbers of the specified matrix; 5) Nature (Player 2) has chosen her strategy (single column) and does not change it during all the games of the game; 6) The aim of the game is to maximize the winnings of the slot machine. Player 1 (automaton), at each moment of time t, plays a game with nature and, depending on the selected action d*=di and the probability p*=pi corresponding to this action, receives a reward (or penalty) signal from the environment I, i= 1, ..., m at the entrance. Since the optimal strategy of the game with nature is known and corresponds to the player's choice of an action with a maximum price of pi = pmax, it creates an opportunity to compare the effectiveness of the action of an informed player who knows the entire payment matrix with the actions of an automaton who does not know the contents of this matrix. From the theorems on the asymptotic optimality of the types of automata discussed above, it follows that the greater the memory depth q of an automaton, the more precisely its cumulative actions correspond to the optimal strategy.
Two slot machine games The case of a two–player zero-sum matrix game is an example of how one machine is able to create an environment for another machine. In this case, the row numbers are matched to the actions of the first machine, and the column numbers of the payment matrix are matched to the actions of the second machine. If the first machine with probability p receives a reward, then the second machine receives a penalty, and vice versa, implementing the principle of an antagonistic zero-sum game. The task of the first machine is to maximize rewards, the task of the second machine is to minimize penalties. If the game allows a solution in pure strategies, then the automata implement a stationary environment for each other, for which an increase in expediency is associated with an increase in the depth of memory q. The solution in mixed strategies requires automata to have an optimal memory depth that satisfies the properties of this dynamic environment and the type of automaton. Similarly, the game of two automata is constructed in the case of non-cooperative, non-antagonistic, bimatric games with a finite number of strategies.
The Placement game An example of how a collective of automata forms a dynamic environment for an individual automaton is the placement game. The biological prerequisite for the placement game is the following task: 1) there are a finite number of sites (pastures, or hunting areas) with varying degrees of food productivity. Animals are housed in these areas in some way; 2) if two or more animals are currently on the same site, then the food productivity of the site is divided equally between these animals. The animal's task is to maximize the amount of food it gets. The description of the placement game is easily formulated in the language of game theory as follows: 1) let m strategies be given d=d1, ..., dm. The value of each strategy is equal to the mathematical expectation of the "Reward / Penalty" event if it is selected; 2) within the framework of the given strategies, there are n automata: n 3) it is required to find conditions that ensure maximum winnings for the entire set of machines. These conditions are divided into three main groups: - types of the studied automata that determine the structure and features of their behavior; - the value of the memory depth q, which determines the degree of inertia of the machine (the speed of switching between actions); - structural changes of automata aimed at organizing various ways of interaction in a team. We will analyze the types of submachine guns discussed above: a linear tactics submachine gun, a Krinsky submachine gun, a Robbins submachine gun and a Krylov submachine gun. We will also analyze the effect of memory depth q on the degree of inertia of automata, i.e. on the speed of automata switching between actions in the environment E. The interaction of the machines varies from the complete isolation of the machines from each other, to the introduction of a common cash register procedure external to the structure of the machines.
Estimation of the switching speed The speed of switching the actions of the machine determines the minimum time for which the machine is able to use all available strategies. It is required to determine the required time interval for the operation of a specific type of machine to collect reliable statistics. Since all automata form a homogeneous collective, and the choice of action by each of the n automata is random, then, starting at some point in time t=t*, the amounts of rewards and penalties of automata throughout the collective will take statistically similar values. In this regard, as an indicator of such a state of the "collective of automata" system, it is advisable to take the coefficient of variation of the amounts of rewards and fines for the entire team. Obviously, if the coefficient of variation does not increase, then the collective of automata has assumed a statistically equilibrium (not improved) state. Computational experiments have shown that the best result is achieved when the operating time of a system consisting of n automata is determined by the operating time of a similar reference automaton that does not belong to the team under study, i.e. is not influenced by the team, but operates in the same environment. In this regard, let's consider a comparison of the operating time of an individual automaton and an automaton in a team. Statement. The average switching speed between the actions of one of the isolated automata in the placement game is not lower than the value of this speed of the automaton operating in the corresponding stationary environment. Indeed, if the automaton, when choosing the current action d*=di, i=1, ..., m, has no neighbors in the current batch, then its new state will be determined only by the value of the environmental response to the choice of this action in accordance with the probability of receiving encouragement p*=pi. In this case, its rate of change of actions will coincide with the speed of the machine in the corresponding stationary environment E. If, when choosing this strategy d*=di, the considered automaton has neighbors (one or more than one) on the i-th site, then the current situation splits into two cases: If the environment encouragement signal went to the machine in question, then in the current batch the fact of the presence of neighbors does not affect its speed of switching between actions. If, on the contrary, when receiving a reward signal from the environment as part of the food division, the machine in question gets a fine, then such a signal can only reduce (at least not increase) the time it stays on the current site. Thus, the average switching speed between the actions of an isolated automaton in a placement game is not less than the switching speed of the automaton in the corresponding stationary environment. In practice, the switching speed of the machine in the team is higher than the switching speed of the reference machine.
Conducting a computational experiment There are 10 sites set for the placement game. As actions available to automata. The amount of the reward is (+1), the amount of the fine is (-1). These values determine the following probabilities of receiving rewards: Table 1. Payment matrix
At the first stage of a series of computational experiments, it was necessary to determine the efficiency values of the studied types of automata and statistically reliable time intervals for the formation of these values. As a reference value for such a time interval, the time (number of moves) of sequential passage by a fixed automaton with a given memory depth of all sites (from the worst to the best) was chosen. As an assessment of the effectiveness of the functioning of a given machine, it is advisable to take the average value of efficiency per turn, i.e. the amount of rewards and penalties received during a single passage of all sites by the machine (by performing all available actions). In fact, this value is equal to the mathematical expectation of rewards and penalties of the machine for the entire set of actions in the case of their single execution. Since the time spent by the machine on a particular site depends on the cumulative effect of the machine's design, the current depth of memory and the sequence of penalties and rewards due to the values of the random number generator, it was decided to build a series of a fixed number of independent walkthroughs. This allowed us to:
To conduct a series of computational experiments, it was necessary to create specialized software that implements algorithms for the behavior of the studied types of automata and the environment according to the rules of the placement game, allowing you to set the appropriate types of automata, their characteristics (the value of the depth of memory) and various environmental parameters, groups of automata, as sets of independent agents. A series of 10 independent experiments was conducted for each type of machine, after which the data obtained were averaged. For each series of experiments, the random number generator was started from a fixed initial value, generating, for all types of automata, an identical pseudo-random stationary environment E. Let's denote the types of automata as follows: Automatic machine with linear tactics- ALT; Krinsky automatic rifle- AKRN; Robbins Machine Gun- ARBB; Krylov's automatic rifle- AKRL.
Table 2. Mathematical expectations of rewards and penalties
Considering the value of the mathematical expectation of each automaton as the result of executing some algorithm, we will use the value of the average operating time of the automaton as an estimate of the speed of this algorithm to achieve the result obtained. Table 3. The average operating time of the machine
To analyze the behavior of automata in the placement game, teams consisting of 4 automata were selected, and as a criterion for optimal behavior, the amount of rewards and penalties for the entire team was chosen as the most flexible. Additionally, this optimality value shows the average location of the machines by sites, i.e. with increasing memory, the machines tend to be placed on the best sites, maximizing the amount of rewards and minimizing the factor of competition in the team as a negative factor.
Table 4. Integral estimates of the effectiveness of functioning in the placement game.
From the previous table, we can see the average time (number of moves) required to obtain the specified result by the team, since the same type of reference automaton acts as the time of operation of the team. Krylov's automaton [10] has a pronounced mechanism of inertia of behavior, which ensures both a long (compared with other automata) stay in the best condition, and the slowest approach to the best placement of the team as a whole.
Conclusion The systems (collectives) discussed above, consisting of fairly simple automata, have the property of expedient behavior. The expedient behavior of such systems is already observed with a small number of automata included in them. A variety of behaviors is also achieved even with minor changes in the way input signals from the environment are processed. We have considered 4 types of automata. Two automatic strategies have been identified: 1. Due to the high switching speed, visit all available sites (automatic machine with linear tactics); 2. Due to its high inertia, it remains in the best condition for a long time (Krylov's automaton). However, the second result is achieved by a significantly long period of time for the transition of the team to an optimal statistically stable state, which, with relatively small values of memory depth, leads to significant computational costs, as illustrated by the table of average operating time periods of reference automata. The amounts of rewards and penalties are numerically equal to the sums of mathematical expectations of the best sites occupied by automata, therefore, further complication of automata designs, i.e., an increase in memory depth cannot lead to a significant increase in the efficiency of the functioning of the team as a whole, since the difference between the possible best value and the achieved result is on average no more than 10-20 percent of the possible with a significant increase in the time (model and real) functioning. The obtained results can be used as a basis for the formation of automata collectives optimal in complexity and computational costs for solving more complex optimization problems. These tasks include:
References
1. Stefanyuk, V.L. (2004). Local organization of intelligent systems. Moscow: Fizmatlit.
2. Gaaze-Rapoport, M.G., & Pospelov, D.A. (2019). From amoeba to robot. Models of behavior. Moscow: Lenand. 3. Zhuravlev, A.L., Savchenko, T.N., & Golovina, G.M. (2010). Mathematical psychology: school of V. Yu. Krylov-Series. Scientific schools of the Institute of Psychology of the Russian Academy of Sciences. Moscow: Publishing house of the Institute of Psychology of the Russian Academy of Sciences. 4. Zhdanov A.A. (2024). Autonomous artificial intelligence. Moscow: Knowledge Laboratory. 5. Dimitrichenko, D.P. (2023). Optimization of a recurrent neural network using automata with a variable structure. Software systems and computational methods, 4, 30-43. doi:10.7256/2454-0714.2023.4.69011 Retrieved from http://en.e-notabene.ru/itmag/article_69011.html 6. Karpov, V.E., Karpova, I.P., & Kulinich, A.A. (2019). Social communities robots. Moscow: URSS. 7. Karpov, V.E., & Koroleva, M.N. (2022). On the issue of formalizing the ethics of behavior of a collaborative robot. Information and Mathematical Technologies in Science and Management, 4(28), 223-233. 8. Pospelov, D.A. (1966). Games and automata. Moscow: Energy. 9. Tsetlin, M.L. (1969). Research in the theory of automata and modeling of biological systems. Moscow: Science. 10. Pospelov, D.A. (1970). Probabilistic automata. Moscow: Energy. 11. Varshavsky, V.I. (1973). Collective behavior of automata. Moscow: Nauka. 12. Varshavsky, V.I., & Pospelov, D.A. (1984). The orchestra plays without a conductor: reflections on the evolution of some technical systems and their management. Moscow: Nauka.
First Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
Second Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|