Qi ZHANG, Zongwu XIE, Baoshi CAO, Yang LIU
State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin 150001, China
KEYWORDS Bolt assembly;Policy initialization;Policy iteration;Reinforcement learning(RL);Robotic assembly;Trajectory efficiency
Abstract Bolt assembly by robots is a vital and difficult task for replacing astronauts in extravehicular activities(EVA),but the trajectory efficiency still needs to be improved during the wrench insertion into hex hole of bolt.In this paper, a policy iteration method based on reinforcement learning(RL)is proposed,by which the problem of trajectory efficiency improvement is constructed as an issue of RL-based objective optimization.Firstly, the projection relation between raw data and state-action space is established, and then a policy iteration initialization method is designed based on the projection to provide the initialization policy for iteration.Policy iteration based on the protective policy is applied to continuously evaluating and optimizing the action-value function of all state-action pairs till the convergence is obtained.To verify the feasibility and effectiveness of the proposed method, a noncontact demonstration experiment with human supervision is performed.Experimental results show that the initialization policy and the generated policy can be obtained by the policy iteration method in a limited number of demonstrations.A comparison between the experiments with two different assembly tolerances shows that the convergent generated policy possesses higher trajectory efficiency than the conservative one.In addition,this method can ensure safety during the training process and improve utilization efficiency of demonstration data.
In-space assembly (ISA) technologies can effectively expand space structures, improve spacecraft performance, and reduce launch requirements.1–3Bolt assembly is an indispensable on-orbit extra-vehicular activity (EVA) in the future work of International Space Station (ISS)/Chinese Space Station(CSS) and large space telescopes tasks.Currently, astronauts have to use the electric wrench tool to screw hexagonal bolts in the extra-vehicular scene.This activity is time consuming,and lacking in operational flexibility and safety,4,5so it is expected that space robots can replace astronauts in this activity.6–8The space robot should autonomously make decisions for wrench insertion during assembly according to predetermined rules in the on-orbit extra-vehicular environment.
Previous studies have proposed many peg insertion methods based on vision from multiple routes.Edge-fitting and shadow-aided positioning can facilitate alignment and pegin-hole assembly with the visual-servo system.9An edge detection method was proposed for post-processing applications based on edge proportion statistics.10,11With the aid of vision recognition,eccentric peg in a hole of the crankshaft and bearing assembly was achieved.12However,the software and hardware of the space robot data processing system restrict the processing capability of visual information, so insufficient visual information dissatisfies the real-time on-orbit visual servo.With the obtained sparse visual information, it is difficult to distinguish the hexagon contour features of wrench and hole but it is possible to recognize the approximate circular features to locate the hexagon hole of bolt.13Force-based methods provide another way of insertion assembly by robots,and the core is to guide the assembly trajectory by contact force.However,analyzing the relationship between the contact force and trajectory required for assembly is rather difficult,14,15so several types of methods were designed from the perspective of control, shape recognition and contact model.16,17The force control methods were adopted typically to solve peg-in-hole problems.18–20By analyzing the geometric relationship through force-torque information, the hole detection algorithm tried to find the direction of a hole relative to a peg based on shape recognition.21The guidance algorithm was proposed to assemble the parts of complex shapes, and the force required for assembly was decided by kinesthetic teaching with a Gaussian mixture model.22Force information can help to maximize the regions of attraction (ROA), and the complex problem of peg-in-hole assembly planning was achieved analytically.23,24Besides, the static contact model was also used for the peg-in-hole task, and the relative pose was estimated by force-torque maps.25To eliminate the interference of external influence, the noncontact robotic demonstration method with human supervision was used to collect the pure contact force, so as to analyze the contact model in the wrench insertion task.26.
Even though the analytical contact model has been used in some cases, the robotic insertion assembly efficiency requires to be further improved.Reinforcement learning (RL) has shown its superiority in making the optimal decision, but robotic application by RL still has various barriers.Highdimensional and unexplainable state definition is one of them.Deep reinforcement learning can handle high-dimensional image inputs by the neural network.27The end-to-end approach utilizes images as both state and goal representation.28,29High-dimensional image information is processed by the neural network with plenty of parameters, so ambiguous state expressions do not have clear physical meanings.Sample efficiency is also a major constraint for deploying RL in robotic tasks as the interaction time is so long that experimenters cannot bear it.Some attempts have made progresses in simulation, but application of RL in real-word robotic tasks is still challenging.30,31Training times can be reduced by simultaneously executing the policy under the condition of multiple robots.32Exploration sufficiency before RL policy iteration is another obstacle in robotic tasks, and the existing method for exploring initialization or mild policy is impractical in robotic applications.The initial position cannot be set arbitrarily in the robotic assembly problem, which makes exploration initialization infeasible.The mild policy selects each action in any state with nonzero probability,but the frequency of selecting certain actions is not high enough during data acquisition.In addition, risk avoidance during policy training is also an important issue, because the robotic manipulator is continuously subjected to an unknown dangerous interacting contact force.From another point of view, this contact force can also provide guidance for assembly.In a word, to apply force-based RL in the robotic wrench insertion task, we need to overcome three difficulties as follows:
1.The state-action space design in RL requires consideration of the physical meaning of contact state.
2.Sample efficiency cannot be neglected in sufficient traverse of the state-action space in policy iteration initialization.
3.The security issue should be considered especially in policy iteration.
However, none of the methods mentioned above can meet the three key demands in force-based RL robotic manipulation at the same time.To cope with this problem, a novel policy iteration method including state-action definition, initialization method and protective policy is proposed in this paper to realize the robotic wrench insertion task.The ground validation experiment using the noncontact demonstration method with human supervision proves that the proposed policy iteration method can improve the trajectory efficiency of robotic assembly in a safe way.
The rest of the paper is organized as follows.Section 2 mainly provides illustration of the proposed policy iteration method.The experiment platform, execution and result are introduced in Section 3.Section 4 presents the conclusions and future work.
In this section, the process of policy iteration is elaborated.First,how to improve trajectory efficiency of the insertion task is defined as an objective optimization problem.To acquire a policy with higher efficiency, it is necessary to design the RL state-action space by projecting from raw force-position data.Then, policy iteration initialization is conducted to transverse all state-action pairs with higher sample efficiency.Finally,the safety during the training process of policy iteration is guaranteed by using the protective policy.
In this section,the robotic wrench insertion task is divided into several stages to illustrate the optimization objective,and then the trajectory efficiency is defined with a mathematical expression.It is known that the contour shapes of the wrench and the bolt hole are hexagonal, so robotic wrench insertion into the bolt hole requires multiple steps.The relative efficient sequence of entire insertion assembly can be divided into three stages,with the wrench coordinate as defined in Fig.1.
(1) rotating along the Z-axis (Rz+) till alignment is achieved in the Rzdimension.
(2) moving along the Z-axis (Pz+) till the end face of wrench contacts the bottom of hole.
Fig.1 Three stages of wrench insertion into hexagonal hole of bolt.
(3) adjusting the position (Pxor Py) or orientation (Rxor Ry) of the robotic end-effector.
Stage(2)and Stage(3)can be exchanged according to realtime conditions as shown in Fig.1, which illustrates an intuitive stage presentation of wrench insertion into the hexagonal hole of bolt.
For Stage (1), the end-effector wrench tool of robotic manipulator only needs to be rotated along the direction in which the bolt can be tightened.When the lateral side of wrench is in contact with one of the hexagonal holes as shown in Fig.1,the wrench tool bears a torque opposite to the direction of rotation.This torque value can be measured by the 6-dimensional force/torque sensor, serving as the criterion for the end of Stage (1).
For Stage (2), the robotic end-effector wrench tool only needs to move along the direction of insertion.When the end face of wrench tool fits with the bottom of the hexagonal bolt hole as shown in Fig.1, the wrench will bear a positive pressure opposite to the moving direction.The pressure force value measured by the 6-dimensional force/torque sensor can be utilized as the criterion for the end of Stage (2).
Since we only need to focus on whether the motion is finished in one dimension in the trajectory of Stage(1)and Stage(2),making the multi-dimensional decision unnecessary during trajectory planning.Moreover,in these two stages,it is unnecessary to consider trajectory efficiency improvement.Oppositely, the motion decision in different dimensions of position and attitude should be considered in Stage (3).Thus, Stage(3)is particularly studied to improve efficiency of the assembly trajectory in this paper.
For Stage (3), the necessary trajectory from an arbitrary point to the terminal point is the‘‘sum”of the shortest position and orientation distance, and the actual trajectory is the‘‘sum”of the actual position and orientation path.The difference between these two concepts is illustrated in Fig.2.
Thus, trajectory efficiency is defined as the ratio between necessary trajectory (Rnand Pn) and actual trajectory (Raand Pa) as follows:
Fig.2 Comparison between necessary trajectory and actual trajectory.
This definition of trajectory efficiency can be adopted to evaluate and improve assembly efficiency of robotic wrench insertion task.
To improve the assembly trajectory efficiency by means of RL,the raw experiment data in the assembly task should be projected into the state-action space in the RL framework.Force instead of vision information is utilized to guide the robotic assembly task,so the state space should be designed according to force data.The properly designed state-action space correlates with the contact model established in the 2-dimensional space,26as shown in Fig.3.As the contact model is established in the 2-dimensional plane rather than the 3-dimensional space,the wrench restricts the adjustment motion in this plane instead of directly moving to the ideal direction.In this sense,dimensionality reduction is achieved by choosing the position or orientation from a specific direction in Stage (3).
Fig.3 Relationship between force-torque direction and position-orientation deviation.
Two contact points in the simplified contact model result in contact force.These two contact forces can be denoted as F1and F2, and can be regarded as two parallel forces.In fact,F = F1-F2is the resultant force of the contact force measured by the force sensor,and M is the torque value measured by the force sensor.The direction of the resultant force and torque/-force ratio reflects the position deviation and orientation deviation.26Thus, according to the pose deviation determined by the contact model shown in Fig.3, active trajectory planning is simplified to judge which decision is better in either position or orientation adjustment in this 2-dimensional plane.
In the actual trajectory decision-making process, only the plane with larger horizontal contact force is considered, and the force-torque plane orthogonal to it is ignored in each step of adjustment(Fig.1).Therefore, the position(represented by P) or orientation (represented by R) of robotic end-effector is adjusted only in this plane each time.
Then, the state and action can be designed, as expressed in Eqs.(2)–(3) according to Fig.3.
After determining the principal plane for analysis, raw force-torque data are selected to construct the state variable,then the motion command can also be restricted to two options.Data dimensionality reduction is achieved accordingly.Specifically, the robot receives force information as the state variable, and then the decision action is described in 6-D motion of the robotic end-effector.After the motion of end-effector, the contact state changes accordingly.The force data and the motion data can be recorded in the format as shown in Eq.(4).
To improve utilization efficiency of raw data, a data processing method is specially designed from the perspective of RL training, as shown in Eq.(5).The conditions of task completion are also illustrated.
Due to the high expense of acquiring robotic manipulation data in the real physical environment,we should make full use of the obtained demonstration data.According to the contact model and the RL model,the contact forces of Fxand Fyin the horizontal direction are compared first.The direction of the greater force determines the principal analysis plane at this moment (ignoring the contact force in the other orthogonal plane), as shown in Fig.4.
After selecting the analysis plane at any time,the raw force data (Fxor Fy, Mxor My) that can be utilized to generate the designed state are determined (Fig.4).Analyzing the contact conditions of all situations with the same simplified 2-dimensional contact model as shown in Fig.3 significantly improves data utilization efficiency.
Obviously, the value of horizontal force affects the motion decision.For the specific task in this paper, the force is very scattered in value, so the range of force values is divided into 5 segments with an interval of 1 N to reduce the number of intervals and simplify judgement.
According to the simplified contact model, both the horizontal force and the ratio between the torque and the horizontal force (in the same plane) reflect the position-orientation deviation.Specifically,this ratio may fall in 3 different interval ranges according to the distance between the end point of tool and force-torque sensor.Therefore, the defined twodimensional state space, SF and SL, needs to be discussed based on categorization.The detailed correspondence relation between raw force data and state is shown in Fig.5.
In the simplified contact model,the robot is only needed to determine the unidirectional position or the orientation motion at a moment.In fact,the robot end-effector command has 8 options totally in Stage (3).Thus, after establishing the state space, the relation between the motion command and action space is established according to the plane determined by the horizontal contact force and state variable as shown in Fig.4.In the previous work, conservative policy has been proved to be available,26which can be defined under the framework of RL in Table 1.AR and AP correspond to Rxor Ryand Pxor Pyrespectively in Fig.4.
Fig.4 Restricted options of the actions according to principal analysis plane and state variables.
Fig.5 State definitions according to different situations.
Obviously, it can be deduced that SL3 represents one-side contact with only position deviation.Thus, the policy at state SL3 outputs concrete action, AP, and the subsequent policies in this paper all obey the same principle.This policy in Table 1 is used to complete the task with the goal of controlling small contact force during the whole process at the expense of sacrificing efficiency, which can explain why we conduct this work.
In the design of the state-action space,the state corresponds to raw force-torque data.After selecting certain force or torque to constitute state variables according to larger horizontal force and contact model, the 6-dimensional raw force-torque data are reduced into a 2-dimensional state variable.The action space corresponds to the motion command of the robotic end-effector.According to the current state and contact model,the action space becomes a 1-dimensional variable with 2 options at a moment.The method of projection from raw data to the state-action space reflects the applicability and innovation of force-based active trajectory planning.The design of the state and action space in this paper greatly improves the efficiency of data utilization and reduces the difficulty of decision making.Four contact conditions are of the same state,and cover all data.The action decision is simplified from one of eight to one of two.
After designing the state-action space of RL according to the contact model,policy evaluation before policy iteration is necessary for effectiveness comparison of different actions with the same state for higher reward.However, some state-action pairs are never visited after the projection of experiment data collected by a specific policy.Thus,the policy iteration initialization method is proposed to solve the state-action space coverage by efficiently sampling the subsequent experiment data with specific rules.Common initialization methods include exploration initialization and mild policy.
The method of exploration initialization requires that every state have a certain probability as the start point.For the robotic wrench tool, the whole assembly process of insertion into the hexagonal hole of bolt has been divided into 3 stages as illustrated in Section 2.1.Policy improvement is intensively studied in Stage (3), which means that the start point in Stage(3) is an uncertain end point of Stage (1) and Stage (2).Thus,the exploration initialization method is not feasible for the current task.
Mild policy is a method that can cover all the actions, but the probability of selecting specific actions may be relatively low.For the robotic wrench insertion task, the probability of visiting specific state is already low.Choosing low-probability actions in the states with low probability is difficult to traverse all state-action pairs.Theoretically, mild policy requires more demonstration data,increasing data acquisition cost.
Although the conservative policy can accomplish the task safely, it cannot guarantee the accessibility to all state-action spaces.The number of times that the previously existing demonstration data have traversed each state-action pair are counted and listed in a table, called as state-action table.The policy iteration initialization method in this paper requires the inclusion of all state-action pairs by a more direct and efficient method.The core is to develop an active trajectory adjustment policy based on the vacancy of the state-action table,till the data of all positions in the state-action table have been obtained for calculating and comparing action-values.With respect to the policy iteration initialization method,the robotic experiment is required to obtain the data in the demonstration condition, in which the input of different policies is allowable.
After accomplishing the projection from raw demonstration data to state-action variable,the collection of all previous demonstration experiment data will generate a state-action table (e.g., Table 2).Every-one increment to the number inthe table indicates that the trajectory has passed through this position one more time, and each time it passes by, the trajectory efficiency is calculated using this as the starting point.
Table 1 Conservative policy under framework of reinforcement learning.
Thus, taking the actual situations of robotic wrench insertion task into account, the policy initialization method can be designed,as shown in Fig.6.To accomplish the whole process, several policy forms are defined according to the stateaction table to obtain the initialization policy, which is required as a prerequisite for policy iteration.
Projection policy: it is completely opposite to the conservative strategy.
Directive policy: it chooses the action that has never been chosen if the specific states are encountered by chance in the demonstration.For other states, the policy chooses the action with smaller number in the state-action table.After adopting the projection policy, some positions in the state-action table still cannot be accessed for the uncertainty of demonstration process.The directive policy continues with policy initialization after executing the projection policy twice,in order to protect the robotic manipulator and the manipulated object as far as possible.
Both projection policy and directive policy can be defined as table policy.Specifically, the amount of data in stateaction table determines the formulation of the table policy.However, risks exist in the table policy.Initialization during the demonstration with human supervision provides opportunity for policy adjustment at any time.In the initialization process,the policy can be changed,which gives rise to the concept of the protective policy.
Protective policy:if the contact force increases continuously after consecutive actions of table policy twice, switch to the conservative policy in the next action.When the contact force decreases,the table policy can be executed again.If the contact force still increases after adopting the table policy twice, the conservative policy will work to the end.Not only the vacancy in the state-action table matters, but also the human demonstrator should make a real-time judgement about security by contact force.This requires that the demonstration platform be able to display the contact force on the monitor in real time(Fig.8(c)).
Once the demonstration experiment is performed, the demonstration data are classified according to the definition of state-action space.The data amount in each position of the state-action table is accumulated, and the action-value function can be calculated by mathematical expectation.According to the policy definition above, the policy initialization demonstration can be divided into several steps.
In the first step,the policy initialization method starts with the conservative policy.It is obvious that more than half positions of the state-action table will be empty as some stateaction pairs are not accessed in limited amount of robotic demonstration experiment.
Fig.6 Policy initialization process of the following stage of policy iteration.
In the second step, demonstration experiments are conducted twice by the projection policy.In this step, the protective policy is involved in the demonstration to protect the robotic manipulator and the manipulated object.
In the third step,the directive policy is adopted because the projection policy is too aggressive.The protective policy assists in this step,till no position in the state-action table shows zero.
After policy iteration initialization, an initialization policy is obtained.Taking this policy as the start point,policy iteration gains a series of policies till convergence is realized.Directly executing these policies in the robotic task would inevitably cause damage to the robotic system.Therefore, the security issue should be considered and guaranteed sufficiently in the policy iteration process by using the protective policy.
Policy iteration includes policy evaluation and policy improvement.Generally, three policy evaluation methods(Monte Carlo, Dynamic Programming, and Temporal-Difference) are commonly used to calculate the state-value function or action-value function.The state transition probability representing the complete and accurate model of environment should be known in the DP-based method.Biased estimation exists in the TD-based method.Conversely, theMC-based method is unbiased estimation,but requires traversing data.For the robotic assembly task in the real physical environment,the state transition probability,which represents probability of transition to the next state from current state and current action, cannot be described analytically.For the general demonstration experiment, it is indispensable to execute the experiment from beginning to end to obtain complete experiment data.Thus, the inherent problems of MC-based method do not affect the calculation of action-value function,and the MC-based method can avoid involving in state transition probability.Therefore, the Monte Carlo method is utilized to calculate the action-value function by Eq.(6).
Table 2 Number of times that previous demonstration data have traversed each state-action pair after accumulating 4-time conservative policy demonstrations and twice projection policy demonstrations.
Among them, the reward function is designed as shown in Eq.(7) according to the trajectory efficiency definition in Eq.(1) to improve trajectory efficiency in Stage (3) of insertion assembly.
Then, the mathematical expectation of action-value function can be calculated accordingly.As the state-action spaces are small enough to represent approximated action-value functions as a table, the action-values that reflect the average trajectory efficiency of an action in each given state are listed in the form of Q-table (Table 4).The optimal policy can be acquired exactly by comparison of Q-table, and outputs the optimal action in any state by Eq.(8).The policy improvement procedure can be calculated accordingly.
The RL method can be classified into on-policy method and off-policy method according to whether the policy that generates data is identical to the evaluated and improved policy or not.The off-policy method is used to guarantee sufficient exploration, and complex importance sampling is necessary.As the policy initialization method can achieve coverage of state-action space, the on-policy method can satisfy the task of robotic wrench insertion.As the stochastic policy would generate unpredictable motion, the greedy policy is adopted to reduce uncertainty caused by actions in this paper.
The greedy policy as a deterministic policy would also face the security issue because the next state is still unknown.Solutions should be designed according to the state-action space and the policy initialization method.Based on the protective policy above, two concepts are defined below:
Generated policy: the final convergent policy of policy iteration.The generated policy should be capable of completing the assembly independently without the protective policy.
Intermediate policy: the policy produced during the policy iteration process instead of the initialization policy and final generated policy.
Different from the classical policy iteration method, when the intermediate policy in every iteration works in the real physical environment, the protective policy must take the responsibility if necessary.The intermediate policy cannot guarantee the security of the robotic manipulator and manipulated object.In addition, it may not complete the assembly trajectory by its own.The pseudo code of policy iteration is listed in Algorithm 1.
Therefore, the situation where the policy can complete the task without relying on the protective policy is a necessary condition for iterative convergence.If the output policy remains unchanged and does not rely on the protective policy,this policy can be regarded as the final generated policy.
Algorithm 1 1: Initializeπ1, π1 = initialization policy 2: while π(s) is non-convergence do 3: Generate demo using π(s), if danger occurs, add protective policy.4: for each (s,a) pair appearing in the sequence 5: G ←return reward of every visit (s,a)6: Append G to Returns(s,a)7: Q(s,a)←average(Returns(s,a))8: end 9: π(s)←arg maxaQ(s,a)10: end 11: return π(s)
In this section, the experiment platform is introduced first to illustrate the environment of method verification.Then, the policy iteration initialization process and results are listed.The results of policy iteration initialization can be utilized as the start point of policy iteration process.Then the protection-based policy iteration process, and results show that assembly trajectory efficiency can be improved in the case of safety.After policy iteration, the policy that can complete the task independently, called as the generated policy, is compared with the conservative one according to the criterion of trajectory efficiency.
The validation experiment platform for policy iteration method needs to support the policy iteration process.Not only the application background of space robot, but also the demonstration demand should be considered for this platform.
For the special space environment, real-time performance and sufficiency of visual information cannot afford to visualservo.Software and hardware of the space robot system cannot guarantee accurate modelling of the hexagonal contour edge.13However, approximate circular features acquired by cameras can help the space robot locate the initial contact position.
Table 3 State-action table after directive policy demonstrations for 4 times.
Table 4 Action-values in specific states at the beginning of the policy iteration.
Thus, the wrench assembly experiment platform for the ground verification experiment in this paper makes it possible for the wrench to contact the bolt and align it in the position dimension(Fig.7).Only the 6-dimensional force-torque sensor installed between the robotic manipulator and the end-effector is adopted to guide this active trajectory planning for assembly.In this paper, the JR3 6-axis Force/Torque sensor (Product Number: 67M25A3) is selected.
In the space environment,visual servoing is achieved by the eye-in-hand camera and visual marker.The relative pose between the target load and the visual marker is designed as a known variable.Within the 300 mm range,the position measurement accuracy of the camera can be better than 0.5 mm,and the orientation measurement accuracy can be better than 0.3°.In the ground verification experiment, the teleoperation method is used to guide the robotic manipulator to the initial position, as shown in Fig.7.Considering the factors such as position and orientation deviation of the robotic manipulator,the initial conditions of the ground verification experiment should tolerate a larger initial position and orientation deviation between the robotic end-effector wrench tool and the hexagonal hole of bolt.
The initial pose deviations between the wrench tool and the hexagonal hole are set as follows:orientation deviation is 2.5°,and the position deviation is 1 mm.The generated policy after training should ensure safety, and the trajectory efficiency can be improved in this condition.
Due to uncertain contact, makes it is dangerous to directly execute the arbitrary autonomous strategy in performing the assembly task.It is natural to protect the whole robotic system when acquiring more efficient policy.Generally, dragging the robotic manipulator by the human demonstrator to acquire trajectory is a common robot demonstration method.In the situation where the robot end-effector is constrained by the manipulated object, the additional external force reflected in the force-torque sensor will be generated due to the contact between the human demonstrator and the robot manipulator.To avoid these issues, the method of noncontact demonstration with human supervision is adopted26to protect the whole experiment of policy improvement and provide interface to send robot end-effector motion command by the mouse(Fig.8(a)).By this demonstration method, not only the arbitrary policy can be applied by the software interface, but also it becomes possible for the human demonstrator to supervise the assembly process by eyes.The real-time contact force detected by the force-torque sensor can also be reflected in the monitor for reference.
Fig.7 Robotic wrench insertion task on noncontact demonstration platform with human supervision from local and global views.
In addition,when the assembling strategy is determined,the robot end-effector motion command can be input according to the real-time force signal in the monitor.The software interface provides opportunity for the demonstrator to classify the interval of force data.To ensure the purity and accuracy of the position information, only the position control method is adopted in this robotic demonstration method.To simulate different environments, adjustable angle flat tongs and rotation platform can be used to set 2-dimensional orientation deviations.
After the robotic wrench tool contacts the hexagonal hole of the socket bolt, the demonstration experiment begins, as shown in Fig.7.The software interface supports inputting the arbitrary policy by pressing the position or orientation command buttons (6 pairs) on the software interface in Fig.8(a).After choosing the analysis plane according to real-time force information, the motion command is input by the mouse.After adjusting the corresponding robot endpoint motion, the position or orientation values on the monitor are updated.When pressing the command button,both the force-torque data in this moment and the updated cumulative position-orientation values are recorded in the same row in a text file.The human demonstrator operates the button according to the readily available strategy.This demonstration software can be used to record the force-position data in the format in Eq.(4) till assembly is completed.
Fig.8 Core parts of demonstration software interface.
As the wrench tool has a certain length entering the hole of bolt, any position or orientation adjustment will cause uncertain contact.Thus,the relatively large motion amplitude above can accelerate the whole process significantly.Besides, the interval between the wrench and hole can tolerate this motion amplitude.
In Stage(1),|Fx|and|Fy|are controlled to be less than 2 N,and Fzis controlled in the interval of[-5N,-2N]in Stage(2).In Stage (3), the assembly trajectory is judged to be finished when the maximum value of |Mx| and |My| is smaller than 0.1 N?m.
In the entire training process, policy improvement is completed under the demonstration condition.Making full use of existing demonstration will also reduce the number of samples needed to be collected by the robotic experiment.
According to the definition of policy initialization, initialization demonstration experiments aim to fill the state-action table.As the conservative policy in the first step can only fill up half of the state-action table theoretically, demonstration experiments are only conducted 4 times.The angular position of adjustable rotation platform is set as 210° or 300°, and the angular position of adjustable angle flat tongs is 2.5° in the policy initialization process.The detailed process of initialization policy acquisition is depicted in Fig.9,where X represents that the number in this position is non-zero.
To fill the positions that would never be accessed by the conservative policy in the state-action table,the projection policy is adopted twice together with the protective policy for the reason of security in the second step.After classifying and accumulating the raw demonstration data according to the state-action definition,the state-action table is listed(Table 2).The state-action table cannot be totally filled because the demonstration trajectories cannot guarantee every state accessible.
It can be seen in Table 2 that 4 in 20 positions are not accessed right now.Thus, in the third initialization step, the human demonstrator should focus on the states of these 4 state-action pairs and execute the directive policy in certain states.After robotic wrench insertion demonstration experiments for 4 times and data collection, the state-action table turns to be Table 3.
So far,each position in the state-action table has been filled,which can be seen as a sign of completion of the policy initialization process, and policy iteration can follow up.Although the vacancies of the table can be seen in some demonstrations,the data still contribute to the calculation of the action-value of other state-action pairs.
The policy initialization method proposed in this paper is suitable for policy iteration in the real physical environment of robotic system.In the robotic assembly task,the initial state cannot be deliberately appointed,so some category data with a small amount in the past data set should be made full use of.When a policy outputs a specific action, the protective policy is prepared for protection.In the process of executing robotic tasks, the principal problem is not the small probability of selecting a certain action in a certain state,but the small probability of accessing some states.Therefore, the random policy or the epsilon-greedy policy is not as efficient as the directive policy.
When policy initialization finishes, the action-values corresponding to Table 3 and Eq.(6) can be calculated, as listed in Table 4, which is also named as the Q-table.The bold characters highlight the relatively greater action-value given certain two-dimensional state, and the parentheses show the amount of data for calculation.
Fig.9 Process of acquiring initialization policy.
To make a clear comparison of the action-values for different actions, the Q-table is depicted in Fig.11(a) to show the better action in each given 2-dimensional state.Then,the concrete policy iteration process is depicted in Fig.10 to acquire the generated policy.
Similarly, the robotic manipulator and the manipulated object still face security risks for the intermediate policy during policy iteration.Thus, the protective policy is required to cooperate with the intermediate policy demonstration.After adopting the policy acquired from Fig.11(a), the actionvalues are updated, as shown in Fig.11(b).The angular position of the adjustable rotation platform is set to 190° in the policy iteration process.
Policy iteration does not converge here.Therefore, the robotic assembly demonstration experiment continues by the latest intermediate policy together with the protective policy.Accumulating the latest demonstration data, the updated action-values are depicted in Fig.11(c).
Although the newly output intermediate policy given by Fig.11(c)stays the same with the previous policy,this intermediate policy cannot complete the task without the aid of the protective policy.Policy iteration needs to continue.However,the data amount of calculating the action-values in Q-table increases, which benefits future calculation.After executing the intermediate policy corresponding to Fig.11(c) in the following demonstration experiment (with the protective policy),the data are obtained, and then the action-values are updated as shown in Fig.11(d).
Fig.10 Policy iteration process for acquiring final generated policy with initialization policy as the start point.
The protective policy is not necessary when adopting the policy produced by the least action-values shown in Fig.11(d) in the following demonstration experiment.After processing the latest demonstration data and adding them into existing data set, the action-values are updated, as shown in Fig.11(e).
The action-values from Fig.11(e) generate the same policy with that from Fig.11(d).To judge whether the policy has converged, the latest policy is applied again.The updated actionvalue is shown in Fig.11(f), and the corresponding policy can be judged as the convergent policy and called as the generated policy.This generated policy can be expressed explicitly in Table 5.
As the intermediate policy during the iteration process requires auxiliary the protective policy, these policies are not comparable.In the following part, the generated policy is compared with the conservative policy in terms of trajectory efficiency by additional experiments.
Fig.11 Action-values of all state-action pairs for 6 calculations in policy iteration.
Compared with direct policy training under the unsupervised condition, the protective policy with human supervision can protect the robot and the manipulated object.It is also a novel idea to introduce the protective policy to deal with the danger caused by the unstable intermediate policy during policy iteration.
To improve trajectory efficiency, policy iteration is utilized,and a policy initialization method is proposed.The trajectory efficiencies are compared in the demonstration mode by adopting two policies.The angular position of adjustable rotation platform is set to 100° in the policy comparison process, and the angular position of adjustable-angle flat tongs is 2°.To illustrate the influence of manufacturing tolerances of hexagonal bolts and wrench,two types of M8 bolts are selected in the experiment,and the distance between the opposite sides of the hexagonal hole of these two bolts are 6.07 mm and 6.03 mm,respectively.Besides, the distance between the opposite sides of the hexagonal wrench in this experiment is 5.98 mm,resulting in two types of assembly tolerance,0.05 mm and 0.09 mm,respectively.
Under this initial condition,the main orientation deviation is in the Rxdirection(compared with Ry).Both generated policy and conservative policy execute demonstration experiments for 10 times, and the trajectory efficiency for each experiment is calculated by Eq.(1).For the assembly tolerance of 0.09 mm,the average trajectory efficiency of the generated policy is 58.56%,while it is 55.25%for the conservative one.For the assembly tolerance of 0.05 mm,the average trajectory efficiency of the generated policy is 82.00 %, while it is 62.91 %for the conservative one.The results are compared and shown in Fig.12.
During the demonstration process, it can be found in both experiments with two types of bolts that the first adjustment of orientation direction has a great impact on trajectory efficiency.The trajectory efficiency is further classified and compared according to the direction of the first orientation adjustment, which is differentiated by Rxand Ryin Table 6(for the assembly tolerance of 0.09 mm) and Table 7 (for the assembly tolerance of 0.05 mm).
According to policy tables(Table 1 and Table 5),these two policies only have differences in state [SF3,SL1].If one of the policies generates a trajectory without going through this state,this demonstration trajectory data is obviously invalid for comparison.
Regardless of the direction of the first orientation adjustment, the trajectory efficiency of generated policy is always higher than that of conservative policy.Priority choice in thedirection with large orientation deviation leads to higher trajectory efficiency.This is because orientation deviation is the dominant deviation relative to the position deviation during the assembly process.Every time the orientation is adjusted,the position deviation will inevitably change.Conversely, the position adjustment does not change the orientation deviation.Therefore, if the first adjustment is towards the direction with greater orientation deviation, it will result in higher trajectory efficiency.
Table 5 Expressions of generated policy.
Fig.12 Trajectory efficiency results for different assembly tolerance.
Table 6 Trajectory efficiency classified and compared according to the first orientation adjustment (assembly tolerance:0.09 mm).
Table 7 Trajectory efficiency classified and compared according to first orientation adjustment (assembly tolerance:0.05 mm).
However, it is difficult to judge the principal orientation deviation direction under the current condition.The first orientation adjustment decision depends on initial position deviation, which depends on the uncertain initial contact state.It is hard to guarantee the optimal orientation adjustment with the current policy,but the policy initialization and the iteration method surely improve follow-up trajectory efficiency.
Besides, the comparison experiment shows the influence of manufacturing tolerance.For the bolt assembly with small tolerance, the trajectory efficiency is higher for both policies.In our view,small assembly tolerance restricts the position adjustment space.When the horizontal contact force locates in the areas of the adjusting orientation of the robotic end-effector,small assembly tolerance leads to more accurate position alignment.Thus, it is obvious that higher trajectory efficiency can be realized with small assembly tolerance between the hexagonal bolt hole and wrench.
To improve the assembly efficiency of robots, the efficiency improvement of trajectory planning in wrench insertion is constructed as an objective optimization problem based on RL.To achieve objective optimization through RL, a novel policy iteration method is proposed.For the start point of policy iteration,a policy iteration initialization method based on vacancy of state-action table is designed and applied.Compared with the initialization mode dependent on exploring starts and the mild policy,this initialization method can improve demonstration efficiency by finishing the state-action traverse in a limited number of times.In addition, policy iteration based on the protective policy can output the generated policy and avoid unpredictable risks.
In robotic experiments,the conservative policy and the generated policy are compared separately in the condition of noncontact robot demonstration with human supervision.For each 10-time experiment, the average trajectory efficiencies of generated policy are 58.56 % for 0.09 mm assembly tolerance and 82.00%for 0.05 mm assembly tolerance,which are much higher than those of conservative policy(55.25%for 0.09 mm assembly tolerance and 62.91 % for 0.05 mm assembly tolerance).Although the trajectory efficiency is affected by the first orientation decision, the generated policy has significant improvement in trajectory efficiency compared with the conservative one for all assembly tolerances in this paper.
In the future, the state-action space could be refined in the targeted region, so as to further improve trajectory efficiency,and the influence of the first orientation adjustment decision should be eliminated.In addition, this method guarantees the feasibility of inserting wrench into the bolt for the robotic manipulator and makes it possible to replace astronauts’extravehicular activities.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This study was supported by the National Natural Science Foundation of China(No.91848202)and the Special Foundation (Pre-Station) of China Postdoctoral Science (No.2021TQ0089).
CHINESE JOURNAL OF AERONAUTICS2023年3期