• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    A policy iteration method for improving robot assembly trajectory efficiency

    2023-04-22 02:06:18QiZHANGZongwuXIEBaoshiCAOYangLIU
    CHINESE JOURNAL OF AERONAUTICS 2023年3期

    Qi ZHANG, Zongwu XIE, Baoshi CAO, Yang LIU

    State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin 150001, China

    KEYWORDS Bolt assembly;Policy initialization;Policy iteration;Reinforcement learning(RL);Robotic assembly;Trajectory efficiency

    Abstract Bolt assembly by robots is a vital and difficult task for replacing astronauts in extravehicular activities(EVA),but the trajectory efficiency still needs to be improved during the wrench insertion into hex hole of bolt.In this paper, a policy iteration method based on reinforcement learning(RL)is proposed,by which the problem of trajectory efficiency improvement is constructed as an issue of RL-based objective optimization.Firstly, the projection relation between raw data and state-action space is established, and then a policy iteration initialization method is designed based on the projection to provide the initialization policy for iteration.Policy iteration based on the protective policy is applied to continuously evaluating and optimizing the action-value function of all state-action pairs till the convergence is obtained.To verify the feasibility and effectiveness of the proposed method, a noncontact demonstration experiment with human supervision is performed.Experimental results show that the initialization policy and the generated policy can be obtained by the policy iteration method in a limited number of demonstrations.A comparison between the experiments with two different assembly tolerances shows that the convergent generated policy possesses higher trajectory efficiency than the conservative one.In addition,this method can ensure safety during the training process and improve utilization efficiency of demonstration data.

    1.Introduction

    In-space assembly (ISA) technologies can effectively expand space structures, improve spacecraft performance, and reduce launch requirements.1–3Bolt assembly is an indispensable on-orbit extra-vehicular activity (EVA) in the future work of International Space Station (ISS)/Chinese Space Station(CSS) and large space telescopes tasks.Currently, astronauts have to use the electric wrench tool to screw hexagonal bolts in the extra-vehicular scene.This activity is time consuming,and lacking in operational flexibility and safety,4,5so it is expected that space robots can replace astronauts in this activity.6–8The space robot should autonomously make decisions for wrench insertion during assembly according to predetermined rules in the on-orbit extra-vehicular environment.

    Previous studies have proposed many peg insertion methods based on vision from multiple routes.Edge-fitting and shadow-aided positioning can facilitate alignment and pegin-hole assembly with the visual-servo system.9An edge detection method was proposed for post-processing applications based on edge proportion statistics.10,11With the aid of vision recognition,eccentric peg in a hole of the crankshaft and bearing assembly was achieved.12However,the software and hardware of the space robot data processing system restrict the processing capability of visual information, so insufficient visual information dissatisfies the real-time on-orbit visual servo.With the obtained sparse visual information, it is difficult to distinguish the hexagon contour features of wrench and hole but it is possible to recognize the approximate circular features to locate the hexagon hole of bolt.13Force-based methods provide another way of insertion assembly by robots,and the core is to guide the assembly trajectory by contact force.However,analyzing the relationship between the contact force and trajectory required for assembly is rather difficult,14,15so several types of methods were designed from the perspective of control, shape recognition and contact model.16,17The force control methods were adopted typically to solve peg-in-hole problems.18–20By analyzing the geometric relationship through force-torque information, the hole detection algorithm tried to find the direction of a hole relative to a peg based on shape recognition.21The guidance algorithm was proposed to assemble the parts of complex shapes, and the force required for assembly was decided by kinesthetic teaching with a Gaussian mixture model.22Force information can help to maximize the regions of attraction (ROA), and the complex problem of peg-in-hole assembly planning was achieved analytically.23,24Besides, the static contact model was also used for the peg-in-hole task, and the relative pose was estimated by force-torque maps.25To eliminate the interference of external influence, the noncontact robotic demonstration method with human supervision was used to collect the pure contact force, so as to analyze the contact model in the wrench insertion task.26.

    Even though the analytical contact model has been used in some cases, the robotic insertion assembly efficiency requires to be further improved.Reinforcement learning (RL) has shown its superiority in making the optimal decision, but robotic application by RL still has various barriers.Highdimensional and unexplainable state definition is one of them.Deep reinforcement learning can handle high-dimensional image inputs by the neural network.27The end-to-end approach utilizes images as both state and goal representation.28,29High-dimensional image information is processed by the neural network with plenty of parameters, so ambiguous state expressions do not have clear physical meanings.Sample efficiency is also a major constraint for deploying RL in robotic tasks as the interaction time is so long that experimenters cannot bear it.Some attempts have made progresses in simulation, but application of RL in real-word robotic tasks is still challenging.30,31Training times can be reduced by simultaneously executing the policy under the condition of multiple robots.32Exploration sufficiency before RL policy iteration is another obstacle in robotic tasks, and the existing method for exploring initialization or mild policy is impractical in robotic applications.The initial position cannot be set arbitrarily in the robotic assembly problem, which makes exploration initialization infeasible.The mild policy selects each action in any state with nonzero probability,but the frequency of selecting certain actions is not high enough during data acquisition.In addition, risk avoidance during policy training is also an important issue, because the robotic manipulator is continuously subjected to an unknown dangerous interacting contact force.From another point of view, this contact force can also provide guidance for assembly.In a word, to apply force-based RL in the robotic wrench insertion task, we need to overcome three difficulties as follows:

    1.The state-action space design in RL requires consideration of the physical meaning of contact state.

    2.Sample efficiency cannot be neglected in sufficient traverse of the state-action space in policy iteration initialization.

    3.The security issue should be considered especially in policy iteration.

    However, none of the methods mentioned above can meet the three key demands in force-based RL robotic manipulation at the same time.To cope with this problem, a novel policy iteration method including state-action definition, initialization method and protective policy is proposed in this paper to realize the robotic wrench insertion task.The ground validation experiment using the noncontact demonstration method with human supervision proves that the proposed policy iteration method can improve the trajectory efficiency of robotic assembly in a safe way.

    The rest of the paper is organized as follows.Section 2 mainly provides illustration of the proposed policy iteration method.The experiment platform, execution and result are introduced in Section 3.Section 4 presents the conclusions and future work.

    2.Policy iteration method

    In this section, the process of policy iteration is elaborated.First,how to improve trajectory efficiency of the insertion task is defined as an objective optimization problem.To acquire a policy with higher efficiency, it is necessary to design the RL state-action space by projecting from raw force-position data.Then, policy iteration initialization is conducted to transverse all state-action pairs with higher sample efficiency.Finally,the safety during the training process of policy iteration is guaranteed by using the protective policy.

    2.1.Optimization objective and definition of trajectory efficiency

    In this section,the robotic wrench insertion task is divided into several stages to illustrate the optimization objective,and then the trajectory efficiency is defined with a mathematical expression.It is known that the contour shapes of the wrench and the bolt hole are hexagonal, so robotic wrench insertion into the bolt hole requires multiple steps.The relative efficient sequence of entire insertion assembly can be divided into three stages,with the wrench coordinate as defined in Fig.1.

    (1) rotating along the Z-axis (Rz+) till alignment is achieved in the Rzdimension.

    (2) moving along the Z-axis (Pz+) till the end face of wrench contacts the bottom of hole.

    Fig.1 Three stages of wrench insertion into hexagonal hole of bolt.

    (3) adjusting the position (Pxor Py) or orientation (Rxor Ry) of the robotic end-effector.

    Stage(2)and Stage(3)can be exchanged according to realtime conditions as shown in Fig.1, which illustrates an intuitive stage presentation of wrench insertion into the hexagonal hole of bolt.

    For Stage (1), the end-effector wrench tool of robotic manipulator only needs to be rotated along the direction in which the bolt can be tightened.When the lateral side of wrench is in contact with one of the hexagonal holes as shown in Fig.1,the wrench tool bears a torque opposite to the direction of rotation.This torque value can be measured by the 6-dimensional force/torque sensor, serving as the criterion for the end of Stage (1).

    For Stage (2), the robotic end-effector wrench tool only needs to move along the direction of insertion.When the end face of wrench tool fits with the bottom of the hexagonal bolt hole as shown in Fig.1, the wrench will bear a positive pressure opposite to the moving direction.The pressure force value measured by the 6-dimensional force/torque sensor can be utilized as the criterion for the end of Stage (2).

    Since we only need to focus on whether the motion is finished in one dimension in the trajectory of Stage(1)and Stage(2),making the multi-dimensional decision unnecessary during trajectory planning.Moreover,in these two stages,it is unnecessary to consider trajectory efficiency improvement.Oppositely, the motion decision in different dimensions of position and attitude should be considered in Stage (3).Thus, Stage(3)is particularly studied to improve efficiency of the assembly trajectory in this paper.

    For Stage (3), the necessary trajectory from an arbitrary point to the terminal point is the‘‘sum”of the shortest position and orientation distance, and the actual trajectory is the‘‘sum”of the actual position and orientation path.The difference between these two concepts is illustrated in Fig.2.

    Thus, trajectory efficiency is defined as the ratio between necessary trajectory (Rnand Pn) and actual trajectory (Raand Pa) as follows:

    Fig.2 Comparison between necessary trajectory and actual trajectory.

    This definition of trajectory efficiency can be adopted to evaluate and improve assembly efficiency of robotic wrench insertion task.

    2.2.RL state-action space design for data projection

    To improve the assembly trajectory efficiency by means of RL,the raw experiment data in the assembly task should be projected into the state-action space in the RL framework.Force instead of vision information is utilized to guide the robotic assembly task,so the state space should be designed according to force data.The properly designed state-action space correlates with the contact model established in the 2-dimensional space,26as shown in Fig.3.As the contact model is established in the 2-dimensional plane rather than the 3-dimensional space,the wrench restricts the adjustment motion in this plane instead of directly moving to the ideal direction.In this sense,dimensionality reduction is achieved by choosing the position or orientation from a specific direction in Stage (3).

    Fig.3 Relationship between force-torque direction and position-orientation deviation.

    Two contact points in the simplified contact model result in contact force.These two contact forces can be denoted as F1and F2, and can be regarded as two parallel forces.In fact,F = F1-F2is the resultant force of the contact force measured by the force sensor,and M is the torque value measured by the force sensor.The direction of the resultant force and torque/-force ratio reflects the position deviation and orientation deviation.26Thus, according to the pose deviation determined by the contact model shown in Fig.3, active trajectory planning is simplified to judge which decision is better in either position or orientation adjustment in this 2-dimensional plane.

    In the actual trajectory decision-making process, only the plane with larger horizontal contact force is considered, and the force-torque plane orthogonal to it is ignored in each step of adjustment(Fig.1).Therefore, the position(represented by P) or orientation (represented by R) of robotic end-effector is adjusted only in this plane each time.

    Then, the state and action can be designed, as expressed in Eqs.(2)–(3) according to Fig.3.

    After determining the principal plane for analysis, raw force-torque data are selected to construct the state variable,then the motion command can also be restricted to two options.Data dimensionality reduction is achieved accordingly.Specifically, the robot receives force information as the state variable, and then the decision action is described in 6-D motion of the robotic end-effector.After the motion of end-effector, the contact state changes accordingly.The force data and the motion data can be recorded in the format as shown in Eq.(4).

    To improve utilization efficiency of raw data, a data processing method is specially designed from the perspective of RL training, as shown in Eq.(5).The conditions of task completion are also illustrated.

    Due to the high expense of acquiring robotic manipulation data in the real physical environment,we should make full use of the obtained demonstration data.According to the contact model and the RL model,the contact forces of Fxand Fyin the horizontal direction are compared first.The direction of the greater force determines the principal analysis plane at this moment (ignoring the contact force in the other orthogonal plane), as shown in Fig.4.

    After selecting the analysis plane at any time,the raw force data (Fxor Fy, Mxor My) that can be utilized to generate the designed state are determined (Fig.4).Analyzing the contact conditions of all situations with the same simplified 2-dimensional contact model as shown in Fig.3 significantly improves data utilization efficiency.

    Obviously, the value of horizontal force affects the motion decision.For the specific task in this paper, the force is very scattered in value, so the range of force values is divided into 5 segments with an interval of 1 N to reduce the number of intervals and simplify judgement.

    According to the simplified contact model, both the horizontal force and the ratio between the torque and the horizontal force (in the same plane) reflect the position-orientation deviation.Specifically,this ratio may fall in 3 different interval ranges according to the distance between the end point of tool and force-torque sensor.Therefore, the defined twodimensional state space, SF and SL, needs to be discussed based on categorization.The detailed correspondence relation between raw force data and state is shown in Fig.5.

    In the simplified contact model,the robot is only needed to determine the unidirectional position or the orientation motion at a moment.In fact,the robot end-effector command has 8 options totally in Stage (3).Thus, after establishing the state space, the relation between the motion command and action space is established according to the plane determined by the horizontal contact force and state variable as shown in Fig.4.In the previous work, conservative policy has been proved to be available,26which can be defined under the framework of RL in Table 1.AR and AP correspond to Rxor Ryand Pxor Pyrespectively in Fig.4.

    Fig.4 Restricted options of the actions according to principal analysis plane and state variables.

    Fig.5 State definitions according to different situations.

    Obviously, it can be deduced that SL3 represents one-side contact with only position deviation.Thus, the policy at state SL3 outputs concrete action, AP, and the subsequent policies in this paper all obey the same principle.This policy in Table 1 is used to complete the task with the goal of controlling small contact force during the whole process at the expense of sacrificing efficiency, which can explain why we conduct this work.

    In the design of the state-action space,the state corresponds to raw force-torque data.After selecting certain force or torque to constitute state variables according to larger horizontal force and contact model, the 6-dimensional raw force-torque data are reduced into a 2-dimensional state variable.The action space corresponds to the motion command of the robotic end-effector.According to the current state and contact model,the action space becomes a 1-dimensional variable with 2 options at a moment.The method of projection from raw data to the state-action space reflects the applicability and innovation of force-based active trajectory planning.The design of the state and action space in this paper greatly improves the efficiency of data utilization and reduces the difficulty of decision making.Four contact conditions are of the same state,and cover all data.The action decision is simplified from one of eight to one of two.

    2.3.Policy iteration initialization method

    After designing the state-action space of RL according to the contact model,policy evaluation before policy iteration is necessary for effectiveness comparison of different actions with the same state for higher reward.However, some state-action pairs are never visited after the projection of experiment data collected by a specific policy.Thus,the policy iteration initialization method is proposed to solve the state-action space coverage by efficiently sampling the subsequent experiment data with specific rules.Common initialization methods include exploration initialization and mild policy.

    The method of exploration initialization requires that every state have a certain probability as the start point.For the robotic wrench tool, the whole assembly process of insertion into the hexagonal hole of bolt has been divided into 3 stages as illustrated in Section 2.1.Policy improvement is intensively studied in Stage (3), which means that the start point in Stage(3) is an uncertain end point of Stage (1) and Stage (2).Thus,the exploration initialization method is not feasible for the current task.

    Mild policy is a method that can cover all the actions, but the probability of selecting specific actions may be relatively low.For the robotic wrench insertion task, the probability of visiting specific state is already low.Choosing low-probability actions in the states with low probability is difficult to traverse all state-action pairs.Theoretically, mild policy requires more demonstration data,increasing data acquisition cost.

    Although the conservative policy can accomplish the task safely, it cannot guarantee the accessibility to all state-action spaces.The number of times that the previously existing demonstration data have traversed each state-action pair are counted and listed in a table, called as state-action table.The policy iteration initialization method in this paper requires the inclusion of all state-action pairs by a more direct and efficient method.The core is to develop an active trajectory adjustment policy based on the vacancy of the state-action table,till the data of all positions in the state-action table have been obtained for calculating and comparing action-values.With respect to the policy iteration initialization method,the robotic experiment is required to obtain the data in the demonstration condition, in which the input of different policies is allowable.

    After accomplishing the projection from raw demonstration data to state-action variable,the collection of all previous demonstration experiment data will generate a state-action table (e.g., Table 2).Every-one increment to the number inthe table indicates that the trajectory has passed through this position one more time, and each time it passes by, the trajectory efficiency is calculated using this as the starting point.

    Table 1 Conservative policy under framework of reinforcement learning.

    Thus, taking the actual situations of robotic wrench insertion task into account, the policy initialization method can be designed,as shown in Fig.6.To accomplish the whole process, several policy forms are defined according to the stateaction table to obtain the initialization policy, which is required as a prerequisite for policy iteration.

    Projection policy: it is completely opposite to the conservative strategy.

    Directive policy: it chooses the action that has never been chosen if the specific states are encountered by chance in the demonstration.For other states, the policy chooses the action with smaller number in the state-action table.After adopting the projection policy, some positions in the state-action table still cannot be accessed for the uncertainty of demonstration process.The directive policy continues with policy initialization after executing the projection policy twice,in order to protect the robotic manipulator and the manipulated object as far as possible.

    Both projection policy and directive policy can be defined as table policy.Specifically, the amount of data in stateaction table determines the formulation of the table policy.However, risks exist in the table policy.Initialization during the demonstration with human supervision provides opportunity for policy adjustment at any time.In the initialization process,the policy can be changed,which gives rise to the concept of the protective policy.

    Protective policy:if the contact force increases continuously after consecutive actions of table policy twice, switch to the conservative policy in the next action.When the contact force decreases,the table policy can be executed again.If the contact force still increases after adopting the table policy twice, the conservative policy will work to the end.Not only the vacancy in the state-action table matters, but also the human demonstrator should make a real-time judgement about security by contact force.This requires that the demonstration platform be able to display the contact force on the monitor in real time(Fig.8(c)).

    Once the demonstration experiment is performed, the demonstration data are classified according to the definition of state-action space.The data amount in each position of the state-action table is accumulated, and the action-value function can be calculated by mathematical expectation.According to the policy definition above, the policy initialization demonstration can be divided into several steps.

    In the first step,the policy initialization method starts with the conservative policy.It is obvious that more than half positions of the state-action table will be empty as some stateaction pairs are not accessed in limited amount of robotic demonstration experiment.

    Fig.6 Policy initialization process of the following stage of policy iteration.

    In the second step, demonstration experiments are conducted twice by the projection policy.In this step, the protective policy is involved in the demonstration to protect the robotic manipulator and the manipulated object.

    In the third step,the directive policy is adopted because the projection policy is too aggressive.The protective policy assists in this step,till no position in the state-action table shows zero.

    2.4.Policy iteration based on protective policy

    After policy iteration initialization, an initialization policy is obtained.Taking this policy as the start point,policy iteration gains a series of policies till convergence is realized.Directly executing these policies in the robotic task would inevitably cause damage to the robotic system.Therefore, the security issue should be considered and guaranteed sufficiently in the policy iteration process by using the protective policy.

    Policy iteration includes policy evaluation and policy improvement.Generally, three policy evaluation methods(Monte Carlo, Dynamic Programming, and Temporal-Difference) are commonly used to calculate the state-value function or action-value function.The state transition probability representing the complete and accurate model of environment should be known in the DP-based method.Biased estimation exists in the TD-based method.Conversely, theMC-based method is unbiased estimation,but requires traversing data.For the robotic assembly task in the real physical environment,the state transition probability,which represents probability of transition to the next state from current state and current action, cannot be described analytically.For the general demonstration experiment, it is indispensable to execute the experiment from beginning to end to obtain complete experiment data.Thus, the inherent problems of MC-based method do not affect the calculation of action-value function,and the MC-based method can avoid involving in state transition probability.Therefore, the Monte Carlo method is utilized to calculate the action-value function by Eq.(6).

    Table 2 Number of times that previous demonstration data have traversed each state-action pair after accumulating 4-time conservative policy demonstrations and twice projection policy demonstrations.

    Among them, the reward function is designed as shown in Eq.(7) according to the trajectory efficiency definition in Eq.(1) to improve trajectory efficiency in Stage (3) of insertion assembly.

    Then, the mathematical expectation of action-value function can be calculated accordingly.As the state-action spaces are small enough to represent approximated action-value functions as a table, the action-values that reflect the average trajectory efficiency of an action in each given state are listed in the form of Q-table (Table 4).The optimal policy can be acquired exactly by comparison of Q-table, and outputs the optimal action in any state by Eq.(8).The policy improvement procedure can be calculated accordingly.

    The RL method can be classified into on-policy method and off-policy method according to whether the policy that generates data is identical to the evaluated and improved policy or not.The off-policy method is used to guarantee sufficient exploration, and complex importance sampling is necessary.As the policy initialization method can achieve coverage of state-action space, the on-policy method can satisfy the task of robotic wrench insertion.As the stochastic policy would generate unpredictable motion, the greedy policy is adopted to reduce uncertainty caused by actions in this paper.

    The greedy policy as a deterministic policy would also face the security issue because the next state is still unknown.Solutions should be designed according to the state-action space and the policy initialization method.Based on the protective policy above, two concepts are defined below:

    Generated policy: the final convergent policy of policy iteration.The generated policy should be capable of completing the assembly independently without the protective policy.

    Intermediate policy: the policy produced during the policy iteration process instead of the initialization policy and final generated policy.

    Different from the classical policy iteration method, when the intermediate policy in every iteration works in the real physical environment, the protective policy must take the responsibility if necessary.The intermediate policy cannot guarantee the security of the robotic manipulator and manipulated object.In addition, it may not complete the assembly trajectory by its own.The pseudo code of policy iteration is listed in Algorithm 1.

    Therefore, the situation where the policy can complete the task without relying on the protective policy is a necessary condition for iterative convergence.If the output policy remains unchanged and does not rely on the protective policy,this policy can be regarded as the final generated policy.

    Algorithm 1 1: Initializeπ1, π1 = initialization policy 2: while π(s) is non-convergence do 3: Generate demo using π(s), if danger occurs, add protective policy.4: for each (s,a) pair appearing in the sequence 5: G ←return reward of every visit (s,a)6: Append G to Returns(s,a)7: Q(s,a)←average(Returns(s,a))8: end 9: π(s)←arg maxaQ(s,a)10: end 11: return π(s)

    3.Experiment and results

    In this section, the experiment platform is introduced first to illustrate the environment of method verification.Then, the policy iteration initialization process and results are listed.The results of policy iteration initialization can be utilized as the start point of policy iteration process.Then the protection-based policy iteration process, and results show that assembly trajectory efficiency can be improved in the case of safety.After policy iteration, the policy that can complete the task independently, called as the generated policy, is compared with the conservative one according to the criterion of trajectory efficiency.

    3.1.Experiment platform for validating policy iteration method

    The validation experiment platform for policy iteration method needs to support the policy iteration process.Not only the application background of space robot, but also the demonstration demand should be considered for this platform.

    For the special space environment, real-time performance and sufficiency of visual information cannot afford to visualservo.Software and hardware of the space robot system cannot guarantee accurate modelling of the hexagonal contour edge.13However, approximate circular features acquired by cameras can help the space robot locate the initial contact position.

    Table 3 State-action table after directive policy demonstrations for 4 times.

    Table 4 Action-values in specific states at the beginning of the policy iteration.

    Thus, the wrench assembly experiment platform for the ground verification experiment in this paper makes it possible for the wrench to contact the bolt and align it in the position dimension(Fig.7).Only the 6-dimensional force-torque sensor installed between the robotic manipulator and the end-effector is adopted to guide this active trajectory planning for assembly.In this paper, the JR3 6-axis Force/Torque sensor (Product Number: 67M25A3) is selected.

    In the space environment,visual servoing is achieved by the eye-in-hand camera and visual marker.The relative pose between the target load and the visual marker is designed as a known variable.Within the 300 mm range,the position measurement accuracy of the camera can be better than 0.5 mm,and the orientation measurement accuracy can be better than 0.3°.In the ground verification experiment, the teleoperation method is used to guide the robotic manipulator to the initial position, as shown in Fig.7.Considering the factors such as position and orientation deviation of the robotic manipulator,the initial conditions of the ground verification experiment should tolerate a larger initial position and orientation deviation between the robotic end-effector wrench tool and the hexagonal hole of bolt.

    The initial pose deviations between the wrench tool and the hexagonal hole are set as follows:orientation deviation is 2.5°,and the position deviation is 1 mm.The generated policy after training should ensure safety, and the trajectory efficiency can be improved in this condition.

    Due to uncertain contact, makes it is dangerous to directly execute the arbitrary autonomous strategy in performing the assembly task.It is natural to protect the whole robotic system when acquiring more efficient policy.Generally, dragging the robotic manipulator by the human demonstrator to acquire trajectory is a common robot demonstration method.In the situation where the robot end-effector is constrained by the manipulated object, the additional external force reflected in the force-torque sensor will be generated due to the contact between the human demonstrator and the robot manipulator.To avoid these issues, the method of noncontact demonstration with human supervision is adopted26to protect the whole experiment of policy improvement and provide interface to send robot end-effector motion command by the mouse(Fig.8(a)).By this demonstration method, not only the arbitrary policy can be applied by the software interface, but also it becomes possible for the human demonstrator to supervise the assembly process by eyes.The real-time contact force detected by the force-torque sensor can also be reflected in the monitor for reference.

    Fig.7 Robotic wrench insertion task on noncontact demonstration platform with human supervision from local and global views.

    In addition,when the assembling strategy is determined,the robot end-effector motion command can be input according to the real-time force signal in the monitor.The software interface provides opportunity for the demonstrator to classify the interval of force data.To ensure the purity and accuracy of the position information, only the position control method is adopted in this robotic demonstration method.To simulate different environments, adjustable angle flat tongs and rotation platform can be used to set 2-dimensional orientation deviations.

    After the robotic wrench tool contacts the hexagonal hole of the socket bolt, the demonstration experiment begins, as shown in Fig.7.The software interface supports inputting the arbitrary policy by pressing the position or orientation command buttons (6 pairs) on the software interface in Fig.8(a).After choosing the analysis plane according to real-time force information, the motion command is input by the mouse.After adjusting the corresponding robot endpoint motion, the position or orientation values on the monitor are updated.When pressing the command button,both the force-torque data in this moment and the updated cumulative position-orientation values are recorded in the same row in a text file.The human demonstrator operates the button according to the readily available strategy.This demonstration software can be used to record the force-position data in the format in Eq.(4) till assembly is completed.

    Fig.8 Core parts of demonstration software interface.

    As the wrench tool has a certain length entering the hole of bolt, any position or orientation adjustment will cause uncertain contact.Thus,the relatively large motion amplitude above can accelerate the whole process significantly.Besides, the interval between the wrench and hole can tolerate this motion amplitude.

    In Stage(1),|Fx|and|Fy|are controlled to be less than 2 N,and Fzis controlled in the interval of[-5N,-2N]in Stage(2).In Stage (3), the assembly trajectory is judged to be finished when the maximum value of |Mx| and |My| is smaller than 0.1 N?m.

    In the entire training process, policy improvement is completed under the demonstration condition.Making full use of existing demonstration will also reduce the number of samples needed to be collected by the robotic experiment.

    3.2.Policy iteration initialization process and results

    According to the definition of policy initialization, initialization demonstration experiments aim to fill the state-action table.As the conservative policy in the first step can only fill up half of the state-action table theoretically, demonstration experiments are only conducted 4 times.The angular position of adjustable rotation platform is set as 210° or 300°, and the angular position of adjustable angle flat tongs is 2.5° in the policy initialization process.The detailed process of initialization policy acquisition is depicted in Fig.9,where X represents that the number in this position is non-zero.

    To fill the positions that would never be accessed by the conservative policy in the state-action table,the projection policy is adopted twice together with the protective policy for the reason of security in the second step.After classifying and accumulating the raw demonstration data according to the state-action definition,the state-action table is listed(Table 2).The state-action table cannot be totally filled because the demonstration trajectories cannot guarantee every state accessible.

    It can be seen in Table 2 that 4 in 20 positions are not accessed right now.Thus, in the third initialization step, the human demonstrator should focus on the states of these 4 state-action pairs and execute the directive policy in certain states.After robotic wrench insertion demonstration experiments for 4 times and data collection, the state-action table turns to be Table 3.

    So far,each position in the state-action table has been filled,which can be seen as a sign of completion of the policy initialization process, and policy iteration can follow up.Although the vacancies of the table can be seen in some demonstrations,the data still contribute to the calculation of the action-value of other state-action pairs.

    The policy initialization method proposed in this paper is suitable for policy iteration in the real physical environment of robotic system.In the robotic assembly task,the initial state cannot be deliberately appointed,so some category data with a small amount in the past data set should be made full use of.When a policy outputs a specific action, the protective policy is prepared for protection.In the process of executing robotic tasks, the principal problem is not the small probability of selecting a certain action in a certain state,but the small probability of accessing some states.Therefore, the random policy or the epsilon-greedy policy is not as efficient as the directive policy.

    3.3.Policy iteration process and results

    When policy initialization finishes, the action-values corresponding to Table 3 and Eq.(6) can be calculated, as listed in Table 4, which is also named as the Q-table.The bold characters highlight the relatively greater action-value given certain two-dimensional state, and the parentheses show the amount of data for calculation.

    Fig.9 Process of acquiring initialization policy.

    To make a clear comparison of the action-values for different actions, the Q-table is depicted in Fig.11(a) to show the better action in each given 2-dimensional state.Then,the concrete policy iteration process is depicted in Fig.10 to acquire the generated policy.

    Similarly, the robotic manipulator and the manipulated object still face security risks for the intermediate policy during policy iteration.Thus, the protective policy is required to cooperate with the intermediate policy demonstration.After adopting the policy acquired from Fig.11(a), the actionvalues are updated, as shown in Fig.11(b).The angular position of the adjustable rotation platform is set to 190° in the policy iteration process.

    Policy iteration does not converge here.Therefore, the robotic assembly demonstration experiment continues by the latest intermediate policy together with the protective policy.Accumulating the latest demonstration data, the updated action-values are depicted in Fig.11(c).

    Although the newly output intermediate policy given by Fig.11(c)stays the same with the previous policy,this intermediate policy cannot complete the task without the aid of the protective policy.Policy iteration needs to continue.However,the data amount of calculating the action-values in Q-table increases, which benefits future calculation.After executing the intermediate policy corresponding to Fig.11(c) in the following demonstration experiment (with the protective policy),the data are obtained, and then the action-values are updated as shown in Fig.11(d).

    Fig.10 Policy iteration process for acquiring final generated policy with initialization policy as the start point.

    The protective policy is not necessary when adopting the policy produced by the least action-values shown in Fig.11(d) in the following demonstration experiment.After processing the latest demonstration data and adding them into existing data set, the action-values are updated, as shown in Fig.11(e).

    The action-values from Fig.11(e) generate the same policy with that from Fig.11(d).To judge whether the policy has converged, the latest policy is applied again.The updated actionvalue is shown in Fig.11(f), and the corresponding policy can be judged as the convergent policy and called as the generated policy.This generated policy can be expressed explicitly in Table 5.

    As the intermediate policy during the iteration process requires auxiliary the protective policy, these policies are not comparable.In the following part, the generated policy is compared with the conservative policy in terms of trajectory efficiency by additional experiments.

    Fig.11 Action-values of all state-action pairs for 6 calculations in policy iteration.

    Compared with direct policy training under the unsupervised condition, the protective policy with human supervision can protect the robot and the manipulated object.It is also a novel idea to introduce the protective policy to deal with the danger caused by the unstable intermediate policy during policy iteration.

    3.4.Comparison between generated policy and conservative policy

    To improve trajectory efficiency, policy iteration is utilized,and a policy initialization method is proposed.The trajectory efficiencies are compared in the demonstration mode by adopting two policies.The angular position of adjustable rotation platform is set to 100° in the policy comparison process, and the angular position of adjustable-angle flat tongs is 2°.To illustrate the influence of manufacturing tolerances of hexagonal bolts and wrench,two types of M8 bolts are selected in the experiment,and the distance between the opposite sides of the hexagonal hole of these two bolts are 6.07 mm and 6.03 mm,respectively.Besides, the distance between the opposite sides of the hexagonal wrench in this experiment is 5.98 mm,resulting in two types of assembly tolerance,0.05 mm and 0.09 mm,respectively.

    Under this initial condition,the main orientation deviation is in the Rxdirection(compared with Ry).Both generated policy and conservative policy execute demonstration experiments for 10 times, and the trajectory efficiency for each experiment is calculated by Eq.(1).For the assembly tolerance of 0.09 mm,the average trajectory efficiency of the generated policy is 58.56%,while it is 55.25%for the conservative one.For the assembly tolerance of 0.05 mm,the average trajectory efficiency of the generated policy is 82.00 %, while it is 62.91 %for the conservative one.The results are compared and shown in Fig.12.

    During the demonstration process, it can be found in both experiments with two types of bolts that the first adjustment of orientation direction has a great impact on trajectory efficiency.The trajectory efficiency is further classified and compared according to the direction of the first orientation adjustment, which is differentiated by Rxand Ryin Table 6(for the assembly tolerance of 0.09 mm) and Table 7 (for the assembly tolerance of 0.05 mm).

    According to policy tables(Table 1 and Table 5),these two policies only have differences in state [SF3,SL1].If one of the policies generates a trajectory without going through this state,this demonstration trajectory data is obviously invalid for comparison.

    Regardless of the direction of the first orientation adjustment, the trajectory efficiency of generated policy is always higher than that of conservative policy.Priority choice in thedirection with large orientation deviation leads to higher trajectory efficiency.This is because orientation deviation is the dominant deviation relative to the position deviation during the assembly process.Every time the orientation is adjusted,the position deviation will inevitably change.Conversely, the position adjustment does not change the orientation deviation.Therefore, if the first adjustment is towards the direction with greater orientation deviation, it will result in higher trajectory efficiency.

    Table 5 Expressions of generated policy.

    Fig.12 Trajectory efficiency results for different assembly tolerance.

    Table 6 Trajectory efficiency classified and compared according to the first orientation adjustment (assembly tolerance:0.09 mm).

    Table 7 Trajectory efficiency classified and compared according to first orientation adjustment (assembly tolerance:0.05 mm).

    However, it is difficult to judge the principal orientation deviation direction under the current condition.The first orientation adjustment decision depends on initial position deviation, which depends on the uncertain initial contact state.It is hard to guarantee the optimal orientation adjustment with the current policy,but the policy initialization and the iteration method surely improve follow-up trajectory efficiency.

    Besides, the comparison experiment shows the influence of manufacturing tolerance.For the bolt assembly with small tolerance, the trajectory efficiency is higher for both policies.In our view,small assembly tolerance restricts the position adjustment space.When the horizontal contact force locates in the areas of the adjusting orientation of the robotic end-effector,small assembly tolerance leads to more accurate position alignment.Thus, it is obvious that higher trajectory efficiency can be realized with small assembly tolerance between the hexagonal bolt hole and wrench.

    4.Conclusion

    To improve the assembly efficiency of robots, the efficiency improvement of trajectory planning in wrench insertion is constructed as an objective optimization problem based on RL.To achieve objective optimization through RL, a novel policy iteration method is proposed.For the start point of policy iteration,a policy iteration initialization method based on vacancy of state-action table is designed and applied.Compared with the initialization mode dependent on exploring starts and the mild policy,this initialization method can improve demonstration efficiency by finishing the state-action traverse in a limited number of times.In addition, policy iteration based on the protective policy can output the generated policy and avoid unpredictable risks.

    In robotic experiments,the conservative policy and the generated policy are compared separately in the condition of noncontact robot demonstration with human supervision.For each 10-time experiment, the average trajectory efficiencies of generated policy are 58.56 % for 0.09 mm assembly tolerance and 82.00%for 0.05 mm assembly tolerance,which are much higher than those of conservative policy(55.25%for 0.09 mm assembly tolerance and 62.91 % for 0.05 mm assembly tolerance).Although the trajectory efficiency is affected by the first orientation decision, the generated policy has significant improvement in trajectory efficiency compared with the conservative one for all assembly tolerances in this paper.

    In the future, the state-action space could be refined in the targeted region, so as to further improve trajectory efficiency,and the influence of the first orientation adjustment decision should be eliminated.In addition, this method guarantees the feasibility of inserting wrench into the bolt for the robotic manipulator and makes it possible to replace astronauts’extravehicular activities.

    Declaration of Competing Interest

    The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

    Acknowledgements

    This study was supported by the National Natural Science Foundation of China(No.91848202)and the Special Foundation (Pre-Station) of China Postdoctoral Science (No.2021TQ0089).

    熟女av电影| 国产精品蜜桃在线观看| 欧美 亚洲 国产 日韩一| 国产黄色视频一区二区在线观看| 亚洲国产精品国产精品| 亚洲经典国产精华液单| 一区在线观看完整版| 国产亚洲欧美精品永久| 亚洲高清免费不卡视频| 丝袜在线中文字幕| 欧美日本中文国产一区发布| 国产午夜精品久久久久久一区二区三区| 一边亲一边摸免费视频| 久久国内精品自在自线图片| 十八禁高潮呻吟视频| 国产一区二区三区综合在线观看 | 蜜臀久久99精品久久宅男| 亚洲国产欧美在线一区| 午夜91福利影院| 飞空精品影院首页| 伊人久久精品亚洲午夜| 午夜av观看不卡| 丰满迷人的少妇在线观看| 人妻少妇偷人精品九色| 国产又色又爽无遮挡免| 91国产中文字幕| 激情五月婷婷亚洲| 久久人人爽人人片av| 久久婷婷青草| 边亲边吃奶的免费视频| 亚洲精品国产色婷婷电影| 亚洲av免费高清在线观看| 国产在线一区二区三区精| 国产欧美亚洲国产| 国产熟女午夜一区二区三区 | 亚洲性久久影院| 久久久久久久久久久久大奶| 亚洲内射少妇av| 热99国产精品久久久久久7| 黑人巨大精品欧美一区二区蜜桃 | 街头女战士在线观看网站| 欧美国产精品一级二级三级| 麻豆乱淫一区二区| 亚洲精华国产精华液的使用体验| 亚洲精品久久久久久婷婷小说| 色哟哟·www| 国产精品熟女久久久久浪| 欧美精品高潮呻吟av久久| 高清午夜精品一区二区三区| 久久久国产一区二区| 我要看黄色一级片免费的| 国产日韩欧美亚洲二区| 插逼视频在线观看| 美女国产视频在线观看| 国内精品宾馆在线| 午夜免费男女啪啪视频观看| 国产精品一国产av| 男男h啪啪无遮挡| 久久久久久久国产电影| 久久国产精品大桥未久av| 草草在线视频免费看| 日日摸夜夜添夜夜爱| 欧美成人精品欧美一级黄| 91精品三级在线观看| 日本av免费视频播放| 天美传媒精品一区二区| 国产无遮挡羞羞视频在线观看| 成人综合一区亚洲| www.av在线官网国产| av播播在线观看一区| 在线观看美女被高潮喷水网站| 欧美日韩av久久| 麻豆乱淫一区二区| 久久av网站| 夫妻性生交免费视频一级片| 国产午夜精品久久久久久一区二区三区| 亚洲情色 制服丝袜| a级毛色黄片| 中国美白少妇内射xxxbb| 18在线观看网站| 国产高清国产精品国产三级| 亚洲精品456在线播放app| 亚洲,欧美,日韩| 久久99蜜桃精品久久| 国产综合精华液| 亚洲av综合色区一区| 99国产精品免费福利视频| 亚洲色图 男人天堂 中文字幕 | 欧美日韩国产mv在线观看视频| 高清在线视频一区二区三区| 午夜福利在线观看免费完整高清在| 亚洲精品色激情综合| 99久国产av精品国产电影| 亚洲国产毛片av蜜桃av| 欧美精品人与动牲交sv欧美| 9色porny在线观看| 久久精品国产亚洲av天美| 国产高清不卡午夜福利| 爱豆传媒免费全集在线观看| 欧美 亚洲 国产 日韩一| 欧美97在线视频| 22中文网久久字幕| 校园人妻丝袜中文字幕| 久久久a久久爽久久v久久| 亚洲欧美一区二区三区黑人 | 99热网站在线观看| 免费观看的影片在线观看| 美女中出高潮动态图| 22中文网久久字幕| 国产综合精华液| 久久久久久久国产电影| 最近中文字幕2019免费版| 一级爰片在线观看| 中文字幕亚洲精品专区| 精品久久久噜噜| 久久av网站| 永久网站在线| 欧美日韩成人在线一区二区| 亚洲怡红院男人天堂| 免费久久久久久久精品成人欧美视频 | 欧美日韩亚洲高清精品| 免费黄频网站在线观看国产| 免费少妇av软件| 国产白丝娇喘喷水9色精品| 一区二区av电影网| 成人无遮挡网站| 亚洲欧美色中文字幕在线| 大话2 男鬼变身卡| 久久久久精品久久久久真实原创| 日本猛色少妇xxxxx猛交久久| 91在线精品国自产拍蜜月| 在现免费观看毛片| 18禁观看日本| 久热这里只有精品99| 大香蕉久久网| 欧美日韩视频精品一区| 精品久久久久久久久av| 一级毛片我不卡| 少妇熟女欧美另类| 青春草视频在线免费观看| 十八禁网站网址无遮挡| 亚洲av国产av综合av卡| 久久久久久伊人网av| 国产片内射在线| 久久午夜福利片| 亚洲人与动物交配视频| 91成人精品电影| 国产成人午夜福利电影在线观看| 国产av码专区亚洲av| 成人手机av| 女人精品久久久久毛片| 国产综合精华液| 色婷婷久久久亚洲欧美| 男女无遮挡免费网站观看| 亚洲av中文av极速乱| 最近中文字幕高清免费大全6| 99久久人妻综合| 欧美日韩一区二区视频在线观看视频在线| 国国产精品蜜臀av免费| 内地一区二区视频在线| 99视频精品全部免费 在线| 中文精品一卡2卡3卡4更新| 九色亚洲精品在线播放| 看免费成人av毛片| 国产男人的电影天堂91| 香蕉精品网在线| 色网站视频免费| 久久久久视频综合| 国产亚洲一区二区精品| 精品人妻熟女毛片av久久网站| 成人国产麻豆网| 最近手机中文字幕大全| 蜜桃在线观看..| 亚洲天堂av无毛| 成人无遮挡网站| 日韩一区二区视频免费看| 国产精品成人在线| av不卡在线播放| 日韩在线高清观看一区二区三区| 我的女老师完整版在线观看| 国产视频内射| 免费观看在线日韩| 国产成人av激情在线播放 | 日产精品乱码卡一卡2卡三| 久久99热这里只频精品6学生| 国产在线一区二区三区精| 大话2 男鬼变身卡| 视频中文字幕在线观看| 欧美精品高潮呻吟av久久| 亚洲精品久久久久久婷婷小说| 在线观看免费高清a一片| 一个人免费看片子| 制服诱惑二区| 免费日韩欧美在线观看| 亚洲中文av在线| 欧美精品国产亚洲| 青春草国产在线视频| 国产精品一区二区在线观看99| 亚洲国产欧美在线一区| 亚洲av综合色区一区| 美女脱内裤让男人舔精品视频| 黑人猛操日本美女一级片| 亚洲av不卡在线观看| 欧美日韩一区二区视频在线观看视频在线| 天美传媒精品一区二区| 精品一品国产午夜福利视频| 国产国语露脸激情在线看| 老司机影院成人| 欧美人与善性xxx| 日本vs欧美在线观看视频| 国产 精品1| 看免费成人av毛片| 成人影院久久| 国产精品麻豆人妻色哟哟久久| 久久热精品热| 午夜老司机福利剧场| 少妇的逼水好多| 又大又黄又爽视频免费| 亚洲国产精品成人久久小说| 不卡视频在线观看欧美| 一级a做视频免费观看| 婷婷成人精品国产| 我的老师免费观看完整版| 国产成人av激情在线播放 | 99热国产这里只有精品6| av不卡在线播放| 亚洲,欧美,日韩| 亚洲一级一片aⅴ在线观看| 建设人人有责人人尽责人人享有的| 久久久国产精品麻豆| 国产欧美另类精品又又久久亚洲欧美| 女的被弄到高潮叫床怎么办| 国产免费一级a男人的天堂| 久久综合国产亚洲精品| 国产淫语在线视频| 日韩制服骚丝袜av| 欧美 日韩 精品 国产| 男女免费视频国产| av卡一久久| 日韩欧美一区视频在线观看| 一级毛片 在线播放| 麻豆精品久久久久久蜜桃| 美女脱内裤让男人舔精品视频| 中文字幕最新亚洲高清| 精品人妻一区二区三区麻豆| 国产 一区精品| 久久国内精品自在自线图片| 久久久国产精品麻豆| 亚洲一区二区三区欧美精品| 久久精品人人爽人人爽视色| 亚洲国产精品专区欧美| 美女内射精品一级片tv| 99久久人妻综合| 99国产综合亚洲精品| 在线观看三级黄色| 亚洲国产精品国产精品| 最黄视频免费看| 国产精品99久久99久久久不卡 | 热re99久久国产66热| 国产精品国产av在线观看| 亚洲av.av天堂| 狂野欧美激情性bbbbbb| 97超碰精品成人国产| 欧美人与善性xxx| 国产成人精品福利久久| 色婷婷av一区二区三区视频| 王馨瑶露胸无遮挡在线观看| 国产日韩欧美视频二区| 欧美人与善性xxx| 七月丁香在线播放| 夜夜骑夜夜射夜夜干| 少妇被粗大的猛进出69影院 | 国产精品麻豆人妻色哟哟久久| 一区二区三区精品91| 啦啦啦视频在线资源免费观看| 精品久久久噜噜| 日本黄大片高清| 制服人妻中文乱码| 女的被弄到高潮叫床怎么办| 男女免费视频国产| 国产精品一区二区在线不卡| 嘟嘟电影网在线观看| 在线天堂最新版资源| 男女高潮啪啪啪动态图| 亚洲精品456在线播放app| 免费av不卡在线播放| 久久人人爽人人爽人人片va| 一本大道久久a久久精品| 国产不卡av网站在线观看| 国产一区二区三区av在线| 亚洲精品国产av成人精品| 少妇被粗大的猛进出69影院 | 边亲边吃奶的免费视频| 久久久久人妻精品一区果冻| 成人二区视频| 日日摸夜夜添夜夜添av毛片| av在线观看视频网站免费| 黑人欧美特级aaaaaa片| 大片免费播放器 马上看| 精品少妇久久久久久888优播| 欧美成人精品欧美一级黄| 精品酒店卫生间| 亚洲欧美成人综合另类久久久| 国模一区二区三区四区视频| 午夜免费观看性视频| 汤姆久久久久久久影院中文字幕| 啦啦啦视频在线资源免费观看| 亚洲国产精品国产精品| 日韩人妻高清精品专区| 一区二区三区四区激情视频| 久久久国产欧美日韩av| 日韩一本色道免费dvd| 成人漫画全彩无遮挡| 欧美成人精品欧美一级黄| 老司机影院毛片| 久久久久视频综合| 一边摸一边做爽爽视频免费| 色网站视频免费| 免费久久久久久久精品成人欧美视频 | 欧美精品一区二区免费开放| 国产黄色视频一区二区在线观看| 美女大奶头黄色视频| 亚洲综合精品二区| 国产在线免费精品| 午夜福利视频精品| 人人妻人人澡人人爽人人夜夜| 久久青草综合色| 日韩成人伦理影院| 黑人欧美特级aaaaaa片| 97超碰精品成人国产| 国产亚洲午夜精品一区二区久久| 一级毛片aaaaaa免费看小| av免费在线看不卡| 中文字幕人妻熟人妻熟丝袜美| 免费播放大片免费观看视频在线观看| 91精品国产九色| 成人手机av| 日韩一区二区三区影片| 亚洲精品乱码久久久久久按摩| 美女国产高潮福利片在线看| 纵有疾风起免费观看全集完整版| 国产片特级美女逼逼视频| 熟女电影av网| 午夜福利影视在线免费观看| 美女脱内裤让男人舔精品视频| 毛片一级片免费看久久久久| 看十八女毛片水多多多| 亚洲成人手机| 美女脱内裤让男人舔精品视频| 大片免费播放器 马上看| 女人久久www免费人成看片| av在线播放精品| 久久久久久久久久成人| 国产成人av激情在线播放 | 国产精品免费大片| h视频一区二区三区| 午夜激情久久久久久久| 午夜免费观看性视频| 国产男女内射视频| 精品人妻在线不人妻| 精品久久国产蜜桃| 精品人妻在线不人妻| 色哟哟·www| h视频一区二区三区| 日韩av免费高清视频| 久久国产精品男人的天堂亚洲 | 中文字幕精品免费在线观看视频 | 中国三级夫妇交换| 国产免费又黄又爽又色| 美女xxoo啪啪120秒动态图| 一级毛片黄色毛片免费观看视频| 少妇丰满av| 精品久久久精品久久久| 国产又色又爽无遮挡免| 精品一区二区免费观看| 亚洲精品第二区| 久久久久久久久久久免费av| 国国产精品蜜臀av免费| 久久精品国产鲁丝片午夜精品| 亚洲精品日韩在线中文字幕| 久久97久久精品| 香蕉精品网在线| 熟妇人妻不卡中文字幕| 一个人看视频在线观看www免费| 婷婷色麻豆天堂久久| 丰满乱子伦码专区| 美女视频免费永久观看网站| 成人毛片60女人毛片免费| 精品少妇久久久久久888优播| 欧美一级a爱片免费观看看| 免费观看a级毛片全部| 欧美 亚洲 国产 日韩一| 不卡视频在线观看欧美| 国产亚洲最大av| 国产在线一区二区三区精| 亚洲精品日韩在线中文字幕| 99久国产av精品国产电影| 亚洲国产日韩一区二区| 国产视频首页在线观看| 国产日韩欧美在线精品| 精品久久久久久电影网| 久热久热在线精品观看| a级毛片在线看网站| 亚洲婷婷狠狠爱综合网| 在线精品无人区一区二区三| 又粗又硬又长又爽又黄的视频| 亚洲精品456在线播放app| 亚洲精品aⅴ在线观看| 黄色毛片三级朝国网站| 色5月婷婷丁香| 国产成人精品婷婷| videos熟女内射| 亚洲色图 男人天堂 中文字幕 | 成人国产麻豆网| 麻豆乱淫一区二区| av网站免费在线观看视频| 日韩亚洲欧美综合| 亚洲精品乱码久久久久久按摩| 日韩一区二区三区影片| 精品久久久久久久久亚洲| 亚洲人成网站在线播| 菩萨蛮人人尽说江南好唐韦庄| 国产在线一区二区三区精| 十分钟在线观看高清视频www| 一级爰片在线观看| a级毛片免费高清观看在线播放| 亚洲欧美成人精品一区二区| 亚洲色图综合在线观看| 国产精品一区二区在线观看99| 一区二区三区乱码不卡18| 制服丝袜香蕉在线| 少妇 在线观看| 只有这里有精品99| 欧美 日韩 精品 国产| 视频区图区小说| 亚洲av在线观看美女高潮| 一级毛片 在线播放| 黄色欧美视频在线观看| 国产精品一区二区在线不卡| www.av在线官网国产| 午夜福利视频精品| 制服丝袜香蕉在线| 韩国av在线不卡| 欧美日韩视频精品一区| 日日啪夜夜爽| 精品久久久久久久久av| 国产精品99久久久久久久久| 国产精品久久久久久av不卡| 国产精品久久久久久久电影| 亚洲欧美一区二区三区国产| 一区在线观看完整版| 18禁观看日本| 精品午夜福利在线看| 国产av一区二区精品久久| 久久久久人妻精品一区果冻| 五月开心婷婷网| 性色avwww在线观看| 午夜福利网站1000一区二区三区| 99热这里只有是精品在线观看| 草草在线视频免费看| 国产欧美日韩一区二区三区在线 | 久久狼人影院| 18禁在线播放成人免费| 亚洲成色77777| 亚洲四区av| 91午夜精品亚洲一区二区三区| 欧美3d第一页| 亚洲国产精品专区欧美| 亚洲精品久久成人aⅴ小说 | 欧美一级a爱片免费观看看| 久久久精品94久久精品| 久久久久国产网址| 免费观看的影片在线观看| 久久免费观看电影| 亚洲av不卡在线观看| 十分钟在线观看高清视频www| av专区在线播放| 亚洲国产精品一区三区| 久久女婷五月综合色啪小说| 国产片特级美女逼逼视频| 超色免费av| 自线自在国产av| xxx大片免费视频| 午夜免费男女啪啪视频观看| 啦啦啦视频在线资源免费观看| 你懂的网址亚洲精品在线观看| 日韩 亚洲 欧美在线| 哪个播放器可以免费观看大片| 亚洲不卡免费看| 国产成人精品婷婷| 99国产精品免费福利视频| 亚洲国产最新在线播放| 久久久亚洲精品成人影院| 欧美成人精品欧美一级黄| 桃花免费在线播放| 18+在线观看网站| 日日爽夜夜爽网站| 国产免费一级a男人的天堂| 久久99热6这里只有精品| 草草在线视频免费看| 大又大粗又爽又黄少妇毛片口| 成人漫画全彩无遮挡| .国产精品久久| 在线 av 中文字幕| 男人操女人黄网站| 午夜福利视频在线观看免费| 在线观看国产h片| 国产精品一区二区三区四区免费观看| 国产老妇伦熟女老妇高清| 在线观看免费日韩欧美大片 | 亚洲内射少妇av| 91精品三级在线观看| 日韩免费高清中文字幕av| 精品人妻偷拍中文字幕| 精品99又大又爽又粗少妇毛片| 国产欧美另类精品又又久久亚洲欧美| 亚洲无线观看免费| 一本久久精品| 国产黄频视频在线观看| 91成人精品电影| 狂野欧美激情性xxxx在线观看| 成人亚洲精品一区在线观看| 国产片特级美女逼逼视频| 一区在线观看完整版| 欧美精品高潮呻吟av久久| 美女xxoo啪啪120秒动态图| 久久久国产精品麻豆| 人妻少妇偷人精品九色| 日韩精品有码人妻一区| 秋霞伦理黄片| 高清不卡的av网站| 交换朋友夫妻互换小说| 三上悠亚av全集在线观看| 久久久国产一区二区| 免费看光身美女| 免费日韩欧美在线观看| 哪个播放器可以免费观看大片| 欧美精品高潮呻吟av久久| 人人澡人人妻人| 全区人妻精品视频| 日本黄色片子视频| 色吧在线观看| 日韩成人伦理影院| 高清欧美精品videossex| 国产黄频视频在线观看| 久久久国产欧美日韩av| 青青草视频在线视频观看| 狠狠婷婷综合久久久久久88av| 少妇熟女欧美另类| 亚洲av.av天堂| 日韩成人伦理影院| 夜夜骑夜夜射夜夜干| 欧美xxⅹ黑人| 99国产精品免费福利视频| 欧美日韩视频高清一区二区三区二| 国产精品国产av在线观看| 色哟哟·www| 赤兔流量卡办理| 成人国语在线视频| 国产无遮挡羞羞视频在线观看| 欧美日韩在线观看h| 国产欧美日韩一区二区三区在线 | 精品99又大又爽又粗少妇毛片| 视频区图区小说| av在线app专区| 777米奇影视久久| 欧美97在线视频| 狂野欧美激情性xxxx在线观看| 麻豆成人av视频| 在线观看免费高清a一片| 插逼视频在线观看| 亚洲精品日本国产第一区| 亚洲精品乱码久久久久久按摩| 国产亚洲一区二区精品| 日韩伦理黄色片| 国产日韩欧美亚洲二区| 一边亲一边摸免费视频| 久久久欧美国产精品| 综合色丁香网| 久久女婷五月综合色啪小说| 久久国内精品自在自线图片| 成人毛片60女人毛片免费| 三级国产精品片| 又大又黄又爽视频免费| 日韩电影二区| 美女内射精品一级片tv| 日本与韩国留学比较| 成人综合一区亚洲| 午夜福利视频精品| 国产精品国产三级专区第一集| 涩涩av久久男人的天堂| 日本爱情动作片www.在线观看| 久久久久久久久久久久大奶| 日日啪夜夜爽| 校园人妻丝袜中文字幕| 国产精品 国内视频| 国产亚洲精品第一综合不卡 | 三上悠亚av全集在线观看| av一本久久久久| 欧美三级亚洲精品| 中文字幕免费在线视频6| av天堂久久9| 在线看a的网站| 国产欧美日韩一区二区三区在线 | 亚洲经典国产精华液单| 99热这里只有是精品在线观看| 又粗又硬又长又爽又黄的视频| 成人亚洲精品一区在线观看| 久久精品久久久久久久性| 国产亚洲欧美精品永久| 色94色欧美一区二区| a级毛色黄片| 2018国产大陆天天弄谢| 亚洲成人av在线免费| 另类精品久久|